# Modular

> Deploy fast and scalable GenAI inference

This file contains all documentation content in a single document following the llmtxt.org standard.

## @__copy_capture

You can add the `__copy_capture` decorator on a parametric closure to capture
register-passable values by copy. This decorator causes a nested function to
copy the value of the indicated variable into the closure object at the point
of formation instead of capturing that variable by reference. This allows the
closure to be passed as an escaping function, without lifetime concerns.

```mojo
  fn foo(x: Int):
      var z = x

      @__copy_capture(z)
      @parameter
      fn formatter() -> Int:
          return z
      z = 2
      print(formatter())

  fn main():
      foo(5)
```

---

## @always_inline

You can add the `@always_inline` decorator on any function to make the Mojo
compiler "inline" the body of the function (copy it) directly into the body of
the calling function.

This eliminates potential performance costs associated with function calls
jumping to a new point in code. Normally, the compiler will do this
automatically where it can improve performance, but this decorator forces it to
do so. The downside is that it can increase the binary size by duplicating the
function at every call site.

For example:

```mojo
@always_inline
fn add(a: Int, b: Int) -> Int:
    return a + b

print(add(1, 2))
```

Because `add()` is decorated with `@always_inline`, Mojo compiles this program
without adding the `add()` function to the call stack, and it instead performs
the addition directly at the `print()` call site, as if it were written like
this:

```mojo
print(1 + 2)
```

## `@always_inline("nodebug")`

You can also use the decorator with the `"nodebug"` argument, which has the
same effect to inline the function, but without debug information. This means
that you can't step into the function when debugging.

This decorator is intended to be used on the lowest-level functions in a
library,   which may wrap primitive functions, MLIR operations, or inline
assembly. Marking these functions as "nodebug" prevents users from accidentally
stepping into low-level non-Mojo code when debugging.

---

## @compiler.register

The `@compiler.register` decorator registers a custom operation for use with the
Graph API. For more information on custom operations, see
[Intro to custom ops](/max/custom-ops).

To define a custom operation:

* Import the `compiler` package.

* Create a struct that implements the `execute()` and (optional) `shape()`
  static methods.

* Register it using the `@compiler.register` decorator.

The following snippet shows the outline of a custom operation:

```mojo
@compiler.register("add_vectors_custom")
struct AddVectorsCustom:

    @staticmethod
    fn execute[...](...) raises:
        pass

    @staticmethod
    fn shape(...) raises -> IndexList:
        pass
```

The `@compiler.register` decorator takes a single arguments, the name of the
custom operation, as a string. This name is used to load the custom op into your
graph.

Output from the `execute()` method is usually returned using one or more
destination-passing style (DPS) output tensors. Destination-passing style (DPS)
means that the calling function passes in pre-allocated storage space for the
output value(s). This allows for more efficient memory management. For example,
the graph compiler can optimize memory use by allocating output tensors on the
stack, instead of requiring custom ops to allocate heap storage for return
values.

Destination passing style requires the graph compiler to determine the
dimensions of the output tensor(s) before executing the operation. It uses the
operation's `shape()` function to determine the dimensions if they can't be
determined statically.

The following sections describe the `execute()` and `shape()` functions.

### `execute()` function

The `execute()` function performs the actual work of the custom op. It takes the
following parameter:

* `target` (`StaticString`): Indicates the device the operation is running on:
  currently takes the values `"cpu"` or `"gpu"`.

Graph output and input tensors are passed to the `execute()` function as
instances of
[`OutputTensor`](/max/api/mojo/tensor/managed_tensor_slice/#aliases) and
[`InputTensor`](/max/api/mojo/tensor/managed_tensor_slice/#aliases),
respectively. These are both aliases for specific configurations of
[`ManagedTensorSlice`](/max/api/mojo/tensor/managed_tensor_slice/ManagedTensorSlice),
so they both have the same API.

In addition to input and output tensors, the function can take the following
arguments:

* Any arguments of type [`Scalar`](/mojo/manual/types#scalar-values).

* A single argument of type `DeviceContextPtr`. This opaque pointer is
  currently required for GPU support.

```mojo
import compiler
from utils.index import IndexList
from max.tensor import OutputTensor, InputTensor, foreach, ManagedTensorSlice
from runtime.asyncrt import DeviceContextPtr

@compiler.register("add_vectors_custom")
struct AddVectorsCustom:
    @staticmethod
    fn execute[
        # "gpu" or "cpu"
        target: StaticString,
    ](
        # the first argument is the output
        out: OutputTensor,
        # starting here is the list of inputs
        x: InputTensor[type = out.type, rank = out.rank],
        y: InputTensor[type = out.type, rank = out.rank],
        # the context is needed for some GPU calls
        ctx: DeviceContextPtr,
    ) raises:

        @parameter
        @always_inline
        fn func[width: Int](idx: IndexList[x.rank]) -> SIMD[x.type, width]:
            return x.load[width](idx) + y.load[width](idx)

        foreach[func, target=target](out, ctx)
```

### `shape()` function

The `shape()` function returns the dimensions of the output tensor(s).

The `shape()` function is required only if the graph compiler can't statically
determine the shape of the output tensor(s), and you don't manually annotate the
output shapes when building a graph.

The function takes the same arguments as the `execute()` function, minus the
output tensors and `DeviceContextPtr`. It must return an
[`IndexList`](/mojo/stdlib/utils/index_/IndexList/) specifying the dimensions of
the output tensor.

For example, if the operation takes two input tensors, and the shape of the
output tensor matches the first input tensor, you could use the following
`shape()` function:

```mojo
    @staticmethod
    fn shape(
        in1: InputTensor,
        in2: InputTensor,
    ) raises -> IndexList[in1.rank]:
        return in1.spec.shape
```

---

## @implicit

You can add the `@implicit` decorator on any single-argument constructor to
identify it as eligible for implicit conversion.

For example:

```mojo
struct MyInt:
    var value: Int

    @implicit
    fn __init__(out self, value: Int):
        self.value = value

    fn __init__(out self, value: Float64):
        self.value = Int(value)

```

This implicit conversion constructor allows you to pass an `Int` to a function
that takes a `MyInt` argument, or assign an `Int` to a variable of type `MyInt`.
However, the constructor that takes a `Float64` value is **not** an implicit
conversion constructor, so it must be invoked explicitly:

```mojo
fn func(n: MyInt):
    print("MyInt value: ", n.value)

fn main():
    func(Int(42))             # Implicit conversion from Int: OK
    func(MyInt(Float64(4.2))) # Explicit conversion from Float64: OK
    func(Float64(4.2))        # Error: can't convert Float64 to MyInt
```

---

## @nonmaterializable

You can add the `@nonmaterializable` decorator on a struct to declare that the
type can exist only in the parameter domain (it can be used for metaprogramming
only, and not as a runtime type). And, if an instance of this type does
transition into the runtime domain, this decorator declares what type it
becomes there.

To use it, declare your type with `@nonmaterializable(TargetType)`, where
`TargetType` is the type that the object should convert to if it becomes a
runtime value (you must declare the `TargetType`). For example, if a struct is
marked as `@nonmaterializable(Foo)`, then anywhere that it goes from a
parameter value to a runtime value, it automatically converts into the `Foo`
type.

For example, the following `NmStruct` type can be used in the parameter domain,
but the `converted_to_has_bool` instance of it is converted to `HasBool` when it's
materialized as a runtime value:

```mojo
@value
@register_passable("trivial")
struct HasBool:
    var x: Bool

    fn __init__(out self, x: Bool):
        self.x = x

    @always_inline("nodebug")
    fn __init__(out self, nms: NmStruct):
        self.x = True if (nms.x == 77) else False

@value
@nonmaterializable(HasBool)
@register_passable("trivial")
struct NmStruct:
    var x: Int

    @always_inline("nodebug")
    fn __add__(self, rhs: Self) -> Self:
        return NmStruct(self.x + rhs.x)

alias still_nm_struct = NmStruct(1) + NmStruct(2)
# When materializing to a run-time variable, it is automatically converted,
# even without a type annotation.
var converted_to_has_bool = still_nm_struct
```

:::note

A non-materializable struct must have all of its methods annotated
as `@always_inline`, and it must be computable in the parameter domain.

:::

---

## @parameter

You can add the `@parameter` decorator on an `if` or `for` statement to run that
code at compile time, or on a nested function to create a [parametric
closure](#parametric-closure).

## Parametric `if` statement

You can add `@parameter` to any `if` condition that's based on a valid
parameter expression (it's an expression that evaluates at compile time). This
ensures that only the live branch of the `if` statement is compiled into the
program, which can reduce your final binary size. For example:

```mojo
@parameter
if True:
    print("this will be included in the binary")
else:
    print("this will be eliminated at compile time")
```

```output
this will be included in the binary
```

## Parametric `for` statement

You can add the `@parameter` decorator to a `for` loop to create a loop that's
"unrolled" at compile time.

The loop sequence and induction values must be valid parameter expressions (that
is, expressions that evaluate at compile time). For example, if you use
`for i in range(LIMIT)`, the expression `range(LIMIT)` defines the loop
sequence. This is a valid parameter expression if `LIMIT` is a parameter, alias,
or integer literal.

The compiler "unrolls" the loop by replacing the `for` loop with
`LIMIT` copies of the loop body with different constant `i` values.

You can use run-time expressions in the body of the loop (for example, in the
following example, the `list`, `threshold`, and `count` variables are all
run-time values).

```mojo
from random import rand

def main():
    alias LIST_SIZE = 128

    var list = List[Float64](length=LIST_SIZE, fill=0)
    rand(list.unsafe_ptr(), LIST_SIZE)

    var threshold = 0.6
    var count = 0

    @parameter
    for i in range(LIST_SIZE):
        if (list[i] > threshold):
            count += 1

    print(StaticString("{} items over 0.6").format(count))
```

The `@parameter for` construct unrolls at the beginning of compilation, which
might explode the size of the program that still needs to be compiled, depending
on the amount of code that's unrolled.

Currently, `@parameter for` requires the sequence's `__iter__` method to
return a `_StridedRangeIterator`, meaning the induction variables must be
`Int`. The intention is to lift this restriction in the future.

## Parametric closure

You can add `@parameter` on a nested function to create a “parametric”
capturing closure. This means you can create a closure function that captures
values from the outer scope (regardless of whether they are variables or
parameters), and then use that closure as a parameter. For example:

```mojo
fn use_closure[func: fn(Int) capturing [_] -> Int](num: Int) -> Int:
    return func(num)

fn create_closure():
    var x = 1

    @parameter
    fn add(i: Int) -> Int:
        return x + i

    var y = use_closure[add](2)
    print(y)

create_closure()
```

```output
3
```

Without the `@parameter` decorator, you'll get a compiler error that says you
"cannot use a dynamic value in call parameter"—referring to the
`use_closure[add](2)` call—because the `add()` closure would still be dynamic.

Note the `[_]` in the function type:

```mojo
fn use_closure[func: fn(Int) capturing [_] -> Int](num: Int) -> Int:
```

This origin specifier represents the set of origins for the values captured by
the parametric closure. This allows the compiler to correctly extend the
lifetimes of those values. For more information on lifetimes and origins, see
[Lifetimes, origins and references](/mojo/manual/values/lifetimes).

---

## @register_passable

You can add the `@register_passable` decorator on a struct to tell Mojo that
the type should be passed in machine registers (such as a CPU register; subject
to the details of the underlying architecture). For tiny data types like an
integer or floating-point number, this is much more efficient than storing
values in stack memory. This means the type is always passed by value and
cannot be passed by reference.

The basic `@register_passable` decorator does not change the fundamental
behavior of a type: it still needs an `__init__()` and `__copyinit__()` method
to be copyable (and it may have a `__del__()` method, if necessary). For example:

```mojo
@register_passable
struct Pair:
    var a: Int
    var b: Int

    fn __init__(out self, one: Int, two: Int):
        self.a = one
        self.b = two

    fn __copyinit__(out self, existing: Self):
        self.a = existing.a
        self.b = existing.b

fn test_pair():
    var x = Pair(5, 10)
    var y = x

    print(y.a, y.b)
    y.a = 10
    y.b = 20
    print(y.a, y.b)
```

```mojo
test_pair()
```

```output
5 10
10 20
```

This behavior is what we expect from `Pair`, with or without the decorator.

You should be aware of a few other observable effects:

1. `@register_passable` types cannot hold instances of types
   that are not also `@register_passable`.

2. `@register_passable` types do not have a predictable identity,
   and so the `self` pointer is not stable/predictable (e.g. in hash tables).

3. `@register_passable` arguments and result are exposed to C and C++ directly,
   instead of being passed by-pointer.

4. `@register_passable` types cannot have a [`__moveinit__()`
   constructor](/mojo/manual/lifecycle/life#move-constructor), because
   values passed in a register cannot be passed by reference.

## `@register_passable("trivial")`

Most types that use `@register_passable` are just "bags of bits," which we call
"trivial" types. These trivial types are simple and should be copied, moved,
and destroyed without any custom constructors or a destructor. For these types,
you can add the `"trivial"` argument, and Mojo synthesizes all the lifecycle
methods as appropriate for a trivial register-passable type:

```mojo
@register_passable("trivial")
struct Pair:
    var a: Int
    var b: Int
```

This is similar to the [`@value`](/mojo/manual/decorators/value) decorator,
except when using `@register_passable("trivial")` the only lifecycle method
you're allowed to define is the `__init__()` constructor (but you don't have
to)—you *cannot* define any copy or move constructors or a destructor.

Examples of trivial types include:

* Arithmetic types such as `Int`, `Bool`, `Float64` etc.
* Pointers (the address value is trivial, not the data being pointed to).
* Arrays of other trivial types, including SIMD.

For more information about lifecycle methods (constructors and destructors)
see the section about [Value lifecycle](/mojo/manual/lifecycle/).

:::note TODO

This decorator is due for reconsideration. Lack of custom
copy/move/destroy logic and "passability in a register" are orthogonal concerns
and should be split. This former logic should be subsumed into a more general
decorator, which is orthogonal to `@register_passable`.

:::

---

## @staticmethod

You can add the `@staticmethod` decorator on a struct method to declare a static
method.

For example:

```mojo
from collections import List
from pathlib import Path

struct MyStruct:
    var data: List[UInt8]

    fn __init__(out self):
        self.data = List[UInt8]()

    fn __moveinit__(out self, owned existing: Self):
        self.data = existing.data ^

    @staticmethod
    fn load_from_file(file_path: Path) raises -> Self:
        var new_struct = MyStruct()
        new_struct.data = file_path.read_bytes()
        return new_struct ^
```

Unlike an instance method, a static method doesn't take an implicit `self`
argument. It's not attached to a specific instance of a struct, so it can't
access instance data.

For more information see the documentation on
[static methods](/mojo/manual/structs#static-methods).

---

## @value

You can add the `@value` decorator on a struct to generate boilerplate
lifecycle methods, including the member-wise `__init__()` constructor,
`__copyinit__()` copy constructor, and `__moveinit__()` move constructor.

For example, consider a simple struct like this:

```mojo
@value
struct MyPet:
    var name: String
    var age: Int
```

Mojo sees the `@value` decorator and notices that you don't have any constructors
and it synthesizes them for you, the result being as if you had actually
written this:

```mojo
struct MyPet:
    var name: String
    var age: Int

    fn __init__(out self, owned name: String, age: Int):
        self.name = name^
        self.age = age

    fn __copyinit__(out self, existing: Self):
        self.name = existing.name
        self.age = existing.age

    fn __moveinit__(out self, owned existing: Self):
        self.name = existing.name^
        self.age = existing.age
```

Mojo synthesizes each lifecycle method only when it doesn't exist, so
you can use `@value` and still define your own versions to override the default
behavior. For example, it is fairly common to use the default member-wise and
move constructor, but create a custom copy constructor.

For more information about these lifecycle methods, read
[Life of a value](/mojo/manual/lifecycle/life).

---

## A step-by-step guide to Magic

import GetMagic from '@site/src/includes/get_magic.mdx';

This guide will walk you through Magic, a command-line-based package management tool by Modular
designed for fast, efficient, and scalable project management in Mojo and MAX environments.
Whether you're managing dependencies across multiple platforms, setting up environments for specific tasks,
or working with Python-based projects, `magic` CLI simplifies these processes and more.
Built on the powerful [Pixi](https://prefix.dev/),
Magic leverages its capabilities to provide seamless environment management and package handling for MAX and Mojo applications.

:::note
We recommend using Magic when developing in Mojo. If you're using the MAX
framework with Python, we recommend installing our APIs and CLI tools via `pip`
inside a Python virtual environment. For more details, see the
[MAX install guide](/max/packages).
:::

In this tutorial, we'll guide you through everything from setting up your first project
and understanding the `magic` CLI to running Mojo code and building a FastAPI application.
By the end, you'll have a solid grasp of how to configure and use magic effectively for
streamlined project management, dependency handling, and environment setup.

## Step 1: Create a project

The first step to using Magic is creating a new project. Run the following command:

```sh
magic init hello-magic --format mojoproject
```

:::tip Documentation

For detailed command options and examples, run `magic  --help` in your terminal or explore the [magic commands reference](/magic/commands).

:::

And navigate to the project directly:

```sh
cd hello-magic
```

Your project structure will look like this:

```txt
├── .gitattributes
├── .gitignore
├── .magic
├── magic.lock
└── mojoproject.toml
```

- `.magic` directory is used to store environment configurations and manages the dependencies for your project.
  This helps Magic keep your project isolated and ensures that different versions of dependencies won't conflict
  with other projects. The `.magic/envs` sub-directory specifically stores the virtual environments for your project.
  Unlike other package managers, Magic keeps your environment separate and clean, making it easy to manage and switch between different projects.
- `mojoproject.toml` is a single TOML configuration file.
- `magic.lock` is a critical file for ensuring reproducibility.
  It captures the exact versions of every dependency in your project.
  This ensures that when you or someone else runs your project in the future or on a different machine,
  Magic will install the exact same versions of packages. This avoids the common "it works on my machine" issue,
  providing consistency, especially in complex projects across different platforms.
  Please find more details in [Pixi lockfile](https://pixi.sh/latest/features/lockfile/).

:::note Magic caches dependencies

Magic uses a system wide cache for packages so creating extra environments do not take up more disk space.

:::

### Inspect `mojoproject.toml`

The `mojoproject.toml` file defines your project's configuration.
It contains sections like project metadata, dependencies, and channels, all in a single TOML file.
Here's an example:

```txt
[project]
authors = ["Modular "]
channels = ["https://conda.modular.com/max-nightly", "https://conda.modular.com/max", "https://repo.prefix.dev/modular-community", "conda-forge"]
description = "Add a short description here"
name = "hello-magic"
platforms = ["osx-arm64"]
version = "0.1.0"

[dependencies]
max = ">=24.4.0,=3.8,=24.4.0,=3.8,=0.114.0,  PythonObject:
    torch = Python.import_module("torch")
    return torch.zeros(1)

def main():
    print(zero())
```

Now, include the following to `main.py`:

```python
import subprocess
from fastapi import FastAPI, HTTPException

@app.get("/zero")
def zero():
    try:
        p = subprocess.Popen(
            ["magic", "run", "mojo", "local/zero.mojo"],
            stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT,
            text=True,
        )
        while True:
            output = p.stdout.readline()
            if output == "" and p.poll() is not None:
                raise HTTPException(
                    status_code=500, detail="Failed to produce zero"
                )

            return {"message": f"answer is {output}"}

    except subprocess.SubprocessError as e:
        raise HTTPException(status_code=500, detail="Failed to execute subprocess")
```

and invoke `magic run dev-server` navigate to the `/zero` endpoint
[http://127.0.0.1:8000/zero](http://127.0.0.1:8000/zero). We should see

```
{"message":"answer is tensor([0.])\n"}
```

Above, we are running `magic run mojo local/zero.mojo` from a subprocess.
Another way is to first build the binary separately and earlier with:

```bash
cd local && magic run mojo build zero.mojo
```

then navigate to the top repository and in runtime run the built binary instead:

```bash
magic run bash -c local/zero
```

The latter takes advantage of the Mojo compiler whereas `mojo local/zero.mojo`
uses the just-in-time (JIT) feature of the Mojo compiler.

## Step 6: Setup a test environment

Testing is crucial in development, especially when dealing with complex dependencies.
Using Magic, you can set up a dedicated testing environment that isolates your testing dependencies
from your development dependencies. This ensures that your development environment remains clean and focused,
while your test environment has everything it needs to run unit tests, integration tests, etc.
Isolating these environments also helps prevent any accidental conflicts or issues during testing.

To add a specific testing dependencies in a dedicated environment using Magic, first run:

```sh
magic task add test "pytest" --feature test
```

which includes the following:

```txt
[feature.test.tasks]
test = "pytest"
```

Then we need to explicitly add the `default` environment:

```sh
magic project environment add default --solve-group default
```

After than, we can include the `test` environment via:

```sh
magic project environment add test --feature test --solve-group default
```

This adds the following configuration:

```
[environments]
default = { solve-group = "default" }
test = { features = ["test"], solve-group = "default" }
```

:::note Group dependencies

Here `--solve-group` is a way to group dependencies together which is useful when having multiple environments sharing
the same dependencies. Check out [pixi's multi-environment](https://pixi.sh/latest/features/multi_environment/) for more.

:::

Finally, add `pytest` as a dependency for the test environment via `--feature`:

```sh
magic add pytest --pypi --feature test
```

which includes the following in `mojoproject.toml`:

```txt
[feature.test.pypi-dependencies]
pytest = ">=8.3.2,

Report feedback, including issues on our [MAX](https://github.com/modular/modular/issues) GitHub tracker.

---

## abort

`abort[result: AnyType = None]() -> result`

Calls a target dependent trap instruction if available.

**Parameters:**

* ​result (`AnyType`): The result type.

**Returns:**

A null result type.

`abort[result: AnyType = None](message: String) -> result`

Calls a target dependent trap instruction if available.

**Parameters:**

* ​result (`AnyType`): The result type.

**Args:**

* ​message (`String`): The message to include when aborting.

**Returns:**

A null result type.

---

## abs

`abs(t: IntTuple[origin]) -> IntTuple`

Compute the absolute value of each element in an `IntTuple`.

This function applies the absolute value operation to each integer
in a potentially nested `IntTuple` structure.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to transform.

**Returns:**

A new `IntTuple` with the same structure but with absolute values.

---

## abs

`abs[T: Absable](value: T) -> T`

Get the absolute value of the given object.

**Parameters:**

* ​T (`Absable`): The type conforming to Absable.

**Args:**

* ​value (`T`): The object to get the absolute value of.

**Returns:**

The absolute value of the object.

---

## abs

`abs(x: ComplexSIMD[type, size]) -> SIMD[type, size]`

Performs elementwise abs (norm) on each element of the complex value.

**Args:**

* ​x (`ComplexSIMD[type, size]`): The complex vector to perform absolute value on.

**Returns:**

The elementwise abs of x.

---

## Absable

The `Absable` trait describes a type that defines an absolute value operation.

Types that conform to `Absable` will work with the builtin `abs` function.
The absolute value operation always returns the same type as the input.

For example:

```mojo
struct Point(Absable):
    var x: Float64
    var y: Float64

    fn __abs__(self) -> Self:
        return sqrt(self.x * self.x + self.y * self.y)
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__abs__`

`__abs__(self: _Self) -> _Self`

Get the absolute value of this instance.

**Returns:**

The absolute value of the instance.

---

## AccessPolicyWindow

`@register_passable(trivial)`
`struct AccessPolicyWindow`

Specifies an access policy for a window of memory.

This struct defines a contiguous extent of memory beginning at base\_ptr and
ending at base\_ptr + num\_bytes, with associated access policies. It allows
fine-grained control over how memory is accessed and cached, which can
significantly impact performance for memory-bound workloads.

The window is partitioned into segments with different access properties based
on the hit\_ratio. Accesses to "hit segments" use the hit\_prop policy, while
accesses to "miss segments" use the miss\_prop policy.

Note:
The `num_bytes` value is limited by `CU_DEVICE_ATTRIBUTE_MAX_ACCESS_POLICY_WINDOW_SIZE`.
The CUDA driver may align the `base_ptr` and restrict the maximum size.

## Fields

* ​base\_ptr (`UnsafePointer[NoneType]`): Starting address of the access policy window. Driver may align it.
* ​num\_bytes (`Int`): Size in bytes of the window policy. CUDA driver may restrict the maximum size and alignment.
* ​hit\_ratio (`SIMD[float32, 1]`): Specifies percentage of lines assigned hit\_prop, rest are assigned miss\_prop. Value should be between 0.0 and 1.0.
* ​hit\_prop (`AccessProperty`): AccessProperty applied to hit segments within the window.
* ​miss\_prop (`AccessProperty`): AccessProperty applied to miss segments within the window. Must be either NORMAL or STREAMING.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Initializes a new AccessPolicyWindow with default values.

`__init__[T: AnyType](*, base_ptr: UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin], count: Int, hit_ratio: SIMD[float32, 1], hit_prop: AccessProperty = AccessProperty(__init__[__mlir_type.!pop.int_literal](0)), miss_prop: AccessProperty = AccessProperty(__init__[__mlir_type.!pop.int_literal](0))) -> Self`

Initializes an `AccessPolicyWindow` for a typed memory region.

**Parameters:**

* ​T (`AnyType`): The type of data in the memory region.

**Args:**

* ​base\_ptr (`UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the start of the memory region.
* ​count (`Int`): Number of elements of type T in the memory region.
* ​hit\_ratio (`SIMD[float32, 1]`): Fraction of the window that should use hit\_prop (0.0 to 1.0).
* ​hit\_prop (`AccessProperty`): Access property for hit segments (default: NORMAL).
* ​miss\_prop (`AccessProperty`): Access property for miss segments (default: NORMAL).

### `__str__`

`__str__(self) -> String`

Returns a string representation of the `AccessPolicyWindow`.

**Returns:**

A string representation of the `AccessPolicyWindow`.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes a string representation of the `AccessPolicyWindow` to a writer.

This method formats all the fields of the AccessPolicyWindow into a human-readable
string representation and writes it to the provided writer.

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait.

**Args:**

* ​writer (`W`): The writer instance to write the formatted string to.

---

## AccessProperty

`@register_passable(trivial)`
`struct AccessProperty`

Specifies performance hint with AccessPolicyWindow for hit\_prop and miss\_prop fields.

This struct defines cache persistence properties that can be used with
`AccessPolicyWindow` to control how data is cached during GPU memory accesses.
It provides hints to the memory subsystem about the expected access patterns,
which can improve performance for specific workloads.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `NORMAL`

`alias NORMAL = AccessProperty(__init__[__mlir_type.!pop.int_literal](0))`

Normal cache persistence with default caching behavior.

### `PERSISTING`

`alias PERSISTING = AccessProperty(__init__[__mlir_type.!pop.int_literal](2))`

Persisting access is more likely to persist in cache, optimized for reused data.

### `STREAMING`

`alias STREAMING = AccessProperty(__init__[__mlir_type.!pop.int_literal](1))`

Streaming access is less likely to persist in cache, optimized for single-use data.

## Methods

### `__init__`

`__init__(*, other: Self) -> Self`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compares two `AccessProperty` instances for equality.

**Args:**

* ​other (`Self`): The `AccessProperty` to compare with.

**Returns:**

True if the instances have the same value, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Compares two `AccessProperty` instances for inequality.

**Args:**

* ​other (`Self`): The `AccessProperty` to compare with.

**Returns:**

True if the instances have different values, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Checks if two `AccessProperty` instances have the same value.

**Args:**

* ​other (`Self`): The `AccessProperty` to compare with.

**Returns:**

True if the instances have the same value, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Checks if two `AccessProperty` instances have different values.

**Args:**

* ​other (`Self`): The `AccessProperty` to compare with.

**Returns:**

True if the instances have different values, False otherwise.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the `AccessProperty`.

**Returns:**

A string representation of the `AccessProperty`.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes a string representation of the `AccessProperty` to a writer.

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait.

**Args:**

* ​writer (`W`): The writer instance to write the formatted string to.

---

## accumulate


---

## accumulate_wo_tile

`accumulate_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](c_tile_size: Int, output: UnsafePointer[SIMD[output_dt, 1]], output_stride: Int, input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, partial_load_size: Int)`

---

## accumulate_wo_tile_1d

`accumulate_wo_tile_1d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load_filter: Bool, effected_by_padding: Bool, input_dt: DType, filter_dt: DType](c_tile_size: Int, S: Int, mut acc: _Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop], input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, input_stride_to_nbr: Int, filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, filter_stride_to_nbr: Int, partial_load_filter_size: Int, w: Int, W: Int, dilation: Int)`

Update one row in the output for a given (c, f) tile.

**Parameters:**

* ​micro\_kernel\_height (`Int`): Number of input points in register tiling.
* ​micro\_kernel\_width (`Int`): Number of SIMD resgiters assigned to F.
* ​simd\_size (`Int`): Number of elements in a SIMD register.
* ​partial\_load\_filter (`Bool`): Whether using partial load for filter.
* ​effected\_by\_padding (`Bool`): Whether the tile is effected by padding.
* ​input\_dt (`DType`): DType of input.
* ​filter\_dt (`DType`): DType of filter.

**Args:**

* ​c\_tile\_size (`Int`): Tile size in input channel.
* ​S (`Int`): Filter window width.
* ​acc (`_Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop]`): Pointer to register tile accumulator.
* ​input (`UnsafePointer[SIMD[input_dt, 1]]`): Pointer to the first input point in WO tile.
* ​input\_stride (`Int`): Stride between two input points, i.e., C w/ NHWC layout.
* ​input\_stride\_to\_nbr (`Int`): Stride between an input point and its neighbor.
* ​filter (`UnsafePointer[SIMD[filter_dt, 1]]`): Pointer to the first coef in the filter window.
* ​filter\_stride (`Int`): Stride between two segments of size `micro_kernel_width * simd_size`.
* ​filter\_stride\_to\_nbr (`Int`): Stride between between two neighbor coefs, i.e.,
  CF w/ RSCF layout.
* ​partial\_load\_filter\_size (`Int`): Size of partial load for filter.
* ​w (`Int`): Coordinate in an input row.
* ​W (`Int`): Input width.
* ​dilation (`Int`): Convolution dilation.

---

## accumulate_wo_tile_2d

`accumulate_wo_tile_2d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load_filter: Bool, effected_by_padding: Bool, input_dt: DType, filter_dt: DType](c_tile_size: Int, RS: IndexList[2], mut acc: _Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop], input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, input_stride_to_nbr: IndexList[2], filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, filter_stride_to_nbr: IndexList[2], partial_load_filter_size: Int, hw: IndexList[2], HW: IndexList[2], dilation: IndexList[2])`

---

## accumulate_wo_tile_3d

`accumulate_wo_tile_3d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, partial_load_filter: Bool, effected_by_padding: Bool, input_dt: DType, filter_dt: DType](c_tile_size: Int, QRS: IndexList[3], mut acc: _Accumulator[type, num_rows, num_cols, simd_width, row_start, row_stop], input: UnsafePointer[SIMD[input_dt, 1]], input_stride: Int, input_stride_to_nbr: IndexList[3], filter: UnsafePointer[SIMD[filter_dt, 1]], filter_stride: Int, filter_stride_to_nbr: IndexList[3], partial_load_filter_size: Int, dhw: IndexList[3], DHW: IndexList[3], dilation: IndexList[3])`

---

## acos

`acos[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `acos` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `acos` of the input.

---

## acosh

`acosh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `acosh` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `acosh` of the input.

---

## activations

The module contains implementations of activation functions.

## Functions

* [​`elu`](./elu): Compute the Elu Op using the equation $z if z >= 0 else alpha*(e^z -1)$.
* [​`gelu`](./gelu): Compute the GELU Op using the equation $0.5 * x * (1 + erf(x / sqrt(2)))$.
* [​`gelu_approximate`](./gelu_approximate): Compute the approximate GELU Op using the equation $0.5 * x * (1 + tanh(sqrt(2 / pi) * (x + 0.044715 * x^3)))$.
* [​`relu`](./relu): Compute the Relu Op using the equation $max(0, x)$.
* [​`relu_n1`](./relu_n1): Compute the Relu N1 Op using the equation $max(min(x,1),-1)$.
* [​`sign`](./sign): Compute the sign (0, 1) of the input value.

---

## AddressSpace

`@register_passable(trivial)`
`struct AddressSpace`

Address space of the pointer.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `GENERIC`

`alias GENERIC = AddressSpace(0)`

Generic address space.

## Methods

### `__init__`

`__init__(value: Int) -> Self`

Initializes the address space from the underlying integral value.

**Args:**

* ​value (`Int`): The address space value.

`__init__(value: _GPUAddressSpace) -> Self`

Initializes the address space from the underlying integral value.

**Args:**

* ​value (`_GPUAddressSpace`): The address space value.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

True if the two address spaces are equal and False otherwise.

**Args:**

* ​other (`Self`): The other address space value.

**Returns:**

True if the two address spaces are equal and False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

True if the two address spaces are inequal and False otherwise.

**Args:**

* ​other (`Self`): The other address space value.

**Returns:**

True if the two address spaces are inequal and False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

True if the two address spaces are equal and False otherwise.

**Args:**

* ​other (`Self`): The other address space value.

**Returns:**

True if the two address spaces are equal and False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

True if the two address spaces are equal and False otherwise.

**Args:**

* ​other (`Self`): The other address space value.

**Returns:**

True if the two address spaces are equal and False otherwise.

### `value`

`value(self) -> Int`

The integral value of the address space.

**Returns:**

The integral value of the address space.

### `__int__`

`__int__(self) -> Int`

The integral value of the address space.

**Returns:**

The integral value of the address space.

### `__index__`

`__index__(self) -> index`

Convert to index.

**Returns:**

The corresponding \_\_mlir\_type.index value.

### `__str__`

`__str__(self) -> String`

Gets a string representation of the AddressSpace.

**Returns:**

The string representation of the AddressSpace.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats the address space to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

---

## advanced_indexing_getitem

`advanced_indexing_getitem[input_rank: Int, index_rank: Int, input_type: DType, index_type: DType, //, start_axis: Int, num_index_tensors: Int, target: StringSlice[StaticConstantOrigin], single_thread_blocking_override: Bool, trace_description: StringSlice[StaticConstantOrigin], input_tensor_fn: fn[Int](IndexList[input_rank]) capturing -> SIMD[input_type, $0], indices_fn: fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]](out_tensor: NDBuffer[input_type, ((num_index_tensors * -1) + index_rank + input_rank), origin], in_tensor_strides: IndexList[input_rank], ctx: DeviceContextPtr)`

Implement basic numpy-style advanced indexing.

This is designed to be fused with other view-producing operations to
implement full numpy-indexing semantics.

This assumes the dimensions in `input_tensor` not indexed by index tensors
are ":", ie selecting all indices along the slice. For example in numpy:

```
# rank(indices1) == 3
# rank(indices2) == 3
out_tensor = input_tensor[:, :, :, indices1, indices2, :, :]
```

We calculate the following for all valid valued indexing variables:

```
out_tensor[a, b, c, i, j, k, d, e] = input_tensor[
    a, b, c,
    indices1[i, j, k],
    indices2[i, j, k],
    d, e
]
```

In this example `start_axis = 3` and `num_index_tensors = 2`.

TODO(GEX-1951): Support boolean tensor mask support
TODO(GEX-1952): Support non-contiguous indexing tensor case
TODO(GEX-1953): Support fusion (especially view-fusion)

**Parameters:**

* ​input\_rank (`Int`): The rank of the input tensor.
* ​index\_rank (`Int`): The rank of the indexing tensors.
* ​input\_type (`DType`): The dtype of the input tensor.
* ​index\_type (`DType`): The dtype of the indexing tensors.
* ​start\_axis (`Int`): The first dimension in input where the indexing tensors
  are applied. It is assumed the indexing tensors are applied in
  consecutive dimensions.
* ​num\_index\_tensors (`Int`): The number of indexing tensors.
* ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to operation on.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​trace\_description (`StringSlice[StaticConstantOrigin]`): For profiling, the trace name the operation will
  appear under.
* ​input\_tensor\_fn (`fn[Int](IndexList[input_rank]) capturing -> SIMD[input_type, $0]`): Fusion lambda for the input tensor.
* ​indices\_fn (`fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]`): Fusion lambda for the indices tensors.

**Args:**

* ​out\_tensor (`NDBuffer[input_type, ((num_index_tensors * -1) + index_rank + input_rank), origin]`): The output tensor to write to.
* ​in\_tensor\_strides (`IndexList[input_rank]`): The strides of the input tensor.
* ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler.

---

## advanced_indexing_getitem_shape

`advanced_indexing_getitem_shape[input_rank: Int, index_rank: Int, //, start_axis: Int, num_index_tensors: Int](input_shape: IndexList[input_rank], index_shape: IndexList[index_rank]) -> IndexList[((num_index_tensors * -1) + index_rank + input_rank)]`

Calculate the output shape from advanced indexing.

**Parameters:**

* ​input\_rank (`Int`): The rank of the input tensor.
* ​index\_rank (`Int`): The rank of the indexing tensors.
* ​start\_axis (`Int`): The first dimension in input where the indexing tensors
  are applied. It is assumed the indexing tensors are applied in
  consecutive dimensions.
* ​num\_index\_tensors (`Int`): The number of indexing tensors.

**Args:**

* ​input\_shape (`IndexList[input_rank]`): The shape of the input tensor in the operation.
* ​index\_shape (`IndexList[index_rank]`): The shape of the indexing tensors in the operation.

---

## advanced_indexing_setitem_inplace

`advanced_indexing_setitem_inplace[input_rank: Int, index_rank: Int, updates_rank: Int, input_type: DType, index_type: DType, //, start_axis: Int, num_index_tensors: Int, target: StringSlice[StaticConstantOrigin], single_thread_blocking_override: Bool, trace_description: StringSlice[StaticConstantOrigin], updates_tensor_fn: fn[Int](IndexList[updates_rank]) capturing -> SIMD[input_type, $0], indices_fn: fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]](input_tensor: NDBuffer[input_type, input_rank, origin], index_tensor_shape: IndexList[index_rank, element_type=element_type], updates_tensor_strides: IndexList[updates_rank], ctx: DeviceContextPtr)`

Implement basic numpy-style advanced indexing with assignment.

This is designed to be fused with other view-producing operations to
implement full numpy-indexing semantics.

This assumes the dimensions in `input_tensor` not indexed by index tensors
are ":", ie selecting all indices along the slice. For example in numpy:

```
# rank(indices1) == 2
# rank(indices2) == 2
# rank(updates) == 2
input_tensor[:, :, :, indices1, indices2, :, :] = updates
```

We calculate the following for all valid valued indexing variables:

```
input_tensor[
    a, b, c,
    indices1[i, j],
    indices2[i, j],
    d, e
] = updates[i, j]
```

In this example `start_axis = 3` and `num_index_tensors = 2`.

In terms of implementation details, our strategy is to iterate over
all indices over a common iteration range. The idea is we can map
indices in this range to the write location in `input_tensor` as well
as the data location in `updates`. An update can illustrate how this is
possible best:

Imagine the `input_tensor` shape is \[A, B, C, D] and we have indexing
tensors I1 and I2 with shape \[M, N, K]. Assume I1 and I2 are applied
to dimensions 1 and 2.

I claim an appropriate common iteration range is then (A, M, N, K, D).
Note we expect `updates` to be the shape \[A, M, N, K, D]. We will show
this by providing the mappings into `updates` and `input_tensor`:

Consider an arbitrary set of indices in this range (a, m, n, k, d):
\- The index into `updates` is (a, m, n, k, d).
\- The index into `input_tensor` is (a, I1\[m, n, k], I2\[m, n, k], d).

TODO(GEX-1951): Support boolean tensor mask support
TODO(GEX-1952): Support non-contiguous indexing tensor case
TODO(GEX-1953): Support fusion (especially view-fusion)
TODO(GEX-1954): Unify getitem and setitem using generic views.
(Requires non-strided view functions).

**Parameters:**

* ​input\_rank (`Int`): The rank of the input tensor.
* ​index\_rank (`Int`): The rank of the indexing tensors.
* ​updates\_rank (`Int`): The rank of the updates tensor.
* ​input\_type (`DType`): The dtype of the input tensor.
* ​index\_type (`DType`): The dtype of the indexing tensors.
* ​start\_axis (`Int`): The first dimension in input where the indexing tensors
  are applied. It is assumed the indexing tensors are applied in
  consecutive dimensions.
* ​num\_index\_tensors (`Int`): The number of indexing tensors.
* ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to operation on.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​trace\_description (`StringSlice[StaticConstantOrigin]`): For profiling, the trace name the operation will
  appear under.
* ​updates\_tensor\_fn (`fn[Int](IndexList[updates_rank]) capturing -> SIMD[input_type, $0]`): Fusion lambda for the update tensor.
* ​indices\_fn (`fn[Int](IndexList[index_rank]) capturing -> SIMD[index_type, 1]`): Fusion lambda for the indices tensors.

**Args:**

* ​input\_tensor (`NDBuffer[input_type, input_rank, origin]`): The input tensor being indexed into and modified in-place.
* ​index\_tensor\_shape (`IndexList[index_rank, element_type=element_type]`): The shape of each index tensor.
* ​updates\_tensor\_strides (`IndexList[updates_rank]`): The strides of the update tensor.
* ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler.

---

## AI glossary

import MDXListing from '@site/src/components/Listing/MDXListing';

export const terms = [
        '*.mdx'
    ]

---

## algorithm

Implements the algorithm package.

## Modules

* [​`functional`](/mojo/stdlib/algorithm/functional/): Implements higher-order functions.
* [​`memory`](/mojo/stdlib/algorithm/memory/): Implements `parallel_memcpy`.
* [​`reduction`](/mojo/stdlib/algorithm/reduction/): Implements SIMD reductions.

---

## AlibiScoreMod

`@register_passable(trivial)`
`struct AlibiScoreMod[num_heads: Int]`

AlibiScoreMod adds the appropriate ALiBi constant bias to attention score.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`ScoreModTrait`,
`UnknownDestructibility`

## Aliases

### `name_str`

`alias name_str = __init__[__mlir_type.!kgen.string]("alibi")`

## Methods

### `score_mod`

`score_mod[type: DType, width: Int, //, *, element_type: DType = int32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width], max_prompt_len: Int) -> SIMD[type, width]`

---

## align_down

`align_down(value: Int, alignment: Int) -> Int`

Returns the closest multiple of alignment that is less than or equal to value.

**Args:**

* ​value (`Int`): The value to align.
* ​alignment (`Int`): Value to align to.

**Returns:**

Closest multiple of the alignment that is less than or equal to the
input value. In other words, floor(value / alignment) \* alignment.

`align_down(value: UInt, alignment: UInt) -> UInt`

Returns the closest multiple of alignment that is less than or equal to value.

**Args:**

* ​value (`UInt`): The value to align.
* ​alignment (`UInt`): Value to align to.

**Returns:**

Closest multiple of the alignment that is less than or equal to the
input value. In other words, floor(value / alignment) \* alignment.

---

## align_down_residual

`align_down_residual(value: Int, alignment: Int) -> Int`

Returns the remainder after aligning down value to alignment.

**Args:**

* ​value (`Int`): The value to align.
* ​alignment (`Int`): Value to align to.

**Returns:**

The remainder after aligning down value to the closest multiple of
alignment. In other words, value - align\_down(value, alignment).

---

## align_up

`align_up(value: Int, alignment: Int) -> Int`

Returns the closest multiple of alignment that is greater than or equal to value.

**Args:**

* ​value (`Int`): The value to align.
* ​alignment (`Int`): Value to align to.

**Returns:**

Closest multiple of the alignment that is greater than or equal to the
input value. In other words, ceiling(value / alignment) \* alignment.

`align_up(value: UInt, alignment: UInt) -> UInt`

Returns the closest multiple of alignment that is greater than or equal to value.

**Args:**

* ​value (`UInt`): The value to align.
* ​alignment (`UInt`): Value to align to.

**Returns:**

Closest multiple of the alignment that is greater than or equal to the
input value. In other words, ceiling(value / alignment) \* alignment.

---

## alignof

`alignof[type: AnyType, target: target = _current_target()]() -> Int`

Returns the align of (in bytes) of the type.

**Parameters:**

* ​type (`AnyType`): The type in question.
* ​target (`target`): The target architecture.

**Returns:**

The alignment of the type in bytes.

`alignof[dtype: DType, target: target = _current_target()]() -> Int`

Returns the align of (in bytes) of the dtype.

**Parameters:**

* ​dtype (`DType`): The DType in question.
* ​target (`target`): The target architecture.

**Returns:**

The alignment of the dtype in bytes.

---

## all

`all[T: Boolable & Copyable & Movable, //](list: List[T, hint_trivial_type]) -> Bool`

Checks if **all** elements in the list are truthy.

**Parameters:**

* ​T (`Boolable & Copyable & Movable`): The type of elements to check.

**Args:**

* ​list (`List[T, hint_trivial_type]`): The list to check.

**Returns:**

`True` if **all** elements in the list are truthy, `False` otherwise.

`all[T: Boolable & KeyElement, //](set: Set[T]) -> Bool`

Checks if **all** elements in the set are truthy.

**Parameters:**

* ​T (`Boolable & KeyElement`): The type of elements to check.

**Args:**

* ​set (`Set[T]`): The set to check.

**Returns:**

`True` if **all** elements in the set are truthy, `False` otherwise.

`all(value: SIMD[dtype, size]) -> Bool`

Checks if **all** elements in the simd vector are truthy.

**Args:**

* ​value (`SIMD[dtype, size]`): The simd vector to check.

**Returns:**

`True` if **all** elements in the simd vector are truthy, `False`
otherwise.

---

## all_true

`all_true(src: NDBuffer[type, 1, origin]) -> Bool`

Returns True if all the elements in a buffer are True and False otherwise.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.

**Returns:**

True if all of the elements of the buffer are True and False otherwise.

---

## allgather

`allgather[type: DType, rank: Int, ngpus: Int, //](input_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus], output_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus], ctxs: List[DeviceContext])`

Performs all-gather across GPUs.

**Parameters:**

* ​type (`DType`): DType - The data type of tensor elements.
* ​rank (`Int`): Int - Number of dimensions in input tensors.
* ​ngpus (`Int`): Int - Number of GPUs participating in all-gather.

**Args:**

* ​input\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus]`): Input buffers from each GPU.
* ​output\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus]`): Output buffers for each GPU.
* ​ctxs (`List[DeviceContext]`): List of device contexts for participating GPUs.

---

## allgather

Multi-GPU allgather implementation that gathers values from multiple GPUs into an output buffer.

## Functions

* [​`allgather`](/mojo/stdlib/gpu/comm/allgather/allgather): Performs all-gather across GPUs.

---

## allreduce

`allreduce[type: DType, rank: Int, ngpus: Int, outputs_lambda: fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None](input_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus], output_buffers: InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus], rank_sigs: InlineArray[UnsafePointer[Signal], 8], ctxs: List[DeviceContext], _max_num_blocks: Optional[Int] = Optional(None))`

Performs an allreduce operation across multiple GPUs.

This function serves as the main entry point for performing allreduce operations
across multiple GPUs. It automatically selects between two implementations:

* A peer-to-peer (P2P) based implementation when P2P access is possible between GPUs
* A naive implementation as fallback when P2P access is not available

The allreduce operation combines values from all GPUs using element-wise addition
and distributes the result back to all GPUs.

Note:

* Input and output buffers must have identical shapes across all GPUs.
* The number of elements must be identical across all input/output buffers.
* Performance is typically better with P2P access enabled between GPUs.

**Parameters:**

* ​type (`DType`): The data type of the tensor elements (e.g. DType.float32).
* ​rank (`Int`): The number of dimensions in the input/output tensors.
* ​ngpus (`Int`): The number of GPUs participating in the allreduce.
* ​outputs\_lambda (`fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None`): An output elementwise lambda.

**Args:**

* ​input\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus]`): Array of input tensors from each GPU, one per GPU.
* ​output\_buffers (`InlineArray[NDBuffer[type, rank, MutableAnyOrigin], ngpus]`): Array of output tensors for each GPU to store results.
* ​rank\_sigs (`InlineArray[UnsafePointer[Signal], 8]`): Array of Signal pointers used for cross-GPU synchronization.
* ​ctxs (`List[DeviceContext]`): List of device contexts for each participating GPU.
* ​\_max\_num\_blocks (`Optional[Int]`): Optional maximum number of blocks used to compute grid
  configuration.
  If not passed a dispatch table sets the grid configuration.

---

## allreduce

Multi-GPU allreduce implementation for efficient tensor reduction across GPUs.

This module provides an optimized implementation of allreduce operations across multiple GPUs,
supporting both peer-to-peer (P2P) and non-P2P communication patterns. The implementation
automatically selects between two approaches based on hardware capabilities:

1. P2P-based implementation (when P2P access is available):
   * Uses direct GPU-to-GPU memory access for better performance
   * Implements both single-stage and two-stage algorithms:
     * Single-stage for latency-bound transfers (small tensors)
     * Two-stage (reduce-scatter + all-gather) for bandwidth-bound transfers (large tensors)
   * Optimized for NVLink bandwidth utilization
   * Uses vectorized memory access and higher precision accumulation

2. Non-P2P fallback implementation:
   * Copies data through host memory when direct GPU access isn't possible
   * Simple but functional approach for systems without P2P support

The implementation is tuned for common GPU architectures (A100, H100) and includes
parameters that can be adjusted for different hardware configurations.

Limitations:

* Number of elements must be a multiple of SIMD width
* Maximum of 8 GPUs supported
* All input/output buffers must have identical shapes

## Aliases

### `elementwise_epilogue_type`

`alias elementwise_epilogue_type = fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None`

### `MAX_GPUS`

`alias MAX_GPUS = 8`

Maximum number of GPUs supported in the allreduce implementation.

This constant sets the upper bound for the number of GPUS supported in this algorithm.

### `MAX_NUM_BLOCKS_UPPER_BOUND`

`alias MAX_NUM_BLOCKS_UPPER_BOUND = 512`

Maximum number of thread blocks to use for reduction kernels.

This value has been empirically optimized through grid search across different GPU architectures.
While this value is optimal for A100 GPUs, H100 GPUs may benefit from more blocks to fully
saturate NVLink bandwidth.

## Structs

* [​`Signal`](/mojo/stdlib/gpu/comm/allreduce/Signal): A synchronization primitive for coordinating GPU thread blocks across multiple devices.

## Functions

* [​`allreduce`](/mojo/stdlib/gpu/comm/allreduce/allreduce): Performs an allreduce operation across multiple GPUs.
* [​`can_enable_p2p`](/mojo/stdlib/gpu/comm/allreduce/can_enable_p2p): If peer-to-peer access is supported, enables it between all GPU pairs.

---

## AMDScheduleBarrierMask

`@register_passable(trivial)`
`struct AMDScheduleBarrierMask`

Represents different instruction scheduling masks for AMDGPU scheduling instructions.

These masks control which types of instructions can be reordered across a barrier for
performance optimization. When used with schedule\_barrier(), the mask determines which
instructions the compiler is allowed to move across the barrier point.

## Implemented traits

`AnyType`,
`Copyable`,
`Intable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ALL_ALU`

`alias ALL_ALU = AMDScheduleBarrierMask(1)`

Allows reordering of all arithmetic and logic instructions that don't involve memory operations.

### `ALL_DS`

`alias ALL_DS = AMDScheduleBarrierMask(128)`

Permits reordering of all Local Data Share (LDS) operations.

### `ALL_VMEM`

`alias ALL_VMEM = AMDScheduleBarrierMask(16)`

Enables reordering of all vector memory operations (reads and writes).

### `DS_READ`

`alias DS_READ = AMDScheduleBarrierMask(256)`

Enables reordering of LDS read operations only.

### `DS_WRITE`

`alias DS_WRITE = AMDScheduleBarrierMask(512)`

Enables reordering of LDS write operations only.

### `MFMA`

`alias MFMA = AMDScheduleBarrierMask(8)`

Allows reordering of matrix multiplication and WMMA instructions.

### `NONE`

`alias NONE = AMDScheduleBarrierMask(0)`

No instructions can cross the barrier. Most restrictive option.

### `SALU`

`alias SALU = AMDScheduleBarrierMask(4)`

Permits reordering of scalar arithmetic/logic unit instructions only.

### `TRANS`

`alias TRANS = AMDScheduleBarrierMask(1024)`

Allows reordering of transcendental instructions (sin, cos, exp, etc).

### `VALU`

`alias VALU = AMDScheduleBarrierMask(2)`

Permits reordering of vector arithmetic/logic unit instructions only.

### `VMEM_READ`

`alias VMEM_READ = AMDScheduleBarrierMask(32)`

Allows reordering of vector memory read operations only.

### `VMEM_WRITE`

`alias VMEM_WRITE = AMDScheduleBarrierMask(64)`

Allows reordering of vector memory write operations only.

## Methods

### `__init__`

`@implicit`
`__init__(value: Int) -> Self`

Initializes an `AMDScheduleBarrierMask` from an integer value.

This implicit constructor allows creating a barrier mask directly from an integer,
which is useful for combining multiple mask flags using bitwise operations.

**Args:**

* ​value (`Int`): The integer value to use for the barrier mask.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compares two `AMDScheduleBarrierMask` instances for equality.

**Args:**

* ​other (`Self`): The other `AMDScheduleBarrierMask` to compare with.

**Returns:**

True if the masks have the same value, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Compares two `AMDScheduleBarrierMask` instances for inequality.

**Args:**

* ​other (`Self`): The other `AMDScheduleBarrierMask` to compare with.

**Returns:**

True if the masks have different values, False otherwise.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the `AMDScheduleBarrierMask`.

Converts the mask to a human-readable string based on its value.

**Returns:**

A string representation of the mask, or aborts if the value is invalid.

### `__int__`

`__int__(self) -> Int`

Converts the `AMDScheduleBarrierMask` to an integer.

**Returns:**

The integer value of the mask, which can be used with low-level APIs.

---

## AMDSchedulerTuning

`@register_passable(trivial)`
`struct AMDSchedulerTuning`

## Fields

* ​block\_shape (`IndexList[2]`):
* ​tuning\_values (`IndexList[3]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

---

## AndMask

`@register_passable(trivial)`
`struct AndMask[T: MHAMask, S: MHAMask, //, lhs: T, rhs: S]`

Mask that's the AND of two masks.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`MHAMask`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask = get_vtable_entry(:trait T, "apply_log2e_after_mask") if get_vtable_entry(:trait T, "apply_log2e_after_mask") else get_vtable_entry(:trait S, "apply_log2e_after_mask")`

### `mask_out_of_bound`

`alias mask_out_of_bound = get_vtable_entry(:trait T, "mask_out_of_bound") if get_vtable_entry(:trait T, "mask_out_of_bound") else get_vtable_entry(:trait S, "mask_out_of_bound")`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds = get_vtable_entry(:trait S, "mask_safe_out_of_bounds") if get_vtable_entry(:trait T, "mask_safe_out_of_bounds") else get_vtable_entry(:trait T, "mask_safe_out_of_bounds")`

## Methods

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

### `status`

`status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

---

## any

`any[T: Boolable & Copyable & Movable, //](list: List[T, hint_trivial_type]) -> Bool`

Checks if **any** element in the list is truthy.

**Parameters:**

* ​T (`Boolable & Copyable & Movable`): The type of elements to check.

**Args:**

* ​list (`List[T, hint_trivial_type]`): The list to check.

**Returns:**

`True` if **any** element in the list is truthy, `False` otherwise.

`any[T: Boolable & KeyElement, //](set: Set[T]) -> Bool`

Checks if **any** element in the set is truthy.

**Parameters:**

* ​T (`Boolable & KeyElement`): The type of elements to check.

**Args:**

* ​set (`Set[T]`): The set to check.

**Returns:**

`True` if **any** element in the set is truthy, `False` otherwise.

`any(value: SIMD[dtype, size]) -> Bool`

Checks if **any** element in the simd vector is truthy.

**Args:**

* ​value (`SIMD[dtype, size]`): The simd vector to check.

**Returns:**

`True` if **any** element in the simd vector is truthy, `False`
otherwise.

---

## any_true

`any_true(src: NDBuffer[type, 1, origin]) -> Bool`

Returns True if any the elements in a buffer are True and False otherwise.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.

**Returns:**

True if any of the elements of the buffer are True and False otherwise.

---

## anytype

Defines the core traits for object lifetime management in Mojo.

This module provides the foundational traits that define how objects are created,
managed and destroyed in Mojo:

* `UnknownDestructibility`: The most basic trait that all types extend by default.
  Types with this trait have no destructor and no lifetime management.

* `AnyType`: The base trait for types that require lifetime management through
  destructors. Any type that needs cleanup when it goes out of scope should
  implement this trait.

* `ImplicitlyDestructible`: An alias for `AnyType` to help with the transition
  to linear types. Use this when you want to be explicit about a type having
  a destructor.

These traits are built into Mojo and do not need to be imported.

## Aliases

### `ImplicitlyDestructible`

`alias ImplicitlyDestructible = AnyType`

## Traits

* [​`AnyType`](/mojo/stdlib/builtin/anytype/AnyType): A trait for types that require lifetime management through destructors.
* [​`UnknownDestructibility`](/mojo/stdlib/builtin/anytype/UnknownDestructibility): The most basic trait that all Mojo types extend by default.

---

## AnyType

A trait for types that require lifetime management through destructors.

The `AnyType` trait is fundamental to Mojo's memory management system. It indicates
that a type has a destructor that needs to be called when instances go out of scope.
This is essential for types that own resources like memory, file handles, or other
system resources that need proper cleanup.

Key aspects:

* Any type with a destructor must implement this trait
* The destructor (`__del__`) is called automatically when an instance's lifetime ends
* Composition of types with destructors automatically gets a destructor
* All Mojo structs and traits inherit from `AnyType` by default unless they specify
  `@explicit_destroy`

Example:

```mojo
struct ResourceOwner(AnyType):
    var ptr: UnsafePointer[Int]

    fn __init__(out self, size: Int):
        self.ptr = UnsafePointer[Int].alloc(size)

    fn __del__(owned self):
        # Clean up owned resources
        self.ptr.free()
```

Best practices:

* Implement this trait when your type owns resources that need cleanup
* Ensure the destructor properly frees all owned resources
* Consider using `@explicit_destroy` for types that should never have destructors
* Use composition to automatically handle nested resource cleanup

## Implemented traits

`UnknownDestructibility`

## Methods

### `__del__`

`__del__(owned self: _Self, /)`

Destroys the instance and cleans up any owned resources.

This method is called automatically when an instance's lifetime ends. It receives
an owned value and should perform all necessary cleanup operations like:

* Freeing allocated memory
* Closing file handles
* Releasing system resources
* Cleaning up any other owned resources

The instance is considered dead after this method completes, regardless of
whether any explicit cleanup was performed.

---

## API references

import ListingCards from '@site/src/components/Listing/ListingCards';

export const cards = [
{
  title: 'Python',
  url: '/max/api/python',
  description: 'The Python library API reference.'
},
{
  title: 'Mojo',
  url: '/mojo/lib',
  description: 'The Mojo library API reference.'
},
{
  title: 'REST',
  url: '/max/api/serve',
  description: 'The MAX serving REST API reference.'
}
]

---

## append_shape

`append_shape[rank: Int](in_shape: IndexList[rank], last2nd: Int, last: Int) -> IndexList[(rank + 2)]`

Append input shape by inserting `last2nd` and `last` at the end.

---

## apple_accelerate

## Aliases

### `APPLE_ACCELERATE`

`alias APPLE_ACCELERATE = _Global[__init__[__mlir_type.!kgen.string]("APPLE_ACCELERATE"), _OwnedDLHandle, _init_dylib]`

### `cblas_gemm_type`

`alias cblas_gemm_type = fn(_CBLASOrder, _CBLASTranspose, _CBLASTranspose, SIMD[int32, 1], SIMD[int32, 1], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1]) -> None`

### `LIB_ACC_PATH`

`alias LIB_ACC_PATH = "/System/Library/Frameworks/Accelerate.framework/Accelerate"`

## Functions

* [​`apple_batched_matmul`](./apple_batched_matmul):
* [​`apple_gemv`](./apple_gemv):
* [​`apple_matmul`](./apple_matmul):
* [​`get_cblas_f32_function`](./get_cblas_f32_function):
* [​`use_apple_accelerate_lib`](./use_apple_accelerate_lib):

---

## apple_amx_intrinsics

## Functions

* [​`dot_at_b`](./dot_at_b):
* [​`dot_at_b_impl`](./dot_at_b_impl):
* [​`extrx`](./extrx): Extracts a row or moves it to x, result in amx0.
* [​`extry`](./extry): Extracts a row or moves it to y, result in amx0.
* [​`fma`](./fma):
* [​`fma16`](./fma16): Float16 matrix multiply and subtract.
* [​`fma32`](./fma32): Float32 matrix multiply and add.
* [​`fma64`](./fma64): Float64 matrix multiply and add.
* [​`fms16`](./fms16): Float16 matrix multiply and add.
* [​`fsm32`](./fsm32): Float32 matrix multiply and subtract.
* [​`fsm64`](./fsm64): Float64 matrix multiply and subtract.
* [​`genlut`](./genlut):
* [​`ldx`](./ldx):
* [​`ldy`](./ldy):
* [​`ldz`](./ldz):
* [​`ldzi`](./ldzi):
* [​`load_z`](./load_z):
* [​`mac16`](./mac16): SI16 matrix multiply and add.
* [​`matfp`](./matfp): Float16 matrix multiply.
* [​`max_int__`](./max_int__): UI16 matrix multiply.
* [​`read_x`](./read_x):
* [​`read_y`](./read_y):
* [​`store_x`](./store_x):
* [​`store_y`](./store_y):
* [​`store_z`](./store_z):
* [​`stx`](./stx):
* [​`sty`](./sty):
* [​`stz`](./stz):
* [​`stzi`](./stzi):
* [​`transpose_z_to_x_or_y`](./transpose_z_to_x_or_y):
* [​`vec_int__`](./vec_int__): Horizontal ui16 multiply `z0[i] += x0[i] + y0[i]`.
* [​`vecfp`](./vecfp): Horizontal float16 multiply `z0[i] += x0[i] + y0[i]`.

---

## apple_batched_matmul

`apple_batched_matmul[*, transpose_b: Bool = False, elementwise_epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])`

---

## apple_gemv

`apple_gemv[*, b_packed: Bool, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape])`

---

## apple_matmul

`apple_matmul[*, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](cblas_gemm_fn: fn(_CBLASOrder, _CBLASTranspose, _CBLASTranspose, SIMD[int32, 1], SIMD[int32, 1], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1]) -> None, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])`

`apple_matmul[*, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])`

---

## apply

`apply[: origin.set, //, func: fn(Int) capturing -> Int](t: IntTuple[origin]) -> IntTuple`

Apply a function to each integer value in an `IntTuple`.

This function recursively applies the given function to each integer value
in a potentially nested `IntTuple` structure, preserving the structure.

**Parameters:**

* ​func (`fn(Int) capturing -> Int`): Function to apply to each integer value.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to transform.

**Returns:**

A new `IntTuple` with the same structure but with each integer value
transformed by the function.

---

## apply_epilogue

`apply_epilogue[elementwise_lambda: fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None, dst_layout: Layout, dst_element_layout: Layout = __init__[::Origin[::Bool(IntTuple(1), IntTuple(1))](src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], offset: Int)`

---

## apply_penalties_to_logits

`apply_penalties_to_logits[logit_type: DType, penalty_type: DType, //, target: StringSlice[StaticConstantOrigin]](logits: LayoutTensor[logit_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], compressed_frequency_data: LayoutTensor[int32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], frequency_offsets: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], frequency_penalty: SIMD[penalty_type, 1], presence_penalty: SIMD[penalty_type, 1], repetition_penalty: SIMD[penalty_type, 1], ctx: DeviceContextPtr)`

Apply penalties to the logits based on the frequency of the tokens in the batch.

The frequency data is stored in a CSR format, where the frequency\_offsets is the
starting index of each sequence in the frequency\_data array. The frequency\_data
array is a 2D array, where:

* frequency\_data\[i, 0] is the token id
* frequency\_data\[i, 1] is the frequency of the token in the sequence

---

## apply_predicate

`apply_predicate[predicate: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool](a: IntTuple[origin], b: IntTuple[origin]) -> Bool`

Apply a predicate function recursively to two `IntTuple`s.

This function traverses two `IntTuple`s with the same structure and applies
a predicate function to corresponding elements. The predicate is applied
only to the leaf nodes (integer values).

Note:
If the structures of the two `IntTuple`s don't match (different nesting or length),
the function returns False without applying the predicate.

**Parameters:**

* ​predicate (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool`): A function that takes two `IntTuple`s (containing integer values)
  and returns a boolean result.

**Args:**

* ​a (`IntTuple[origin]`): First `IntTuple` to compare.
* ​b (`IntTuple[origin]`): Second `IntTuple` to compare.

**Returns:**

True if the predicate returns True for all corresponding elements and
the structures match, False otherwise.

---

## apply_q

`apply_q[dtype: DType, element_layout: Layout](sigma: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], A: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], X: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Applies the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` to the `X` matrix.

See `qr_factorization` for more details on the construction of the
Householder reflector.

---

## apply_tiler

`apply_tiler[func: fn(Layout, Layout) -> Layout](layout_a: Layout, tiler: List[Layout]) -> Layout`

Applies a layout transformation function to each element of a layout with a tiler.

This utility function applies the specified transformation function to each
corresponding pair of elements from the layout and tiler list. It's a generic
mechanism for implementing various tiling operations.

Example:

```mojo
from layout import Layout, LayoutList, IntTuple
from layout.layout import apply_tiler, logical_divide

# Apply logical_divide to each element of a layout with a tiler
var base = Layout.row_major(6, 8)
var tilers = LayoutList()
tilers.append(Layout(IntTuple(2, 2), IntTuple(1, 2)))
var result = apply_tiler[logical_divide](base, tilers)
```

.

**Parameters:**

* ​func (`fn(Layout, Layout) -> Layout`): A function that takes two layouts and returns a transformed layout.

**Args:**

* ​layout\_a (`Layout`): The base layout to transform.
* ​tiler (`List[Layout]`): A list of layouts to use in the transformation.

**Returns:**

A new layout resulting from applying the transformation function to each pair.

---

## apply_zip

`apply_zip[func: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin]) -> IntTuple`

Apply a function to pairs of elements from two `IntTuple`s.

This function zips two `IntTuple`s together and applies the given function
to each pair of elements, creating a new `IntTuple` with the results.

**Parameters:**

* ​func (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> IntTuple`): Function that takes two `IntTuple`s and returns an `IntTuple`.

**Args:**

* ​t1 (`IntTuple[origin]`): First `IntTuple`.
* ​t2 (`IntTuple[origin]`): Second `IntTuple`.

**Returns:**

A new `IntTuple` containing the results of applying func to each pair.

`apply_zip[: origin.set, //, func: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) capturing -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin]) -> IntTuple`

Apply a capturing function to pairs of elements from two `IntTuple`s.

This overload allows the function to capture variables from its environment.

**Parameters:**

* ​func (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) capturing -> IntTuple`): Capturing function that takes two `IntTuple`s and returns an `IntTuple`.

**Args:**

* ​t1 (`IntTuple[origin]`): First `IntTuple`.
* ​t2 (`IntTuple[origin]`): Second `IntTuple`.

**Returns:**

A new `IntTuple` containing the results of applying func to each pair.

`apply_zip[func: fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin], t3: IntTuple[origin]) -> IntTuple`

Apply a function to triplets of elements from three `IntTuple`s.

This function zips three `IntTuple`s together and applies the given function
to each triplet of elements, creating a new `IntTuple` with the results.

**Parameters:**

* ​func (`fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) -> IntTuple`): Function that takes three `IntTuple`s and returns an `IntTuple`.

**Args:**

* ​t1 (`IntTuple[origin]`): First `IntTuple`.
* ​t2 (`IntTuple[origin]`): Second `IntTuple`.
* ​t3 (`IntTuple[origin]`): Third `IntTuple`.

**Returns:**

A new `IntTuple` containing the results of applying func to each triplet.

`apply_zip[: origin.set, //, func: fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) capturing -> IntTuple](t1: IntTuple[origin], t2: IntTuple[origin], t3: IntTuple[origin]) -> IntTuple`

Apply a capturing function to triplets of elements from three `IntTuple`s.

This overload allows the function to capture variables from its environment.

**Parameters:**

* ​func (`fn[ImmutableOrigin, ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1], IntTuple[$2]) capturing -> IntTuple`): Capturing function that takes three `IntTuple`s and returns an `IntTuple`.

**Args:**

* ​t1 (`IntTuple[origin]`): First `IntTuple`.
* ​t2 (`IntTuple[origin]`): Second `IntTuple`.
* ​t3 (`IntTuple[origin]`): Third `IntTuple`.

**Returns:**

A new `IntTuple` containing the results of applying func to each triplet.

---

## arange

`arange[type: DType, simd_width: Int](start: SIMD[type, 1], stop: SIMD[type, 1], step: SIMD[type, 1], index: IndexList[1]) -> SIMD[type, simd_width]`

---

## arange

## Functions

* [​`arange`](./arange):
* [​`arange_shape`](./arange_shape):

---

## arange_shape

`arange_shape[type: DType, single_thread_blocking_override: Bool](start: SIMD[type, 1], stop: SIMD[type, 1], step: SIMD[type, 1]) -> IndexList[1]`

---

## arc

Reference-counted smart pointers.

You can import these APIs from the `memory` package. For example:

```mojo
from memory import ArcPointer
```

## Structs

* [​`ArcPointer`](/mojo/stdlib/memory/arc/ArcPointer): Atomic reference-counted pointer.

---

## architectures

## `register_all_models()` {#max.pipelines.architectures.register_all_models}

> max.pipelines.architectures.register\_all\_models()

Imports model architectures, thus registering the architecture in the shared `PipelineRegistry`.

---

## ArcPointer

`@register_passable`
`struct ArcPointer[T: Movable]`

Atomic reference-counted pointer.

This smart pointer owns an instance of `T` indirectly managed on the heap.
This pointer is copyable, including across threads, maintaining a reference
count to the underlying data.

When you initialize an `ArcPointer` with a value, it allocates memory and
moves the value into the allocated memory. Copying an instance of an
`ArcPointer` increments the reference count. Destroying an instance
decrements the reference count. When the reference count reaches zero,
`ArcPointer` destroys the value and frees its memory.

This pointer itself is thread-safe using atomic accesses to reference count
the underlying data, but references returned to the underlying data are not
thread-safe.

Subscripting an `ArcPointer` (`ptr[]`) returns a mutable reference to the
stored value. This is the only safe way to access the stored value. Other
methods, such as using the `unsafe_ptr()` method to retrieve an unsafe
pointer to the stored value, or accessing the private fields of an
`ArcPointer`, are unsafe and may result in memory errors.

For a comparison with other pointer types, see [Intro to
pointers](/mojo/manual/pointers/) in the Mojo Manual.

Examples:

```mojo
from memory import ArcPointer
var p = ArcPointer(4)
var p2 = p
p2[]=3
print(3 == p[])
```

## Parameters

* ​T (`Movable`): The type of the stored value.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Identifiable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(owned value: T) -> Self`

Construct a new thread-safe, reference-counted smart pointer, and move the value into heap memory managed by the new pointer.

**Args:**

* ​value (`T`): The value to manage.

### `__copyinit__`

`__copyinit__(existing: Self) -> Self`

Copy an existing reference. Increment the refcount to the object.

**Args:**

* ​existing (`Self`): The existing reference.

### `__del__`

`__del__(owned self)`

Delete the smart pointer.

Decrement the reference count for the stored value. If there are no more
references, delete the object and free its memory.

### `__getitem__`

`__getitem__[self_life: ImmutableOrigin](ref [self_life] self) -> ref [self_life] T`

Returns a mutable reference to the managed value.

**Parameters:**

* ​self\_life (`ImmutableOrigin`): The origin of self.

**Returns:**

A reference to the managed value.

### `__is__`

`__is__(self, rhs: Self) -> Bool`

Returns True if the two `ArcPointer` instances point at the same object.

**Args:**

* ​rhs (`Self`): The other `ArcPointer`.

**Returns:**

True if the two `ArcPointers` instances point at the same object and
False otherwise.

### `__isnot__`

`__isnot__(self, rhs: Self) -> Bool`

Returns True if the two `ArcPointer` instances point at different objects.

**Args:**

* ​rhs (`Self`): The other `ArcPointer`.

**Returns:**

True if the two `ArcPointer` instances point at different objects
and False otherwise.

### `copy`

`copy(self) -> Self`

Copy the object.

**Returns:**

A copy of the value.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[T]`

Retrieves a pointer to the underlying memory.

**Returns:**

The `UnsafePointer` to the pointee.

### `count`

`count(self) -> SIMD[uint64, 1]`

Count the amount of current references.

**Returns:**

The current amount of references to the pointee.

---

## arg

Implements functions and variables for interacting with execution and system environment.

You can import these APIs from the `sys` package. For example:

```mojo
from sys import argv
def main():
    arguments = argv()
    print(
        arguments[0], #app.mojo
        arguments[1]  #Hello world!
    )
    for arg in arguments:
        print(arg)
# If the program is app.mojo:
# mojo run app.mojo "Hello world!"
```

## Functions

* [​`argv`](/mojo/stdlib/sys/arg/argv): The list of command line arguments.

---

## arg_nonzero

`arg_nonzero[type: DType, output_type: DType, rank: Int](input_buffer: NDBuffer[type, rank, origin], output_buffer: NDBuffer[output_type, 2, origin])`

Gather the indices of all non-zero elements in input buffer storing the indices in the output\_buffer.

**Parameters:**

* ​type (`DType`): The element type.
* ​output\_type (`DType`): The integer type to store the indices in.
* ​rank (`Int`): The rank of the tensor.

**Args:**

* ​input\_buffer (`NDBuffer[type, rank, origin]`): The tensor to count the non-zeros in.
* ​output\_buffer (`NDBuffer[output_type, 2, origin]`): The indices of all non-zero elements.

---

## arg_nonzero

## Functions

* [​`arg_nonzero`](./arg_nonzero): Gather the indices of all non-zero elements in input buffer storing the indices in the output\_buffer.
* [​`arg_nonzero_shape`](./arg_nonzero_shape): Return \[NumNonZeros, InputRank] where NumNonZeros are the number of non-zero elements in the input.

---

## arg_nonzero_shape

`arg_nonzero_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input_buffer: NDBuffer[type, rank, origin]) -> IndexList[2]`

Return \[NumNonZeros, InputRank] where NumNonZeros are the number of non-zero elements in the input.

**Parameters:**

* ​type (`DType`): The element type.
* ​rank (`Int`): The rank.
* ​single\_thread\_blocking\_override (`Bool`): This op can block.

**Args:**

* ​input\_buffer (`NDBuffer[type, rank, origin]`): The tensor to count the non-zeros in.

**Returns:**

Shape of the arg\_nonzero kernel for this input \[NumNonZeros, InputRank].

---

## argmax

`argmax(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis: Int, output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Finds the indices of the maximum element along the specified axis.

**Args:**

* ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor.
* ​axis (`Int`): The axis.
* ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor.

`argmax(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis_buf: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Finds the indices of the maximum element along the specified axis.

**Args:**

* ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor.
* ​axis\_buf (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor.
* ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor.

---

## argmax_gpu

`argmax_gpu[type: DType, output_type: DType, rank: Int](ctx: DeviceContext, input: NDBuffer[type, rank, origin], output: NDBuffer[output_type, rank, origin])`

---

## argmaxmin

## Functions

* [​`argmax`](./argmax): Finds the indices of the maximum element along the specified axis.
* [​`argmin`](./argmin): Finds the indices of the minimum element along the specified axis.

---

## argmaxmin_gpu

`argmaxmin_gpu[type: DType, output_type: DType, rank: Int, largest: Bool](ctx: DeviceContext, input: NDBuffer[type, rank, origin], output: NDBuffer[output_type, rank, origin])`

Wraps the Top-K GPU kernel with K=1 to perform argmax on the inner-most dimension.

**Parameters:**

* ​type (`DType`): DType - The data type of the input tensor.
* ​output\_type (`DType`): DType - The data type of the output tensor.
* ​rank (`Int`): Int - The rank of the input tensor.
* ​largest (`Bool`): Bool - Whether to perform argmax or argmin.

---

## argmaxmin_gpu

## Functions

* [​`argmax_gpu`](./argmax_gpu):
* [​`argmaxmin_gpu`](./argmaxmin_gpu): Wraps the Top-K GPU kernel with K=1 to perform argmax on the inner-most dimension.
* [​`argmin_gpu`](./argmin_gpu):

---

## argmin

`argmin(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis: Int, output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Finds the indices of the minimum element along the specified axis.

**Args:**

* ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor.
* ​axis (`Int`): The axis.
* ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor.

`argmin(input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], axis_buf: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Finds the indices of the minimum element along the specified axis.

**Args:**

* ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor.
* ​axis\_buf (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor.
* ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The axis tensor.

---

## argmin_gpu

`argmin_gpu[type: DType, output_type: DType, rank: Int](ctx: DeviceContext, input: NDBuffer[type, rank, origin], output: NDBuffer[output_type, rank, origin])`

---

## args_to_tuple

`args_to_tuple[swap: Bool](arg_0: Int, arg_1: Int) -> Tuple[Int, Int]`

---

## argsort

`argsort[*, ascending: Bool = True, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ctx: DeviceContext)`

Performs argsort on input buffer, storing indices in output buffer.

**Parameters:**

* ​ascending (`Bool`): Sort direction (True for ascending, False for descending).
* ​target (`StringSlice[StaticConstantOrigin]`): Target device ("cpu" or "gpu").

**Args:**

* ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer to store sorted indices.
* ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer containing values to sort.
* ​ctx (`DeviceContext`): Device context for execution.

`argsort[ascending: Bool = True](output: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

CPU-only version of argsort.

**Parameters:**

* ​ascending (`Bool`): Sort direction (True for ascending, False for descending).

**Args:**

* ​output (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer to store sorted indices.
* ​input (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Buffer containing values to sort.

---

## argsort

## Functions

* [​`argsort`](./argsort): Performs argsort on input buffer, storing indices in output buffer.

---

## argv

`argv() -> VariadicList[StringSlice[StaticConstantOrigin]]`

The list of command line arguments.

**Returns:**

The list of command line arguments provided when mojo was invoked.

---

## ascii

`ascii(value: StringSlice[origin]) -> String`

Get the ASCII representation of the object.

**Args:**

* ​value (`StringSlice[origin]`): The object to get the ASCII representation of.

**Returns:**

A string containing the ASCII representation of the object.

---

## asin

`asin[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `asin` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `asin` of the input.

---

## asinh

`asinh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `asinh` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `asinh` of the input.

---

## assert_almost_equal

`assert_almost_equal[dtype: DType, size: Int](lhs: SIMD[dtype, size], rhs: SIMD[dtype, size], msg: String = __init__[__mlir_type.!kgen.string](""), *, atol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0E-8), rtol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0000000000000001E-5), equal_nan: Bool = False, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are equal up to a tolerance. If it is not then an Error is raised.

When the type is boolean or integral, then equality is checked. When the
type is floating-point, then this checks if the two input values are
numerically the close using the $abs(lhs - rhs) dtype (`DType`): The dtype of the left- and right-hand-side SIMD vectors.
* ​size (`Int`): The width of the left- and right-hand-side SIMD vectors.

**Args:**

* ​lhs (`SIMD[dtype, size]`): The lhs of the equality.
* ​rhs (`SIMD[dtype, size]`): The rhs of the equality.
* ​msg (`String`): The message to print.
* ​atol (`SIMD[float64, 1]`): The absolute tolerance.
* ​rtol (`SIMD[float64, 1]`): The relative tolerance.
* ​equal\_nan (`Bool`): Whether to treat nans as equal.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

---

## assert_equal

`assert_equal[T: EqualityComparable & Stringable, //](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are equal. If it is not then an Error is raised.

**Parameters:**

* ​T (`EqualityComparable & Stringable`): The type of the input values.

**Args:**

* ​lhs (`T`): The lhs of the equality.
* ​rhs (`T`): The rhs of the equality.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_equal(lhs: String, rhs: String, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are equal. If it is not then an Error is raised.

**Args:**

* ​lhs (`String`): The lhs of the equality.
* ​rhs (`String`): The rhs of the equality.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_equal[dtype: DType, size: Int](lhs: SIMD[dtype, size], rhs: SIMD[dtype, size], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are equal. If it is not then an Error is raised.

**Parameters:**

* ​dtype (`DType`): The dtype of the left- and right-hand-side SIMD vectors.
* ​size (`Int`): The width of the left- and right-hand-side SIMD vectors.

**Args:**

* ​lhs (`SIMD[dtype, size]`): The lhs of the equality.
* ​rhs (`SIMD[dtype, size]`): The rhs of the equality.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_equal[T: Copyable & Movable & EqualityComparable & Representable, //](lhs: List[T], rhs: List[T], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that two lists are equal.

**Parameters:**

* ​T (`Copyable & Movable & EqualityComparable & Representable`): The type of the elements in the lists.

**Args:**

* ​lhs (`List[T]`): The left-hand side list.
* ​rhs (`List[T]`): The right-hand side list.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_equal[O1: ImmutableOrigin, O2: ImmutableOrigin](lhs: List[StringSlice[O1]], rhs: List[StringSlice[O2]], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that two lists are equal.

**Parameters:**

* ​O1 (`ImmutableOrigin`): The origin of lhs.
* ​O2 (`ImmutableOrigin`): The origin of rhs.

**Args:**

* ​lhs (`List[StringSlice[O1]]`): The left-hand side list.
* ​rhs (`List[StringSlice[O2]]`): The right-hand side list.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_equal[D: DType](lhs: List[SIMD[D, 1]], rhs: List[SIMD[D, 1]], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that two lists are equal.

**Parameters:**

* ​D (`DType`): A DType.

**Args:**

* ​lhs (`List[SIMD[D, 1]]`): The left-hand side list.
* ​rhs (`List[SIMD[D, 1]]`): The right-hand side list.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_equal(lhs: PythonObject, rhs: PythonObject, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are equal. If it is not then an Error is raised.

**Args:**

* ​lhs (`PythonObject`): The lhs of the equality.
* ​rhs (`PythonObject`): The rhs of the equality.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (default to the `__call_location`).

**Raises:**

An Error with the provided message if assert fails.

---

## assert_false

`assert_false[T: Boolable, //](val: T, msg: String = __init__[__mlir_type.!kgen.string]("condition was unexpectedly True"), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input value is False and raises an Error if it's not.

**Parameters:**

* ​T (`Boolable`): The type of the value argument.

**Args:**

* ​val (`T`): The value to assert to be False.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

---

## assert_is

`assert_is[T: Stringable & Identifiable](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values have the same identity. If they do not then an Error is raised.

**Parameters:**

* ​T (`Stringable & Identifiable`): A Stringable and Identifiable type.

**Args:**

* ​lhs (`T`): The lhs of the `is` statement.
* ​rhs (`T`): The rhs of the `is` statement.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

---

## assert_is_not

`assert_is_not[T: Stringable & Identifiable](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values have different identities. If they do not then an Error is raised.

**Parameters:**

* ​T (`Stringable & Identifiable`): A Stringable and Identifiable type.

**Args:**

* ​lhs (`T`): The lhs of the `is not` statement.
* ​rhs (`T`): The rhs of the `is not` statement.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

---

## assert_not_equal

`assert_not_equal[T: EqualityComparable & Stringable, //](lhs: T, rhs: T, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are not equal. If it is not then an Error is raised.

**Parameters:**

* ​T (`EqualityComparable & Stringable`): The type of the input values.

**Args:**

* ​lhs (`T`): The lhs of the inequality.
* ​rhs (`T`): The rhs of the inequality.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_not_equal(lhs: String, rhs: String, msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are not equal. If it is not then an an Error is raised.

**Args:**

* ​lhs (`String`): The lhs of the inequality.
* ​rhs (`String`): The rhs of the inequality.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_not_equal[dtype: DType, size: Int](lhs: SIMD[dtype, size], rhs: SIMD[dtype, size], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input values are not equal. If it is not then an Error is raised.

**Parameters:**

* ​dtype (`DType`): The dtype of the left- and right-hand-side SIMD vectors.
* ​size (`Int`): The width of the left- and right-hand-side SIMD vectors.

**Args:**

* ​lhs (`SIMD[dtype, size]`): The lhs of the inequality.
* ​rhs (`SIMD[dtype, size]`): The rhs of the inequality.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

`assert_not_equal[T: Copyable & Movable & EqualityComparable & Representable, //](lhs: List[T], rhs: List[T], msg: String = __init__[__mlir_type.!kgen.string](""), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that two lists are not equal.

**Parameters:**

* ​T (`Copyable & Movable & EqualityComparable & Representable`): The type of the elements in the lists.

**Args:**

* ​lhs (`List[T]`): The left-hand side list.
* ​rhs (`List[T]`): The right-hand side list.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

---

## assert_raises

`struct assert_raises`

Context manager that asserts that the block raises an exception.

You can use this to test expected error cases, and to test that the correct
errors are raised. For instance:

```mojo
from testing import assert_raises

# Good! Caught the raised error, test passes
with assert_raises():
    raise "SomeError"

# Also good!
with assert_raises(contains="Some"):
    raise "SomeError"

# This will assert, we didn't raise
with assert_raises():
    pass

# This will let the underlying error propagate, failing the test
with assert_raises(contains="Some"):
    raise "OtherError"
```

## Fields

* ​message\_contains (`Optional[String]`): If present, check that the error message contains this literal string.
* ​call\_location (`_SourceLocation`): Assigned the value returned by \_\_call\_locations() at Self.**init**.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, *, location: Optional[_SourceLocation] = Optional(None))`

Construct a context manager with no message pattern.

**Args:**

* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

`__init__(out self, *, contains: String, location: Optional[_SourceLocation] = Optional(None))`

Construct a context manager matching specific errors.

**Args:**

* ​contains (`String`): The test will only pass if the error message
  includes the literal text passed.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

### `__enter__`

`__enter__(self)`

Enter the context manager.

### `__exit__`

`__exit__(self)`

Exit the context manager with no error.

**Raises:**

AssertionError: Always. The block must raise to pass the test.

`__exit__(self, error: Error) -> Bool`

Exit the context manager with an error.

**Args:**

* ​error (`Error`): The error raised.

**Returns:**

True if the error message contained the expected string.

**Raises:**

Error: If the error raised doesn't include the expected string.

---

## assert_true

`assert_true[T: Boolable, //](val: T, msg: String = __init__[__mlir_type.!kgen.string]("condition was unexpectedly False"), *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the input value is True and raises an Error if it's not.

**Parameters:**

* ​T (`Boolable`): The type of the value argument.

**Args:**

* ​val (`T`): The value to assert to be True.
* ​msg (`String`): The message to be printed if the assertion fails.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

**Raises:**

An Error with the provided message if assert fails and `None` otherwise.

---

## assume

`assume(val: Bool)`

Signals to the optimizer that the condition is always true. This allows the optimizer to optimize the code.

**Args:**

* ​val (`Bool`): The input value which is assumed to be `True`.

---

## async_copy

`async_copy[type: DType, //, size: Int, *, fill: OptionalReg[SIMD[type, 1]] = OptionalReg[SIMD[type, 1]]({:i1 0, 1}), bypass_L1_16B: Bool = True, l2_prefetch: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), eviction_policy: CacheEviction = CacheEviction(0)](src: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)], dst: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)], src_size: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](0), predicate: Bool = False)`

Asynchronously copies data from global memory to shared memory.

This function provides a high-performance asynchronous memory copy operation with
configurable caching behavior, prefetching, and fill values. It maps directly to
the PTX cp.async instruction on NVIDIA GPUs.

**Constraints:**

* Fill value only supported for types type (`DType`): The data type to copy (e.g. float32, int32).
* ​size (`Int`): Number of bytes to copy (must be 4, 8, or 16).
* ​fill (`OptionalReg[SIMD[type, 1]]`): Optional fill value for uncopied bytes when src\_size bypass\_L1\_16B (`Bool`): If True, bypasses L1 cache for 16-byte copies.
* ​l2\_prefetch (`OptionalReg[Int]`): Optional L2 prefetch size (64, 128, or 256 bytes).
* ​eviction\_policy (`CacheEviction`): Cache eviction policy for the copy operation.

**Args:**

* ​src (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]`): Source pointer in global memory.
* ​dst (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)]`): Destination pointer in shared memory.
* ​src\_size (`SIMD[int32, 1]`): Actual bytes to copy from src (remaining bytes use fill value).
* ​predicate (`Bool`): Optional predicate to conditionally execute the copy.

---

## async_copy_arrive

`async_copy_arrive[type: AnyType, address_space: AddressSpace](address: UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin])`

Makes a memory barrier track all prior async copy operations from this thread.

This function ensures that all previously initiated asynchronous copy operations
from the executing thread are tracked by the memory barrier at the specified location.
Only supported on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The data type stored at the barrier location.
* ​address\_space (`AddressSpace`): The memory address space where the barrier is located.

**Args:**

* ​address (`UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory barrier object location.

---

## async_copy_commit_group

`async_copy_commit_group()`

Commits all prior initiated but uncommitted cp.async instructions into a cp.async-group.

This function creates a new cp.async-group containing all previously initiated but uncommitted
asynchronous copy operations. The group can then be waited on using async\_copy\_wait\_group().

Notes:

* Only supported on NVIDIA GPUs
* Maps to the cp.async.commit.group PTX instruction
* Used for managing asynchronous memory transfers
* Should be paired with async\_copy\_wait\_group() or async\_copy\_wait\_all()

---

## async_copy_wait_all

`async_copy_wait_all()`

Waits for completion of all committed cp.async-groups.

This function blocks execution until all previously committed cp.async-groups
have completed their memory transfers. It provides a barrier to ensure all
asynchronous copies are finished.

Notes:

* Only supported on NVIDIA GPUs.
* Maps to the cp.async.wait.all PTX instruction.
* Ensures all outstanding asynchronous transfers are complete.
* More coarse-grained than `async_copy_wait_group()`.

---

## async_copy_wait_group

`async_copy_wait_group(n: SIMD[int32, 1])`

Waits for the completion of `n` most recently committed cp.async-groups.

This function blocks execution until the specified number of previously committed
cp.async-groups have completed their memory transfers.

Notes:

* Only supported on NVIDIA GPUs.
* Maps to the cp.async.wait.group PTX instruction.
* Provides fine-grained control over asynchronous transfer synchronization.
* Can be used to implement a pipeline of asynchronous transfers.

**Args:**

* ​n (`SIMD[int32, 1]`): The number of pending cp.async-groups to wait for. Must be > 0.

---

## asyncrt

This module implements the low level concurrency library.

## Structs

* [​`DeviceContextPtr`](/mojo/stdlib/runtime/asyncrt/DeviceContextPtr): Exposes a pointer to a C++ DeviceContext to Mojo.
* [​`DeviceContextPtrList`](/mojo/stdlib/runtime/asyncrt/DeviceContextPtrList): A fixed-size collection of `DeviceContextPtr` objects.
* [​`Task`](/mojo/stdlib/runtime/asyncrt/Task): Represents an asynchronous task that will produce a value of the specified type.
* [​`TaskGroup`](/mojo/stdlib/runtime/asyncrt/TaskGroup): A group of tasks that can be executed concurrently.
* [​`TaskGroupContext`](/mojo/stdlib/runtime/asyncrt/TaskGroupContext): Context structure for task group operations.

## Functions

* [​`create_task`](/mojo/stdlib/runtime/asyncrt/create_task): Run the coroutine as a task on the AsyncRT Runtime.
* [​`parallelism_level`](/mojo/stdlib/runtime/asyncrt/parallelism_level): Gets the parallelism level of the Runtime.

---

## atan

`atan[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `atan` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `atan` of the input.

---

## atan2

`atan2[dtype: DType, width: Int, //](y: SIMD[dtype, width], x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `atan2` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​y (`SIMD[dtype, width]`): The first input argument.
* ​x (`SIMD[dtype, width]`): The second input argument.

**Returns:**

The `atan2` of the inputs.

---

## atanh

`atanh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `atanh` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `atanh` of the input.

---

## atof

`atof(str_slice: StringSlice[origin]) -> SIMD[float64, 1]`

Parses the given string as a floating point and returns that value.

For example, `atof("2.25")` returns `2.25`.

This function is in the prelude, so you don't need to import it.

**Args:**

* ​str\_slice (`StringSlice[origin]`): A string to be parsed as a floating point.

**Returns:**

An floating point value that represents the string, or otherwise raises.

**Raises:**

If the given string cannot be parsed as an floating point value, for
example in `atof("hi")`.

---

## atol

`atol(str_slice: StringSlice[origin], base: Int = 10) -> Int`

Parses and returns the given string as an integer in the given base.

If base is set to 0, the string is parsed as an integer literal, with the
following considerations:

* '0b' or '0B' prefix indicates binary (base 2)
* '0o' or '0O' prefix indicates octal (base 8)
* '0x' or '0X' prefix indicates hexadecimal (base 16)
* Without a prefix, it's treated as decimal (base 10)

This follows [Python's integer literals format](https://docs.python.org/3/reference/lexical_analysis.html#integers).

This function is in the prelude, so you don't need to import it.

Examples:

```text
>>> atol("32")
32
>>> atol("FF", 16)
255
>>> atol("0xFF", 0)
255
>>> atol("0b1010", 0)
10
```

**Args:**

* ​str\_slice (`StringSlice[origin]`): A string to be parsed as an integer in the given base.
* ​base (`Int`): Base used for conversion, value must be between 2 and 36, or 0.

**Returns:**

An integer value that represents the string.

**Raises:**

If the given string cannot be parsed as an integer value or if an
incorrect base is provided.

---

## atomic

Implements the `Atomic` struct.

You can import these APIs from the `os` package. For example:

```mojo
from os import Atomic
```

## Structs

* [​`Atomic`](/mojo/stdlib/os/atomic/Atomic): Represents a value with atomic operations.
* [​`Consistency`](/mojo/stdlib/os/atomic/Consistency): Represents the consistency model for atomic operations.

---

## Atomic

`struct Atomic[dtype: DType, *, scope: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")]`

Represents a value with atomic operations.

The class provides atomic `add` and `sub` methods for mutating the value.

## Parameters

* ​dtype (`DType`): DType of the value.
* ​scope (`StringSlice[StaticConstantOrigin]`): The memory synchronization scope.

## Fields

* ​value (`SIMD[dtype, 1]`): The atomic value.
  This is the underlying value of the atomic. Access to the value can only
  occur through atomic primitive operations.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(out self, value: SIMD[dtype, 1])`

Constructs a new atomic value.

**Args:**

* ​value (`SIMD[dtype, 1]`): Initial value represented as `Scalar[dtype]` type.

### `__iadd__`

`__iadd__(mut self, rhs: SIMD[dtype, 1])`

Performs atomic in-place add.

Atomically replaces the current value with the result of arithmetic
addition of the value and arg. That is, it performs atomic
post-increment. The operation is a read-modify-write operation. Memory
is affected according to the value of order which is sequentially
consistent.

**Args:**

* ​rhs (`SIMD[dtype, 1]`): Value to add.

### `__isub__`

`__isub__(mut self, rhs: SIMD[dtype, 1])`

Performs atomic in-place sub.

Atomically replaces the current value with the result of arithmetic
subtraction of the value and arg. That is, it performs atomic
post-decrement. The operation is a read-modify-write operation. Memory
is affected according to the value of order which is sequentially
consistent.

**Args:**

* ​rhs (`SIMD[dtype, 1]`): Value to subtract.

### `load`

`load(mut self) -> SIMD[dtype, 1]`

Loads the current value from the atomic.

**Returns:**

The current value of the atomic.

### `fetch_add`

`static fetch_add[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], rhs: SIMD[dtype, 1]) -> SIMD[dtype, 1]`

Performs atomic in-place add.

Atomically replaces the current value with the result of arithmetic
addition of the value and arg. That is, it performs atomic
post-increment. The operation is a read-modify-write operation. Memory
is affected according to the value of order which is sequentially
consistent.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The source pointer.
* ​rhs (`SIMD[dtype, 1]`): Value to add.

**Returns:**

The original value before addition.

`fetch_add[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](mut self, rhs: SIMD[dtype, 1]) -> SIMD[dtype, 1]`

Performs atomic in-place add.

Atomically replaces the current value with the result of arithmetic
addition of the value and arg. That is, it performs atomic
post-increment. The operation is a read-modify-write operation. Memory
is affected according to the value of order which is sequentially
consistent.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​rhs (`SIMD[dtype, 1]`): Value to add.

**Returns:**

The original value before addition.

### `store`

`static store[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], value: SIMD[dtype, 1])`

Performs atomic store. The operation is a read-modify-write operation. Memory is affected according to the value of order which is sequentially consistent.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The source pointer.
* ​value (`SIMD[dtype, 1]`): The value to store.

### `fetch_sub`

`fetch_sub[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](mut self, rhs: SIMD[dtype, 1]) -> SIMD[dtype, 1]`

Performs atomic in-place sub.

Atomically replaces the current value with the result of arithmetic
subtraction of the value and arg. That is, it performs atomic
post-decrement. The operation is a read-modify-write operation. Memory
is affected according to the value of order which is sequentially
consistent.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​rhs (`SIMD[dtype, 1]`): Value to subtract.

**Returns:**

The original value before subtraction.

### `compare_exchange_weak`

`compare_exchange_weak[*, failure_ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6)), success_ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](self, mut expected: SIMD[dtype, 1], desired: SIMD[dtype, 1]) -> Bool`

Atomically compares the self value with that of the expected value. If the values are equal, then the self value is replaced with the desired value and True is returned. Otherwise, False is returned the the expected value is rewritten with the self value.

**Parameters:**

* ​failure\_ordering (`Consistency`): The memory ordering for the failure case.
* ​success\_ordering (`Consistency`): The memory ordering for the success case.

**Args:**

* ​expected (`SIMD[dtype, 1]`): The expected value.
* ​desired (`SIMD[dtype, 1]`): The desired value.

**Returns:**

True if self == expected and False otherwise.

### `max`

`static max[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], rhs: SIMD[dtype, 1])`

Performs atomic in-place max on the pointer.

Atomically replaces the current value pointer to by `ptr` by the result
of max of the value and arg. The operation is a read-modify-write
operation. The operation is a read-modify-write operation perform
according to sequential consistency semantics.

**Constraints:**

The input type must be either integral or floating-point type.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The source pointer.
* ​rhs (`SIMD[dtype, 1]`): Value to max.

`max[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](self, rhs: SIMD[dtype, 1])`

Performs atomic in-place max.

Atomically replaces the current value with the result of max of the
value and arg. The operation is a read-modify-write operation perform
according to sequential consistency semantics.

**Constraints:**

The input type must be either integral or floating-point type.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​rhs (`SIMD[dtype, 1]`): Value to max.

### `min`

`static min[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], rhs: SIMD[dtype, 1])`

Performs atomic in-place min on the pointer.

Atomically replaces the current value pointer to by `ptr` by the result
of min of the value and arg. The operation is a read-modify-write
operation. The operation is a read-modify-write operation perform
according to sequential consistency semantics.

**Constraints:**

The input type must be either integral or floating-point type.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The source pointer.
* ​rhs (`SIMD[dtype, 1]`): Value to min.

`min[*, ordering: Consistency = Consistency(__init__[__mlir_type.!pop.int_literal](6))](self, rhs: SIMD[dtype, 1])`

Performs atomic in-place min.

Atomically replaces the current value with the result of min of the
value and arg. The operation is a read-modify-write operation. The
operation is a read-modify-write operation perform according to
sequential consistency semantics.

**Constraints:**

The input type must be either integral or floating-point type.

**Parameters:**

* ​ordering (`Consistency`): The memory ordering.

**Args:**

* ​rhs (`SIMD[dtype, 1]`): Value to min.

---

## attention

A vanilla opaque KV Cache optimized attention mechanism.

## `Attention` {#max.nn.attention.attention.Attention}

> *class* max.nn.attention.attention.Attention(n\_heads: 'int', kv\_params: 'KVCacheParams', wqkv: 'TensorValue', wo: 'LinearV1', scale: 'float')

**Parameters:**

* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **wqkv** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) )
* **wo** ([`LinearV1`](../linear.md#max.nn.linear.LinearV1) )
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) )

## `AttentionQKV` {#max.nn.attention.attention.AttentionQKV}

> *class* max.nn.attention.attention.AttentionQKV(n\_heads: 'int', kv\_params: 'KVCacheParams', wq: 'TensorValueLike', wk: 'TensorValueLike', wv: 'TensorValueLike', wo: 'LinearV1', scale: 'float')

**Parameters:**

* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **wq** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **wk** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **wv** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **wo** ([`LinearV1`](../linear.md#max.nn.linear.LinearV1) )
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) )

---

## attention

## Modules

* [`attention`](/max/api/python/nn/attention/attention)
* [`attention_with_rope`](/max/api/python/nn/attention/attention_with_rope)
* [`ragged_attention`](/max/api/python/nn/attention/ragged_attention)
* [`interfaces`](/max/api/python/nn/attention/interfaces)

---

## Attention

A mechanism used in AI models such as [transformers](transformer.mdx) that
enables the model to selectively focus on different parts of the input sequence
when making predictions.

Unlike traditional model architectures that process all input data with equal
importance, models with attention assign different importance levels to
different tokens (such as words or pixels). This allows the model to better
understand the complete meaning of the input, especially when an accurate
meaning depends on relationships between tokens that are far apart (such as
between words that occur far apart in a sentence).

Attention is crucial for large language models (LLMs) so they can capture
long-range dependencies and contextual relationships in the given text. It
allows LLMs to handle complex and nuanced language, enabling them to generate
coherent and contextually relevant output even when the input text includes
nuanced references to other parts of the text.

Attention was introduced and refined in the papers [Neural Machine Translation
by Jointly Learning to Align and Translate](https://arxiv.org/abs/1409.0473)
(Bahdanau et al., 2014) and [Effective Approaches to Attention-based Neural
Machine Translation ](https://arxiv.org/abs/1508.04025) (Luong et al., 2015).

The most well-known form of attention is [self-attention](self-attention.mdx),
in which each token gets its own attention score for every other token (each
token "attends to" all other tokens), in order to determine the relative
importance of each other token in that context.

---

## Attention mask

An attention mask is a mechanism used in the [attention](attention.mdx) layers
of a [transformer](transformer.mdx) model to indicate which tokens the model
should ignore when computing attention scores.

For example, attention masks can prevent the model from attending to [padding
tokens](padding-tokens.mdx), which are added to make sequences in a batch the
same length and thus offer no information for attention.

Another common mask is a "causal mask" (or "look-ahead mask"), which prevents
the [self-attention](self-attention) layer from looking at future tokens when
predicting a new token, ensuring that it attends only to previous tokens in the
sequence. Although it sounds absurd that it would even try to look at future
tokens (because it's generating tokens one at a time, in order), the
self-attention is designed for more general-purpose attention scoring. In its
most basic form, self-attention is agnostic to token order—it looks at all
tokens in the sequence equally, based on their embeddings, and calculates
scores by looking both backward and ahead in the sequence. (For example,
self-attention is used during [context encoding](context-encoding.mdx) to
establish an understanding of the input text.) So instead of creating a
different kind of attention mechanism for autoregressive inference, the causal
mask instructs the self-attention layer to simply ignore all future tokens and
only look backward when generating scores that help predict the next token.

---

## attention_with_rope

An opaque KV Cache optimized attention mechanism with Rope.

## `AttentionWithRope` {#max.nn.attention.attention_with_rope.AttentionWithRope}

> *class* max.nn.attention.attention\_with\_rope.AttentionWithRope(\*, rope, num\_attention\_heads, num\_key\_value\_heads, hidden\_size, kv\_params, devices=None, dtype=float32, linear\_cls=\, stacked\_qkv=False, scale=None, has\_bias=False, float8\_config=None, clip\_qkv=None)

Implementation of attention that uses the rope frequency.

Initializes the attention layer.

**Parameters:**

* **rope** ([`OptimizedRotaryEmbedding`](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding) ) – The rope layer to borrow the freq\_cis value from.
* **num\_attention\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of attention heads.
* **num\_key\_value\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of key/value heads.
* **hidden\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimension of the hidden states.
* **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) – KV Cache Params, including the number of kv heads, the head dim, and data type.
* **dtype** ([`DType`](../../dtype.md#max.dtype.DType) ) – DType of the QKV and output projection weights.
* **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]`  `|`  `None` ) – Device to place the weights and run the computation. If
  multiple are provided, the first device is used. Use
  DistributedAttentionWithRope to use all devices during
  attention computation.
* **linear\_cls** (`Callable` `[` `...` `,`  [`Linear`](../linear.md#max.nn.linear.Linear) `]` ) – Linear class to use for the outputs dense layer.
* **stacked\_qkv** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether the weights are stacked together.
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  `None` ) – Value used to scale the results of the attention output.
* **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether to use an attention bias.
* **clip\_qkv** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  `None` ) – If provided, the QKV weights are clamped between
  \[-clip\_qkv, clip\_qkv]
* **float8\_config** ([`Float8Config`](../linear.md#max.nn.linear.Float8Config)  `|`  `None` )

### `qkv_input_scale` {#max.nn.attention.attention_with_rope.AttentionWithRope.qkv_input_scale}

> *property* qkv\_input\_scale\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [None](https://docs.python.org/3/library/constants.html#None)\*

The max of q, k, and v scale input vectors.

### `qkv_weight_scale` {#max.nn.attention.attention_with_rope.AttentionWithRope.qkv_weight_scale}

> *property* qkv\_weight\_scale\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\*

The max of q, k, and v scale weight vectors.

### `rope` {#max.nn.attention.attention_with_rope.AttentionWithRope.rope}

> rope\*: [OptimizedRotaryEmbedding](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding)\*

### `wqkv` {#max.nn.attention.attention_with_rope.AttentionWithRope.wqkv}

> *property* wqkv\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\*

The concatenation of q, k, and v weight vectors.

### `wqkv_bias` {#max.nn.attention.attention_with_rope.AttentionWithRope.wqkv_bias}

> *property* wqkv\_bias\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [None](https://docs.python.org/3/library/constants.html#None)\*

The concatenation of q, k, and v bias weight vectors.

## `AttentionWithRopeQKV` {#max.nn.attention.attention_with_rope.AttentionWithRopeQKV}

> *class* max.nn.attention.attention\_with\_rope.AttentionWithRopeQKV(n\_heads: 'int', kv\_params: 'KVCacheParams', wq: 'TensorValueLike', wk: 'TensorValueLike', wv: 'TensorValueLike', wo: 'LinearV1', scale: 'float', rope: 'OptimizedRotaryEmbedding')

**Parameters:**

* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **wq** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **wk** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **wv** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **wo** ([`LinearV1`](../linear.md#max.nn.linear.LinearV1) )
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **rope** ([`OptimizedRotaryEmbedding`](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding) )

### `rope` {#max.nn.attention.attention_with_rope.AttentionWithRopeQKV.rope}

> rope\*: [OptimizedRotaryEmbedding](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding)\*

## `AttentionWithRopeV1` {#max.nn.attention.attention_with_rope.AttentionWithRopeV1}

> *class* max.nn.attention.attention\_with\_rope.AttentionWithRopeV1(n\_heads, kv\_params, wqkv, wo, scale, rope, bias=None, perm\_idx=None, quantization\_config=None)

Implementation of attention that uses the rope frequency.

Deprecated: Use AttentionWithRope instead.

**Parameters:**

* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **wqkv** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) )
* **wo** ([`LinearV1`](../linear.md#max.nn.linear.LinearV1) )
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **rope** ([`OptimizedRotaryEmbedding`](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding) )
* **bias** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue)  `|`  `None` )
* **perm\_idx** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue)  `|`  `None` )
* **quantization\_config** ([`QuantizationConfig`](../../graph/quantization.md#max.graph.quantization.QuantizationConfig)  `|`  `None` )

### `bias` {#max.nn.attention.attention_with_rope.AttentionWithRopeV1.bias}

> bias\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

### `perm_idx` {#max.nn.attention.attention_with_rope.AttentionWithRopeV1.perm_idx}

> perm\_idx\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

### `quantization_config` {#max.nn.attention.attention_with_rope.AttentionWithRopeV1.quantization_config}

> quantization\_config\*: [QuantizationConfig](../../graph/quantization.md#max.graph.quantization.QuantizationConfig) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

### `rope` {#max.nn.attention.attention_with_rope.AttentionWithRopeV1.rope}

> rope\*: [OptimizedRotaryEmbedding](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding)\*

## `DistributedAttentionWithRope` {#max.nn.attention.attention_with_rope.DistributedAttentionWithRope}

> *class* max.nn.attention.attention\_with\_rope.DistributedAttentionWithRope(\*\*kwargs)

Initializes the attention layer.

**Parameters:**

* **rope** – The rope layer to borrow the freq\_cis value from.
* **num\_attention\_heads** – The number of attention heads.
* **num\_key\_value\_heads** – Number of key/value heads.
* **hidden\_size** – The dimension of the hidden states.
* **kv\_params** – KV Cache Params, including the number of kv heads, the head dim, and data type.
* **dtype** – DType of the QKV and output projection weights.
* **devices** – Device to place the weights and run the computation. If
  multiple are provided, the first device is used. Use
  DistributedAttentionWithRope to use all devices during
  attention computation.
* **linear\_cls** – Linear class to use for the outputs dense layer.
* **stacked\_qkv** – Whether the weights are stacked together.
* **scale** – Value used to scale the results of the attention output.
* **has\_bias** – Whether to use an attention bias.
* **clip\_qkv** – If provided, the QKV weights are clamped between
  \[-clip\_qkv, clip\_qkv]

## `GGUFQAttentionWithRope` {#max.nn.attention.attention_with_rope.GGUFQAttentionWithRope}

> *class* max.nn.attention.attention\_with\_rope.GGUFQAttentionWithRope(\*, rope, num\_attention\_heads, num\_key\_value\_heads, hidden\_size, kv\_params, dtype, quantization\_encoding, devices=None, linear\_cls=\, scale=None, has\_bias=False, clip\_qkv=None)

Implementation of attention with GGUF quantized weights.

Initializes the attention layer.

**Parameters:**

* **rope** ([`OptimizedRotaryEmbedding`](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding) ) – The rope layer to borrow the freq\_cis value from.
* **num\_attention\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of attention heads.
* **num\_key\_value\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of key/value heads.
* **hidden\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimension of the hidden states.
* **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) – KV Cache Params, including the number of kv heads, the head dim, and data type.
* **layer\_idx** – The layer number associated with this Attention block.
* **dtype** ([`DType`](../../dtype.md#max.dtype.DType) ) – DType of the weights, should always be uint8.
* **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]`  `|`  `None` ) – Device to place the weights and run the computation. If
  multiple are provided, the first device is used. Use
  DistributedAttentionWithRope to use all devices during
  attention computation.
* **quantization\_encoding** ([`QuantizationEncoding`](../../graph/quantization.md#max.graph.quantization.QuantizationEncoding) ) – Quantization encoding of the weights.
* **linear\_cls** (`Callable` `[` `...` `,`  [`Linear`](../linear.md#max.nn.linear.Linear) `]` ) – Linear class to use for the outputs dense layer.
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  `None` ) – Value used to scale the results of the attention output.
* **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether to use an attention bias.
* **clip\_qkv** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  `None` ) – If provided, the QKV weights are clamped between
  \[-clip\_qkv, clip\_qkv]

### `rope` {#max.nn.attention.attention_with_rope.GGUFQAttentionWithRope.rope}

> rope\*: [OptimizedRotaryEmbedding](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding)\*

### `wqkv` {#max.nn.attention.attention_with_rope.GGUFQAttentionWithRope.wqkv}

> *property* wqkv\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\*

The concatenation of q, k, and v weight vectors.

### `wqkv_bias` {#max.nn.attention.attention_with_rope.GGUFQAttentionWithRope.wqkv_bias}

> *property* wqkv\_bias\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [None](https://docs.python.org/3/library/constants.html#None)\*

The concatenation of q, k, and v bias weight vectors.

## `GPTQAttentionWithRope` {#max.nn.attention.attention_with_rope.GPTQAttentionWithRope}

> *class* max.nn.attention.attention\_with\_rope.GPTQAttentionWithRope(quantization\_config, rope, num\_attention\_heads, num\_key\_value\_heads, hidden\_size, kv\_params, devices=None, dtype=float32, scale=None, linear\_cls=\)

Implementation of the GPT-Q attention layer.

Initializes the attention layer.

**Parameters:**

* **rope** ([`OptimizedRotaryEmbedding`](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding) ) – The rope layer to borrow the freq\_cis value from.
* **num\_attention\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of attention heads.
* **num\_key\_value\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of key/value heads.
* **hidden\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimension of the hidden states.
* **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) – KV Cache Params, including the number of kv heads, the head dim, and data type.
* **dtype** ([`DType`](../../dtype.md#max.dtype.DType) ) – DType of the QKV and output projection weights.
* **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]`  `|`  `None` ) – Device to place the weights and run the computation. If
  multiple are provided, the first device is used. Use
  DistributedAttentionWithRope to use all devices during
  attention computation.
* **linear\_cls** (`Callable` `[` `...` `,`  [`Linear`](../linear.md#max.nn.linear.Linear) `]` ) – Linear class to use for the outputs dense layer.
* **stacked\_qkv** – Whether the weights are stacked together.
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  `None` ) – Value used to scale the results of the attention output.
* **has\_bias** – Whether to use an attention bias.
* **clip\_qkv** – If provided, the QKV weights are clamped between
  \[-clip\_qkv, clip\_qkv]
* **quantization\_config** ([`QuantizationConfig`](../../graph/quantization.md#max.graph.quantization.QuantizationConfig) )

### `wqkv` {#max.nn.attention.attention_with_rope.GPTQAttentionWithRope.wqkv}

> *property* wqkv\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\*

The concatenation of q, k, and v weight vectors.

## `LatentAttentionWithRope` {#max.nn.attention.attention_with_rope.LatentAttentionWithRope}

> *class* max.nn.attention.attention\_with\_rope.LatentAttentionWithRope(\*, rope, num\_attention\_heads, num\_key\_value\_heads, hidden\_size, kv\_params, dtype, devices=None, linear\_cls=\, scale=None, has\_bias=False, clip\_qkv=None, q\_lora\_rank=None, kv\_lora\_rank=512, qk\_nope\_head\_dim=128, qk\_rope\_head\_dim=64, v\_head\_dim=128, buffer\_size=16384)

Implementation of Latent Attention with Rope.

Initializes the attention layer.

**Parameters:**

* **rope** ([`OptimizedRotaryEmbedding`](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding) ) – The rope layer to borrow the freq\_cis value from.
* **num\_attention\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of attention heads.
* **num\_key\_value\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of key/value heads.
* **hidden\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimension of the hidden states.
* **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) – KV Cache Params, including the number of kv heads, the head dim, and data type.
* **layer\_idx** – The layer number associated with this Attention block.
* **dtype** ([`DType`](../../dtype.md#max.dtype.DType) ) – DType of the weights, should always be uint8.
* **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]`  `|`  `None` ) – Device to place the weights and run the computation. If
  multiple are provided, the first device is used. Use
  DistributedAttentionWithRope to use all devices during
  attention computation.
* **quantization\_encoding** – Quantization encoding of the weights.
* **linear\_cls** (`Callable` `[` `...` `,`  [`Linear`](../linear.md#max.nn.linear.Linear) `]` ) – Linear class to use for the outputs dense layer.
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  `None` ) – Value used to scale the results of the attention output.
* **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether to use an attention bias.
* **clip\_qkv** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  `None` ) – If provided, the QKV weights are clamped between
  \[-clip\_qkv, clip\_qkv]
* **buffer\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Buffer size for storing the temporal results during prefill,
  in unit of tokens.
* **q\_lora\_rank** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **kv\_lora\_rank** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **qk\_nope\_head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **qk\_rope\_head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **v\_head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) )

### `rope` {#max.nn.attention.attention_with_rope.LatentAttentionWithRope.rope}

> rope\*: [OptimizedRotaryEmbedding](../rotary_embedding.md#max.nn.rotary_embedding.OptimizedRotaryEmbedding)\*

### `w_uk_uv` {#max.nn.attention.attention_with_rope.LatentAttentionWithRope.w_uk_uv}

> *property* w\_uk\_uv\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)]\*

The concatenation of q, k, and v weight vectors.

### `wqkv` {#max.nn.attention.attention_with_rope.LatentAttentionWithRope.wqkv}

> *property* wqkv\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\*

The concatenation of q, k, and v weight vectors.

### `wqkv_bias` {#max.nn.attention.attention_with_rope.LatentAttentionWithRope.wqkv_bias}

> *property* wqkv\_bias\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [None](https://docs.python.org/3/library/constants.html#None)\*

The concatenation of q, k, and v bias weight vectors.

## `distribute_value()` {#max.nn.attention.attention_with_rope.distribute_value}

> max.nn.attention.attention\_with\_rope.distribute\_value(v, devices)

**Parameters:**

* **v** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) )
* **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` )

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*TensorValue*](../../graph/TensorValue.md#max.graph.TensorValue)]

---

## Attribute

`@register_passable(trivial)`
`struct Attribute`

Represents GPU kernel function attributes.

This struct defines constants for various function attributes that can be queried
or set for GPU kernels. These attributes provide information about resource
requirements and execution constraints of kernel functions.

## Fields

* ​code (`SIMD[int32, 1]`): The numeric code representing the attribute type.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `BINARY_VERSION`

`alias BINARY_VERSION = Attribute(__init__[__mlir_type.!pop.int_literal](6))`

The binary architecture version for which the function was compiled. This value is the major binary version \* 10 + the minor binary version, so a binary version 1.3 function would return the value 13. Note that this will return a value of 10 for legacy cubins that do not have a properly- encoded binary architecture version..

### `CACHE_MODE_CA`

`alias CACHE_MODE_CA = Attribute(__init__[__mlir_type.!pop.int_literal](7))`

The attribute to indicate whether the function has been compiled with user specified option "-Xptxas --dlcm=ca" set .

### `CLUSTER_SCHEDULING_POLICY_PREFERENCE`

`alias CLUSTER_SCHEDULING_POLICY_PREFERENCE = Attribute(__init__[__mlir_type.!pop.int_literal](15))`

The block scheduling policy of a function. The value type is CUclusterSchedulingPolicy / cudaClusterSchedulingPolicy.

### `CLUSTER_SIZE_MUST_BE_SET`

`alias CLUSTER_SIZE_MUST_BE_SET = Attribute(__init__[__mlir_type.!pop.int_literal](10))`

If this attribute is set, the kernel must launch with a valid cluster size specified.

### `CONST_SIZE_BYTES`

`alias CONST_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](2))`

The size in bytes of user-allocated constant memory required by this function.

### `LOCAL_SIZE_BYTES`

`alias LOCAL_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](3))`

The size in bytes of local memory used by each thread of this function.

### `MAX_DYNAMIC_SHARED_SIZE_BYTES`

`alias MAX_DYNAMIC_SHARED_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](8))`

The maximum size in bytes of dynamically-allocated shared memory that can be used by this function. If the user-specified dynamic shared memory size is larger than this value.

### `MAX_THREADS_PER_BLOCK`

`alias MAX_THREADS_PER_BLOCK = Attribute(__init__[__mlir_type.!pop.int_literal](0))`

The maximum number of threads per block, beyond which a launch of the function would fail. This number depends on both the function and the device on which the function is currently loaded.

### `NON_PORTABLE_CLUSTER_SIZE_ALLOWED`

`alias NON_PORTABLE_CLUSTER_SIZE_ALLOWED = Attribute(__init__[__mlir_type.!pop.int_literal](14))`

Whether the function can be launched with non-portable cluster size. 1 is allowed, 0 is disallowed. A non-portable cluster size may only function on the specific SKUs the program is tested on. The launch might fail if the program is run on a different hardware platform.CUDA API provides cudaOccupancyMaxActiveClusters to assist with checking whether the desired size can be launched on the current device.Portable Cluster SizeA portable cluster size is guaranteed to be functional on all compute capabilities higher than the target compute capability. The portable cluster size for sm\_90 is 8 blocks per cluster.

### `NUM_REGS`

`alias NUM_REGS = Attribute(__init__[__mlir_type.!pop.int_literal](4))`

The number of registers used by each thread of this function.

### `PREFERRED_SHARED_MEMORY_CARVEOUT`

`alias PREFERRED_SHARED_MEMORY_CARVEOUT = Attribute(__init__[__mlir_type.!pop.int_literal](9))`

On devices where the L1 cache and shared memory use the same hardware resources, this sets the shared memory carveout preference, in percent of the total shared memory.

### `PTX_VERSION`

`alias PTX_VERSION = Attribute(__init__[__mlir_type.!pop.int_literal](5))`

The PTX virtual architecture version for which the function was compiled. This value is the major PTX version \* 10 + the minor PTX version, so a PTX version 1.3 function would return the value 13. Note that this may return the undefined value of 0 for cubins compiled prior to CUDA 3.0..

### `REQUIRED_CLUSTER_DEPTH`

`alias REQUIRED_CLUSTER_DEPTH = Attribute(__init__[__mlir_type.!pop.int_literal](13))`

The required cluster depth in blocks. The values must either all be 0 or all be positive. The validity of the cluster dimensions is otherwise checked at launch time.

### `REQUIRED_CLUSTER_HEIGHT`

`alias REQUIRED_CLUSTER_HEIGHT = Attribute(__init__[__mlir_type.!pop.int_literal](12))`

The required cluster height in blocks. The values must either all be 0 or all be positive. The validity of the cluster dimensions is otherwise checked at launch time.

### `REQUIRED_CLUSTER_WIDTH`

`alias REQUIRED_CLUSTER_WIDTH = Attribute(__init__[__mlir_type.!pop.int_literal](11))`

The required cluster width in blocks. The values must either all be 0 or all be positive. The validity of the cluster dimensions is otherwise checked at launch time.

### `SHARED_SIZE_BYTES`

`alias SHARED_SIZE_BYTES = Attribute(__init__[__mlir_type.!pop.int_literal](1))`

The size in bytes of statically-allocated shared memory required by this function. This does not include dynamically-allocated shared memory requested by the user at runtime.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two Attribute instances are equal.

**Args:**

* ​other (`Self`): The Attribute to compare with.

**Returns:**

True if both attributes have the same code, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if two Attribute instances are not equal.

**Args:**

* ​other (`Self`): The Attribute to compare with.

**Returns:**

True if the attributes have different codes, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Identity comparison operator for Attribute instances.

**Args:**

* ​other (`Self`): The Attribute to compare with.

**Returns:**

True if both attributes are identical (have the same code), False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Negative identity comparison operator for Attribute instances.

**Args:**

* ​other (`Self`): The Attribute to compare with.

**Returns:**

True if the attributes are not identical, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes a string representation of the `Attribute` to the provided writer.

```
This method converts the `Attribute` enum value to its corresponding string name
and writes it to the provided writer object.
```

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait.

**Args:**

* ​writer (`W`): A Writer object that will receive the string representation.

---

## Autoregression

Autoregression is a process by which an AI model iteratively predicts future
values based on previous values in a sequence, using its own output as input to
itself. Because each prediction depends on prior context, the process is
sequential, which limits parallelization.

Autoregression is a standard procedure in [transformer](transformer.mdx) models
such as large language models (LLMs) and other models that perform time-series
forecasting. This autoregressive process explains why AI chat bots like ChatGPT
and Gemini stream the output one word at a time—although they sometimes run so
fast that they appear to produce more than one word at a time.

---

## avg_pool

`avg_pool[type: DType, int_type: DType, rank: Int = 4, count_boundary: Bool = False](input: NDBuffer[type, rank, origin], filter: NDBuffer[int_type, 1, origin], strides: NDBuffer[int_type, 1, origin], dilations: NDBuffer[int_type, 1, origin], paddings: NDBuffer[int_type, 1, origin], output: NDBuffer[type, rank, origin], ceil_mode: Bool = False)`

Computes the average pool.

Params:
count\_boundary: Whether to count the boundary in the average computation.

**Args:**

* ​input (`NDBuffer[type, rank, origin]`): Batched image input to the pool2d operator.
* ​filter (`NDBuffer[int_type, 1, origin]`): Filter size on height and width dimensions with assumed tuple
  def (filter\_h, filter\_w).
* ​strides (`NDBuffer[int_type, 1, origin]`): Strides on height and width dimensions with assumed
  tuple def (stride\_h, stride\_w).
* ​dilations (`NDBuffer[int_type, 1, origin]`): Dilations on height and width dimensions with assumed
  tuple def (dilation\_h, dilation\_w).
* ​paddings (`NDBuffer[int_type, 1, origin]`): Paddings on height and width dimensions with assumed
  tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)).
* ​output (`NDBuffer[type, rank, origin]`): Pre-allocated output tensor space.
* ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding.

---

## avg_pool_gpu

`avg_pool_gpu[type: DType, int_type: DType, rank: Int = 4, count_boundary: Bool = False](ctx: DeviceContext, input: NDBuffer[type, rank, origin], filter: NDBuffer[int_type, 1, origin], strides: NDBuffer[int_type, 1, origin], dilations: NDBuffer[int_type, 1, origin], paddings: NDBuffer[int_type, 1, origin], output: NDBuffer[type, rank, origin], ceil_mode: Bool = False)`

Computes the average pool on GPU.

Params:
count\_boundary: Whether to count the boundary in the average computation.

**Args:**

* ​ctx (`DeviceContext`): The DeviceContext to use for GPU execution.
* ​input (`NDBuffer[type, rank, origin]`): (On device) Batched image input to the pool2d operator.
* ​filter (`NDBuffer[int_type, 1, origin]`): (On host) Filter size on height and width dimensions with assumed tuple
  def (filter\_h, filter\_w).
* ​strides (`NDBuffer[int_type, 1, origin]`): (On host) Strides on height and width dimensions with assumed
  tuple def (stride\_h, stride\_w).
* ​dilations (`NDBuffer[int_type, 1, origin]`): (On host) Dilations on height and width dimensions with assumed
  tuple def (dilation\_h, dilation\_w).
* ​paddings (`NDBuffer[int_type, 1, origin]`): (On host) Paddings on height and width dimensions with assumed
  tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)).
* ​output (`NDBuffer[type, rank, origin]`): (On device) Pre-allocated output tensor space.
* ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding.

---

## Axis

`@register_passable(trivial)`
`struct Axis`

## Fields

* ​axis (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Indexer`,
`Intable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(axis: Int) -> Self`

`__init__(out self, axis: Int, rank: Int)`

### `__int__`

`__int__(self) -> Int`

### `__index__`

`__index__(self) -> index`

Convert to index.

**Returns:**

The corresponding \_\_mlir\_type.index value.

---

## b16decode

`b16decode(str: StringSlice[origin]) -> String`

Performs base16 decoding on the input string.

**Args:**

* ​str (`StringSlice[origin]`): A base16 encoded string.

**Returns:**

The decoded string.

---

## b16encode

`b16encode(str: StringSlice[origin]) -> String`

Performs base16 encoding on the input string slice.

**Args:**

* ​str (`StringSlice[origin]`): The input string slice.

**Returns:**

Base16 encoding of the input string.

---

## b64decode

`b64decode[*, validate: Bool = False](str: StringSlice[origin]) -> String`

Performs base64 decoding on the input string.

**Parameters:**

* ​validate (`Bool`): If true, the function will validate the input string.

**Args:**

* ​str (`StringSlice[origin]`): A base64 encoded string.

**Returns:**

The decoded string.

---

## b64encode

`b64encode(input_bytes: Span[SIMD[uint8, 1], origin], mut result: String)`

Performs base64 encoding on the input string.

Notes:
This method reserves the necessary capacity. `result` can be a 0
capacity string.

**Args:**

* ​input\_bytes (`Span[SIMD[uint8, 1], origin]`): The input string buffer.
* ​result (`String`): The string in which to store the values.

`b64encode(input_string: StringSlice[origin]) -> String`

Performs base64 encoding on the input string.

**Args:**

* ​input\_string (`StringSlice[origin]`): The input string buffer.

**Returns:**

The ASCII base64 encoded string.

`b64encode(input_bytes: Span[SIMD[uint8, 1], origin]) -> String`

Performs base64 encoding on the input string.

**Args:**

* ​input\_bytes (`Span[SIMD[uint8, 1], origin]`): The input string buffer.

**Returns:**

The ASCII base64 encoded string.

---

## Backend

`@register_passable(trivial)`
`struct Backend`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `AUTOMATIC`

`alias AUTOMATIC = Backend(0)`

### `CUBLAS`

`alias CUBLAS = Backend(1)`

### `CUBLASLT`

`alias CUBLASLT = Backend(2)`

### `HIPBLASLT`

`alias HIPBLASLT = Backend(4)`

### `ROCBLAS`

`alias ROCBLAS = Backend(3)`

## Methods

### `__init__`

`@implicit`
`__init__(value: Int) -> Self`

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

### `__is__`

`__is__(self, other: Self) -> Bool`

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

### `__int__`

`__int__(self) -> Int`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

---

## ballot

`ballot[dtype: DType](value: Bool) -> SIMD[dtype, 1]`

Returns a bitfield(Int32 or Int64) containing the result of its Bool argument in all active lanes, and zero in all inactive lanes. For example, ballot(True) returns EXEC mask.

**Parameters:**

* ​dtype (`DType`): The DType of the return type.

**Args:**

* ​value (`Bool`): The value to place across the mask.

**Returns:**

A bitfield(Int32 or Int64) containing the result of its Bool argument in all active lanes.

---

## barrier

`barrier()`

Performs a synchronization barrier at the block level.

This is equivalent to \_\_syncthreads() in CUDA. All threads in a thread block must
execute this function before any thread can proceed past the barrier. This ensures
memory operations before the barrier are visible to all threads after the barrier.

---

## base64

Provides functions for base64 encoding strings.

You can import these APIs from the `base64` package. For example:

```mojo
from base64 import b64encode
```

## Functions

* [​`b16decode`](/mojo/stdlib/base64/base64/b16decode): Performs base16 decoding on the input string.
* [​`b16encode`](/mojo/stdlib/base64/base64/b16encode): Performs base16 encoding on the input string slice.
* [​`b64decode`](/mojo/stdlib/base64/base64/b64decode): Performs base64 decoding on the input string.
* [​`b64encode`](/mojo/stdlib/base64/base64/b64encode): Performs base64 encoding on the input string.

---

## base64

Implements the base64 package.

## Modules

* [​`base64`](/mojo/stdlib/base64/base64/): Provides functions for base64 encoding strings.

---

## basename

`basename[PathLike: PathLike, //](path: PathLike) -> String`

Returns the tail section of a path.

```mojo
from os.path import basename

basename("a/path/foo.txt")  # returns "foo.txt"
```

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to retrieve the basename from.

**Returns:**

The basename from the path.

---

## Basics of GPU programming with Mojo

import Requirements from '@site/src/components/Requirements';
import { requirementsWithGPU } from '@site/docs/max/requirements';

If you have any questions or feedback for this content, please post it in the
[Modular forum thread
here](https://forum.modular.com/t/gpu-programming-manual/755).

This documentation aims to build your GPU programming knowledge from the ground
up, starting with the lowest levels of the stack before progressing to
higher-level functionality. It’s designed for a diverse audience, from
experienced GPU developers to programmers new to GPU coding. Mojo allows you to
program NVIDIA GPUs, with direct access to low-level GPU primitives, while
sharing types and functions that can also run on CPUs where applicable.  If
you're experienced with [NVIDIA Compute Unified Device
Architecture](https://developer.nvidia.com/cuda-toolkit) (CUDA), what you'll
learn here will enable you to expand your reach as we release support for more
hardware.

## Introduction to massively parallel programming

We can no longer rely on new generations of CPUs to increase application
performance through improved clock speeds. Power demands and heat dissipation
limits have stalled that trend, pushing the hardware industry toward increasing
the number of physical cores. Modern consumer CPUs now boast 16 cores or more,
capable of running in parallel, which forces programmers to rethink how they
maximize performance. This shift is especially evident in AI applications, where
performance scales remarkably well with additional cores.

NVIDIA’s breakthrough came with CUDA, a general programming model that allows
developers to target both server and consumer GPUs for any application domain.
This vision sparked an AI revolution when Alex Krizhevsky, Ilya Sutskever, and
Geoffrey Hinton trained AlexNet on consumer GPUs, significantly outperforming
traditional computer vision methods. GPUs pack thousands of cores, the NVIDIA
H100 can run 16,896 threads in parallel in a single clock cycle, with over
270,000 threads queued and ready to go. They're also engineered in a way where
the cost of scheduling threads is much lower compared to a traditional CPU.

Harnessing this hardware requires a new programming mindset. Mojo represents a
chance to rethink GPU programming and make it more approachable. C/C++ is at the
core of GPU programming, but we’ve seen leaps in ergonomics and memory safety
from systems programming languages in recent years. Mojo expands on Python’s
familiar syntax, adds direct access to low-level CPU and GPU intrinsics for
systems programming, and introduces ergonomic and safety improvements from
modern languages. This course aims to empower programmers with minimal
specialized knowledge to build high-performance, GPU-enabled applications. By
lowering the barrier to entry, we aim to fuel more breakthroughs and accelerate
innovation.

## Setup

System requirements:

:::note

These examples can run on many consumer NVIDIA GeForce GPUs, though they aren't
officially supported yet. Make sure you have the latest NVIDIA drivers.

:::

All of these notebook cells are runnable through a VS Code extension. You can
install
[Markdown Lab](https://marketplace.visualstudio.com/items?itemName=jackos.mdlab),
then clone the repo that contains the markdown that generated this website:

```sh
git clone git@github.com:modular/max
cd max/mojo/docs/manual/gpu
```

And open `basics.mdx` to run the code cells interactively.

If you prefer the traditional approach using a CLI, first install magic if you
don't have it:

```bash
curl -ssL https://magic.modular.com | bash
```

Then restart your terminal, create a project, and enter the virtual environment:

```sh
magic init gpu-basics --mojoproject
cd gpu-basics
magic shell # enter virtual environment
```

You can now create file such as `main.mojo` and put everything except the
imports into a `def main`:

```mojo :once
from gpu import thread_idx
from gpu.host import DeviceContext

def main():
    fn printing_kernel():
        print("GPU thread: [", thread_idx.x, thread_idx.y, thread_idx.z, "]")

    var ctx = DeviceContext()

    ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4)
    ctx.synchronize()
```

Then compile and run the file using `mojo main.mojo`.

When you're ready to exit the virtual environment run the command: `exit`.

## Imports

These are all the imports required to run the examples, put this at the top of
your file if you're running from `mojo main.mojo`:

```mojo
from gpu import thread_idx, block_idx, warp, barrier
from gpu.host import DeviceContext, DeviceBuffer, HostBuffer
from gpu.memory import AddressSpace
from memory import stack_allocation
from layout import Layout, LayoutTensor
from math import iota
from sys import sizeof
```

## Your first kernel

In the context of GPU programming, a kernel is a program that runs on each
thread that you launch:

```mojo
fn printing_kernel():
    print("GPU thread: [", thread_idx.x, thread_idx.y, thread_idx.z, "]")
```

:::note

We're using `fn` here without the `raises` keyword because a kernel function is
not allowed to raise an error condition. When you define a Mojo function with
`def`, the compiler always assumes that the function *can* raise an error
condition. See [Functions](/mojo/manual/functions) more information.

:::

We can pass this function as a parameter to `enqueue_function()` to compile it
for your attached GPU and launch it. First we need to get the
[`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) for your
GPU:

```mojo
var ctx = DeviceContext()
```

Now we have the `DeviceContext` we can compile and launch the kernel:

```mojo :once
ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4)

# Wait for the kernel to finish executing before handing back to CPU
ctx.synchronize()
```

```text
GPU thread: [ 0 0 0 ]
GPU thread: [ 1 0 0 ]
GPU thread: [ 2 0 0 ]
GPU thread: [ 3 0 0 ]
```

:::note

The term `kernel` in this context originated in the 1980s with the introduction
of the [Single Program, Multiple
Data](https://en.wikipedia.org/wiki/Single_program,_multiple_data) (SPMD)
parallel programming technique, which underpins ROCm and CUDA. In this approach,
a kernel executes concurrently across distinct elements of large data
structures.

:::

## Threads

Because we passed `block_dim=4`, we launched 4 threads on the x dimension, the
kernel code we wrote is executed on each thread. The printing can be out of
order depending on which thread reaches that `print()` call first.

Now add the y and z dimensions with `block_dim=(2, 2, 2)`:

:::note

For the `grid_dim` and `block_dim` arguments you can use a single value or a
tuple. A single value will launch N blocks/threads on the x dimension, while
using a tuple with up to three values will determine the (x, y, z) dimensions.

:::

```mojo :once
ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=(2, 2, 2))
ctx.synchronize()
```

```text
GPU thread: [ 0 0 0 ]
GPU thread: [ 1 0 0 ]
GPU thread: [ 0 1 0 ]
GPU thread: [ 1 1 0 ]
GPU thread: [ 0 0 1 ]
GPU thread: [ 1 0 1 ]
GPU thread: [ 0 1 1 ]
GPU thread: [ 1 1 1 ]
```

We're now launching 8 (2x2x2) threads in total.

## Host vs device and enqueue

You'll see the word host which refers to the CPU that schedules work for the
device, device refers to the accelerator which in this case is a GPU.

When you encounter the term `enqueue` in a method or function call, it means
that the host is scheduling the operation to execute asynchronously on the
device. If your host-side code relies on the outcome of these device-enqueued
operations, you need to call `ctx.synchronize()`. For instance, printing from
the CPU without first synchronizing might result in out-of-order output:

```mojo :once
ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4)
print("This might print before the GPU has completed its work")
```

```text
This might print before the GPU has completed its work
GPU thread: [ 0 0 0 ]
GPU thread: [ 1 0 0 ]
GPU thread: [ 2 0 0 ]
GPU thread: [ 3 0 0 ]
```

In the above example we failed to call `synchronize()` before printing on the
host, the device could be slightly slower to finish its work, so you might
see that output after the host output. Let's add a `synchronize()` call:

```mojo :once
ctx.enqueue_function[printing_kernel](grid_dim=1, block_dim=4)
ctx.synchronize()
print("This will print after the GPU has completed its work")
```

```text
GPU thread: [ 0 0 0 ]
GPU thread: [ 1 0 0 ]
GPU thread: [ 2 0 0 ]
GPU thread: [ 3 0 0 ]
This will print after the GPU has completed its work
```

Any method or function you `enqueue` to run on the device, will run in the order
that you enqueued them. It's only when you're doing something from the host
which is dependent on the results of enqueued calls that you have to
synchronize.

In GPU programming with Mojo, when there's a tradeoff between GPU performance
and safety or ergonomics, performance takes priority, aligning with the
expectations of kernel engineers. For instance, while we could eliminate the
`enqueue` prefix from method calls and force synchronization for each of them,
this would come at a performance cost. Take note to remember anything from these
warning text blocks for potential safety violations:

:::warning Synchronization

For any methods or functions prefixed with `enqueue`, you must synchronize
before running CPU code that is dependent on what you're enqueuing. Enqueueing
multiple method or function calls for a single GPU is safe, as they are
scheduled to run in the order you call them.

:::

Mojo enhances the safety and ergonomics of C++ GPU programming where it doesn't
sacrifice performance. For example, ASAP destruction automatically frees buffer
memory on last use of the object, eliminating memory leaks and ensuring memory
is released as early as possible. This is an evolution on modern memory
management solutions such as C++ RAII, which is scope based and may hold onto
memory for longer than expected, which is a precious resource in AI
applications.

## Blocks

This kernel demonstrates how blocks work:

```mojo :once
fn block_kernel():
    print(
        "block: [",
        block_idx.x,
        block_idx.y,
        block_idx.z,
        "]",
        "thread: [",
        thread_idx.x,
        thread_idx.y,
        thread_idx.z,
        "]"
    )

ctx.enqueue_function[block_kernel](grid_dim=(2, 2), block_dim=2)
ctx.synchronize()
```

```text
block: [ 0 0 0 ] thread: [ 0 0 0 ]
block: [ 0 0 0 ] thread: [ 1 0 0 ]
block: [ 1 0 0 ] thread: [ 0 0 0 ]
block: [ 1 0 0 ] thread: [ 1 0 0 ]
block: [ 1 1 0 ] thread: [ 0 0 0 ]
block: [ 1 1 0 ] thread: [ 1 0 0 ]
block: [ 0 1 0 ] thread: [ 0 0 0 ]
block: [ 0 1 0 ] thread: [ 1 0 0 ]
```

We're still launching 8 (2x2x2) threads, where there are 4 blocks, each with 2
threads. In GPU programming this grouping of blocks and threads is important,
each block can have its own fast SRAM (Static Random Access Memory) which allows
threads to communicate. The threads within a block can also communicate through
registers, we'll cover this concept when we get to warps. For now the
important information to internalize is:

- `grid_dim` defines how many blocks are launched.
- `block_dim` defines how many threads are launched in each block.

## Tiles

The x, y, z dimensions of blocks are important for splitting up large jobs into
tiles, so each thread can work on its own subset of the problem. Below is a
visualization for how a contiguous array of data can be split up into tiles, if
we have an array of UInt32 (Unsigned Integer 32bit) data like:

```plaintext
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ]
```

We could split work up between threads and blocks, we're only going to use the x
dimension for threads and blocks to get started:

```plaintext
Thread  |    0  1  2  3
-------------------------
block 0 | [  0  1  2  3 ]
block 1 | [  4  5  6  7 ]
block 2 | [  8  9 10 11 ]
block 3 | [ 12 13 14 15 ]
```

If you had a much larger data array you could further split it up into tiles,
e.g. tile with widths [2, 2] at index (0, 0) would be:

```plaintext
[ 0 1 ]
[ 4 5 ]
```

And index (2, 0) would be:

```plaintext
[ 2 3 ]
[ 6 7 ]
```

This is where you'd introduce the y dimension, later we'll being working on
image data which is a tensor with 3 dimensions: (height, width, color_channels).
For now we're going to focus on how blocks and threads interact, splitting up an
array into 1 row per block, and 1 value per thread.

## Buffers

First we'll initialize a contiguous array on the GPU:

```mojo
alias dtype = DType.uint32
alias blocks = 4
alias threads = 4
alias elements_in = blocks * threads # one element per thread

var in_buffer = ctx.enqueue_create_buffer[dtype](elements_in)
```

Creating the GPU buffer is allocating _global memory_ which can be accessed from
any block and thread inside a GPU kernel, this memory is relatively slow
compared to _shared memory_ which is shared between all of the threads in a
block, more on that later.

We can't access memory in a GPU address space from CPU to initialize the values
unless we map it to host:

```mojo
with in_buffer.map_to_host() as host_buffer:
    iota(host_buffer.unsafe_ptr(), elements_in)
    print(host_buffer)
```

```text
HostBuffer([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
```

If you're loading or storing values from a buffer allocated on GPU, mapping to
host ensures the values are copied into the CPU address space when the context
manager enters (start of the `with` block), and back to the GPU address space
when the context manager exits (end of the `with` block). Note that
`map_to_host()` will call `synchronize()` before writing the data back to CPU,
so you don't have to call it separately.

## Tensor indexing from threads

Now that we have the data set up, we can wrap the data in a
[LayoutTensor](/mojo/kernels/layout/layout_tensor/LayoutTensor/) so that we can
reason about how to index into the array, allowing each thread to get its
corresponding value:

```mojo :clear
alias layout = Layout.row_major(blocks, threads)

var in_tensor = LayoutTensor[dtype, layout](in_buffer)
```

:::note Memory Layout

"Row major" means the values are stored sequentially in memory:

[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ]

"Column major" means memory advances down each column first, then moves to the
next column. This layout is used in some GPU tiling kernels because it can align
with coalesced column accesses:

[ 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 ]

:::

`LayoutTensor` is a view of the data in buffer, it does not own the underlying
memory. It's a powerful abstraction and offers many advanced methods which we'll
dive into in later chapters.

We'll create an alias so that we don't have to repeat the type information for
each kernel launch:

```mojo :clear
alias InTensor = LayoutTensor[dtype, layout, MutableAnyOrigin]
```

More information on [origins here](/mojo/manual/values/lifetimes).

Initially we'll just print the values to confirm it's indexing as we expect:

```mojo :once
fn print_values_kernel(in_tensor: InTensor):
    var bid = block_idx.x
    var tid = thread_idx.x
    print("block:", bid, "thread:", tid, "val:", in_tensor[bid, tid])

ctx.enqueue_function[print_values_kernel](
    in_tensor, grid_dim=blocks, block_dim=threads,
)
ctx.synchronize()
```

```text
block: 3 thread: 0 val: 12
block: 3 thread: 1 val: 13
block: 3 thread: 2 val: 14
block: 3 thread: 3 val: 15
block: 1 thread: 0 val: 4
block: 1 thread: 1 val: 5
block: 1 thread: 2 val: 6
block: 1 thread: 3 val: 7
block: 2 thread: 0 val: 8
block: 2 thread: 1 val: 9
block: 2 thread: 2 val: 10
block: 2 thread: 3 val: 11
block: 0 thread: 0 val: 0
block: 0 thread: 1 val: 1
block: 0 thread: 2 val: 2
block: 0 thread: 3 val: 3
```

As in the visualization above, the block/thread is getting the corresponding
value that we expect. You can see `block: 3 thread: 3` has the last value 15.
Try experimenting with different `grid_dim`, `block_dim` and indexing values
to see how the behavior changes.

## Multiply kernel

Now that we've verified we're getting the correct values when indexing, we'll
launch a kernel to multiply each value:

```mojo :once
fn multiply_kernel[multiplier: Int](in_tensor: InTensor):
    in_tensor[block_idx.x, thread_idx.x] *= multiplier

ctx.enqueue_function[multiply_kernel[2]](
    in_tensor,
    grid_dim=blocks,
    block_dim=threads,
)

# Map to host and print as 2D array
with in_buffer.map_to_host() as host_buffer:
    var host_tensor = LayoutTensor[dtype, layout](host_buffer)
    print(host_tensor)
```

```text
0 2 4 6
8 10 12 14
16 18 20 22
24 26 28 30
```

Congratulations! You've successfully run a kernel that modifies values from your
GPU, and printed the result on your CPU. You can see above that each thread
multiplied a single value by 2 in parallel on the GPU, and copied the result
back to the CPU.

## Sum reduce output

We're going to set up a new buffer which will have all the reduced values with
the sum of each thread in the block:

```plaintext
Output: [ block[0] block[1] block[2] block[3] ]
```

Set up the output buffer/tensor for the host and device:

```mojo :clear
var out_buffer = ctx.enqueue_create_buffer[dtype](blocks)

# Zero the values on the device as they'll be used to accumulate results
_ = out_buffer.enqueue_fill(0)

alias out_layout = Layout.row_major(elements_in)
alias OutTensor = LayoutTensor[dtype, out_layout, MutableAnyOrigin]

var out_tensor = OutTensor(out_buffer)
```

The problem here is that we can't have all the threads summing their values into
the same index in the output buffer as that will introduce race conditions.
We're going to introduce new concepts to deal with this.

## Shared memory

This kernel uses shared memory to accumulate values. Shared memory is much
faster than global memory because it resides on-chip, closer to the processing
cores, reducing latency and increasing bandwidth. It's not an optimal solution
for this kind of reduction operation, but it's a good way to introduce shared
memory in a simple example.  We'll cover better solutions in the next sections.

```mojo :once
fn sum_reduce_kernel(
    in_tensor: InTensor, out_tensor: OutTensor
):
    # This allocates memory to be shared between threads in a block prior to the
    # kernel launching. Each kernel gets a pointer to the allocated memory.
    var shared = stack_allocation[
        threads,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()

    # Place the corresponding value into shared memory
    shared[thread_idx.x] = in_tensor[block_idx.x, thread_idx.x][0]

    # Await all the threads to finish loading their values into shared memory
    barrier()

    # If this is the first thread, sum and write the result to global memory
    if thread_idx.x == 0:
        for i in range(threads):
            out_tensor[block_idx.x] += shared[i]

ctx.enqueue_function[sum_reduce_kernel](
    in_tensor,
    out_tensor,
    grid_dim=blocks,
    block_dim=threads,
)

# Copy the data back to the host and print out the buffer
with out_buffer.map_to_host() as host_buffer:
    print(host_buffer)
```

```text
HostBuffer([6, 22, 38, 54])
```

For our first block/tile we summed the values:

```plaintext
sum([ 0 1 2 3 ]) == 6
```

And the reduction resulted in the output having the sum of 6 in the first
position. Every tile/block has been reduced to:

```plaintext
[ 6 22 38 54]
```

## Sum multiple values from a single thread

We could skip using shared memory altogether by launching a single thread per
block. Each thread can load more than a single value, here we'll be launching
one thread per block, loading the 4 corresponding values from that block, and
summing them together:

```mojo :once
fn simd_reduce_kernel(
    in_tensor: InTensor, out_tensor: OutTensor
):
    # The [4] means it loads 4 sequential values before doing the `reduce_add`
    out_tensor[block_idx.x] = in_tensor.load[4](block_idx.x, 0).reduce_add()

ctx.enqueue_function[simd_reduce_kernel](
    in_tensor,
    out_tensor,
    grid_dim=blocks,
    block_dim=1, # one thread per block
)

# Ensure we have the same result
with out_buffer.map_to_host() as host_buffer:
    print(host_buffer)
```

```text
HostBuffer([6, 22, 38, 54])
```

This is cleaner and faster, instead of 4 threads writing to shared memory, we're
using 1 thread per block and summing them together without the intermediate
step. However, this can be even faster by launching one thread per value and
doing a single instruction in parallel using warps.

## Warps

:::note Warps

Warp level instructions are an advanced concept, this section is to demonstrate
that these low-level primitives are available from Mojo. We'll go into more
depth on warps later, so don't worry if it doesn't make sense yet.

:::

A _warp_ is a group of threads (32 on NVIDIA GPUs) within a block. Threads
within the same warp can synchronize their execution, and take advantage of
[Single Instruction, Multiple
Threads](https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads)
(SIMT). SIMT (GPU-focused) allows multiple threads to execute the same
instruction on different data with independent control flow and thread states,
while SIMD (CPU-focused) applies a single instruction to multiple data elements
simultaneously with no thread independence.

We have only 4 threads within each block, well under the 32 limit, if this
wasn't the case you'd have to do two reductions, one from each warp to shared
memory, then another from shared memory to the output buffer or tensor.

Here is a simple warp reduction kernel:

```mojo :once
fn warp_reduce_kernel(
    in_tensor: InTensor, out_tensor: OutTensor
):
    var value = in_tensor.load[1](block_idx.x, thread_idx.x)

    # Each thread gets the value from one thread higher, summing them as they go
    value = warp.sum(value)

    # Print each reduction step in the first block
    if block_idx.x == 0:
        print("thread:", thread_idx.x, "value:", value)

    # Thread 0 has the reduced sum of the values from all the other threads
    if thread_idx.x == 0:
        out_tensor[block_idx.x] = value

ctx.enqueue_function[warp_reduce_kernel](
    in_tensor,
    out_tensor,
    grid_dim=blocks,
    block_dim=threads,
)

# Ensure we have the same result
with out_buffer.map_to_host() as host_buffer:
    print(host_buffer)
```

```text
thread: 0 value: 6
thread: 1 value: 6
thread: 2 value: 5
thread: 3 value: 3
HostBuffer([6, 22, 38, 54])
```

You can see in the output that the first block had the values [0 1 2 3] and was
reduced from top to bottom (shuffle down) in this way, where the sum result of
one thread is passed to the next thread down:

| Thread | value | next_value | result |
|--------|-------|------------|--------|
| 3      | 3     | N/A        | 3      |
| 2      | 2     | 3          | 5      |
| 1      | 1     | 5          | 6      |
| 0      | 0     | 6          | 6      |

## Exercise

Now that we've covered some of the core primitives for GPU programming, here's
an exercise to solidify your understanding. Feel free to revisit the examples as
you work through it the first time, then challenge yourself to write the code
independently. Experimenting with the code and observing the results is also a
highly valuable way to deepen your skills, don’t hesitate to tweak things and
see what happens!

1. Create a host buffer for the input of `DType` `Float32`, with 32 elements,
and initialize the numbers ordered sequentially. Copy the host buffer to the
device.
2. Create a in_tensor that wraps the host buffer, with the dimensions (8, 4)
3. Create an host and device buffer for the output of `DType` `Float32`, with 8
elements, don't forget to zero the values with `enqueue_memset()`.
4. Launch a GPU kernel with 8 blocks and 4 threads that reduce sums the values,
using your preferred method to write to the output buffer.
5. Copy the device buffer to the host buffer, and print it out on the CPU.

  Click to expand answer.

```mojo :reset
from gpu import thread_idx, block_idx, warp
from gpu.host import DeviceContext
from layout import Layout, LayoutTensor
from math import iota

alias dtype = DType.float32
alias blocks = 8
alias threads = 4
alias elements_in = blocks * threads

# Create context
var ctx = DeviceContext()

# Create buffers
var in_buffer = ctx.enqueue_create_buffer[dtype](elements_in)
var out_buffer = ctx.enqueue_create_buffer[dtype](blocks)

# Fill in input values sequentially and copy to device
with in_buffer.map_to_host() as host_buffer:
    iota(host_buffer.unsafe_ptr(), elements_in)

# Zero output buffer values
_ = out_buffer.enqueue_fill(0)

# Create the LayoutTensors
alias layout = Layout.row_major(blocks, threads)
alias InTensor = LayoutTensor[dtype, layout, MutableAnyOrigin]
var in_tensor = InTensor(in_buffer)

alias out_layout = Layout.row_major(blocks)
alias OutTensor = LayoutTensor[dtype, out_layout, MutableAnyOrigin]
var out_tensor = OutTensor(out_buffer)

fn reduce_sum(in_tensor: InTensor, out_tensor: OutTensor):
    var value = in_tensor.load[1](block_idx.x, thread_idx.x)
    value = warp.sum(value)
    if thread_idx.x == 0:
        out_tensor[block_idx.x] = value

ctx.enqueue_function[reduce_sum](
    in_tensor,
    out_tensor,
    grid_dim=blocks,
    block_dim=threads,
)

with out_buffer.map_to_host() as host_buffer:
    print(host_buffer)
```

```text
HostBuffer([6.0, 22.0, 38.0, 54.0, 70.0, 86.0, 102.0, 118.0])
```

The next chapter is coming soon, in the meantime you can check out some [GPU
programming examples
here](https://github.com/modular/modular/tree/main/examples/gpu_functions), or learn
how you can integrate your GPU programming experience into the Python ecosystem
[with custom ops](/max/custom-ops/).

---

## Batch

`@register_passable(trivial)`
`struct Batch`

A batch of benchmarks, the benchmark.run() function works out how many iterations to run in each batch based the how long the previous iterations took.

## Fields

* ​duration (`Int`): Total duration of batch stored as nanoseconds.
* ​iterations (`Int`): Total iterations in the batch.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(*, other: Self) -> Self`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

### `mean`

`mean(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]`

Returns the average duration of the batch.

**Args:**

* ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`).

**Returns:**

The average duration of the batch.

---

## batched_matmul

`batched_matmul[rank: Int, a_type: DType, b_type: DType, c_type: DType, //, *, transpose_a: Bool, transpose_b: Bool, elementwise_epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), saturated_vnni: Bool = False, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c_buf: NDBuffer[c_type, rank, origin], a_buf: NDBuffer[a_type, rank, origin], b_buf: NDBuffer[b_type, rank, origin], *, context: DeviceContextPtr = DeviceContextPtr())`

`batched_matmul[rank: Int, a_type: DType, b_type: DType, c_type: DType, //, *, transpose_b: Bool, elementwise_epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), saturated_vnni: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c_buf: NDBuffer[c_type, rank, origin], a_buf: NDBuffer[a_type, rank, origin], b_buf: NDBuffer[b_type, rank, origin], *, context: DeviceContextPtr = DeviceContextPtr())`

---

## batched_matmul_kernel

`batched_matmul_kernel[rank: Int, c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), accum_type: DType = get_accum_type[::DType,::DType]()](c_buff: NDBuffer[c_type, 3, MutableAnyOrigin, c_shape], a_buff: NDBuffer[a_type, 3, MutableAnyOrigin, a_shape], b_buff: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], c_buff_nd_shape: IndexList[rank])`

---

## batched_matmul_shape

`batched_matmul_shape[rank: Int, a_type: DType, b_type: DType, single_thread_blocking_override: Bool](a_buff: NDBuffer[a_type, rank, origin], b_buff: NDBuffer[b_type, rank, origin]) -> IndexList[rank]`

Compute the output shape of a `batch_matmul` operation, and assert the inputs are compatible.

**Parameters:**

* ​rank (`Int`): Rank of the input and output tensors.
* ​a\_type (`DType`): Type of the lhs input tensor.
* ​b\_type (`DType`): Type of the rhs input tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​a\_buff (`NDBuffer[a_type, rank, origin]`): The lhs input tensor.
* ​b\_buff (`NDBuffer[b_type, rank, origin]`): The rhs input tensor.

**Returns:**

The output shape.

---

## Batching

Batching is the process of combining multiple inference requests into a single
forward pass through the model, thus executing multiple requests simultaneously
and improving computational efficiency. To account for requests with varying
sequence lengths, it's common to add techniques such as
[padding](padding-tokens.mdx) (to standardize lengths) or [ragged
tensors](ragged-tensors.mdx) (to handle variable lengths directly).

Batch sizes can be either static or dynamic. Whereas static batching uses a
fixed batch size and thus waits until the system receives a specific number of
inference requests before sending them into the model, dynamic batching uses a
flexible batch size. For example, dynamic batching may send a batch of requests
to the model as soon as the batch either reaches a certain number of requests
(batch size limit) or it reaches a timeout threshold.

Dynamic batching can get a lot more complicated than that with additional
tricks that keep GPUs busy instead of waiting for one batch to finish before
starting another. One such strategy for large language models (LLMs) is
[continuous batching](continuous-batching.mdx).

---

## Bench

`struct Bench`

Constructs a Benchmark object, used for running multiple benchmarks and comparing the results.

Example:

```mojo
from benchmark import (
    Bench,
    BenchConfig,
    Bencher,
    BenchId,
    ThroughputMeasure,
    BenchMetric,
    Format,
)
from utils import IndexList
from gpu.host import DeviceContext
from pathlib import Path

fn example_kernel():
    print("example_kernel")

var shape = IndexList[2](1024, 1024)
var bench = Bench(BenchConfig(max_iters=100))

@parameter
@always_inline
fn example(mut b: Bencher, shape: IndexList[2]) capturing raises:
    @parameter
    @always_inline
    fn kernel_launch(ctx: DeviceContext) raises:
        ctx.enqueue_function[example_kernel](
            grid_dim=shape[0], block_dim=shape[1]
        )

    var bench_ctx = DeviceContext()
    b.iter_custom[kernel_launch](bench_ctx)

bench.bench_with_input[IndexList[2], example](
    BenchId("top_k_custom", "gpu"),
    shape,
    ThroughputMeasure(
        BenchMetric.elements, shape.flattened_length()
    ),
    ThroughputMeasure(
        BenchMetric.flops, shape.flattened_length() * 3 # number of ops
    ),
)
# Add more benchmarks like above to compare results

# Pretty print in table format
print(bench)

# Dump report to csv file
bench.config.out_file = Path("out.csv")
bench.dump_report()

# Print in tabular csv format
bench.config.format = Format.tabular
print(bench)
```

You can pass arguments when running a program that makes use of `Bench`:

```sh
mojo benchmark.mojo -o out.csv -r 10
```

This will repeat the benchmarks 10 times and write the output to `out.csv`
in csv format.

## Fields

* ​config (`BenchConfig`): Constructs a Benchmark object based on specific configuration and mode.
* ​mode (`Mode`): Benchmark mode object representing benchmark or test mode.
* ​info\_vec (`List[BenchmarkInfo]`): A list containing the benchmark info.

## Implemented traits

`AnyType`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(out self, config: Optional[BenchConfig] = Optional(None), mode: Mode = Mode(0))`

Constructs a Benchmark object based on specific configuration and mode.

**Args:**

* ​config (`Optional[BenchConfig]`): Benchmark configuration object to control length and frequency of benchmarks.
* ​mode (`Mode`): Benchmark mode object representing benchmark or test mode.

### `bench_with_input`

`bench_with_input[: origin.set, //, T: AnyType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, measures: List[ThroughputMeasure] = List())`

Benchmarks an input function with input args of type AnyType.

**Parameters:**

* ​T (`AnyType`): Benchmark function input type.
* ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​input (`T`): Represents the target function's input arguments.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`bench_with_input[: origin.set, //, T: AnyType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, *measures: ThroughputMeasure)`

Benchmarks an input function with input args of type AnyType.

**Parameters:**

* ​T (`AnyType`): Benchmark function input type.
* ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​input (`T`): Represents the target function's input arguments.
* ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's.

`bench_with_input[: origin.set, //, T: AnyTrivialRegType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, measures: List[ThroughputMeasure] = List())`

Benchmarks an input function with input args of type AnyTrivialRegType.

**Parameters:**

* ​T (`AnyTrivialRegType`): Benchmark function input type.
* ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​input (`T`): Represents the target function's input arguments.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`bench_with_input[: origin.set, //, T: AnyTrivialRegType, bench_fn: fn(mut Bencher, T) raises capturing -> None](mut self, bench_id: BenchId, input: T, *measures: ThroughputMeasure)`

Benchmarks an input function with input args of type AnyTrivialRegType.

**Parameters:**

* ​T (`AnyTrivialRegType`): Benchmark function input type.
* ​bench\_fn (`fn(mut Bencher, T) raises capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​input (`T`): Represents the target function's input arguments.
* ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's.

### `bench_function`

`bench_function[: origin.set, //, bench_fn: fn() raises capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmarks or Tests an input function.

**Parameters:**

* ​bench\_fn (`fn() raises capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`bench_function[: origin.set, //, bench_fn: fn() capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmarks or Tests an input function.

**Parameters:**

* ​bench\_fn (`fn() capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`bench_function[: origin.set, //, bench_fn: fn(mut Bencher) capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmarks or Tests an input function.

**Parameters:**

* ​bench\_fn (`fn(mut Bencher) capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`bench_function[: origin.set, //, bench_fn: fn(mut Bencher) capturing -> None](mut self, bench_id: BenchId, *measures: ThroughputMeasure)`

Benchmarks or Tests an input function.

**Parameters:**

* ​bench\_fn (`fn(mut Bencher) capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's.

`bench_function[: origin.set, //, bench_fn: fn(mut Bencher) raises capturing -> None](mut self, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmarks or Tests an input function.

**Parameters:**

* ​bench\_fn (`fn(mut Bencher) raises capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`bench_function[: origin.set, //, bench_fn: fn(mut Bencher) raises capturing -> None](mut self, bench_id: BenchId, *measures: ThroughputMeasure)`

Benchmarks or Tests an input function.

**Parameters:**

* ​bench\_fn (`fn(mut Bencher) raises capturing -> None`): The function to be benchmarked.

**Args:**

* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​\*measures (`ThroughputMeasure`): Variadic arg used to represent a list of ThroughputMeasure's.

### `dump_report`

`dump_report(mut self)`

Prints out the report from a Benchmark execution. If `Bench.config.out_file` is set, it will also write the output in the format set in `out_file_format` to the file defined in `out_file`.

### `pad`

`pad(self, width: Int, string: String) -> String`

Pads a string to a given width.

**Args:**

* ​width (`Int`): The width to pad the string to.
* ​string (`String`): The string to pad.

**Returns:**

A string padded to the given width.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the benchmark results.

**Returns:**

A string representing the benchmark results.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the benchmark results to a writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writer trait.

**Args:**

* ​writer (`W`): The writer to write to.

---

## BenchConfig

`struct BenchConfig`

Defines a benchmark configuration struct to control execution times and frequency.

## Fields

* ​out\_file (`Optional[Path]`): Output file to write results to.
* ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs.
* ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs.
* ​min\_warmuptime\_secs (`SIMD[float64, 1]`): Lower bound on warmup time in secs.
* ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement.
* ​max\_iters (`Int`): Max number of iterations to run.
* ​num\_repetitions (`Int`): Number of times the benchmark has to be repeated.
* ​flush\_denormals (`Bool`): Whether or not the denormal values are flushed.
* ​show\_progress (`Bool`): If True, print progress of each benchmark.
* ​format (`Format`): The format to print results. (default: "table").
* ​out\_file\_format (`Format`): The format to write out the file with `dump_file` (default: "csv").
* ​verbose\_timing (`Bool`): Whether to print verbose timing results.
* ​verbose\_metric\_names (`Bool`): If True print the metric name and unit, else print the unit only.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `VERBOSE_TIMING_LABELS`

`alias VERBOSE_TIMING_LABELS = List(__init__[__mlir_type.!kgen.string]("min (ms)"), __init__[__mlir_type.!kgen.string]("mean (ms)"), __init__[__mlir_type.!kgen.string]("max (ms)"), __init__[__mlir_type.!kgen.string]("duration (ms)"), Tuple())`

Labels to print verbose timing results.

## Methods

### `__init__`

`__init__(out self, out_file: Optional[Path] = Optional(None), min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](2), min_warmuptime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1), max_batch_size: Int = 0, max_iters: Int = 1000000000, num_repetitions: Int = 1, flush_denormals: Bool = True)`

Constructs and initializes Benchmark config object with default and inputed values.

**Args:**

* ​out\_file (`Optional[Path]`): Output file to write results to.
* ​min\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `0.1`).
* ​max\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `1`).
* ​min\_warmuptime\_secs (`SIMD[float64, 1]`): Lower bound on warmup time in secs (default `1.0`).
* ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time measurement.
* ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`).
* ​num\_repetitions (`Int`): Number of times the benchmark has to be repeated.
* ​flush\_denormals (`Bool`): Whether or not the denormal values are flushed.

`__init__(out self, *, other: Self)`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

---

## bencher

## Structs

* [​`Bench`](/mojo/stdlib/benchmark/bencher/Bench): Constructs a Benchmark object, used for running multiple benchmarks and comparing the results.
* [​`BenchConfig`](/mojo/stdlib/benchmark/bencher/BenchConfig): Defines a benchmark configuration struct to control execution times and frequency.
* [​`Bencher`](/mojo/stdlib/benchmark/bencher/Bencher): Defines a Bencher struct which facilitates the timing of a target function.
* [​`BenchId`](/mojo/stdlib/benchmark/bencher/BenchId): Defines a benchmark Id struct to identify and represent a particular benchmark execution.
* [​`BenchmarkInfo`](/mojo/stdlib/benchmark/bencher/BenchmarkInfo): Defines a Benchmark Info struct to record execution Statistics.
* [​`BenchMetric`](/mojo/stdlib/benchmark/bencher/BenchMetric): Defines a benchmark throughput metric.
* [​`Format`](/mojo/stdlib/benchmark/bencher/Format): Defines a format for the benchmark output when printing or writing to a file.
* [​`Mode`](/mojo/stdlib/benchmark/bencher/Mode): Defines a Benchmark Mode to distinguish between test runs and actual benchmarks.
* [​`ThroughputMeasure`](/mojo/stdlib/benchmark/bencher/ThroughputMeasure): Records a throughput metric of metric BenchMetric and value.

---

## Bencher

`@register_passable`
`struct Bencher`

Defines a Bencher struct which facilitates the timing of a target function.

## Fields

* ​num\_iters (`Int`): Number of iterations to run the target function.
* ​elapsed (`Int`): The total time elapsed when running the target function.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(num_iters: Int) -> Self`

Constructs a Bencher object to run and time a function.

**Args:**

* ​num\_iters (`Int`): Number of times to run the target function.

### `iter`

`iter[: origin.set, //, iter_fn: fn() capturing -> None](mut self)`

Returns the total elapsed time by running a target function a particular number of times.

**Parameters:**

* ​iter\_fn (`fn() capturing -> None`): The target function to benchmark.

`iter[iter_fn: fn() raises capturing -> None](mut self)`

Returns the total elapsed time by running a target function a particular number of times.

**Parameters:**

* ​iter\_fn (`fn() raises capturing -> None`): The target function to benchmark.

### `iter_preproc`

`iter_preproc[: origin.set, : origin.set, //, iter_fn: fn() capturing -> None, preproc_fn: fn() capturing -> None](mut self)`

Returns the total elapsed time by running a target function a particular number of times.

**Parameters:**

* ​iter\_fn (`fn() capturing -> None`): The target function to benchmark.
* ​preproc\_fn (`fn() capturing -> None`): The function to preprocess the target function.

### `iter_custom`

`iter_custom[: origin.set, //, iter_fn: fn(Int) capturing -> Int](mut self)`

Times a target function with custom number of iterations.

**Parameters:**

* ​iter\_fn (`fn(Int) capturing -> Int`): The target function to benchmark.

`iter_custom[: origin.set, //, kernel_launch_fn: fn(DeviceContext) raises capturing -> None](mut self, ctx: DeviceContext)`

Times a target GPU function with custom number of iterations via DeviceContext ctx.

**Parameters:**

* ​kernel\_launch\_fn (`fn(DeviceContext) raises capturing -> None`): The target GPU kernel launch function to benchmark.

**Args:**

* ​ctx (`DeviceContext`): The GPU DeviceContext for launching kernel.

`iter_custom[: origin.set, //, kernel_launch_fn: fn(DeviceContext, Int) raises capturing -> None](mut self, ctx: DeviceContext)`

Times a target GPU function with custom number of iterations via DeviceContext ctx.

**Parameters:**

* ​kernel\_launch\_fn (`fn(DeviceContext, Int) raises capturing -> None`): The target GPU kernel launch function to benchmark.

**Args:**

* ​ctx (`DeviceContext`): The GPU DeviceContext for launching kernel.

`iter_custom[iter_fn: fn(Int) raises capturing -> Int](mut self)`

Times a target function with custom number of iterations.

**Parameters:**

* ​iter\_fn (`fn(Int) raises capturing -> Int`): The target function to benchmark.

### `iter_custom_multicontext`

`iter_custom_multicontext[: origin.set, //, kernel_launch_fn: fn() raises capturing -> None](mut self, ctxs: List[DeviceContext])`

Times a target GPU function with custom number of iterations via DeviceContext ctx.

**Parameters:**

* ​kernel\_launch\_fn (`fn() raises capturing -> None`): The target GPU kernel launch function to benchmark.

**Args:**

* ​ctxs (`List[DeviceContext]`): The list of GPU DeviceContext's for launching kernel.

---

## BenchId

`struct BenchId`

Defines a benchmark Id struct to identify and represent a particular benchmark execution.

## Fields

* ​func\_name (`String`): The target function name.
* ​input\_id (`Optional[String]`): The target function input id phrase.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, func_name: String, input_id: String)`

Constructs a Benchmark Id object from input function name and Id phrase.

**Args:**

* ​func\_name (`String`): The target function name.
* ​input\_id (`String`): The target function input id phrase.

`@implicit`
`__init__(out self, func_name: String)`

Constructs a Benchmark Id object from input function name.

**Args:**

* ​func\_name (`String`): The target function name.

`@implicit`
`__init__(out self, func_name: StringLiteral[value])`

Constructs a Benchmark Id object from input function name.

**Args:**

* ​func\_name (`StringLiteral[value]`): The target function name.

---

## benchmark

Implements the benchmark module for runtime benchmarking.

You can import these APIs from the `benchmark` package. For example:

```mojo
import benchmark
from time import sleep
```

You can pass any `fn` as a parameter into `benchmark.run[...]()`, it will return
a `Report` where you can get the mean, duration, max, and more:

```mojo
fn sleeper():
    sleep(.01)

var report = benchmark.run[sleeper]()
print(report.mean())
```

```output
0.012256487394957985
```

You can print a full report:

```mojo
report.print()
```

```output
---------------------
Benchmark Report (s)
---------------------
Mean: 0.012265747899159664
Total: 1.459624
Iters: 119
Warmup Total: 0.025020000000000001
Fastest Mean: 0.0121578
Slowest Mean: 0.012321428571428572

```

Or all the batch runs:

```mojo
report.print_full()
```

```output
---------------------
Benchmark Report (s)
---------------------
Mean: 0.012368649122807017
Total: 1.410026
Iters: 114
Warmup Total: 0.023341000000000001
Fastest Mean: 0.012295586956521738
Slowest Mean: 0.012508099999999999

Batch: 1
Iterations: 20
Mean: 0.012508099999999999
Duration: 0.250162

Batch: 2
Iterations: 46
Mean: 0.012295586956521738
Duration: 0.56559700000000002

Batch: 3
Iterations: 48
Mean: 0.012380562499999999
Duration: 0.59426699999999999
```

If you want to use a different time unit you can bring in the Unit and pass
it in as an argument:

```mojo
from benchmark import Unit

report.print(Unit.ms)
```

```output
---------------------
Benchmark Report (ms)
---------------------
Mean: 0.012312411764705882
Total: 1.465177
Iters: 119
Warmup Total: 0.025010999999999999
Fastest Mean: 0.012015649999999999
Slowest Mean: 0.012421204081632654
```

The unit's are just aliases for string constants, so you can for example:

```mojo
print(report.mean("ms"))
```

```output
12.199145299145298
```

Benchmark.run takes four arguments to change the behaviour, to set warmup
iterations to 5:

```mojo
r = benchmark.run[sleeper](5)
```

```output
0.012004808080808081
```

To set 1 warmup iteration, 2 max iterations, a min total time of 3 sec, and a
max total time of 4 s:

```mojo
r = benchmark.run[sleeper](1, 2, 3, 4)
```

Note that the min total time will take precedence over max iterations

## Structs

* [​`Batch`](/mojo/stdlib/benchmark/benchmark/Batch): A batch of benchmarks, the benchmark.run() function works out how many iterations to run in each batch based the how long the previous iterations took.
* [​`Report`](/mojo/stdlib/benchmark/benchmark/Report): Contains the average execution time, iterations, min and max of each batch.
* [​`Unit`](/mojo/stdlib/benchmark/benchmark/Unit): Time Unit used by Benchmark Report.

## Functions

* [​`run`](/mojo/stdlib/benchmark/benchmark/run): Benchmarks the function passed in as a parameter.

---

## benchmark

Implements the benchmark package for runtime benchmarking.

You can import these APIs from the `benchmark` package. For example:

```mojo
import benchmark
from time import sleep
```

You can pass any `fn` as a parameter into `benchmark.run[...]()`, it will return
a `Report` where you can get the mean, duration, max, and more:

```mojo
fn sleeper():
    sleep(.01)

var report = benchmark.run[sleeper]()
print(report.mean())
```

```output
0.012256487394957985
```

You can print a full report:

```mojo
report.print()
```

```output
---------------------
Benchmark Report (s)
---------------------
Mean: 0.012265747899159664
Total: 1.459624
Iters: 119
Warmup Mean: 0.01251
Warmup Total: 0.025020000000000001
Warmup Iters: 2
Fastest Mean: 0.0121578
Slowest Mean: 0.012321428571428572

```

Or all the batch runs:

```mojo
report.print_full()
```

```output
---------------------
Benchmark Report (s)
---------------------
Mean: 0.012368649122807017
Total: 1.410026
Iters: 114
Warmup Mean: 0.0116705
Warmup Total: 0.023341000000000001
Warmup Iters: 2
Fastest Mean: 0.012295586956521738
Slowest Mean: 0.012508099999999999

Batch: 1
Iterations: 20
Mean: 0.012508099999999999
Duration: 0.250162

Batch: 2
Iterations: 46
Mean: 0.012295586956521738
Duration: 0.56559700000000002

Batch: 3
Iterations: 48
Mean: 0.012380562499999999
Duration: 0.59426699999999999
```

If you want to use a different time unit you can bring in the Unit and pass
it in as an argument:

```mojo
from benchmark import Unit

report.print(Unit.ms)
```

```output
---------------------
Benchmark Report (ms)
---------------------
Mean: 0.012312411764705882
Total: 1.465177
Iters: 119
Warmup Mean: 0.012505499999999999
Warmup Total: 0.025010999999999999
Warmup Iters: 2
Fastest Mean: 0.012015649999999999
Slowest Mean: 0.012421204081632654
```

The unit's are just aliases for string constants, so you can for example:

```mojo
print(report.mean("ms"))
```

```output
12.199145299145298
```

Benchmark.run takes four arguments to change the behaviour, to set warmup
iterations to 5:

```mojo
r = benchmark.run[sleeper](5)
```

```output
0.012004808080808081
```

To set 1 warmup iteration, 2 max iterations, a min total time of 3 sec, and a
max total time of 4 s:

```mojo
r = benchmark.run[sleeper](1, 2, 3, 4)
```

Note that the min total time will take precedence over max iterations

## Modules

* [​`bencher`](/mojo/stdlib/benchmark/bencher/):
* [​`benchmark`](/mojo/stdlib/benchmark/benchmark/): Implements the benchmark module for runtime benchmarking.
* [​`compiler`](/mojo/stdlib/benchmark/compiler/):
* [​`memory`](/mojo/stdlib/benchmark/memory/):
* [​`quick_bench`](/mojo/stdlib/benchmark/quick_bench/):

---

## Benchmark MAX on an NVIDIA H100 GPU

import SmallCards from '@site/src/components/SmallCards';

:::success MAX Supports many GPU types

This article will soon reflect all the GPU types that MAX Supports:\
Available today: H100, H200, A100, A10G, L40s.\
Coming soon - B100s, B200s, MI300X.

:::

Performance optimization is a key challenge in deploying AI inference
workloads, especially when balancing factors like accuracy, latency, and cost.
In this tutorial, we'll show you how to benchmark MAX on an NVIDIA H100
GPU, using a Python script to evaluate key metrics, including the following:

- Request throughput
- Input and output token throughput
- Time-to-first-token (TTFT)
- Time per output token (TPOT)

Our script
([`benchmark_serving.py`](https://github.com/modular/modular/tree/main/benchmark/benchmark_serving.py))
is adapted from vLLM with additional features, such as client-side GPU metric
collection to ensure consistent and comprehensive performance measurement
that's tailored to MAX.

Before we start the benchmark script, we'll start an endpoint running Llama 3
with MAX. Then we'll use the `benchmark_serving.py` script to send a
bunch of inference requests and measure the performance.

## Requirements

To get started with this tutorial, you need the following:

- **Hardware**: Local access to NVIDIA H100 GPUs

- **Python**: Version 3.9 - 3.13

- **Magic**: Follow the [Magic installation guide](/magic/#install-magic)

- **Docker and Docker Compose**: Installed with [NVIDIA GPU
support](https://docs.docker.com/config/containers/resource_constraints/#gpu)

- **Latest NVIDIA drivers**: Refer to the [NVIDIA driver
installation guide](https://www.nvidia.com/download/index.aspx)

- **NVIDIA Container Toolkit**: Follow the [installation
guide](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)

- **Hugging Face account**: Obtain an [access
token](https://huggingface.co/settings/tokens) and set it as an environment
variable:

    ```bash
    export HF_TOKEN="your_huggingface_token"
    ```

## Set up your environment

From here on, you should be running commands on the system with the NVIDIA GPU.
If you haven't already, open a shell to that system now.

Clone the MAX repository, navigate to the `benchmark` folder, and install
the dependencies in a virtual environment with the following commands:

```bash
git clone -b stable https://github.com/modular/modular.git

cd max/benchmark

magic shell
```

:::note

To exit the `magic` shell simply run `exit`. For more information, see [the
Magic tutorial](/max/tutorials/magic).

:::

## Prepare benchmarking dataset (optional)

This tutorial uses the `--dataset-name` argument in our benchmark script to
automatically download the `sharegpt` or `code-debug` datasets for benchmarking.

You can optionally provide a path to your own dataset using the `--dataset-path`
argument. For example, you can download the
[ShareGPT](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered)
dataset with the following command:

```bash
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
```

You can then reference the local dataset using the `--dataset-path` argument:

```bash
python benchmark_serving.py \
  ...
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
```

For more information on available benchmarking datasets, see
[Command line arguments for `benchmark_serving.py`](https://github.com/modular/modular/tree/main/benchmark#command-line-arguments-for-benchmark_servingpy).

## Start the model endpoint

We provide a pre-configured GPU-enabled Docker container that simplifies the
process do deploy an endpoint with MAX. For more information, see [MAX
container](/max/container).

To pull and run the MAX container that hosts Llama 3 as an endpoint, run this
command:

```bash
docker run --rm --gpus=all \
  --ipc=host \
  -p 8000:8000 \
  --env "HF_TOKEN=${HF_TOKEN}" \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  modular/max-nvidia-full:latest \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --devices gpu:0,1,2,3 \
  --max-num-steps 10 \
  --max-batch-size 512
```
where `--devices gpu:0,1,2,3` refers to the GPU IDs to use.
Note that Llama3.3-70B requires 4xH100 or 4xA100 instances to run in bfloat16 precision.

You can explore other model options in the MAX
[model repository](https://builds.modular.com/?category=models).

:::note

These settings work well on H100 GPUs. You can adjust
`--max-batch-size` depending on your system's available resources such as
GPU memory.

:::

You'll know that the server is running when you see the following log:

```output
Server ready on http://0.0.0.0:8000
```

## Start benchmarking

To benchmark MAX with 8 prompts from the `code_debug` dataset, run this
command:

```bash
python benchmark_serving.py \
  --backend modular \
  --model meta-llama/Llama-3.3-70B-Instruct \
  --dataset-name code_debug \
  --endpoint /v1/completions \
  --num-prompts 8 \
  --collect-gpu-stats
```

For more information on available arguments, see the [MAX benchmarking
reference](https://github.com/modular/modular/tree/main/benchmark#reference).

:::tip Optional cleanup

Here's how to clean up the Docker image:

```bash
docker rmi $(docker images -q modular/max-nvidia-full:latest)
```

:::

## Interpret the results

The output should look similar to the following:

```output
============ Serving Benchmark Result ============
Successful requests:                     8
Failed requests:                         0
Benchmark duration (s):                  90.00
Total input tokens:                      712840
Total generated tokens:                  16
Request throughput (req/s):              0.09
Input token throughput (tok/s):          7920.01
Output token throughput (tok/s):         0.18
---------------Time to First Token----------------
Mean TTFT (ms):                          46506.48
Median TTFT (ms):                        44050.82
P99 TTFT (ms):                           88887.81
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17790.64
Median TPOT (ms):                        17292.79
P99 TPOT (ms):                           38986.51
---------------Inter-token Latency----------------
Mean ITL (ms):                           17790.57
Median ITL (ms):                         17292.70
P99 ITL (ms):                            38986.49
-------------------Token Stats--------------------
Max input tokens:                        109256
Max output tokens:                       2
Max total tokens:                        109258
--------------------GPU Stats---------------------
GPU Utilization (%):                     99.24
Peak GPU Memory Used (MiB):              76312.88
GPU Memory Available (MiB):              5030.75
==================================================
```

For more information about each metric, see the [MAX benchmarking key
metrics](https://github.com/modular/modular/tree/main/benchmark#key-metrics-explained).

### Measure latency with finite request rates

Latency metrics like time-to-first-token (TTFT) and time per output token
(TPOT) matter most when the server isn't overloaded. An overloaded server will
queue requests, which results in a massive increase in latency that varies
depending on the size of the benchmark more than the actual latency of the
server—larger benchmarks result in a deeper queue.

If you'd like to vary the size of the queue, you can adjust the request rate
with the `--request-rate ` flag. This creates a stochastic request load with
an average rate of `N` requests per second.

### Comparing to alternatives

You can run the benchmarking script using the Modular, vLLM, or TensorRT-LLM
backends to compare performance with alternative LLM serving frameworks. When
using the TensorRT-LLM backend, be sure to change the `--endpoint` to
`/v2/models/ensemble/generate_stream`. MAX achieves competitive throughput on
most workloads and will further improve with upcoming optimizations.

## Next steps

Now that you have detailed benchmarking results for Llama 3 on MAX
using an NVIDIA H100 GPU, here are some other topics to explore next:

export const cards = [
  {
    title: 'Deploy Llama 3 on GPU with MAX',
    link: '/max/tutorials/max-serve-local-to-cloud',
    description: `Learn how to deploy Llama 3 on GPU with MAX.`,
  },
  {
    title: 'Deploy Llama 3 on GPU-powered Kubernetes clusters',
    link: '/max/tutorials/deploy-max-serve-on-kubernetes',
    description:
    `Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs`,
  },
  {
    title: 'Bring your own fine-tuned model to MAX pipelines',
    link: '/max/tutorials/max-pipeline-bring-your-own-model',
    description: `Learn how to customize your own model in MAX pipelines.`,
  },
  {
    title: 'Get started with MAX Graph in Python',
    link: '/max/tutorials/get-started-with-max-graph-in-python',
    description: `Learn how to build a model graph with our Python API for inference with MAX Engine.`,
  },
];

To read more about our performance methodology, check our our blog post, [MAX
GPU: State of the Art Throughput on a New GenAI
platform](https://www.modular.com/blog/max-gpu-state-of-the-art-throughput-on-a-new-genai-platform).

You can also share your experience on the [Modular
Forum](https://forum.modular.com/) and in our [Discord
Community](https://discord.gg/modular). Be sure to stay up to date with all the
performance improvements coming soon by [signing up for our
newsletter](https://www.modular.com/modverse#signup).

---

## BenchmarkInfo

`struct BenchmarkInfo`

Defines a Benchmark Info struct to record execution Statistics.

## Fields

* ​name (`String`): The name of the benchmark.
* ​result (`Report`): The output report after executing a benchmark.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.
* ​verbose\_timing (`Bool`): Whether to print verbose timing results.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, name: String, result: Report, measures: List[ThroughputMeasure] = List(), verbose_timing: Bool = False)`

Constructs a `BenchmarkInfo` object to return benchmark report and statistics.

**Args:**

* ​name (`String`): The name of the benchmark.
* ​result (`Report`): The output report after executing a benchmark.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.
* ​verbose\_timing (`Bool`): Whether to print verbose timing results.

`__init__(out self, *, other: Self)`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

---

## BenchMetric

`struct BenchMetric`

Defines a benchmark throughput metric.

## Fields

* ​code (`Int`): Op-code of the Metric.
* ​name (`String`): Metric's name.
* ​unit (`String`): Metric's throughput rate unit (count/second).

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `bytes`

`alias bytes = BenchMetric(1, __init__[__mlir_type.!kgen.string]("DataMovement"), __init__[__mlir_type.!kgen.string]("GB/s"))`

### `DEFAULTS`

`alias DEFAULTS = List(BenchMetric(0, __init__[__mlir_type.!kgen.string]("throughput"), __init__[__mlir_type.!kgen.string]("GElems/s")), BenchMetric(1, __init__[__mlir_type.!kgen.string]("DataMovement"), __init__[__mlir_type.!kgen.string]("GB/s")), BenchMetric(2, __init__[__mlir_type.!kgen.string]("Arithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s")), Tuple())`

Default set of benchmark metrics.

### `elements`

`alias elements = BenchMetric(0, __init__[__mlir_type.!kgen.string]("throughput"), __init__[__mlir_type.!kgen.string]("GElems/s"))`

### `flops`

`alias flops = BenchMetric(2, __init__[__mlir_type.!kgen.string]("Arithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s"))`

### `theoretical_flops`

`alias theoretical_flops = BenchMetric(3, __init__[__mlir_type.!kgen.string]("TheoreticalArithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s"))`

## Methods

### `__init__`

`__init__(out self, *, other: Self)`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compares two metrics for equality.

**Args:**

* ​other (`Self`): The metric to compare.

**Returns:**

True if the two metrics are equal.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Compares two metrics for inequality.

**Args:**

* ​other (`Self`): The metric to compare.

**Returns:**

True if the two metrics are NOT equal.

### `__str__`

`__str__(self) -> String`

Gets a string representation of this metric.

**Returns:**

The string representation.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this BenchMetric to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `check_name`

`check_name(self, alt_name: String) -> Bool`

Checks whether a string contains the metric's name.

**Args:**

* ​alt\_name (`String`): Alternative name of a metric.

**Returns:**

True if 'alt\_name' is valid alternative of the metric's name.

### `get_metric_from_list`

`static get_metric_from_list(name: String, metric_list: List[BenchMetric]) -> Self`

Gets a metric from a given list using only the metric's name.

**Args:**

* ​name (`String`): Metric's name.
* ​metric\_list (`List[BenchMetric]`): List of metrics to search.

**Returns:**

The selected metric.

---

## bin

`bin(num: SIMD[dtype, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0b")) -> String`

Return the binary string representation an integral value.

```mojo
print(bin(123))
print(bin(-123))
```

```plaintext
'0b1111011'
'-0b1111011'
```

**Args:**

* ​num (`SIMD[dtype, 1]`): An integral scalar value.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

The binary string representation of num.

`bin(b: SIMD[bool, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0b")) -> String`

Returns the binary representation of a scalar bool.

**Args:**

* ​b (`SIMD[bool, 1]`): A scalar bool value.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

The binary string representation of b.

`bin[T: Intable, //](num: T, /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0b")) -> String`

Returns the binary representation of an indexer type.

**Parameters:**

* ​T (`Intable`): The Intable type.

**Args:**

* ​num (`T`): An indexer value.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

The binary string representation of num.

---

## bindings

## Aliases

### `MOJO_PYTHON_TYPE_OBJECTS`

`alias MOJO_PYTHON_TYPE_OBJECTS = _Global[__init__[__mlir_type.!kgen.string]("MOJO_PYTHON_TYPE_OBJECTS"), Dict[StringSlice[StaticConstantOrigin], TypedPythonObject[__init__[__mlir_type.!kgen.string]("Type")]], _init_python_type_objects]`

Mapping of Mojo type identifiers to unique `PyTypeObject*` binding that Mojo type to this CPython interpreter instance.

### `Typed_initproc`

`alias Typed_initproc = fn(PyObjectPtr, TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")], PyObjectPtr) -> SIMD[int32, 1]`

### `Typed_newfunc`

`alias Typed_newfunc = fn(UnsafePointer[PyTypeObject], TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")], PyObjectPtr) -> PyObjectPtr`

## Structs

* [​`PyMojoObject`](/mojo/stdlib/python/bindings/PyMojoObject): Storage backing a PyObject\* wrapping a Mojo value.
* [​`PythonModuleBuilder`](/mojo/stdlib/python/bindings/PythonModuleBuilder): A builder for creating Python modules with Mojo function and type bindings.
* [​`PythonTypeBuilder`](/mojo/stdlib/python/bindings/PythonTypeBuilder): A builder for a Python 'type' binding.

## Functions

* [​`check_arguments_arity`](/mojo/stdlib/python/bindings/check_arguments_arity): Validate that the provided arguments match the expected function arity.
* [​`lookup_py_type_object`](/mojo/stdlib/python/bindings/lookup_py_type_object): Retrieve a reference to the unique Python type describing Python objects containing Mojo values of type `T`.

---

## bit

Provides functions for bit manipulation.

You can import these APIs from the `bit` package. For example:

```mojo
from bit import count_leading_zeros
```

## Functions

* [​`bit_not`](/mojo/stdlib/bit/bit/bit_not): Performs a bitwise NOT operation on an SIMD vector of integer values.
* [​`bit_reverse`](/mojo/stdlib/bit/bit/bit_reverse): Reverses the bitpattern of an integer value.
* [​`bit_width`](/mojo/stdlib/bit/bit/bit_width): Computes the minimum number of bits required to represent the integer.
* [​`byte_swap`](/mojo/stdlib/bit/bit/byte_swap): Byte-swaps an integer value with an even number of bytes.
* [​`count_leading_zeros`](/mojo/stdlib/bit/bit/count_leading_zeros): Counts the number of leading zeros of an integer.
* [​`count_trailing_zeros`](/mojo/stdlib/bit/bit/count_trailing_zeros): Counts the number of trailing zeros for an integer.
* [​`log2_floor`](/mojo/stdlib/bit/bit/log2_floor): Returns the floor of the base-2 logarithm of an integer value.
* [​`next_power_of_two`](/mojo/stdlib/bit/bit/next_power_of_two): Computes the smallest power of 2 that is greater than or equal to the input value. Any integral value less than or equal to 1 will be ceiled to 1.
* [​`pop_count`](/mojo/stdlib/bit/bit/pop_count): Counts the number of bits set in an integer value.
* [​`prev_power_of_two`](/mojo/stdlib/bit/bit/prev_power_of_two): Computes the largest power of 2 that is less than or equal to the input value. Any integral value less than or equal to 0 will be floored to 0.
* [​`rotate_bits_left`](/mojo/stdlib/bit/bit/rotate_bits_left): Shifts the bits of an input to the left by `shift` bits (with wrap-around).
* [​`rotate_bits_right`](/mojo/stdlib/bit/bit/rotate_bits_right): Shifts the bits of an input to the right by `shift` bits (with wrap-around).

---

## bit

Implements the bit package.

## Modules

* [​`bit`](/mojo/stdlib/bit/bit/): Provides functions for bit manipulation.

---

## bit_not

`bit_not[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs a bitwise NOT operation on an SIMD vector of integer values.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` is computed as a bitwise
NOT of the integer value at position `i` of the input value.

---

## bit_reverse

`bit_reverse(val: Int) -> Int`

Reverses the bitpattern of an integer value.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The input value with its bitpattern reversed.

`bit_reverse[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Element-wise reverses the bitpattern of a SIMD vector of integer values.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` has a reversed bitpattern
of an integer value of the element at position `i` of the input value.

---

## bit_width

`bit_width(val: Int) -> Int`

Computes the minimum number of bits required to represent the integer.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The number of bits required to represent the integer.

`bit_width[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the minimum number of bits required to represent each element of a SIMD vector of integer values.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` equals the number of bits required to represent the integer at position `i` of the input.

---

## bitcast

`bitcast[dtype: DType, width: Int, //, new_type: DType, new_width: Int = width](val: SIMD[dtype, width]) -> SIMD[new_type, new_width]`

Bitcasts a SIMD value to another SIMD value.

For a discussion of byte order, see
[Converting data: bitcasting and byte order](/mojo/manual/pointers/unsafe-pointers#converting-data-bitcasting-and-byte-order)
in the Mojo Manual.

Examples:

The following example uses `bitcast` to break a 32-bit integer into a vector
of four 8-bit integers:

```mojo
from memory import bitcast

one = SIMD[DType.uint32, 1](4631)
many = bitcast[DType.uint8, 4](one)
print(one, many) # 4631 [23, 18, 0, 0]
```

**Constraints:**

The bitwidth of the two types must be the same.

**Parameters:**

* ​dtype (`DType`): The source type.
* ​width (`Int`): The source width.
* ​new\_type (`DType`): The target type.
* ​new\_width (`Int`): The target width.

**Args:**

* ​val (`SIMD[dtype, width]`): The source value.

**Returns:**

A new SIMD value with the specified type and width with a bitcopy of the
source SIMD value.

---

## bitset

Provides a compact, grow-only set of non-negative integers.

Optimized for space (1 bit per element) and speed (O(1) operations).
Offers set/clear/test/toggle and fast population count. The underlying
storage grows automatically but does not shrink unless `shrink_to_fit`
is called (not implemented yet).

Example:

```mojo
    var bs = BitSet[128]()      # 128-bit set, all clear
    bs.set(42)                  # Mark value 42 as present.
    if bs.test(42):             # Check membership.
        print("hit")            # Prints "hit".
    bs.clear(42)                # Remove 42.
    print(bs.count())           # Prints 0.
```

## Structs

* [​`BitSet`](/mojo/stdlib/collections/bitset/BitSet): A grow-only set storing non-negative integers efficiently using bits.

---

## BitSet

`struct BitSet[size: UInt]`

A grow-only set storing non-negative integers efficiently using bits.

Each integer element is represented by a single bit within an array
of 64-bit words (`UInt64`). This structure is optimized for:

* **Compactness:** Uses 64 times less memory than `List[Bool]`.
* **Speed:** Offers O(1) time complexity for `set`, `clear`, `test`,
  and `toggle` operations (single word load/store).

Ideal for applications like data-flow analysis, graph algorithms, or
any task requiring dense sets of small integers where memory and
lookup speed are critical.

## Parameters

* ​size (`UInt`): The maximum number of bits the bitset can store.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Movable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(out self)`

Initializes an empty BitSet with zero capacity and size.

`__init__(out self: BitSet[UInt(size)], init: SIMD[bool, size])`

Initializes a BitSet with the given SIMD vector of booleans.

**Args:**

* ​init (`SIMD[bool, size]`): A SIMD vector of booleans to initialize the bitset with.

### `__bool__`

`__bool__(self) -> Bool`

Checks if the bitset is non-empty (contains at least one set bit).

Equivalent to `len(self) != 0` or `not self.is_empty()`.

**Returns:**

True if at least one bit is set, False otherwise.

### `__len__`

`__len__(self) -> Int`

Counts the total number of bits that are set to 1 in the bitset.

Uses the efficient `pop_count` intrinsic for each underlying word.
The complexity is proportional to the number of words used by the
bitset's capacity (`_words_size`), not the logical size (`len`).

**Returns:**

The total count of set bits (population count).

### `is_empty`

`is_empty(self) -> Bool`

Checks if the bitset contains any set bits.

Equivalent to `len(self) == 0`. Note that this checks the logical
size, not the allocated capacity.

**Returns:**

True if no bits are set (logical size is 0), False otherwise.

### `set`

`set(mut self, idx: UInt)`

Sets the bit at the specified index `idx` to 1.

If `idx` is greater than or equal to the current logical size,
the logical size is updated. Aborts if `idx` is negative or
greater than or equal to the compile-time `size`.

**Args:**

* ​idx (`UInt`): The non-negative index of the bit to set (must be 

### `clear`

`clear(mut self, idx: UInt)`

Clears the bit at the specified index `idx` (sets it to 0).

Aborts if `idx` is negative or greater than or equal to the
compile-time `size`. Does not change the logical size.

**Args:**

* ​idx (`UInt`): The non-negative index of the bit to clear (must be 

### `toggle`

`toggle(mut self, idx: UInt)`

Toggles (inverts) the bit at the specified index `idx`.

If the bit becomes 1 and `idx` is greater than or equal to the
current logical size, the logical size is updated. Aborts if `idx`
is negative or greater than or equal to the compile-time `size`.

**Args:**

* ​idx (`UInt`): The non-negative index of the bit to toggle (must be 

### `test`

`test(self, idx: UInt) -> Bool`

Tests if the bit at the specified index `idx` is set (is 1).

Aborts if `idx` is negative or greater than or equal to the
compile-time `size`.

**Args:**

* ​idx (`UInt`): The non-negative index of the bit to test (must be 

### `clear_all`

`clear_all(mut self)`

Clears all bits in the set, resetting the logical size to 0.

The allocated storage capacity remains unchanged. Equivalent to
re-initializing the set with `Self()`.

### `union`

`union(self, other: Self) -> Self`

Returns a new bitset that is the union of `self` and `other`.

**Args:**

* ​other (`Self`): The bitset to union with.

**Returns:**

A new bitset containing all elements from both sets.

### `intersection`

`intersection(self, other: Self) -> Self`

Returns a new bitset that is the intersection of `self` and `other`.

**Args:**

* ​other (`Self`): The bitset to intersect with.

**Returns:**

A new bitset containing only the elements present in both sets.

### `difference`

`difference(self, other: Self) -> Self`

Returns a new bitset that is the difference of `self` and `other`.

**Args:**

* ​other (`Self`): The bitset to subtract from `self`.

**Returns:**

A new bitset containing elements from `self` that are not in `other`.

### `write_to`

`write_to[W: Writer, //](self, mut writer: W)`

Writes a string representation of the set bits to the given writer. Outputs the indices of the set bits in ascending order, enclosed in curly braces and separated by commas (e.g., "{1, 5, 42}"). Uses efficient bitwise operations to find set bits without iterating through every possible bit.

**Parameters:**

* ​W (`Writer`): The type of the writer, conforming to the `Writer` trait.

**Args:**

* ​writer (`W`): The writer instance to output the representation to.

### `__repr__`

`__repr__(self) -> String`

Returns a developer-friendly string representation of the bitset.

Currently equivalent to `__str__`.

**Returns:**

A string showing the set bits (e.g., "{1, 5, 42}").

### `__str__`

`__str__(self) -> String`

Returns a user-friendly string representation of the bitset.

Formats the set bits as a comma-separated list within curly braces,
like "{1, 5, 42}". Uses the `write_to` method internally.

**Returns:**

A string showing the set bits.

---

## bitwidthof

`bitwidthof[type: AnyTrivialRegType, target: target = _current_target()]() -> Int`

Returns the size of (in bits) of the type.

**Parameters:**

* ​type (`AnyTrivialRegType`): The type in question.
* ​target (`target`): The target architecture.

**Returns:**

The size of the type in bits.

`bitwidthof[dtype: DType, target: target = _current_target()]() -> Int`

Returns the size of (in bits) of the dtype.

**Parameters:**

* ​dtype (`DType`): The type in question.
* ​target (`target`): The target architecture.

**Returns:**

The size of the dtype in bits.

---

## block

GPU block-level operations and utilities.

This module provides block-level operations for NVIDIA and AMD GPUs, including:

* Block-wide reductions:
  * sum: Compute sum across block
  * max: Find maximum value across block
  * min: Find minimum value across block
  * broadcast: Broadcast value to all threads

The module builds on warp-level operations from the warp module, extending them
to work across a full thread block (potentially multiple warps). It handles both
NVIDIA and AMD GPU architectures and supports various data types with SIMD
vectorization.

## Functions

* [​`broadcast`](/mojo/stdlib/gpu/block/broadcast): Broadcasts a value from a source thread to all threads in a block.
* [​`max`](/mojo/stdlib/gpu/block/max): Computes the maximum value across all threads in a block.
* [​`min`](/mojo/stdlib/gpu/block/min): Computes the minimum value across all threads in a block.
* [​`prefix_sum`](/mojo/stdlib/gpu/block/prefix_sum): Performs a prefix sum (scan) operation across all threads in a block.
* [​`sum`](/mojo/stdlib/gpu/block/sum): Computes the sum of values across all threads in a block.

---

## Block index

In GPU programming, a block index uniquely identifies a subset of
[threads](thread) that execute a [kernel](kernel.mdx) function on the GPU.
Threads are grouped into units called [blocks](thread-block.mdx), and multiple
blocks together form a larger structure known as a [grid](grid.mdx).

Each block within the grid is assigned a unique block index, which can be
represented across one, two, or three dimensions. This allows for flexible
organization of threads to match the structure of the problem being solved.
Within each block, individual threads have their own [thread
index](thread-index.mdx), which, together with the block index, determines which
part of the problem each thread should work on. This hierarchical structure of
grids, blocks, and threads enables efficient workload distribution across the
many processing cores of the GPU, maximizing parallel performance.

Because a programmer can arrange thread blocks within a grid across one, two,
or three dimensions, a block index is a 3-element vector of x, y, and z
coordinates. For 2-dimensional arrangements, the z coordinate of all block
indices is 0, and for 1-dimensional arrangements, both the y and z coordinates
of all block indices are 0.

---

## block_Q4_K

`struct block_Q4_K`

## Fields

* ​base\_scale (`SIMD[float16, 1]`):
* ​base\_min (`SIMD[float16, 1]`):
* ​q\_scales\_and\_mins (`InlineArray[SIMD[uint8, 1], 12]`):
* ​q\_bits (`InlineArray[SIMD[uint8, 1], 128]`):

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `group_count`

`alias group_count = 8`

### `group_size`

`alias group_size = 32`

---

## block_Q6_K

`struct block_Q6_K`

## Fields

* ​q\_bits\_lo (`InlineArray[SIMD[uint8, 1], 128]`):
* ​q\_bits\_hi (`InlineArray[SIMD[uint8, 1], 64]`):
* ​q\_scales (`InlineArray[SIMD[int8, 1], 16]`):
* ​base\_scale (`SIMD[float16, 1]`):

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `group_count`

`alias group_count = 16`

### `group_size`

`alias group_size = 16`

---

## block_QK_K

`struct block_QK_K`

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `quantized_k`

`alias quantized_k = 256`

---

## block_rank_in_cluster

`block_rank_in_cluster() -> SIMD[uint32, 1]`

Returns the unique identifier (rank) for the current thread block within its cluster.

Note:

* Only supported on NVIDIA SM90+ GPUs.
* Maps directly to the `%cluster_ctarank` special register in CUDA PTX.

**Returns:**

A unique identifier in the range \[0, cluster\_size-1] where `cluster_size`
is the total number of thread blocks in the cluster.

---

## block_reduce

`block_reduce[type: DType, //, warps_per_block: Int](val: SIMD[type, 1]) -> SIMD[type, 1]`

---

## block_reduce

`block_reduce[type: DType, max_warps_per_block: Int](val: SIMD[type, 1]) -> SIMD[type, 1]`

---

## block_swizzle

`block_swizzle(block_idx: IndexList[2, element_type=element_type], grid_dim: IndexList[2, element_type=element_type]) -> IndexList[2, element_type=element_type]`

---

## blocked_product

`blocked_product(layout_a: Layout, layout_b: Layout) -> Layout`

Creates a blocked layout by combining two layouts.

This function creates a hierarchical blocked layout by combining a base layout
with a block layout. The result is a layout where each element of the base
layout is replaced by a block defined by the second layout.

This is particularly useful for creating tiled layouts for efficient
cache utilization in tensor operations like matrix multiplication.

Example:

```mojo
from layout import Layout
from layout.layout import blocked_product

# Create a 2x3 matrix layout
var matrix = Layout.row_major(2, 3)
# Define 2x2 blocks
var block = Layout.row_major(2, 2)
# Create a blocked layout with 2x2 blocks
var blocked = blocked_product(block, matrix)
```

Output:

```plaintext
(((2, 2), (2, 3)):((2, 12), (1, 4)))
      0    1    2    3    4    5
   +----+----+----+----+----+----+
0  |  0 |  1 |  4 |  5 |  8 |  9 |
   +----+----+----+----+----+----+
1  |  2 |  3 |  6 |  7 | 10 | 11 |
   +----+----+----+----+----+----+
2  | 12 | 13 | 16 | 17 | 20 | 21 |
   +----+----+----+----+----+----+
3  | 14 | 15 | 18 | 19 | 22 | 23 |
   +----+----+----+----+----+----+
```

.

**Args:**

* ​layout\_a (`Layout`): The base layout to be blocked.
* ​layout\_b (`Layout`): The block layout defining the structure within each block.

**Returns:**

A new layout representing the blocked structure

---

## BlockingScopedLock

`struct BlockingScopedLock`

A scope adapter for BlockingSpinLock.

## Fields

* ​lock (`UnsafePointer[BlockingSpinLock]`): The underlying lock instance.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `LockType`

`alias LockType = BlockingSpinLock`

The type of the lock.

## Methods

### `__init__`

`__init__(out self, lock: UnsafePointer[BlockingSpinLock])`

Primary constructor.

**Args:**

* ​lock (`UnsafePointer[BlockingSpinLock]`): A pointer to the underlying lock.

`__init__(out self, mut lock: BlockingSpinLock)`

Secondary constructor.

**Args:**

* ​lock (`BlockingSpinLock`): A mutable reference to the underlying lock.

### `__enter__`

`__enter__(mut self)`

Acquire the lock on entry. This is done by setting the owner of the lock to own address.

### `__exit__`

`__exit__(mut self)`

Release the lock on exit. Reset the address on the underlying lock.

---

## BlockingSpinLock

`struct BlockingSpinLock`

A basic locking implementation that uses an integer to represent the owner of the lock.

## Fields

* ​counter (`Atomic[int64]`): The atomic counter implementing the spin lock.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `UNLOCKED`

`alias UNLOCKED = -1`

non-zero means locked, -1 means unlocked.

## Methods

### `__init__`

`__init__(out self)`

Default constructor.

### `lock`

`lock(mut self, owner: Int)`

Acquires the lock.

**Args:**

* ​owner (`Int`): The lock's owner (usually an address).

### `unlock`

`unlock(mut self, owner: Int) -> Bool`

Releases the lock.

**Args:**

* ​owner (`Int`): The lock's owner (usually an address).

**Returns:**

The successful release of the lock.

---

## bmm

## Aliases

### `elementwise_epilogue_type`

`alias elementwise_epilogue_type = fn[DType, Int, Int, Int](IndexList[$2], SIMD[$0, $1]) capturing -> None`

## Functions

* [​`batched_matmul`](./batched_matmul):
* [​`batched_matmul_kernel`](./batched_matmul_kernel):
* [​`batched_matmul_shape`](./batched_matmul_shape): Compute the output shape of a `batch_matmul` operation, and assert the inputs are compatible.

---

## bool

Implements the Bool class.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`Bool`](/mojo/stdlib/builtin/bool/Bool): The primitive Bool scalar value used in Mojo.

## Traits

* [​`Boolable`](/mojo/stdlib/builtin/bool/Boolable): The `Boolable` trait describes a type that can be explicitly converted to a `Bool` or evaluated as a boolean expression in `if` or `while` conditions.
* [​`ImplicitlyBoolable`](/mojo/stdlib/builtin/bool/ImplicitlyBoolable): The `ImplicitlyBoolable` trait describes a type that can be implicitly converted to a `Bool`.

## Functions

* [​`all`](/mojo/stdlib/builtin/bool/all): Checks if **all** elements in the list are truthy.
* [​`any`](/mojo/stdlib/builtin/bool/any): Checks if **any** element in the list is truthy.

---

## Bool

`@register_passable(trivial)`
`struct Bool`

The primitive Bool scalar value used in Mojo.

## Fields

* ​value (`i1`): The underlying storage of the boolean value.

## Implemented traits

`AnyType`,
`Boolable`,
`Comparable`,
`ConvertibleFromPython`,
`Copyable`,
`Defaultable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Floatable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`ImplicitlyBoolable`,
`ImplicitlyIntable`,
`Indexer`,
`Intable`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`PythonConvertible`,
`Representable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Aliases

### `MAX`

`alias MAX = __init__[::Boolable](True)`

The maximum value of a Bool.

### `MIN`

`alias MIN = __init__[::Boolable](False)`

The minimum value of a Bool.

## Methods

### `__init__`

`__init__() -> Self`

Construct a default, `False` Bool.

`@implicit`
`__init__[T: ImplicitlyBoolable, //](value: T) -> Self`

Convert an ImplicitlyBoolable value to a Bool.

**Parameters:**

* ​T (`ImplicitlyBoolable`): The ImplicitlyBoolable type.

**Args:**

* ​value (`T`): The boolable value.

`__init__[T: Boolable, //](value: T) -> Self`

Set the bool representation of the object.

**Parameters:**

* ​T (`Boolable`): The type of the object.

**Args:**

* ​value (`T`): The object to get the bool representation of.

`__init__(value: None) -> Self`

Set the bool representation of the `None` type to `False`.

**Args:**

* ​value (`None`): The object to get the bool representation of.

`@implicit`
`__init__(value: SIMD[bool, 1]) -> Self`

Convert a scalar SIMD value to a Bool.

**Args:**

* ​value (`SIMD[bool, 1]`): The scalar value.

### `__bool__`

`__bool__(self) -> Self`

Convert to Bool.

**Returns:**

This value.

### `__neg__`

`__neg__(self) -> Int`

Defines the unary `-` operation.

**Returns:**

0 for False and -1 for True.

### `__invert__`

`__invert__(self) -> Self`

Inverts the Bool value.

**Returns:**

True if the object is false and False otherwise.

### `__lt__`

`__lt__(self, rhs: Self) -> Self`

Compare this Bool to RHS using less-than comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

True if self is False and rhs is True.

### `__le__`

`__le__(self, rhs: Self) -> Self`

Compare this Bool to RHS using less-than-or-equal comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

True if self is False and rhs is True or False.

### `__eq__`

`__eq__(self, rhs: Self) -> Self`

Compare this Bool to RHS.

Performs an equality comparison between the Bool value and the argument.
This method gets invoked when a user uses the `==` infix operator.

**Args:**

* ​rhs (`Self`): The rhs value of the equality statement.

**Returns:**

True if the two values match and False otherwise.

### `__ne__`

`__ne__(self, rhs: Self) -> Self`

Compare this Bool to RHS.

Performs a non-equality comparison between the Bool value and the
argument. This method gets invoked when a user uses the `!=` infix
operator.

**Args:**

* ​rhs (`Self`): The rhs value of the non-equality statement.

**Returns:**

False if the two values do match and True otherwise.

### `__gt__`

`__gt__(self, rhs: Self) -> Self`

Compare this Bool to RHS using greater-than comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

True if self is True and rhs is False.

### `__ge__`

`__ge__(self, rhs: Self) -> Self`

Compare this Bool to RHS using greater-than-or-equal comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

True if self is True and rhs is True or False.

### `__and__`

`__and__(self, rhs: Self) -> Self`

Returns `self & rhs`.

Bitwise and's the Bool value with the argument. This method gets invoked
when a user uses the `and` infix operator.

**Args:**

* ​rhs (`Self`): The right hand side of the `and` statement.

**Returns:**

`self & rhs`.

### `__or__`

`__or__(self, rhs: Self) -> Self`

Returns `self | rhs`.

Bitwise or's the Bool value with the argument. This method gets invoked
when a user uses the `or` infix operator.

**Args:**

* ​rhs (`Self`): The right hand side of the `or` statement.

**Returns:**

`self | rhs`.

### `__xor__`

`__xor__(self, rhs: Self) -> Self`

Returns `self ^ rhs`.

Bitwise Xor's the Bool value with the argument. This method gets invoked
when a user uses the `^` infix operator.

**Args:**

* ​rhs (`Self`): The right hand side of the `xor` statement.

**Returns:**

`self ^ rhs`.

### `__rand__`

`__rand__(self, lhs: Self) -> Self`

Returns `lhs & self`.

**Args:**

* ​lhs (`Self`): The left hand side of the `and` statement.

**Returns:**

`lhs & self`.

### `__ror__`

`__ror__(self, lhs: Self) -> Self`

Returns `lhs | self`.

**Args:**

* ​lhs (`Self`): The left hand side of the `or` statement.

**Returns:**

`lhs | self`.

### `__rxor__`

`__rxor__(self, lhs: Self) -> Self`

Returns `lhs ^ self`.

**Args:**

* ​lhs (`Self`): The left hand side of the `xor` statement.

**Returns:**

`lhs ^ self`.

### `__iand__`

`__iand__(mut self, rhs: Self)`

Computes `self & rhs` and store the result in `self`.

**Args:**

* ​rhs (`Self`): The right hand side of the `and` statement.

### `__ixor__`

`__ixor__(mut self, rhs: Self)`

Computes `self ^ rhs` and stores the result in `self`.

**Args:**

* ​rhs (`Self`): The right hand side of the `xor` statement.

### `__ior__`

`__ior__(mut self, rhs: Self)`

Computes `self | rhs` and store the result in `self`.

**Args:**

* ​rhs (`Self`): The right hand side of the `or` statement.

### `copy`

`copy(self) -> Self`

Explicitly construct a deep copy of the provided value.

**Returns:**

A copy of the value.

### `__as_bool__`

`__as_bool__(self) -> Self`

Convert to Bool.

**Returns:**

This value.

### `__str__`

`__str__(self) -> String`

Get the bool as a string.

Returns `"True"` or `"False"`.

**Returns:**

A string representation.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this boolean to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__repr__`

`__repr__(self) -> String`

Get the bool as a string.

Returns `"True"` or `"False"`.

**Returns:**

A string representation.

### `__int__`

`__int__(self) -> Int`

Convert this Bool to an integer.

**Returns:**

1 if the Bool is True, 0 otherwise.

### `__as_int__`

`__as_int__(self) -> Int`

Implicitly convert to an integral representation of the value, wherever an `Int` is expected.

**Returns:**

The integral representation of the value.

### `__index__`

`__index__(self) -> index`

Convert to index.

**Returns:**

1 if the Bool is True, 0 otherwise.

### `__float__`

`__float__(self) -> SIMD[float64, 1]`

Convert this Bool to a float.

**Returns:**

1.0 if True else 0.0 otherwise.

### `__hash__`

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with the underlying bytes.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

### `to_python_object`

`to_python_object(self) -> PythonObject`

Convert this value to a PythonObject.

**Returns:**

A PythonObject representing the value.

---

## Boolable

The `Boolable` trait describes a type that can be explicitly converted to a `Bool` or evaluated as a boolean expression in `if` or `while` conditions.

This trait requires the type to implement the `__bool__()` method. For
example:

```mojo
struct Foo(Boolable):
    var val: Bool

    fn __bool__(self) -> Bool:
        return self.val
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__bool__`

`__bool__(self: _Self) -> Bool`

Get the boolean representation of the value.

**Returns:**

The boolean representation of the value.

---

## bottom_k_shape

`bottom_k_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], k: Int, axis: Int) -> IndexList[rank]`

---

## BoundingBox

`struct BoundingBox[type: DType]`

## Fields

* ​nw (`SIMD[type, 2]`):
* ​se (`SIMD[type, 2]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, y1: SIMD[type, 1], x1: SIMD[type, 1], y2: SIMD[type, 1], x2: SIMD[type, 1])`

### `iou`

`iou(self, other: Self) -> SIMD[type, 1]`

### `intersection_area`

`intersection_area(self, other: Self) -> SIMD[type, 1]`

### `area`

`area(self) -> SIMD[type, 1]`

---

## breakpoint

`breakpoint()`

Cause an execution trap with the intention of requesting the attention of a debugger.

---

## breakpoint

This module includes the builtin breakpoint function.

## Functions

* [​`breakpoint`](/mojo/stdlib/builtin/breakpoint/breakpoint): Cause an execution trap with the intention of requesting the attention of a debugger.

---

## breakpointhook

`breakpointhook()`

Cause an execution trap with the intention of requesting the attention of a debugger.

---

## Bring your own fine-tuned model to MAX pipelines

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import GetMagic from '@site/src/includes/get_magic.mdx';

In the [MAX 24.4](https://www.modular.com/blog/whats-new-in-max-24-4-max-on-macos-fast-local-llama3-native-quantization-and-gguf-support)
release, we have introduced native support for quantization and GGUF weight format. In this tutorial, we'll guide you through the steps
to integrate your fine-tuned custom model into the MAX pipelines. More specifically, we will start with the initial configuration
and then demonstrate how to download a model from the Hugging Face Hub.
If the model is not already available in a supported quantized GGUF format, we'll show you how to convert it
to prepare for ingestion into the MAX pipelines. Finally, we will explore how to use the quantized GGUF model via the MAX pipelines CLI.

## About model customization

Model customization in machine learning typically involves modifying a pre-trained model to better suit specific tasks or datasets.
One effective approach is fine-tuning, where a model trained on a large dataset is further trained (or fine-tuned) on a smaller,
task-specific dataset. In this tutorial, we focus on [Low Rank Adaptation (LoRA)](https://arxiv.org/abs/2106.09685).
LoRA (and its quantized variant [QLoRA](https://arxiv.org/abs/2305.14314)) allows for efficient adaptation of large models by only updating a small set of additional parameters,
preserving the original model's structure by integrating LoRA layers without altering the primary architecture. For this tutorial, we are assuming
the LoRA weights have been merged into the original model such as **Llama3.1**. Such a functionality is provided by major fine-tuning libraries such as
[unsloth `save_pretrained_merged`](https://docs.unsloth.ai/basics/saving-models/saving-to-gguf)
or using [PEFT model merging](https://huggingface.co/docs/peft/en/developer_guides/model_merging) APIs.

## Step 1: Set up Hugging Face access

To interact with models hosted on Hugging Face, secure access is required either via SSH or an access token.
Follow the instructions in the [Hugging Face documentation](https://huggingface.co/docs/hub/en/security-git-ssh) to set up SSH.
We can verify our configuration by running:

```sh
ssh -T git@hf.co
```

A successful setup will display `Hi , welcome to Hugging Face`.

## Step 2: Set up MAX pipelines

Next is to clone the [MAX GitHub repository](https://github.com/modular/modular) and navigate to the MAX pipeline:

```sh
git clone -b stable https://github.com/modular/modular && cd max
cd src/max
```

## Step 3: Include the `huggingface_hub` CLI

We'll use the `magic` CLI to create a virtual environment and install the
required packages.

Now install the `huggingface_hub` library to enable interactions with the
Hugging Face Hub. This package facilitates the download, and management of
models and datasets:

```sh
magic add --pypi huggingface_hub hf_transfer
```

With the Hugging Face Hub CLI installed, we can proceed to the next steps of downloading and converting our model.

## Step 3: Convert to GGUF format

If your model is already in the [GGUF format](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md), you can skip this conversion step and proceed directly to the next step.
If not, here are the most common methods to convert a model to a quantized GGUF format suitable for deployment:

- **Automated conversion via Hugging Face space**: We can use the [gguf-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space
  for a streamlined conversion process to convert to a supported quantized GGUF format. Remember to log in and for the sake of this tutorial,
  we choose the `Q4_K_M` quantization method.

  You can see all the supported quantization encoding in the [`encodings` module](/max/api/mojo/graph/quantization/encodings).
  For demonstration, we will choose [mlabonne/FineLlama-3.1-8B](https://huggingface.co/mlabonne/FineLlama-3.1-8B).

  After conversion, the model will be available under your HugginFace USERNAME, ready for download and deployment.

  ![](images/max-pipeline-bring-your-own-model/gguf-my-repo.png)

  The following will download the converted GGUF model:

  ```sh
  HF_HUB_ENABLE_HF_TRANSFER=1 magic run huggingface-cli download \
  /FineLlama-3.1-8B-Q4_K_M-GGUF \
  --repo-type model \
  --local-dir ./models
  ```

- **Manually convert via llama.cpp script**: Alternatively, utilize the [llama.cpp converter script](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py) to manually convert your model.

  ```sh
  git clone https://github.com/ggerganov/llama.cpp

  # If your model is available in Hugging Face .
  # Ensure you replace  with the appropriate
  # repository or model ID from Hugging Face.
  # Otherwise skip this command.
  HF_HUB_ENABLE_HF_TRANSFER=1 magic run huggingface-cli download  \
   --repo-type model \
   --local-dir ./models

  python llama.cpp/convert_hf_to_gguf.py models
  ```

With all the requirements in place we are now ready to use our custom model in MAX pipelines.

## Step 4: Run the custom model

With our fine-tuned Llama 3.1 model successfully converted to GGUF format, we're
ready to put it into action using MAX pipelines. For this demonstration, we'll be
using our converted model file `finellama-3.1-8b-q4_k_m.gguf`.

First, let's install the necessary CLI tool. MAX provides the `max`
package, which we can easily install using the `magic` command:

```bash
magic global install max
```

Before running our model, it's worth noting that MAX pipelines offer various
configuration options. You can explore these by running, `max --help`
for the available options.

:::note

If you use private or gated models, you must set your [Hugging Face access
token](https://huggingface.co/docs/hub/en/security-tokens) first. For example:

```bash
export HF_TOKEN="hf_..."
```

Then you can run a MAX Pipelines command for a private or gated model.

:::

Now, let's run our custom model. We'll use the `max generate` command,
specifying our model configuration and a test prompt:

```bash
max generate \
  --model-path=modularai/Llama-3.1-8B-Instruct-GGUF \
  --quantization-encoding "q4_k" \
  --weight-path "./models/finellama-3.1-8b-q4_k_m.gguf" \
  --prompt "What is the meaning of life?"
```

It generates the following answer:

```output
The meaning of life is a question that has been pondered by philosophers, scientists,
and spiritual leaders for centuries. It is a question that has no definitive answer,
as it is deeply personal and subjective to each individual. However, many have
attempted to provide their own interpretations or explanations.

One interpretation of the meaning of life is that it is simply to live and experience
the world around us. This view suggests that the purpose of life is to experience all
that it has to offer, whether it be through the senses, emotions, or intellectual
pursuits. In this sense, the meaning of life is not necessarily tied to any specific goal
or achievement, but rather to the process of living itself.

Another interpretation is that the meaning of life is to find purpose and meaning in
our lives. This view suggests that we are here to seek out our own unique purpose and
to strive to achieve it. This can be achieved through various means, such as through our
work, relationships, or personal pursuits.

A third interpretation is that the meaning of life is to connect with something larger
than ourselves. This view suggests that we are here to connect with a higher power,
whether it be through religion, spirituality, or a sense of awe and wonder at the
universe. In this sense, the meaning of life is to find a sense of purpose and
connection that transcends our individual lives.

Ultimately, the meaning of life is a question that each person must answer for themselves.
It is a question that requires us to reflect on our own values, beliefs, and experiences.
As the saying goes, "Ask a flower" - the meaning of life is not something that can be
answered in words, but rather in the experience of living itself.
```

## Next steps

Congratulations on successfully integrating your fine-tuned Llama3.1 model into the MAX pipelines! 🎉

We have navigated through setting up secure access, downloading and converting models, and finally running your custom model in MAX pipelines.
We encourage you to further customize your models via the MAX Graph API, test your pipeline and explore other MAX features including
how to **deploy your fine-tuned model on GPU using MAX Serve**.

Here are some other topics to explore next:

import TutorialStack from '@site/src/components/TutorialStack';

export const maxTutorials = [
  'get-started-with-max-graph-in-python',
  'max-serve-local-to-cloud',
];

---

## broadcast

`broadcast[type: DType](output: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

For each axis of `input`, if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`.

**Args:**

* ​output (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer.
* ​input (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input buffer.

---

## broadcast

## Functions

* [​`broadcast`](./broadcast): For each axis of `input`, if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`.
* [​`broadcast_impl`](./broadcast_impl): For each axis of `input` ∈ \[axis, rank), if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`.

---

## broadcast

`broadcast[type: DType, width: Int, //, *, block_size: Int](val: SIMD[type, width], src_thread: UInt = UInt(0)) -> SIMD[type, width]`

Broadcasts a value from a source thread to all threads in a block.

This function takes a SIMD value from the specified source thread and
copies it to all other threads in the block, effectively broadcasting
the value across the entire block.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements.
* ​width (`Int`): The number of elements in each SIMD vector.
* ​block\_size (`Int`): The total number of threads in the block.

**Args:**

* ​val (`SIMD[type, width]`): The SIMD value to broadcast from the source thread.
* ​src\_thread (`UInt`): The thread ID of the source thread (default: 0).

**Returns:**

A SIMD value where all threads contain a copy of the input value from
the source thread.

---

## broadcast

`broadcast[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Broadcasts a SIMD value from lane 0 to all lanes in the warp.

This function takes a SIMD value from lane 0 and copies it to all other lanes in the warp,
effectively broadcasting the value across the entire warp. This is useful for sharing data
between threads in a warp without using shared memory.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to broadcast from lane 0.

**Returns:**

A SIMD value where all lanes contain a copy of the input value from lane 0.

`broadcast(val: Int) -> Int`

Broadcasts an integer value from lane 0 to all lanes in the warp.

This function takes an integer value from lane 0 and copies it to all other lanes in the warp.
It provides a convenient way to share scalar integer data between threads without using shared memory.

**Args:**

* ​val (`Int`): The integer value to broadcast from lane 0.

**Returns:**

The broadcast integer value, where all lanes receive a copy of the input from lane 0.

`broadcast(val: UInt) -> UInt`

Broadcasts an unsigned integer value from lane 0 to all lanes in the warp.

This function takes an unsigned integer value from lane 0 and copies it to all other lanes in the warp.
It provides a convenient way to share scalar unsigned integer data between threads without using shared memory.

**Args:**

* ​val (`UInt`): The unsigned integer value to broadcast from lane 0.

**Returns:**

The broadcast unsigned integer value, where all lanes receive a copy of the input from lane 0.

---

## broadcast_impl

`broadcast_impl[type: DType](axis: Int, output: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input_prev_axis_stride: Int, output_prev_axis_stride: Int, input_offset: Int, output_offset: Int, rightmost_broadcast_axis: Int)`

For each axis of `input` ∈ \[axis, rank), if the dimension is 1, duplicate the data at each index of the corresponding axis in `output`, otherwise copy over the entire axis to the corresponding axis in `output`.

**Args:**

* ​axis (`Int`): The axis value.
* ​output (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer.
* ​input (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input buffer.
* ​input\_prev\_axis\_stride (`Int`): The stride at axis `axis - 1` for input.
* ​output\_prev\_axis\_stride (`Int`): The stride at axis `axis - 1` for output.
* ​input\_offset (`Int`): The offset at which we start copying data from.
* ​output\_offset (`Int`): The offset at which we start copying data to.
* ​rightmost\_broadcast\_axis (`Int`): The largest axis at which we need to duplicate `input` data.

---

## BTileGenerator

`struct BTileGenerator[mut: Bool, //, config: KernelConfig, a_type: DType, b_type: DType, c_type: DType, shape: DimList, transpose_b: Bool, b_packed: Bool, origin: Origin[mut]]`

Struct to encapsulate a tile of B that supports prepacking.

If b\_packed is true, calls to get\_tile will return a buffer view from B.
Otherwise, calls to get\_tile will copy a tile from B into a stack allocated
scratch buffer and return a view of that.

## Fields

* ​b (`NDBuffer[b_type, 2, origin, shape]`):
* ​b\_tile\_stack\_ptr (`UnsafePointer[SIMD[b_type, 1]]`):
* ​tile\_n\_k (`IndexList[2]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `get`

`static get(b: NDBuffer[b_type, 2, origin, shape], tile_n_k: IndexList[2]) -> Self`

### `get_tile`

`get_tile[inner_size: Int](self, global_offset: GemmShape, tile_dim_nk: IndexList[2], valid_data_dim_nk: IndexList[2]) -> NDBuffer[b_type, 3, MutableAnyOrigin, config.packed_shape]`

Get a packed matrix (B) tile.

valid\_data\_tile\_nk is ignored for pre-packing, where the tile is padded
to have shape of tile\_dim\_nk.

**Args:**

* ​global\_offset (`GemmShape`): Offset in the global M, N, K dimensions.
* ​tile\_dim\_nk (`IndexList[2]`): Tile shape based on cache size and matrix dimensions.
* ​valid\_data\_dim\_nk (`IndexList[2]`): The upper bounds for N and K dimensions.

**Returns:**

A view of the packed tile.

---

## buffer

Implements the NDBuffer struct.

You can import these APIs from the `memory` package. For example:

```mojo
from buffer import NDBuffer
```

## Structs

* [​`NDBuffer`](/mojo/stdlib/buffer/buffer/NDBuffer): An N-dimensional buffer.

## Functions

* [​`partial_simd_load`](/mojo/stdlib/buffer/buffer/partial_simd_load): Loads a vector with dynamic bound.
* [​`partial_simd_store`](/mojo/stdlib/buffer/buffer/partial_simd_store): Stores a vector with dynamic bound.
* [​`prod_dims`](/mojo/stdlib/buffer/buffer/prod_dims): Computes the product of a slice of the given buffer's dimensions.

---

## buffer

Implements the buffer package.

## Modules

* [​`buffer`](/mojo/stdlib/buffer/buffer/): Implements the NDBuffer struct.
* [​`dimlist`](/mojo/stdlib/buffer/dimlist/): Provides utilities for working with static and variadic lists.

---

## buffer_load

`buffer_load[type: DType, width: Int](src_resource: SIMD[uint32, 4], gds_offset: SIMD[int32, 1]) -> SIMD[type, width]`

Loads data from global memory into a SIMD register.

This function provides a hardware-accelerated global memory load operation
that maps directly to the AMDGPU buffer\_load instruction. It efficiently
transfers data from global memory to registers.

Note:

* Only supported on AMD GPUs.
* Uses non-glc loads by default (can hit L1 cache and persist across wavefronts).
* Supports widths that map to 1, 2, 4, 8, or 16 byte loads.
* Maps directly to llvm.amdgcn.raw\.buffer.load intrinsics.

**Parameters:**

* ​type (`DType`): The data type to load.
* ​width (`Int`): The SIMD vector width for vectorized loads.

**Args:**

* ​src\_resource (`SIMD[uint32, 4]`): Buffer resource descriptor created by make\_buffer\_resource().
* ​gds\_offset (`SIMD[int32, 1]`): Offset in elements (not bytes) from the base address in the resource.

**Returns:**

SIMD vector containing the loaded data.

---

## buffer_load_store_lds

`buffer_load_store_lds[type: DType](src_resource: SIMD[uint32, 4], gds_offset: SIMD[int32, 1], lds_ptr_base: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)], lds_offset: SIMD[int32, 1])`

Loads four bytes from global memory ands writes them to shared memory.

Copies from global memory to shared memory (aka LDS) bypassing storing to
register.

**Parameters:**

* ​type (`DType`): The type of the data to be loaded.

**Args:**

* ​src\_resource (`SIMD[uint32, 4]`): Buffer resource descriptor from make\_buffer\_resource.
* ​gds\_offset (`SIMD[int32, 1]`): Global memory offset.
* ​lds\_ptr\_base (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(3)]`): LDS base address.
* ​lds\_offset (`SIMD[int32, 1]`): LDS offset.

---

## buffer_store

`buffer_store[type: DType, width: Int](src_resource: SIMD[uint32, 4], gds_offset: SIMD[int32, 1], val: SIMD[type, width])`

Stores a register variable to global memory.

Writes to global memory from a register.

**Parameters:**

* ​type (`DType`): The data type.
* ​width (`Int`): The SIMD vector width.

**Args:**

* ​src\_resource (`SIMD[uint32, 4]`): Buffer resource descriptor.
* ​gds\_offset (`SIMD[int32, 1]`): Global memory offset.
* ​val (`SIMD[type, width]`): Value to write.

---

## BufferValue

## `BufferValue` {#max.graph.BufferValue}

> *class* max.graph.BufferValue(value)

Bases: [`Value`](Value.md#max.graph.Value)\[`BufferType`]

Represents a mutable semantic tensor within a Graph.

Value is abstract, it shouldn’t be constructed directly.

**Parameters:**

**value** ([`Value`](Value.md#max.graph.Value)  `|`  `\_Value` `[` `mo.BufferType` `]` )

### `device` {#max.graph.BufferValue.device}

> *property* device\*: DeviceRef\*

Returns the device of the BufferValue.

### `dtype` {#max.graph.BufferValue.dtype}

> *property* dtype\*: [DType](../dtype.md#max.dtype.DType)\*

Returns the tensor data type.

### `print()` {#max.graph.BufferValue.print}

> print(label='debug\_buffer')

**Parameters:**

**label** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )

### `rank` {#max.graph.BufferValue.rank}

> *property* rank\*: [int](https://docs.python.org/3/library/functions.html#int)\*

Returns the rank (number of dims) of the buffer.

### `shape` {#max.graph.BufferValue.shape}

> *property* shape\*: [Shape](type.md#max.graph.type.Shape)\*

Returns the shape of the BufferValue.

### `type` {#max.graph.BufferValue.type}

> *property* type\*: BufferType\*

Returns the type of the [`BufferValue`](#max.graph.BufferValue) as a `BufferType`.

---

## Build custom ops for GPUs

import GetMagic from "@site/src/includes/get_magic.mdx";

[Mojo](/mojo/manual/index.md) is our not-so-secret weapon for achieving
architecture-independent performance for all types of AI workloads. Previously,
only Modular engineers were able to write high-performance parallel processing
operations for a [MAX Graph](/max/model-formats.mdx#max-graph) using Mojo.

In this tutorial, you'll learn how to write custom operations (custom ops) for
MAX graphs using Mojo that can execute efficiently on both CPUs and GPUs. You'll
execute a graph with a custom operation and learn to create a matrix addition
operation that adds one to each matrix element.

To help you get started, we provide several [Custom Operations
recipes](https://github.com/modular/max-recipes/tree/main/custom-ops-introduction)
that you can run with the nightly version of MAX.

## Set up your environment

Using a virtual environment ensures that you have the MAX and Mojo version
that's compatible with this project. We'll use the [Magic
CLI](/magic) to create the environment and install the required packages.

1. 

2. Create a new project with the `custom-ops-introduction` recipe:

   ```sh
   magic init max-custom-ops --from modular/max-recipes/custom-ops-introduction && \
     cd max-custom-ops
   ```

3. You can run the custom addition operation example like this:

   ```sh
   magic run add_one
   ```

   And the following is the expected output:

   ```output
   Graph result:
    [[1.7736697 1.4688652 1.7971799 1.4553597 1.8967733 1.3691401 1.1297637
    1.7047229 1.1314526 1.3924606]
       # ... shorten for brevity
   Expected result:
    [[1.7736697 1.4688652 1.7971799 1.4553597 1.8967733 1.3691401 1.1297637
    1.7047229 1.1314526 1.3924606]
       # ... shorten for brevity
   ```

The exact output will vary based on random initialization of the input tensor.
But the graph result and expected result should be the same.

Now that you've seen the code in action, let's dive into the implementation
details to understand how this custom addition operation works under the hood.

## Define a Mojo custom operation

The MAX Graph API represents models as computational graphs, where each operation
describes parallel computations that the MAX Engine optimizes for hardware
performance. Within these graphs, nodes can process any number of input tensors,
perform computations on the target hardware, and generate one or more output
tensors as results.

To illustrate this, open the `add_custom.mojo` file in the
[kernels](https://github.com/modular/modular/tree/main/examples/custom_ops/kernels)
directory. Here, a custom operation called `AddOneCustom` takes an input tensor,
adds one to every element, and returns the result of that computation as a new
tensor.

This custom compute node is defined as a Mojo struct:

```mojo
import compiler
from tensor import OutputTensor, InputTensor, foreach
from runtime.asyncrt import DeviceContextPtr

from utils.index import IndexList

@compiler.register("add_one")
struct AddOne:
    @staticmethod
    fn execute[
        target: StaticString,
    ](
        out: OutputTensor,
        x: InputTensor[type = out.type, rank = out.rank],
        ctx: DeviceContextPtr,
    ) raises:
```

The [`@compiler.register()`](/max/api/mojo-decorators/compiler-register)
decorator is used to register the custom operation with the name `add_one`
and specify that it produces one output.

Mojo's [Single Instruction Multiple Data (SIMD)](/mojo/stdlib/builtin/simd/SIMD.md)
types and compile-time parameters enable hardware-agnostic parallel processing.

Inputs and outputs take the form of `InputTensor` and `OutputTensor`,
respectively. These are both specialized versions of the
[`ManagedTensorSlice`](/max/api/mojo/tensor/managed_tensor_slice/ManagedTensorSlice),
type, which represents a tensor of a specific
rank and datatype whose memory is managed outside of the operation. Elements are
read from the input tensors and written directly into the output tensors. Any output
tensors must come first in the operation signature.

The core computation, adding one to each element in the tensor, happens in the
`add_one()` function:

```Mojo
@parameter
@always_inline
fn elementwise_add_one[
    width: Int
](idx: IndexList[x.rank]) -> SIMD[x.type, width]:
    return x.load[width](idx) + 1

foreach[elementwise_add_one, target=target](out, ctx)
```

The [`foreach()`](/max/api/mojo/tensor/managed_tensor_slice/foreach/) function
distributes an elementwise computation in parallel across all elements in the
output tensor. This method is optimized for specific hardware platforms,
optimally distributing parallel workloads to make the most efficient use of
computational resources.

A library of these custom operations can be defined in Mojo files (`.mojo`) and
used directly by the graph compiler when defining a MAX Graph. These Mojo files
contain the custom operations that will be used in your MAX Graph.

## Add the custom operation to a graph

The MAX Graph API contains [a series of pre-defined
operations](/max/api/mojo/graph/ops/index.md) written by Modular that have highly
optimized implementations. In addition to those APIs, the
[`custom()`](/max/api/python/graph/ops#max.graph.ops.custom) function allows you
to specify custom user-defined Mojo operations.

To use a Mojo custom operation with GPU acceleration, specify the custom ops in
your MAX graph. The
[`add_one.py`](https://github.com/modular/max-recipes/blob/main/custom-ops-introduction/add_one.py)
example demonstrates building a computational graph in Python:

```python
import os
from pathlib import Path

import numpy as np
from max.driver import CPU, Accelerator, Tensor, accelerator_count
from max.dtype import DType
from max.engine import InferenceSession
from max.graph import Graph, TensorType, ops

if __name__ == "__main__":

    path = Path(__file__).parent / "operations.mojopkg"

    rows = 5
    columns = 10
    dtype = DType.float32

    # Configure our simple one-operation graph.
    graph = Graph(
        "addition",
        forward=lambda x: ops.custom(
            name="add_one",
            values=[x],
            out_types=[TensorType(dtype=x.dtype, shape=x.tensor.shape)],
        )[0].tensor,
        input_types=[
            TensorType(dtype, shape=[rows, columns]),
        ],
    )
```

The [`Graph()`](/max/api/python/graph/Graph.md) takes an input tensor with five
rows and ten columns, runs the custom `add_one` operation on it, and
returns the result. The custom operation is specified using the `ops.custom()`
function, which requires the operation name, input values, and output tensor
types.

Because MAX works across a range of hardware architectures, this same code can be
run on a GPU if it is available, or a local CPU if not. For example:

```python
device = CPU() if accelerator_count() == 0 else Accelerator()
```

Using the `InferenceSession()` class, this graph is placed on whatever device we've selected:

```python
session = InferenceSession(
    devices=[device],
    custom_extensions=path,
)
```

This configures the inference session to run on the detected compute type.

After which MAX Engine can compile it to optimize for the target hardware:

```python
model = session.load(graph)
```

Memory management between host CPUs and accelerator devices is handled through
the MAX Driver API. This interface gives you precise control over memory
transfers, allowing you to optimize performance by explicitly managing these
potentially expensive operations. The API's
[`Tensor`](/max/api/python/driver/#tensor-1) class is designed for seamless
integration with common Python frameworks - it offers zero-copy interoperability
with both NumPy arrays and PyTorch tensors. Here's how we can leverage this to
create a MAX Tensor from random data:

```python
x_array = np.random.uniform(size=(rows, columns)).astype(np.float32)
x = Tensor.from_numpy(x_array)
```

This Tensor is resident on the host and needs to be moved to the accelerator to
be ready for use with the MAX Graph on that device. Note that if the device is
the host CPU, this is a no-op:

```python
x = x.to(device)
```

This Tensor can now be run through our compiled graph, and a device-resident
tensor is the result:

```python
result = model.execute(x)[0]
```

To examine the results, this Tensor can be moved back to the host:

```python
result = result.to(CPU())
```

Then you can convert it back to a NumPy array:

```python
print(result.to_numpy())
```

For a more advanced example, be sure to check out how we compute the [Mandelbrot
set](https://github.com/modular/modular/tree/main/examples/custom_ops) using the
[`ComplexSIMD`](/mojo/stdlib/complex/complex/ComplexSIMD.md) data type and a
vectorized implementation of the fractal computation.

As a final note, the programming interface described above is being provided as
a preview, and some elements will change as we continue to improve [GPU
programming with Mojo](/mojo/manual/gpu/basics).

## More to come

Mojo is an incredible language for programming accelerators: Python-like
high-level syntax, systems language performance, and unique language features
designed for modern heterogeneous computation. We're tremendously excited to be
able to show off how it enables MAX to drive forward the state-of-the-art when
running AI workloads and more on GPUs. Adding custom ops to a graph is our first
introduction to how you can program GPUs with Mojo. These are early examples, and
we will be rolling out more API documentation and examples. To stay up to date
with new releases, [sign up for our
newsletter](https://www.modular.com/modverse#signup), [check out the
community](https://www.modular.com/community), and [join our
forum](https://forum.modular.com/).

The nightly branch of the open-source MAX repository contains everything needed
to run the examples above on an Ampere- or Lovelace-class NVIDIA GPU (more to
come!), as well as on a local CPU. Give them a try today to start experimenting
with programming GPUs in Mojo!

## Next steps

import TutorialStack from '@site/src/components/TutorialStack';

export const maxTutorials = [
  'get-started-with-max-graph-in-python',
  'magic',
];

export const mojoTutorials = [
  'get-started',
];

---

## builtin

Implements the builtin package.

## Modules

* [​`anytype`](/mojo/stdlib/builtin/anytype/): Defines the core traits for object lifetime management in Mojo.
* [​`bool`](/mojo/stdlib/builtin/bool/): Implements the Bool class.
* [​`breakpoint`](/mojo/stdlib/builtin/breakpoint/): This module includes the builtin breakpoint function.
* [​`builtin_slice`](/mojo/stdlib/builtin/builtin_slice/): Implements slice.
* [​`comparable`](/mojo/stdlib/builtin/comparable/):
* [​`constrained`](/mojo/stdlib/builtin/constrained/): Implements compile-time constraints.
* [​`coroutine`](/mojo/stdlib/builtin/coroutine/): Implements classes and methods for coroutines.
* [​`debug_assert`](/mojo/stdlib/builtin/debug_assert/): Implements run-time assertions.
* [​`device_passable`](/mojo/stdlib/builtin/device_passable/):
* [​`dtype`](/mojo/stdlib/builtin/dtype/): Implements the DType class.
* [​`equality_comparable`](/mojo/stdlib/builtin/equality_comparable/):
* [​`error`](/mojo/stdlib/builtin/error/): Implements the Error class.
* [​`file`](/mojo/stdlib/builtin/file/): Provides APIs to read and write files.
* [​`file_descriptor`](/mojo/stdlib/builtin/file_descriptor/): Higher level abstraction for file stream.
* [​`float_literal`](/mojo/stdlib/builtin/float_literal/): Implements the FloatLiteral class.
* [​`floatable`](/mojo/stdlib/builtin/floatable/): Implements the `Floatable` and `FloatableRaising` traits.
* [​`format_int`](/mojo/stdlib/builtin/format_int/): Provides the `hex` and `bin` functions.
* [​`identifiable`](/mojo/stdlib/builtin/identifiable/):
* [​`int`](/mojo/stdlib/builtin/int/): Implements the Int class.
* [​`int_literal`](/mojo/stdlib/builtin/int_literal/): Implements the IntLiteral class.
* [​`io`](/mojo/stdlib/builtin/io/): Provides utilities for working with input/output.
* [​`len`](/mojo/stdlib/builtin/len/): Provides the `len()` function and its associated traits.
* [​`math`](/mojo/stdlib/builtin/math/): Defines basic math functions for use in the open source parts of the standard library since the `math` package is currently closed source and cannot be depended on in the open source parts of the standard library.
* [​`none`](/mojo/stdlib/builtin/none/): Defines the builtin `NoneType`.
* [​`range`](/mojo/stdlib/builtin/range/): Implements a 'range' call.
* [​`rebind`](/mojo/stdlib/builtin/rebind/): Implements type rebind.
* [​`repr`](/mojo/stdlib/builtin/repr/): Provide the `repr` function.
* [​`reversed`](/mojo/stdlib/builtin/reversed/): Provides the `reversed` function for reverse iteration over collections.
* [​`simd`](/mojo/stdlib/builtin/simd/): Implements SIMD primitives and abstractions.
* [​`sort`](/mojo/stdlib/builtin/sort/): Implements the built-in `sort` function.
* [​`str`](/mojo/stdlib/builtin/str/): Provides the `str` function.
* [​`string_literal`](/mojo/stdlib/builtin/string_literal/): Implements the StringLiteral struct.
* [​`swap`](/mojo/stdlib/builtin/swap/): Implements the built-in `swap` function.
* [​`tuple`](/mojo/stdlib/builtin/tuple/): Implements the Tuple type.
* [​`type_aliases`](/mojo/stdlib/builtin/type_aliases/): Defines some type aliases.
* [​`uint`](/mojo/stdlib/builtin/uint/): Implements the UInt class.
* [​`value`](/mojo/stdlib/builtin/value/): Defines core value traits.
* [​`variadics`](/mojo/stdlib/builtin/variadics/): Implements the VariadicList and VariadicPack types.

---

## builtin_slice

Implements slice.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`Slice`](/mojo/stdlib/builtin/builtin_slice/Slice): Represents a slice expression.

## Functions

* [​`slice`](/mojo/stdlib/builtin/builtin_slice/slice-function): Construct slice given the end value.

---

## byte_permute

`byte_permute(a: SIMD[uint32, 1], b: SIMD[uint32, 1], c: SIMD[uint32, 1]) -> SIMD[uint32, 1]`

Permutes bytes from two 32-bit integers based on a control mask.

Selects and rearranges bytes from two source integers based on a control
mask to create a new 32-bit value.

Note:
Byte selection behavior depends on the GPU architecture:

* On NVIDIA: Maps to PRMT instruction
* On AMD: Maps to PERM instruction.

**Args:**

* ​a (`SIMD[uint32, 1]`): First source integer containing bytes to select from.
* ​b (`SIMD[uint32, 1]`): Second source integer containing bytes to select from.
* ​c (`SIMD[uint32, 1]`): Control mask that specifies which bytes to select and their
  positions. Each byte in the mask controls selection/placement of
  one output byte.

**Returns:**

A new 32-bit integer containing the selected and rearranged bytes

---

## byte_swap

`byte_swap(val: Int) -> Int`

Byte-swaps an integer value with an even number of bytes.

Byte swap an integer value (8 bytes) with an even number of bytes (positive multiple
of 16 bits). This returns an integer value (8 bytes) that has its bytes swapped. For
example, if the input bytes are numbered 0, 1, 2, 3, 4, 5, 6, 7 then the returned
integer will have its bytes in 7, 6, 5, 4, 3, 2, 1, 0 order.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The input value with its bytes swapped.

`byte_swap[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Byte-swaps a SIMD vector of integer values with an even number of bytes.

Byte swap an integer value or vector of integer values with an even number
of bytes (positive multiple of 16 bits). For example, The Int16 returns an
Int16 value that has the high and low byte of the input Int16 swapped.
Similarly, Int32 returns an Int32 value that has the four bytes of the input Int32 swapped,
so that if the input bytes are numbered 0, 1, 2, 3 then the returned Int32 will
have its bytes in 3, 2, 1, 0 order. Int64 and other integer type extend this
concept to additional even-byte lengths (6 bytes, 8 bytes and more, respectively).

**Constraints:**

The element type of the input vector must be an integral type.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` is the value of the
element at position `i` of the input value with its bytes swapped.

---

## C API

You can use the following C APIs to run inference with MAX Engine.

## API headers

Each of the following pages represents one of the C API header files:

* [Common](common.md)

  * [`M_version()`](common.md#_CPPv49M_versionv)
  * [`M_newStatus()`](common.md#_CPPv411M_newStatusv)
  * [`M_getError()`](common.md#_CPPv410M_getErrorPK8M_Status)
  * [`M_isError()`](common.md#_CPPv49M_isErrorPK8M_Status)
  * [`M_freeStatus()`](common.md#_CPPv412M_freeStatusP8M_Status)
  * [`M_sizeOf()`](common.md#_CPPv48M_sizeOf7M_Dtype)
  * [`M_getDynamicDimensionValue()`](common.md#_CPPv426M_getDynamicDimensionValuev)
  * [`M_getDynamicRankValue()`](common.md#_CPPv421M_getDynamicRankValuev)
* [Context](context.md)

  * [`M_newRuntimeConfig()`](context.md#_CPPv418M_newRuntimeConfigv)
  * [`M_setNumThreads()`](context.md#_CPPv415M_setNumThreadsP15M_RuntimeConfig6size_t)
  * [`M_setAllocatorType()`](context.md#_CPPv418M_setAllocatorTypeP15M_RuntimeConfig15M_AllocatorType)
  * [`M_setCPUAffinity()`](context.md#_CPPv416M_setCPUAffinityP15M_RuntimeConfigb)
  * [`M_getNumThreads()`](context.md#_CPPv415M_getNumThreadsP15M_RuntimeConfig)
  * [`M_getCPUAffinity()`](context.md#_CPPv416M_getCPUAffinityP15M_RuntimeConfig)
  * [`M_enableCrashLog()`](context.md#_CPPv416M_enableCrashLogP15M_RuntimeConfigPKc)
  * [`M_freeRuntimeConfig()`](context.md#_CPPv419M_freeRuntimeConfigP15M_RuntimeConfig)
  * [`M_newRuntimeContext()`](context.md#_CPPv419M_newRuntimeContextPK15M_RuntimeConfigP8M_Status)
  * [`M_freeRuntimeContext()`](context.md#_CPPv420M_freeRuntimeContextP16M_RuntimeContext)
  * [`M_setDebugPrintOptions()`](context.md#_CPPv422M_setDebugPrintOptionsP16M_RuntimeContext19M_ResultOutputStylejPKc)
  * [`M_setMojoDefineBool()`](context.md#_CPPv419M_setMojoDefineBoolP16M_RuntimeContextPKcb)
  * [`M_setMojoDefineInt()`](context.md#_CPPv418M_setMojoDefineIntP16M_RuntimeContextPKci)
  * [`M_setMojoDefineString()`](context.md#_CPPv421M_setMojoDefineStringP16M_RuntimeContextPKcPKc)
* [Model](model.md)

  * [`M_newCompileConfig()`](model.md#_CPPv418M_newCompileConfigv)
  * [`M_cloneCompileConfig()`](model.md#_CPPv420M_cloneCompileConfigP15M_CompileConfig)
  * [`M_setModelPath()`](model.md#_CPPv414M_setModelPathP15M_CompileConfigPKc)
  * [`M_newModelSource()`](model.md#_CPPv416M_newModelSourcePv17M_FrameworkFormat)
  * [`M_setModelSource()`](model.md#_CPPv416M_setModelSourceP15M_CompileConfigP13M_ModelSource)
  * [`M_compileModel()`](model.md#_CPPv414M_compileModelPK16M_RuntimeContextPP15M_CompileConfigP8M_Status)
  * [`M_waitForCompilation()`](model.md#_CPPv420M_waitForCompilationP20M_AsyncCompiledModelP8M_Status)
  * [`M_compileModelSync()`](model.md#_CPPv418M_compileModelSyncPK16M_RuntimeContextPP15M_CompileConfigP8M_Status)
  * [`M_initModel()`](model.md#_CPPv411M_initModelPK16M_RuntimeContextPK20M_AsyncCompiledModelPK17M_WeightsRegistryP8M_Status)
  * [`M_getInputNames()`](model.md#_CPPv415M_getInputNamesPK20M_AsyncCompiledModelP8M_Status)
  * [`M_getOutputNames()`](model.md#_CPPv416M_getOutputNamesPK20M_AsyncCompiledModelP8M_Status)
  * [`M_getTensorNameAt()`](model.md#_CPPv417M_getTensorNameAtPK17M_TensorNameArray6size_t)
  * [`M_getModelInputSpecByName()`](model.md#_CPPv425M_getModelInputSpecByNamePK20M_AsyncCompiledModelPKcP8M_Status)
  * [`M_getModelOutputSpecByName()`](model.md#_CPPv426M_getModelOutputSpecByNamePK20M_AsyncCompiledModelPKcP8M_Status)
  * [`M_waitForModel()`](model.md#_CPPv414M_waitForModelP12M_AsyncModelP8M_Status)
  * [`M_executeModelSync()`](model.md#_CPPv418M_executeModelSyncPK16M_RuntimeContextP12M_AsyncModelP16M_AsyncTensorMapP8M_Status)
  * [`M_getNumModelInputs()`](model.md#_CPPv419M_getNumModelInputsPK20M_AsyncCompiledModelP8M_Status)
  * [`M_getNumModelOutputs()`](model.md#_CPPv420M_getNumModelOutputsPK20M_AsyncCompiledModelP8M_Status)
  * [`M_validateInputTensorSpec()`](model.md#_CPPv425M_validateInputTensorSpecPK20M_AsyncCompiledModelP16M_AsyncTensorMapP8M_Status)
  * [`M_freeModel()`](model.md#_CPPv411M_freeModelP12M_AsyncModel)
  * [`M_freeCompiledModel()`](model.md#_CPPv419M_freeCompiledModelP20M_AsyncCompiledModel)
  * [`M_freeCompileConfig()`](model.md#_CPPv419M_freeCompileConfigP15M_CompileConfig)
  * [`M_freeModelSource()`](model.md#_CPPv417M_freeModelSourceP13M_ModelSource)
  * [`M_exportCompiledModel()`](model.md#_CPPv421M_exportCompiledModelP20M_AsyncCompiledModelPKcP8M_Status)
* [Tensor](tensor.md)

  * [`M_newTensorSpec()`](tensor.md#_CPPv415M_newTensorSpecPK7int64_t7int64_t7M_DtypePKc)
  * [`M_isDynamicRanked()`](tensor.md#_CPPv417M_isDynamicRankedPK12M_TensorSpec)
  * [`M_getDimAt()`](tensor.md#_CPPv410M_getDimAtPK12M_TensorSpec6size_t)
  * [`M_getRank()`](tensor.md#_CPPv49M_getRankPK12M_TensorSpec)
  * [`M_getDtype()`](tensor.md#_CPPv410M_getDtypePK12M_TensorSpec)
  * [`M_getName()`](tensor.md#_CPPv49M_getNameP12M_TensorSpec)
  * [`M_newAsyncTensorMap()`](tensor.md#_CPPv419M_newAsyncTensorMapPK16M_RuntimeContext)
  * [`M_copyAsyncTensorMap()`](tensor.md#_CPPv420M_copyAsyncTensorMapPK16M_AsyncTensorMap)
  * [`M_getTensorMapSize()`](tensor.md#_CPPv418M_getTensorMapSizePK16M_AsyncTensorMapP8M_Status)
  * [`M_borrowTensorInto()`](tensor.md#_CPPv418M_borrowTensorIntoP16M_AsyncTensorMapPKvPK12M_TensorSpecP8M_Status)
  * [`M_createBorrowedTensor()`](tensor.md#_CPPv422M_createBorrowedTensorPKvPK12M_TensorSpecP16M_RuntimeContext)
  * [`M_getTensorByNameFrom()`](tensor.md#_CPPv421M_getTensorByNameFromP16M_AsyncTensorMapPKcP8M_Status)
  * [`M_tensorMapKeys()`](tensor.md#_CPPv415M_tensorMapKeysP16M_AsyncTensorMapP7int64_t)
  * [`M_deleteTensorMapKeys()`](tensor.md#_CPPv421M_deleteTensorMapKeysPPKc)
  * [`M_getTensorFromValue()`](tensor.md#_CPPv420M_getTensorFromValueP12M_AsyncValue)
  * [`M_getTensorNumElements()`](tensor.md#_CPPv422M_getTensorNumElementsPK13M_AsyncTensor)
  * [`M_getTensorType()`](tensor.md#_CPPv415M_getTensorTypePK13M_AsyncTensor)
  * [`M_getTensorData()`](tensor.md#_CPPv415M_getTensorDataPK13M_AsyncTensor)
  * [`M_getTensorSpec()`](tensor.md#_CPPv415M_getTensorSpecPK13M_AsyncTensor)
  * [`M_getTensorMapIterator()`](tensor.md#_CPPv422M_getTensorMapIteratorP16M_AsyncTensorMapP8M_Status)
  * [`M_advanceTensorMapIterator()`](tensor.md#_CPPv426M_advanceTensorMapIteratorP19M_TensorMapIterator)
  * [`M_getNameFromMapIterator()`](tensor.md#_CPPv424M_getNameFromMapIteratorP19M_TensorMapIterator)
  * [`M_getTensorFromMapIterator()`](tensor.md#_CPPv426M_getTensorFromMapIteratorP19M_TensorMapIterator)
  * [`M_isEndOfTensorMap()`](tensor.md#_CPPv418M_isEndOfTensorMapP16M_AsyncTensorMapP19M_TensorMapIterator)
  * [`M_freeTensor()`](tensor.md#_CPPv412M_freeTensorP13M_AsyncTensor)
  * [`M_freeTensorNameArray()`](tensor.md#_CPPv421M_freeTensorNameArrayP17M_TensorNameArray)
  * [`M_freeTensorSpec()`](tensor.md#_CPPv416M_freeTensorSpecP12M_TensorSpec)
  * [`M_freeAsyncTensorMap()`](tensor.md#_CPPv420M_freeAsyncTensorMapP16M_AsyncTensorMap)
  * [`M_freeTensorMapIterator()`](tensor.md#_CPPv423M_freeTensorMapIteratorP19M_TensorMapIterator)
* [Types](types.md)

  * [`M_Status`](types.md#_CPPv48M_Status)
  * [`M_RuntimeConfig`](types.md#_CPPv415M_RuntimeConfig)
  * [`M_RuntimeContext`](types.md#_CPPv416M_RuntimeContext)
  * [`M_UInt64Counter`](types.md#_CPPv415M_UInt64Counter)
  * [`M_DoubleCounter`](types.md#_CPPv415M_DoubleCounter)
  * [`M_UInt64Histogram`](types.md#_CPPv417M_UInt64Histogram)
  * [`M_DoubleHistogram`](types.md#_CPPv417M_DoubleHistogram)
  * [`M_Int64Gauge`](types.md#_CPPv412M_Int64Gauge)
  * [`M_DoubleGauge`](types.md#_CPPv413M_DoubleGauge)
  * [`M_CustomMetricReader`](types.md#_CPPv420M_CustomMetricReader)
  * [`M_CompileConfig`](types.md#_CPPv415M_CompileConfig)
  * [`M_DeviceConfig`](types.md#_CPPv414M_DeviceConfig)
  * [`M_AsyncCompiledModel`](types.md#_CPPv420M_AsyncCompiledModel)
  * [`M_AsyncModel`](types.md#_CPPv412M_AsyncModel)
  * [`M_AsyncTensor`](types.md#_CPPv413M_AsyncTensor)
  * [`M_TensorNameArray`](types.md#_CPPv417M_TensorNameArray)
  * [`M_TensorSpec`](types.md#_CPPv412M_TensorSpec)
  * [`M_AsyncTensorMap`](types.md#_CPPv416M_AsyncTensorMap)
  * [`M_TensorMapIterator`](types.md#_CPPv419M_TensorMapIterator)
  * [`M_AsyncValue`](types.md#_CPPv412M_AsyncValue)
  * [`M_Config`](types.md#_CPPv48M_Config)
  * [`M_AsyncDict`](types.md#_CPPv411M_AsyncDict)
  * [`M_AsyncList`](types.md#_CPPv411M_AsyncList)
  * [`M_AsyncTuple`](types.md#_CPPv412M_AsyncTuple)
  * [`M_AsyncNone`](types.md#_CPPv411M_AsyncNone)
  * [`M_MaxContext`](types.md#_CPPv412M_MaxContext)
  * [`M_ModelSource`](types.md#_CPPv413M_ModelSource)
  * [`M_WeightsRegistry`](types.md#_CPPv417M_WeightsRegistry)
  * [`M_DevicesList`](types.md#_CPPv413M_DevicesList)
  * [`M_DeviceRefsList`](types.md#_CPPv416M_DeviceRefsList)
  * [`M_Dtype`](types.md#_CPPv47M_Dtype)

    * [`M_UNKNOWN`](types.md#_CPPv4N7M_Dtype9M_UNKNOWNE)
    * [`mIsInteger`](types.md#_CPPv4N7M_Dtype10mIsIntegerE)
    * [`mIsFloat`](types.md#_CPPv4N7M_Dtype8mIsFloatE)
    * [`mIsComplex`](types.md#_CPPv4N7M_Dtype10mIsComplexE)
    * [`mIsSigned`](types.md#_CPPv4N7M_Dtype9mIsSignedE)
    * [`kIntWidthShift`](types.md#_CPPv4N7M_Dtype14kIntWidthShiftE)
    * [`M_INT1`](types.md#_CPPv4N7M_Dtype6M_INT1E)
    * [`M_UINT1`](types.md#_CPPv4N7M_Dtype7M_UINT1E)
    * [`M_INT2`](types.md#_CPPv4N7M_Dtype6M_INT2E)
    * [`M_UINT2`](types.md#_CPPv4N7M_Dtype7M_UINT2E)
    * [`M_INT4`](types.md#_CPPv4N7M_Dtype6M_INT4E)
    * [`M_UINT4`](types.md#_CPPv4N7M_Dtype7M_UINT4E)
    * [`M_INT8`](types.md#_CPPv4N7M_Dtype6M_INT8E)
    * [`M_UINT8`](types.md#_CPPv4N7M_Dtype7M_UINT8E)
    * [`M_INT16`](types.md#_CPPv4N7M_Dtype7M_INT16E)
    * [`M_UINT16`](types.md#_CPPv4N7M_Dtype8M_UINT16E)
    * [`M_INT32`](types.md#_CPPv4N7M_Dtype7M_INT32E)
    * [`M_UINT32`](types.md#_CPPv4N7M_Dtype8M_UINT32E)
    * [`M_INT64`](types.md#_CPPv4N7M_Dtype7M_INT64E)
    * [`M_UINT64`](types.md#_CPPv4N7M_Dtype8M_UINT64E)
    * [`M_INT128`](types.md#_CPPv4N7M_Dtype8M_INT128E)
    * [`M_UINT128`](types.md#_CPPv4N7M_Dtype9M_UINT128E)
    * [`M_FLOAT8_E3M4`](types.md#_CPPv4N7M_Dtype13M_FLOAT8_E3M4E)
    * [`M_FLOAT8_E4M3`](types.md#_CPPv4N7M_Dtype13M_FLOAT8_E4M3E)
    * [`M_FLOAT8_E4M3FN`](types.md#_CPPv4N7M_Dtype15M_FLOAT8_E4M3FNE)
    * [`M_FLOAT8_E4M3FNUZ`](types.md#_CPPv4N7M_Dtype17M_FLOAT8_E4M3FNUZE)
    * [`M_FLOAT8_E5M2`](types.md#_CPPv4N7M_Dtype13M_FLOAT8_E5M2E)
    * [`M_FLOAT8_E5M2FNUZ`](types.md#_CPPv4N7M_Dtype17M_FLOAT8_E5M2FNUZE)
    * [`M_FLOAT16`](types.md#_CPPv4N7M_Dtype9M_FLOAT16E)
    * [`M_BFLOAT16`](types.md#_CPPv4N7M_Dtype10M_BFLOAT16E)
    * [`M_FLOAT32`](types.md#_CPPv4N7M_Dtype9M_FLOAT32E)
    * [`M_FLOAT64`](types.md#_CPPv4N7M_Dtype9M_FLOAT64E)
    * [`M_TF32`](types.md#_CPPv4N7M_Dtype6M_TF32E)
    * [`M_BOOL`](types.md#_CPPv4N7M_Dtype6M_BOOLE)
  * [`M_AllocatorType`](types.md#_CPPv415M_AllocatorType)

    * [`kSystem`](types.md#_CPPv4N15M_AllocatorType7kSystemE)
    * [`kCaching`](types.md#_CPPv4N15M_AllocatorType8kCachingE)
  * [`M_ValueType`](types.md#_CPPv411M_ValueType)

    * [`M_STRING_VALUE`](types.md#_CPPv4N11M_ValueType14M_STRING_VALUEE)
    * [`M_DOUBLE_VALUE`](types.md#_CPPv4N11M_ValueType14M_DOUBLE_VALUEE)
    * [`M_LONG_VALUE`](types.md#_CPPv4N11M_ValueType12M_LONG_VALUEE)
    * [`M_BOOL_VALUE`](types.md#_CPPv4N11M_ValueType12M_BOOL_VALUEE)
    * [`M_TENSOR_VALUE`](types.md#_CPPv4N11M_ValueType14M_TENSOR_VALUEE)
    * [`M_LIST_VALUE`](types.md#_CPPv4N11M_ValueType12M_LIST_VALUEE)
    * [`M_TUPLE_VALUE`](types.md#_CPPv4N11M_ValueType13M_TUPLE_VALUEE)
    * [`M_DICT_VALUE`](types.md#_CPPv4N11M_ValueType12M_DICT_VALUEE)
    * [`M_NONE_VALUE`](types.md#_CPPv4N11M_ValueType12M_NONE_VALUEE)
    * [`M_UNKNOWN_VALUE`](types.md#_CPPv4N11M_ValueType15M_UNKNOWN_VALUEE)
    * [`M_MOJO_VALUE`](types.md#_CPPv4N11M_ValueType12M_MOJO_VALUEE)
    * [`M_PYTHON_MOJO_VALUE`](types.md#_CPPv4N11M_ValueType19M_PYTHON_MOJO_VALUEE)
  * [`M_FrameworkFormat`](types.md#_CPPv417M_FrameworkFormat)

    * [`M_MAX_GRAPH_FRAMEWORK_FORMAT`](types.md#_CPPv4N17M_FrameworkFormat28M_MAX_GRAPH_FRAMEWORK_FORMATE)
    * [`M_TORCHSCRIPT_MODULE_FRAMEWORK_FORMAT`](types.md#_CPPv4N17M_FrameworkFormat37M_TORCHSCRIPT_MODULE_FRAMEWORK_FORMATE)
    * [`M_TORCHSCRIPT_FUNCTION_FRAMEWORK_FORMAT`](types.md#_CPPv4N17M_FrameworkFormat39M_TORCHSCRIPT_FUNCTION_FRAMEWORK_FORMATE)
    * [`M_TORCH_MLIR_FRAMEWORK_FORMAT`](types.md#_CPPv4N17M_FrameworkFormat29M_TORCH_MLIR_FRAMEWORK_FORMATE)
  * [`M_ResultOutputStyle`](types.md#_CPPv419M_ResultOutputStyle)

    * [`M_COMPACT`](types.md#_CPPv4N19M_ResultOutputStyle9M_COMPACTE)
    * [`M_FULL`](types.md#_CPPv4N19M_ResultOutputStyle6M_FULLE)
    * [`M_BINARY`](types.md#_CPPv4N19M_ResultOutputStyle8M_BINARYE)
    * [`M_BINARY_MAX_CHECKPOINT`](types.md#_CPPv4N19M_ResultOutputStyle23M_BINARY_MAX_CHECKPOINTE)
    * [`M_NONE`](types.md#_CPPv4N19M_ResultOutputStyle6M_NONEE)
* [Value](value.md)

  * [`M_getValueByNameFrom()`](value.md#_CPPv420M_getValueByNameFromP16M_AsyncTensorMapPKcP8M_Status)
  * [`M_getValueFromMapIterator()`](value.md#_CPPv425M_getValueFromMapIteratorP19M_TensorMapIterator)
  * [`M_freeValue()`](value.md#_CPPv411M_freeValueP12M_AsyncValue)
  * [`M_getStringFromValue()`](value.md#_CPPv420M_getStringFromValueP12M_AsyncValue)
  * [`M_createStringAsyncValue()`](value.md#_CPPv424M_createStringAsyncValuePKcP16M_RuntimeContext)
  * [`M_getDoubleFromValue()`](value.md#_CPPv420M_getDoubleFromValueP12M_AsyncValue)
  * [`M_createDoubleAsyncValue()`](value.md#_CPPv424M_createDoubleAsyncValuedP16M_RuntimeContext)
  * [`M_getLongFromValue()`](value.md#_CPPv418M_getLongFromValueP12M_AsyncValue)
  * [`M_createLongAsyncValue()`](value.md#_CPPv422M_createLongAsyncValue7int64_tP16M_RuntimeContext)
  * [`M_getBoolFromValue()`](value.md#_CPPv418M_getBoolFromValueP12M_AsyncValue)
  * [`M_createBoolAsyncValue()`](value.md#_CPPv422M_createBoolAsyncValuebP16M_RuntimeContext)
  * [`M_borrowValueInto()`](value.md#_CPPv417M_borrowValueIntoP16M_AsyncTensorMapPKcPK12M_AsyncValueP8M_Status)
  * [`M_getValueType()`](value.md#_CPPv414M_getValueTypeP12M_AsyncValue)
  * [`M_getDictFromValue()`](value.md#_CPPv418M_getDictFromValueP12M_AsyncValue)
  * [`M_createDictAsyncValue()`](value.md#_CPPv422M_createDictAsyncValueP16M_RuntimeContext)
  * [`M_insertIntoDict()`](value.md#_CPPv416M_insertIntoDictP11M_AsyncDictP12M_AsyncValueP12M_AsyncValue)
  * [`M_getListFromValue()`](value.md#_CPPv418M_getListFromValueP12M_AsyncValue)
  * [`M_createListAsyncValue()`](value.md#_CPPv422M_createListAsyncValueP16M_RuntimeContext)
  * [`M_appendToList()`](value.md#_CPPv414M_appendToListP11M_AsyncListP12M_AsyncValue)
  * [`M_getTupleFromValue()`](value.md#_CPPv419M_getTupleFromValueP12M_AsyncValue)
  * [`M_borrowIntoTuple()`](value.md#_CPPv417M_borrowIntoTupleP12M_AsyncTupleP12M_AsyncValue)
  * [`M_createTupleAsyncValue()`](value.md#_CPPv423M_createTupleAsyncValueP16M_RuntimeContext)
  * [`M_getDictSize()`](value.md#_CPPv413M_getDictSizeP11M_AsyncDict)
  * [`M_getListSize()`](value.md#_CPPv413M_getListSizeP11M_AsyncList)
  * [`M_getTupleSize()`](value.md#_CPPv414M_getTupleSizeP12M_AsyncTuple)
  * [`M_getDictKey()`](value.md#_CPPv412M_getDictKeyP11M_AsyncDict6size_t)
  * [`M_getDictValue()`](value.md#_CPPv414M_getDictValueP11M_AsyncDict6size_t)
  * [`M_getListValue()`](value.md#_CPPv414M_getListValueP11M_AsyncList6size_t)
  * [`M_getTupleValue()`](value.md#_CPPv415M_getTupleValueP12M_AsyncTuple6size_t)
  * [`M_createNoneAsyncValue()`](value.md#_CPPv422M_createNoneAsyncValueP16M_RuntimeContext)
  * [`M_freeDict()`](value.md#_CPPv410M_freeDictP11M_AsyncDict)
  * [`M_freeList()`](value.md#_CPPv410M_freeListP11M_AsyncList)
  * [`M_freeTuple()`](value.md#_CPPv411M_freeTupleP12M_AsyncTuple)
  * [`M_freeNone()`](value.md#_CPPv410M_freeNoneP11M_AsyncNone)

## Async API usage

Our C API allows for compiling and executing models asynchronously.  In general,
effective use of asynchronous APIs may be difficult, but rewarding for
performance.  To help with this, we’re going to explain some important concepts
and mental models to keep in mind with the API.

Our APIs are async-safe unless stated otherwise, typically with a `Sync` in the
function identifier name.  For example, we have `M_executeModel` and
[`M_executeModelSync()`](model.md#_CPPv418M_executeModelSyncPK16M_RuntimeContextP12M_AsyncModelP16M_AsyncTensorMapP8M_Status).

### Types

Our API describes the underlying async-holding types with a “value or error”
concept.  Conceptually, this means that the type is in one of three states:

* `Constructed` - the value is not yet there, but there is no error
* `Available` - the value is there and ready for use
* `Error` - the value is not there and there is an error

### Synchronization points

When using async APIs, it is a good idea to be mindful of the synchronization
point APIs currently provided below.  This is useful for discerning between the
`Constructed` and `Available` states mentioned above.  After calling the
synchronization point, the input will never be in a `Constructed` state: it will
always resolve to either being `Available` or `Error`.

* [`M_waitForCompilation()`](model.md#_CPPv420M_waitForCompilationP20M_AsyncCompiledModelP8M_Status)
* [`M_waitForModel()`](model.md#_CPPv414M_waitForModelP12M_AsyncModelP8M_Status)
* `M_waitForTensors`

### Errors

Errors surface immediately when using our synchronous APIs.  Otherwise, in the
case of async APIs, errors will not surface until the next synchronization
point.  You can query the error message by calling [`M_getError()`](common.md#_CPPv410M_getErrorPK8M_Status).

---

## cache_params

## `KVCacheParams` {#max.nn.kv_cache.cache_params.KVCacheParams}

> *class* max.nn.kv\_cache.cache\_params.KVCacheParams(dtype: max.\_core.dtype.DType, n\_kv\_heads: int, head\_dim: int, enable\_prefix\_caching: bool = False, enable\_kvcache\_swapping\_to\_host: bool = False, host\_kvcache\_swap\_space\_gb: Optional\[float] = None, cache\_strategy: max.nn.kv_cache.cache_params.KVCacheStrategy = \, page\_size: Optional\[int] = None, n\_devices: int = 1)

**Parameters:**

* **dtype** ([`DType`](../../dtype.md#max.dtype.DType) )
* **n\_kv\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **enable\_prefix\_caching** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )
* **enable\_kvcache\_swapping\_to\_host** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )
* **host\_kvcache\_swap\_space\_gb** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  `None` )
* **cache\_strategy** ([`KVCacheStrategy`](#max.nn.kv_cache.cache_params.KVCacheStrategy) )
* **page\_size** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **n\_devices** ([`int`](https://docs.python.org/3/library/functions.html#int) )

### `cache_strategy` {#max.nn.kv_cache.cache_params.KVCacheParams.cache_strategy}

> cache\_strategy\*: [KVCacheStrategy](#max.nn.kv_cache.cache_params.KVCacheStrategy)\* *= 'continuous'*

### `dtype` {#max.nn.kv_cache.cache_params.KVCacheParams.dtype}

> dtype\*: [DType](../../dtype.md#max.dtype.DType)\*

### `dtype_shorthand` {#max.nn.kv_cache.cache_params.KVCacheParams.dtype_shorthand}

> *property* dtype\_shorthand\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

The textual representation in shorthand of the dtype.

### `enable_kvcache_swapping_to_host` {#max.nn.kv_cache.cache_params.KVCacheParams.enable_kvcache_swapping_to_host}

> enable\_kvcache\_swapping\_to\_host\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False*

### `enable_prefix_caching` {#max.nn.kv_cache.cache_params.KVCacheParams.enable_prefix_caching}

> enable\_prefix\_caching\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False*

### `head_dim` {#max.nn.kv_cache.cache_params.KVCacheParams.head_dim}

> head\_dim\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `host_kvcache_swap_space_gb` {#max.nn.kv_cache.cache_params.KVCacheParams.host_kvcache_swap_space_gb}

> host\_kvcache\_swap\_space\_gb\*: [float](https://docs.python.org/3/library/functions.html#float) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

### `n_devices` {#max.nn.kv_cache.cache_params.KVCacheParams.n_devices}

> n\_devices\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 1*

### `n_kv_heads` {#max.nn.kv_cache.cache_params.KVCacheParams.n_kv_heads}

> n\_kv\_heads\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `page_size` {#max.nn.kv_cache.cache_params.KVCacheParams.page_size}

> page\_size\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

### `static_cache_shape` {#max.nn.kv_cache.cache_params.KVCacheParams.static_cache_shape}

> *property* static\_cache\_shape\*: [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [str](https://docs.python.org/3/library/stdtypes.html#str), [str](https://docs.python.org/3/library/stdtypes.html#str), [str](https://docs.python.org/3/library/stdtypes.html#str), [str](https://docs.python.org/3/library/stdtypes.html#str)]\*

## `KVCacheStrategy` {#max.nn.kv\_cache.cache\_params.KVCacheStrategy}

> *class* max.nn.kv\_cache.cache\_params.KVCacheStrategy(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None)

### `CONTINUOUS` {#max.nn.kv_cache.cache_params.KVCacheStrategy.CONTINUOUS}

> CONTINUOUS *= 'continuous'*

### `MODEL_DEFAULT` {#max.nn.kv_cache.cache_params.KVCacheStrategy.MODEL_DEFAULT}

> MODEL\_DEFAULT *= 'model\_default'*

### `PAGED` {#max.nn.kv_cache.cache_params.KVCacheStrategy.PAGED}

> PAGED *= 'paged'*

### `kernel_substring()` {#max.nn.kv_cache.cache_params.KVCacheStrategy.kernel_substring}

> kernel\_substring()

Returns the common substring that we include in the kernel name for this caching strategy.

**Return type:**

[str](https://docs.python.org/3/library/stdtypes.html#str)

### `uses_opaque()` {#max.nn.kv_cache.cache_params.KVCacheStrategy.uses_opaque}

> uses\_opaque()

**Return type:**

[bool](https://docs.python.org/3/library/functions.html#bool)

---

## CacheEviction

`@register_passable(trivial)`
`struct CacheEviction`

Represents cache eviction policies for GPU memory operations.

This struct defines different cache eviction priorities that control how data is
evicted from cache when space is needed. The policies affect cache utilization
and performance by controlling which data gets evicted first.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `EVICT_FIRST`

`alias EVICT_FIRST = CacheEviction(1)`

Highest eviction priority - data will be evicted first.

Data cached with this priority is marked as the first candidate for eviction
when cache space is needed. This is optimal for:

* Streaming data that will not be reused
* Single-pass algorithms
* Data with low temporal locality

### `EVICT_LAST`

`alias EVICT_LAST = CacheEviction(2)`

Lowest eviction priority - data will be evicted last.

Data cached with this priority remains in cache until all higher priority data
is evicted. Best used for:

* Frequently accessed data
* Data needed across multiple kernel launches
* Critical data structures that benefit from cache persistence

### `EVICT_NORMAL`

`alias EVICT_NORMAL = CacheEviction(0)`

Default cache eviction priority.

Data cached with normal priority follows standard cache replacement policies.
This is the default behavior and suitable for most general-purpose data access
patterns where no special caching requirements exist.

### `EVICT_UNCHANGED`

`alias EVICT_UNCHANGED = CacheEviction(3)`

Preserves existing cache eviction priority.

When this policy is used:

* Existing cache entries maintain their current eviction priority
* No changes are made to the cache replacement order
* Useful for operations that should not affect caching behavior

### `NO_ALLOCATE`

`alias NO_ALLOCATE = CacheEviction(4)`

Prevents cache allocation for accessed data.

Data is not cached when using this policy. Optimal for:

* Large sequential reads/writes
* Data that will only be accessed once
* Preserving cache space for more critical data
* Streaming operations with no data reuse

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Tests if two CacheEviction instances are equal.

**Args:**

* ​other (`Self`): The CacheEviction to compare against.

**Returns:**

True if the eviction policies are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Tests if two CacheEviction instances are not equal.

**Args:**

* ​other (`Self`): The CacheEviction to compare against.

**Returns:**

True if the eviction policies are not equal, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Tests if two CacheEviction instances are identical.

**Args:**

* ​other (`Self`): The CacheEviction to compare against.

**Returns:**

True if the eviction policies are identical, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Tests if two CacheEviction instances are not identical.

**Args:**

* ​other (`Self`): The CacheEviction to compare against.

**Returns:**

True if the eviction policies are not identical, False otherwise.

### `mnemonic`

`mnemonic(self) -> StringSlice[StaticConstantOrigin]`

Returns the string mnemonic for this cache eviction policy.

Converts the cache eviction policy into its corresponding string
representation used in GPU instructions and debugging.

**Returns:**

A string literal containing the mnemonic for this eviction policy.

---

## CacheOperation

`@register_passable(trivial)`
`struct CacheOperation`

Represents different GPU cache operation policies.

This struct defines various caching behaviors for GPU memory operations,
controlling how data is cached and evicted at different cache levels.
The policies affect performance and memory coherency.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ALWAYS`

`alias ALWAYS = CacheOperation(0)`

Cache at all levels. This will be accessed again.

Best for data that will be frequently reused across multiple threads.
Provides fastest subsequent access but uses the most cache space.

### `GLOBAL`

`alias GLOBAL = CacheOperation(1)`

Cache at global level.

Caches data only in the L2 cache, bypassing L1.
Good for data shared between different thread blocks.

### `LAST_USE`

`alias LAST_USE = CacheOperation(3)`

Indicates the cache line will not be used again.

Hints to the cache that this data can be evicted after this access.
Helps optimize cache utilization.

### `STREAMING`

`alias STREAMING = CacheOperation(2)`

Streaming, this is likely to be accessed once.

Optimizes for streaming access patterns where data is only read once.
May bypass certain cache levels for better throughput.

### `VOLATILE`

`alias VOLATILE = CacheOperation(4)`

Don't cache, and fetch again.

Forces reads/writes to bypass cache and go directly to memory.
Useful for memory-mapped I/O or when cache coherency is required.

### `WRITE_BACK`

`alias WRITE_BACK = CacheOperation(5)`

Write back at all coherent levels.

Updates all cache levels and eventually writes to memory.
Most efficient for multiple writes to same location.

### `WRITE_THROUGH`

`alias WRITE_THROUGH = CacheOperation(6)`

Write through to system memory.

Immediately writes updates to memory while updating cache.
Provides stronger consistency but lower performance than write-back.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Tests if two CacheOperation instances are equal.

**Args:**

* ​other (`Self`): The CacheOperation to compare against.

**Returns:**

True if the operations are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Tests if two CacheOperation instances are not equal.

**Args:**

* ​other (`Self`): The CacheOperation to compare against.

**Returns:**

True if the operations are not equal, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Tests if two CacheOperation instances are identical.

**Args:**

* ​other (`Self`): The CacheOperation to compare against.

**Returns:**

True if the operations are identical, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Tests if two CacheOperation instances are not identical.

**Args:**

* ​other (`Self`): The CacheOperation to compare against.

**Returns:**

True if the operations are not identical, False otherwise.

### `mnemonic`

`mnemonic(self) -> StringSlice[StaticConstantOrigin]`

Returns the PTX mnemonic string for this cache operation.

Converts the cache operation into its corresponding PTX assembly
mnemonic string used in GPU instructions.

**Returns:**

A string literal containing the PTX mnemonic for this operation.

---

## calculate_symmetric_vector

`calculate_symmetric_vector[input_dtype: DType, simd_width: Int, output_bits: Int](data: SIMD[input_dtype, simd_width]) -> Tuple[SIMD[uint8, simd_width], SIMD[input_dtype, 1]]`

Symmetrically quantizes the given SIMD vector `data` with input type `input_dtype` and `simd_width` elements, assuming we want the results to fit in an unsigned integer of size `output_bits`.

**Parameters:**

* ​input\_dtype (`DType`): The dtype of the input tensor.
* ​simd\_width (`Int`): The width of the SIMD input.
* ​output\_bits (`Int`): The bits we want to fit the unsigned integral result in.

**Args:**

* ​data (`SIMD[input_dtype, simd_width]`): The input SIMD we want to quantize.

**Returns:**

A vector of the quantized values.
The associated scale factor.

---

## calculate_tile_n_k

`calculate_tile_n_k[a_type: DType, b_type: DType, c_type: DType, kernel_cols: Int](n: Int, k: Int) -> IndexList[2]`

Helper heuristic function to decide on tile size to partition the matmul given the cache size and desired data layout.

**Parameters:**

* ​a\_type (`DType`): The type of the A tensor.
* ​b\_type (`DType`): The type of the B tensor.
* ​c\_type (`DType`): The type of the C tensor.
* ​kernel\_cols (`Int`): The umber of columns of the micro kernel.

**Returns:**

The calculated tile size to partition the matmul as (TileN, TileK).

`calculate_tile_n_k[a_type: DType, b_type: DType, c_type: DType, kernel_cols: Int](global_tile_shape: GemmShape) -> IndexList[2]`

---

## Calling Mojo from Python

If you have an existing Python project that would benefit from Mojo's
high-performance computing, you shouldn't have to rewrite the whole thing in
Mojo. Instead, you can write just the performance-critical parts your code in
Mojo and then call it from Python.

:::caution Early preview

Calling Mojo code from Python is in early development. You should expect a lot
of changes to the API and ergonomics. Likewise, this documentation is still a
work in progress. See below for [known limitations](#known-limitations).

:::

## Import a Mojo module in Python

To illustrate what calling Mojo from Python looks like, we'll start with a
simple example, and then dig into the details of how it works and what is
possible today.

Consider a project with the following structure:

```text
project
├── 🐍 main.py
└── 🔥 mojo_module.mojo
```

The main entrypoint is a Python program called `main.py`, and the Mojo code
includes functions to call from Python.

For example, let's say we want a Mojo function to take a Python value as an
argument:

```mojo title="🔥 mojo_module.mojo"
fn factorial(py_obj: PythonObject) raises -> Python
    var n = Int(py_obj)
    return math.factorial(n)
```

And we want to call it from Python like this:

```python title="🐍 main.py"
import mojo_module

print(mojo_module.factorial(5))
```

However, before we can call the Mojo function from Python, we must declare it
so Python knows it exists.

Because Python is trying to load `mojo_module`, it looks for a function called
`PyInit_mojo_module()`. (If our file is called `foo.mojo`, the function would
be `PyInit_foo()`.) Within the `PyInit_mojo_module()`, we must declare all Mojo
functions and types that are callable from Python using
[`PythonModuleBuilder`](/mojo/stdlib/python/bindings/PythonModuleBuilder).

So the complete Mojo code looks like this:

```mojo title="🔥 mojo_module.mojo"
from python import PythonObject
from python.bindings import PythonModuleBuilder
import math
from os import abort

@export
fn PyInit_mojo_module() -> PythonObject:
    try:
        var m = PythonModuleBuilder("mojo_module")
        m.def_function[factorial]("factorial", docstring="Compute n!")
        return m.finalize()
    except e:
        return abort[PythonObject](String("error creating Python Mojo module:", e))

fn factorial(py_obj: PythonObject) raises -> PythonObject:
    # Raises an exception if `py_obj` is not convertible to a Mojo `Int`.
    var n = Int(py_obj)

    return math.factorial(n)
```

On the Python side, we currently need some more boilerplate code to make it
work (but this will improve soon):

```python title="🐍 main.py"
import max._mojo.mojo_importer
import os
import sys

sys.path.insert(0, "")
os.environ["MOJO_PYTHON_LIBRARY"] = ""

import mojo_module

print(mojo_module.factorial(5))
```

That's it! Try it:

```sh
python main.py
```

```output
120
```

### How it works

Python supports a standard mechanism called [Python extension
modules](https://docs.python.org/3/extending/extending.html) that enables
compiled languages (like Mojo, C, C++, or Rust) to make themselves callable
from Python in an intuitive way. Concretely, a Python extension module is
simply a dynamic library that defines a suitable `PyInit_*()` function.

Mojo comes with built-in functionality for defining Python extension modules.
The special stuff happens in the `max._mojo.mojo_importer` module we imported.

If we have a look at the filesystem after Python imports the Mojo code, we'll
notice there's a new `__mojocache__` directory, with dynamic library (`.so`)
file inside:

```text
project
├── main.py
├── mojo_module.mojo
└── __mojocache__
    └── mojo_module.hash-ABC123.so
```

Loading `max._mojo.mojo_importer` loads our Python Mojo [import
hook](https://docs.python.org/3/reference/import.html#import-hooks), which
behind the scenes looks for a `.mojo` (or `.🔥`) file that matches the imported
module name, and if found, compiles it using [`mojo build --emit
shared-lib`](/mojo/cli/build#--emit-file_type) to generate a static library.
The resulting file is stored in `__mojocache__`, and is rebuilt only when
it becomes stale (typically, when the Mojo source file changes).

Now that we've looked at the basics of how Mojo can be used from Python, let's
dig into the available features and how you can leverage them to accelerate
your Python with Mojo.

## Binding Mojo types

All Mojo type are eligible to be bound for use from Python.
To expose a Mojo type to Python, it must implement the
[`TypeIdentifiable`][TypeIdentifiable] trait.
For a simple Mojo type, this might look like:

```mojo title="🔥 Mojo"
struct Person(TypeIdentifiable, ...):
    var name: String
    var age: Int

    # Unique name under which the type object is stored.
    # Eventually this will be a compiler provided unique type ID.
    alias TYPE_ID = "mojo_module.Person"
```

:::note

Types currently must also implement `Movable`, `Defaultable`, and
`Representable` to be bound for use from Python.

:::

This enables the type to be bound using `PythonModuleBuilder.add_type[Person]()`:

```mojo title="🔥 Mojo"
var mb = PythonModuleBuilder("mojo_module")
mb.add_type[Person]("Person")
```

Any Mojo type bound using a `PythonTypeBuilder` will have the resulting
Python 'type' object be globally registered, enabling two features:

* Constructing Python objects that wrap Mojo values for use from Python using
  `PythonObject(alloc=Person(..))`.

* Downcasting using `python_obj.downcast_value_ptr[Person]()`

{/* 

 */}

## Constructing Python objects in Mojo

:::note

Python Mojo bindings do not currently support Mojo `__init__()` methods that
take arguments. However, a workaround is possible using free functions that
construct new objects, shown below.

:::

Mojo functions called from Python don't just need to be able to accept
[`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) values as
arguments, they also need to be able to return new values. And sometimes, they
even need to be able to return Mojo native values back to Python. This is
possible by using the `PythonObject(alloc=)` constructor.

An example of this looks like:

```mojo title="🔥 Mojo"
fn create_person() -> PythonObject:
    var person = Person("Sarah", 32)
    return PythonObject(alloc=person^)
```

:::caution

`PythonObject(alloc=...)` will raise an exception if the provided Mojo
object type had not previously been registered using
[`PythonModuleBuilder.add_type()`](/mojo/stdlib/python/bindings/PythonModuleBuilder#add_type).

:::

{/*
TODO: How to distinguish this constructor from the converting constructor?
TODO: Maybe `PythonObject.mojo()` / `PythonObject(mojo_object=)`?
TODO: `PythonObject.__init__[T: TypeIdentifiable](out self, *, owned alloc: T)
*/}

{/*  */}

## `PythonObject` to Mojo values

Within any Mojo code that is handling a
[`PythonObject`](/mojo/stdlib/python/python_object/PythonObject), but
especially within Mojo functions called from Python, it's common to expect an
argument of a particular type. particular type and wish

There are two scenarios where a `PythonObject` can be "converted" into a native
Mojo value:

* **Converting** a Python object into a newly constructed Mojo value that has
   the same logical value as the original Python object.
   This is handled by the [`ConvertibleFromPython`][ConvertibleFromPython] trait.

* **Downcasting** a Python object that holds a native Mojo value to a pointer
   to that inner value.
   This is handled by [`PythonObject.downcast_value_ptr()`][downcast_value_ptr].

{/* 

 */}

### `PythonObject` conversions

:::note

**Binding Initializers.** One current limitation is that non-default Mojo
type `__init__()` methods cannot be bound for calling from Python. In addition
to showing argument conversions, this example also shows how a top-level
function can be used to construct and return instances of Mojo types to Python.

:::

Many Mojo types support conversion directly from equivalent Python types, via
the [`ConvertibleFromPython`][ConvertibleFromPython] trait:

```mojo title="🔥 Mojo"
fn create_person(
    name_obj: PythonObject,
    age_obj: PythonObject
) raises -> PythonObject:
    # These conversions will raise an exception if they fail
    var name = String(name_obj)
    var age = Int(age_obj)

    return PythonObject(alloc=Person(name, age))
```

Which could be called from Python using:

```python title="🐍 Python"
person = mojo_module.create_person("John Smith", 42)
```

Passing invalid arguments would result in a type error:

```python title="🐍 Python"
# TODO: What is the exact error message this emits today?
person = mojo_module.create_person([1, 2, 3], {"foo": 4})
```

### `PythonObject` downcasts

Downcasting from `PythonObject` values to the inner Mojo value:

```mojo title="🔥 Mojo"
fn print_age(person_obj: PythonObject):
    # Raises if `obj` does not contain an instance of the Mojo `Person` type.
    var person = person_obj.downcast_value_ptr[Person]()
    # TODO(MSTDL-1581):
    #   var person = Pointer[Person](downcast_value=person_obj)
    print("Person is", person[].age, "years old")
```

Unsafe mutable via downcasting is also supported. It is up to the user to ensure
that this mutable pointer does not alias any other pointers to the same object
within Mojo:

```mojo title="🔥 Mojo"
fn birthday(person_obj: PythonObject):
    var person = person_obj.downcast_value_ptr[Person]()
    # TODO:
    #   var person = Pointer[Person](unsafe_unique_downcast=person_obj)
    person[].age += 1
```

Entirely unchecked downcasting--which does no type checking--can be done using:

```mojo title="🔥 Mojo"
fn get_person(person_obj: PythonObject):
    var person = person_obj.unchecked_downcast_value_ptr[Person]()
    # TODO:
    #   var person = Pointer[Person](unchecked_downcast_value=person_obj)
```

Unchecked downcasting can be used to eliminate overhead when optimizing a tight
inner loop with Mojo, and you've benchmarked and measured that type checking
downcasts is a significant bottleneck.

## Writing Python in Mojo

In this approach to bindings, we embrace the flexibility of Python, and eschew
trying to convert `PythonObject` arguments into the narrowly constrained,
strongly-typed space of the Mojo type system, in favor of just writing some code
and letting it raise an exception at runtime if we got something wrong.

The flexibility of `PythonObject` enables a unique programming style, wherein
Python code can be "ported" to Mojo with relatively few changes.

```python title="🐍 Python"
def foo(x, y, z):
    x[y] = int(z)
    x = y + z
```

Rule of thumb: Any Python builtin function should be accessible in Mojo using
`Python.()`.

```mojo title="🔥 Mojo"
fn foo(x: PythonObject, y: PythonObject, z: PythonObject) -> PythonObject:
    x[y] = Python.int(z)
    x = y + z
    x.attr = z
```

## Keyword arguments

Keyword arguments are not currently supported natively in Python Mojo bindings,
but a simple pattern can be used to provide them to users of your library, using
a Python wrapper function that passes keyword arguments into Mojo using a dict.

A simple example of this pattern looks like:

```python title="🐍 Python"
import mojo_module

def supports_kwargs(pos, *, kw1 = None, kw2 = None):
    mojo_module.supports_kwargs(pos, { "kw1": kw1, "kw2": kw2})
```

```mojo title="🔥 Mojo"
fn supports_kwargs(pos: PythonObject, kwargs: PythonObject) raises:
    var kw1 = kwargs["kw1"]
    var kw2 = kwargs["kw2"]
```

Because keyword argument validation and default values are handled within the
Python wrapper function, callers will get the standard argument errors they
expect.
And the Mojo code stays simple, as getting the keyword argument is a simple
dictionary lookup.

## Variadic functions

When binding functions using
[`PythonModuleBuilder.def_function()`](/mojo/stdlib/python/bindings/PythonModuleBuilder#def_function),
only fixed-arity functions are supported. To expose Mojo functions that accept
a variadic number of arguments to Python, you can use the lower-level
[`def_py_function()`](/mojo/stdlib/python/bindings/PythonModuleBuilder#def_py_function)
interface, which leaves it to the user to validate the number of arguments
provided.

```mojo title="🔥 Mojo"
@export
fn PyInit_mojo_module() -> PythonObject:
    try:
        var b = PythonModuleBuilder("mojo_module")
        b.def_py_function[count_args]("count_args")
        b.def_py_function[sum_args]("sum_args")
        b.def_py_function[lookup]("lookup")

fn count_args(py_self: PythonObject, args: TypedPythonObject["Tuple"]):
    return len(args)

fn sum_args(py_self: PythonObject, args: TypedPythonObject["Tuple"]):
    var total = args[0]
    for i in range(1, len(args)):
        total += args[i]
    return total

fn lookup(py_self: PythonObject, args: TypedPythonObject["Tuple"]) raises:
    if len(args) != 2 and len(args) != 3:
        raise Error("lookup() expects 2 or 3 arguments")

    var collection = args[0]
    var key = args[1]

    try:
        return collection[key]
    except e:
        if len(args) == 3:
            return args[2]
        else:
            raise e
```

## Building Mojo extension modules

You can create and distribute your Mojo modules for Python in the following
ways:

* As source files, compiled on demand using the Python Mojo importer hook.

   The advantage of this approach is that it's easy to get started with, and
   keeps your project structure simple, while ensuring that your imported Mojo
   code is always up to date after you make an edit.

* As pre-built Python extension module `.so` dynamic libraries, compiled using:

   ```shell
   $ mojo build mojo_module.mojo --emit shared-lib -o mojo_module.so
   ```

   This has the advantage that you can specify any other necessary build options
   manually (optimization or debug flags, import paths, etc.), providing an
   "escape hatch" from the Mojo import hook abstraction for advanced users.

## Known limitations

While we have big ambitions for Python to Mojo interoperability—our goal is for
Mojo to be the best way to extend Python—this feature is still in early and
active development, and there are some limitations to be aware of. These will
be lifted over time.

* **Functions taking more than 3 arguments.**
  Currently `PyTypeBuilder.add_function()` and related
  function bindings only support Mojo functions that take up to 3 `PythonObject`
  arguments: `fn(PythonObject, PythonObject, PythonObject)`.

* **Binding non-default initializers.**
  Currently, only Mojo types that are default constructible (`Foo()`) can be
  bound and constructed using standard object init syntax from within Python.
  A workaround pattern is described below.

* **Keyword arguments.**
  Currently, Mojo functions callable from Python only natively support positional
  arguments. (However, if you really need them, a simple pattern for supporting
  keyword arguments is described below.)

* **Mojo package dependencies.**
  Mojo code that has dependencies on packages other than the Mojo stdlib
  (like those in the ever-growing
  [Modular Community](https://github.com/modular/modular-community) package
  channel) are currently only supported when building Mojo extension modules
  manually, as the Mojo import hook does not currently support a way to
  specify import paths for Mojo package dependencies.

* **Static methods.**
  Binding to type `@staticmethod` methods is not currently supported. Consider
  using a free function (top-level function) instead for the time being.

* **Properties.**
  Computed properties getter and setters are not currently supported.

* **Expected type conversions.**
  A handful of Mojo standard library types can be constructed directly from
  equivalent Python builtin object types, by implementing the
  [`ConvertibleFromPython`][ConvertibleFromPython] trait.
  However, many Mojo standard library types do not yet implement this trait,
  so may require manual conversion logic if needed.

{/*
link reference
*/}

[ConvertibleFromPython]: /mojo/stdlib/python/python_object/ConvertibleFromPython
[TypeIdentifiable]: /mojo/stdlib/builtin/identifiable/TypeIdentifiable
[downcast_value_ptr]: /mojo/stdlib/python/python_object/PythonObject#downcast_value_ptr

---

## Calling Python from Mojo

The Python ecosystem is full of useful libraries, so you shouldn't have to
rewrite them in Mojo. Instead, you can simply import Python packages and call
Python APIs from Mojo. The Python code runs in a standard Python interpreter
(CPython), so your existing Python code doesn't need to change.

## Import a Python module in Mojo

To import a Python module in Mojo, just call
[`Python.import_module()`](/mojo/stdlib/python/python/Python#import_module)
with the module name. The following shows an example of importing the standard
Python [NumPy](https://numpy.org/) package:

```mojo title="🔥 Mojo"
from python import Python

def main():
    # This is equivalent to Python's `import numpy as np`
    np = Python.import_module("numpy")

    # Now use numpy as if writing in Python
    array = np.array(Python.list(1, 2, 3))
    print(array)
```

Running this program produces the following output:

```
[1 2 3]
```

Assuming that you have the NumPy package installed in your
[environment](#create-a-python-environment), this imports NumPy and you can use any
of its features.

A few things to note:

* The `import_module()` method returns a reference to the module in the form of
  a [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject)
  wrapper. You must store the reference in a variable and then use it as shown
  in the example above to access functions, classes, and other objects defined
  by the module. See [Mojo wrapper objects](/mojo/manual/python/types#mojo-wrapper-objects)
  for more information about the `PythonObject` type.

* Currently, you cannot import individual members (such as a single Python class
  or function). You must import the whole Python module and then access members
  through the module name.

* Mojo doesn't yet support top-level code, so the `import_module()` call must
  be inside another method. This means you may need to import a module multiple
  times or pass around a reference to the module. This works the same way as
  Python: importing the module multiple times won't run the initialization
  logic more than once, so you don't pay any performance penalty.

* `import_module()` may raise an exception (for example, if the module isn't
  installed). If you're using it inside an `fn` function, you need to either
  handle errors (using a `try/except` clause), or add the `raises` keyword to
  the function signature. You'll also see this when calling Python functions
  that may raise exceptions. (Raising exceptions is much more common in Python
  code than in the Mojo standard library, which
  [limits their use for performance reasons](/mojo/roadmap#the-standard-library-has-limited-exceptions-use).)

:::caution

[`mojo build`](/mojo/cli/build) doesn't include the Python packages used by
your Mojo project. Instead, Mojo loads the Python interpreter and Python
packages at runtime, so they must be provided in the environment where you run
the Mojo program (such as inside the Magic environment where you built the
executable). For more information, see the section above to [create a Python
environment](#create-a-python-environment).

:::

### Import a local Python module

If you have some local Python code you want to use in Mojo, just add
the directory to the Python path and then import the module.

For example, suppose you have a Python file named `mypython.py`:

```python title="🐍 mypython.py"
import numpy as np

def gen_random_values(size, base):
    # generate a size x size array of random numbers between base and base+1
    random_array = np.random.rand(size, size)
    return random_array + base
```

Here's how you can import it and use it in a Mojo file:

```mojo title="🔥 main.mojo"
from python import Python

def main():
    Python.add_to_path("path/to/module")
    mypython = Python.import_module("mypython")

    values = mypython.gen_random_values(2, 3)
    print(values)
```

Both absolute and relative paths work with
[`add_to_path()`](/mojo/stdlib/python/python/Python#add_to_path). For example,
you can import from the local directory like this:

```mojo title="🔥 Mojo"
Python.add_to_path(".")
```

---

## can_enable_p2p

`can_enable_p2p(ctxs: List[DeviceContext]) -> Bool`

If peer-to-peer access is supported, enables it between all GPU pairs.

**Args:**

* ​ctxs (`List[DeviceContext]`): List of device contexts representing different GPUs.

**Returns:**

True if P2P access is possible between all GPU pairs, False otherwise.

---

## CausalMask

`@register_passable(trivial)`
`struct CausalMask`

MHA causal mask ensures a token is only affected by previous tokens.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`MHAMask`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask = False`

### `mask_out_of_bound`

`alias mask_out_of_bound = is_nvidia_gpu()`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds = True`

## Methods

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

### `status`

`status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

---

## cbrt

`cbrt[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `cbrt` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `cbrt` of the input.

---

## ceil

`ceil[T: Ceilable, //](value: T) -> T`

Get the ceiling value of the given object.

**Parameters:**

* ​T (`Ceilable`): The type conforming to `Ceilable`.

**Args:**

* ​value (`T`): The object to get the ceiling value of.

**Returns:**

The ceiling value of the object.

---

## Ceilable

The `Ceilable` trait describes a type that defines a ceiling operation.

Types that conform to `Ceilable` will work with the builtin `ceil`
function. The ceiling operation always returns the same type as the input.

For example:

```mojo
from math import Ceilable, ceil

@value
struct Complex(Ceilable):
    var re: Float64
    var im: Float64

    fn __ceil__(self) -> Self:
        return Self(ceil(self.re), ceil(self.im))
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__ceil__`

`__ceil__(self: _Self) -> _Self`

Return the ceiling of the Int value, which is itself.

**Returns:**

The Int value itself.

---

## ceildiv

`ceildiv[T: CeilDivable, //](numerator: T, denominator: T) -> T`

Return the rounded-up result of dividing numerator by denominator.

**Parameters:**

* ​T (`CeilDivable`): A type that support floor division.

**Args:**

* ​numerator (`T`): The numerator.
* ​denominator (`T`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

`ceildiv[T: CeilDivableRaising, //](numerator: T, denominator: T) -> T`

Return the rounded-up result of dividing numerator by denominator, potentially raising.

**Parameters:**

* ​T (`CeilDivableRaising`): A type that support floor division.

**Args:**

* ​numerator (`T`): The numerator.
* ​denominator (`T`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

`ceildiv(numerator: IntLiteral[value], denominator: IntLiteral[value]) -> IntLiteral[(0 - (value // (0 - value)))]`

Return the rounded-up result of dividing numerator by denominator.

**Args:**

* ​numerator (`IntLiteral[value]`): The numerator.
* ​denominator (`IntLiteral[value]`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

---

## CeilDivable

The `CeilDivable` trait describes a type that defines a ceil division operation.

Types that conform to `CeilDivable` will work with the `math.ceildiv`
function.

For example:

```mojo
from math import CeilDivable

@value
struct Foo(CeilDivable):
    var x: Float64

    fn __ceildiv__(self, denominator: Self) -> Self:
        return Self(self.x // denominator.x)
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__ceildiv__`

`__ceildiv__(self: _Self, denominator: _Self) -> _Self`

Return the rounded-up result of dividing self by denominator.

**Args:**

* ​denominator (`_Self`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

---

## CeilDivableRaising

The `CeilDivable` trait describes a type that define a floor division and negation operation that can raise.

Types that conform to `CeilDivableRaising` will work with the `//` operator
as well as the `math.ceildiv` function.

For example:

```mojo
from math import CeilDivableRaising

@value
struct Foo(CeilDivableRaising):
    var x: Float64

    fn __ceildiv__(self, denominator: Self) raises -> Self:
        return Self(self.x // denominator.x)
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__ceildiv__`

`__ceildiv__(self: _Self, denominator: _Self) -> _Self`

Return the rounded-up result of dividing self by denominator.

**Args:**

* ​denominator (`_Self`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

---

## check_arguments_arity

`check_arguments_arity(arity: Int, args: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")])`

Validate that the provided arguments match the expected function arity.

This function checks if the number of arguments in the provided tuple matches
the expected arity for a function call. If the counts don't match, it raises
a descriptive error message similar to Python's built-in TypeError messages.

**Args:**

* ​arity (`Int`): The expected number of arguments for the function.
* ​args (`TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]`): A tuple containing the actual arguments passed to the function.

**Raises:**

Error: If the argument count doesn't match the expected arity. The error
message follows Python's convention for TypeError messages, indicating
whether too few or too many arguments were provided.

`check_arguments_arity(arity: Int, args: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")], func_name: StringSlice[origin])`

Validate that the provided arguments match the expected function arity.

This function checks if the number of arguments in the provided tuple matches
the expected arity for a function call. If the counts don't match, it raises
a descriptive error message similar to Python's built-in TypeError messages.

**Args:**

* ​arity (`Int`): The expected number of arguments for the function.
* ​args (`TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]`): A tuple containing the actual arguments passed to the function.
* ​func\_name (`StringSlice[origin]`): The name of the function being called, used in error messages
  to provide better debugging information.

**Raises:**

Error: If the argument count doesn't match the expected arity. The error
message follows Python's convention for TypeError messages, indicating
whether too few or too many arguments were provided, along with the
specific function name.

---

## check_cudnn_error

`check_cudnn_error(stat: cudnnStatus_t)`

---

## chr

`chr(c: Int) -> String`

Returns a String based on the given Unicode code point. This is the inverse of the `ord()` function.

This function is in the prelude, so you don't need to import it.

Example:

```mojo
print(chr(97), chr(8364)) # "a €"
```

**Args:**

* ​c (`Int`): An integer that represents a code point.

**Returns:**

A string containing a single character based on the given code point.

---

## ChunkedCausalMask

`ChunkedCausalMask[local_window_size: Int]() -> OrMask[CausalMask(), ChunkedMask()]`

Mask implementing Chunked Causal attention for Llama4 models.

This groups the mask into chunks of size `local_window_size` and performs causal
attention within each local chunk. Considering the following case:

* Q\_len = 7
* K\_len = 10
* start\_pos = 3
* local\_window\_size = 4

The mask will be applied as follows:
K > 0 1 2 3 4 5 6 7 8 9
Q v x--------------------x
0 | 1 1 1 1 0 0 0 0 0 0
1 | 0 0 0 0 1 0 0 0 0 0
2 | 0 0 0 0 1 1 0 0 0 0
3 | 0 0 0 0 1 1 1 0 0 0
4 | 0 0 0 0 1 1 1 1 0 0
5 | 0 0 0 0 0 0 0 0 1 0
6 | 0 0 0 0 0 0 0 0 1 1

---

## ChunkedMask

`@register_passable(trivial)`
`struct ChunkedMask[local_window_size: Int]`

Mask implementing Chunked attention.

This groups the mask into chunks of size `local_window_size`.
Considering the following case:

* Q\_len = 7
* K\_len = 10
* local\_window\_size = 4

The mask will be applied as follows:
K > 0 1 2 3 4 5 6 7 8 9
Q v x--------------------x
0 | 1 1 1 1 0 0 0 0 0 0
1 | 0 0 0 0 1 1 1 1 0 0
2 | 0 0 0 0 1 1 1 1 0 0
3 | 0 0 0 0 1 1 1 1 0 0
4 | 0 0 0 0 1 1 1 1 0 0
5 | 0 0 0 0 0 0 0 0 1 1
6 | 0 0 0 0 0 0 0 0 1 1

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`MHAMask`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask = False`

### `mask_out_of_bound`

`alias mask_out_of_bound = True`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds = True`

## Methods

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

### `status`

`status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

---

## clamp

`clamp(val: Int, lower_bound: Int, upper_bound: Int) -> Int`

Clamps the integer value vector to be in a certain range.

**Args:**

* ​val (`Int`): The value to clamp.
* ​lower\_bound (`Int`): Minimum of the range to clamp to.
* ​upper\_bound (`Int`): Maximum of the range to clamp to.

**Returns:**

An integer clamped to be within lower\_bound and upper\_bound.

`clamp(val: UInt, lower_bound: UInt, upper_bound: UInt) -> UInt`

Clamps the integer value vector to be in a certain range.

**Args:**

* ​val (`UInt`): The value to clamp.
* ​lower\_bound (`UInt`): Minimum of the range to clamp to.
* ​upper\_bound (`UInt`): Maximum of the range to clamp to.

**Returns:**

An integer clamped to be within lower\_bound and upper\_bound.

`clamp[dtype: DType, width: Int, //](val: SIMD[dtype, width], lower_bound: SIMD[dtype, width], upper_bound: SIMD[dtype, width]) -> SIMD[dtype, width]`

Clamps the values in a SIMD vector to be in a certain range.

Clamp cuts values in the input SIMD vector off at the upper bound and
lower bound values. For example,  SIMD vector `[0, 1, 2, 3]` clamped to
a lower bound of 1 and an upper bound of 2 would return `[1, 1, 2, 2]`.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​val (`SIMD[dtype, width]`): The value to clamp.
* ​lower\_bound (`SIMD[dtype, width]`): Minimum of the range to clamp to.
* ​upper\_bound (`SIMD[dtype, width]`): Maximum of the range to clamp to.

**Returns:**

A SIMD vector containing x clamped to be within lower\_bound and
upper\_bound.

---

## clobber_memory

`clobber_memory()`

Forces all pending memory writes to be flushed to memory.

This ensures that the compiler does not optimize away memory writes if it
deems them to be not necessary. In effect, this operation acts as a barrier
to memory reads and writes.

---

## ClockType

`@register_passable(trivial)`
`struct ClockType`

## Fields

* ​code (`SIMD[int32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `GRAPHICS`

`alias GRAPHICS = ClockType(__init__[__mlir_type.!pop.int_literal](0))`

Graphics clock domain

### `MEM`

`alias MEM = ClockType(__init__[__mlir_type.!pop.int_literal](2))`

Memory clock domain

### `SM`

`alias SM = ClockType(__init__[__mlir_type.!pop.int_literal](1))`

SM clock domain

### `VIDEO`

`alias VIDEO = ClockType(__init__[__mlir_type.!pop.int_literal](2))`

Video clock domain

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

---

## cluster

This module provides low-level NVIDIA GPU cluster synchronization primitives for SM90+ architectures.

The module implements thread block cluster operations that enable efficient communication and
synchronization between thread blocks (CTAs) within a cluster on NVIDIA Hopper architecture and newer GPUs.

All functions are constrained to NVIDIA SM90+ GPUs and will raise an error if used on unsupported hardware.

Note: These are low-level primitives that correspond directly to PTX/NVVM instructions and should be used
with careful consideration of the underlying hardware synchronization mechanisms.

## Functions

* [​`block_rank_in_cluster`](/mojo/stdlib/gpu/cluster/block_rank_in_cluster): Returns the unique identifier (rank) for the current thread block within its cluster.
* [​`cluster_arrive`](/mojo/stdlib/gpu/cluster/cluster_arrive): Signals arrival at a cluster synchronization point with memory ordering guarantees.
* [​`cluster_arrive_relaxed`](/mojo/stdlib/gpu/cluster/cluster_arrive_relaxed): Signals arrival at a cluster synchronization point with relaxed memory ordering.
* [​`cluster_sync`](/mojo/stdlib/gpu/cluster/cluster_sync): Performs a full cluster synchronization with memory ordering guarantees.
* [​`cluster_sync_relaxed`](/mojo/stdlib/gpu/cluster/cluster_sync_relaxed): Performs a full cluster synchronization with relaxed memory ordering.
* [​`cluster_wait`](/mojo/stdlib/gpu/cluster/cluster_wait): Waits for all thread blocks in the cluster to arrive at the synchronization point.
* [​`elect_one_sync`](/mojo/stdlib/gpu/cluster/elect_one_sync): Elects a single thread within a warp to perform an operation.

---

## cluster_arrive

`cluster_arrive()`

Signals arrival at a cluster synchronization point with memory ordering guarantees.

This function ensures all prior memory operations from this thread block are visible to
other thread blocks in the cluster before proceeding. Only supported on NVIDIA SM90+ GPUs.

---

## cluster_arrive_relaxed

`cluster_arrive_relaxed()`

Signals arrival at a cluster synchronization point with relaxed memory ordering.

This is a relaxed version of cluster\_arrive() that does not enforce memory ordering
guarantees. It should be used when memory ordering is not required between thread blocks
in the cluster. Only supported on NVIDIA SM90+ GPUs.

---

## cluster_size

`cluster_size[cluster_shape: StaticTuple[SIMD[int32, 1], 3]]() -> SIMD[int32, 1]`

---

## cluster_sync

`cluster_sync()`

Performs a full cluster synchronization with memory ordering guarantees.

This is a convenience function that combines cluster\_arrive() and cluster\_wait()
to provide a full barrier synchronization across all thread blocks in the cluster.
Ensures memory ordering between thread blocks. Only supported on NVIDIA SM90+ GPUs.

---

## cluster_sync_relaxed

`cluster_sync_relaxed()`

Performs a full cluster synchronization with relaxed memory ordering.

This is a convenience function that combines cluster\_arrive\_relaxed() and cluster\_wait()
to provide a barrier synchronization across all thread blocks in the cluster without
memory ordering guarantees. Only supported on NVIDIA SM90+ GPUs.

---

## cluster_wait

`cluster_wait()`

Waits for all thread blocks in the cluster to arrive at the synchronization point.

This function blocks until all thread blocks in the cluster have called cluster\_arrive()
or cluster\_arrive\_relaxed(). Only supported on NVIDIA SM90+ GPUs.

---

## coalesce

`coalesce(layout: Layout, keep_rank: Bool = False) -> Layout`

Simplifies a layout by combining dimensions with contiguous strides.

This function reduces the rank of a layout by merging dimensions that have
contiguous memory layouts, resulting in a simpler but equivalent layout.

Example:

```mojo
from layout import Layout, IntTuple
from layout.layout import coalesce

# A layout with shape (2, (1, 4)) and stride (1, (4, 2)) can be coalesced
var layout = Layout(IntTuple(2, IntTuple(1, 4)), IntTuple(1, IntTuple(4, 2)))
var coalesced = coalesce(layout)
# Result: Layout with shape (8) and stride (1)
```

.

**Args:**

* ​layout (`Layout`): The layout to coalesce.
* ​keep\_rank (`Bool`): If True, maintains the original rank of the layout. Default is False.

**Returns:**

A simplified layout with reduced rank where possible.

---

## coalesce

`coalesce[l: Layout, keep_rank: Bool = False](layout: RuntimeLayout[l, element_type=element_type, linear_idx_type=linear_idx_type]) -> RuntimeLayout[coalesce(l, keep_rank), element_type=element_type, linear_idx_type=linear_idx_type]`

Coalesce adjacent dimensions in a runtime layout when possible.

This optimizes the layout by merging adjacent dimensions when their
relationship allows it, potentially reducing the number of dimensions.

**Parameters:**

* ​l (`Layout`): The static layout type to coalesce.
* ​keep\_rank (`Bool`): Whether to maintain the original rank (currently unsupported).

**Args:**

* ​layout (`RuntimeLayout[l, element_type=element_type, linear_idx_type=linear_idx_type]`): The input `RuntimeLayout` to coalesce.

**Returns:**

A new `RuntimeLayout` with coalesced dimensions.

---

## codepoint

Unicode codepoint handling.

This module provides the `Codepoint` type for representing single Unicode scalar values.
A codepoint represents a single Unicode character, restricted to valid Unicode scalar
values in the ranges 0 to 0xD7FF and 0xE000 to 0x10FFFF inclusive.

The `Codepoint` type provides functionality for:

* Converting between codepoints and UTF-8 encoded bytes.
* Testing character properties like ASCII, digits, whitespace etc.
* Converting between codepoints and strings.
* Safe construction from integers with validation.

Example:

```mojo
from collections.string import Codepoint
from testing import assert_true

# Create a codepoint from a character
var c = Codepoint.ord('A')

# Check properties
assert_true(c.is_ascii())
assert_true(c.is_ascii_upper())

# Convert to string
var s = String(c)  # "A"
```

## Structs

* [​`Codepoint`](/mojo/stdlib/collections/string/codepoint/Codepoint): A Unicode codepoint, typically a single user-recognizable character; restricted to valid Unicode scalar values.

---

## Codepoint

`struct Codepoint`

A Unicode codepoint, typically a single user-recognizable character; restricted to valid Unicode scalar values.

This type is restricted to store a single Unicode [*scalar value*][1],
typically encoding a single user-recognizable character.

All valid Unicode scalar values are in the range(s) 0 to 0xD7FF and
0xE000 to 0x10FFFF, inclusive. This type guarantees that the stored integer
value falls in these ranges.

[1]: https://www.unicode.org/glossary/#unicode_scalar_value

**Codepoints versus Scalar Values**

Formally, Unicode defines a codespace of values in the range 0 to
0x10FFFF inclusive, and a
[Unicode codepoint](https://www.unicode.org/glossary/#code_point) is any
integer falling within that range. However, due to historical reasons,
it became necessary to "carve out" a subset of the codespace, excluding
codepoints in the range 0xD7FF–0xE000. That subset of codepoints excluding
that range are known as [Unicode scalar values][1]. The codepoints in the
range 0xD7FF-0xE000 are known as "surrogate" codepoints. The surrogate
codepoints will never be assigned a semantic meaning, and can only
validly appear in UTF-16 encoded text.

The difference between codepoints and scalar values is a technical
distiction related to the backwards-compatible workaround chosen to enable
UTF-16 to encode the full range of the Unicode codespace. For simplicities
sake, and to avoid a confusing clash with the Mojo `Scalar` type, this type
is pragmatically named `Codepoint`, even though it is restricted to valid
scalar values.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Intable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, *, unsafe_unchecked_codepoint: SIMD[uint32, 1])`

Construct a `Codepoint` from a code point value without checking that it falls in the valid range.

Safety:
The provided codepoint value MUST be a valid Unicode scalar value.
Providing a value outside of the valid range could lead to undefined
behavior in algorithms that depend on the validity guarantees of
this type.

**Args:**

* ​unsafe\_unchecked\_codepoint (`SIMD[uint32, 1]`): A valid Unicode scalar value code point.

`__init__(out self, codepoint: SIMD[uint8, 1])`

Construct a `Codepoint` from a single byte value.

This constructor cannot fail because non-negative 8-bit integers are
valid Unicode scalar values.

**Args:**

* ​codepoint (`SIMD[uint8, 1]`): The 8-bit codepoint value to convert to a `Codepoint`.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Return True if this character has the same codepoint value as `other`.

**Args:**

* ​other (`Self`): The codepoint value to compare against.

**Returns:**

True if this character and `other` have the same codepoint value;
False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Return True if this character has a different codepoint value from `other`.

**Args:**

* ​other (`Self`): The codepoint value to compare against.

**Returns:**

True if this character and `other` have different codepoint values;
False otherwise.

### `from_u32`

`static from_u32(codepoint: SIMD[uint32, 1]) -> Optional[Codepoint]`

Construct a `Codepoint` from a code point value. Returns None if the provided `codepoint` is not in the valid range.

**Args:**

* ​codepoint (`SIMD[uint32, 1]`): An integer representing a Unicode scalar value.

**Returns:**

A `Codepoint` if `codepoint` falls in the valid range for Unicode
scalar values, otherwise None.

### `ord`

`static ord(string: StringSlice[origin]) -> Self`

Returns the `Codepoint` that represents the given single-character string.

Given a string containing one character, return a `Codepoint`
representing the codepoint of that character. For example,
`Codepoint.ord("a")` returns the codepoint `97`. This is the inverse of
the `chr()` function.

This function is similar to the `ord()` free function, except that it
returns a `Codepoint` instead of an `Int`.

**Args:**

* ​string (`StringSlice[origin]`): The input string, which must contain only a single character.

**Returns:**

A `Codepoint` representing the codepoint of the given character.

### `unsafe_decode_utf8_codepoint`

`static unsafe_decode_utf8_codepoint(s: Span[SIMD[uint8, 1], origin]) -> Tuple[Codepoint, Int]`

Decodes a single `Codepoint` and number of bytes read from a given UTF-8 string pointer.

Safety:
`_ptr` MUST point to the first byte in a **known-valid** UTF-8
character sequence. This function MUST NOT be used on unvalidated
input.

**Args:**

* ​s (`Span[SIMD[uint8, 1], origin]`): Span to UTF-8 encoded data containing at least one valid
  encoded codepoint.

**Returns:**

The decoded codepoint `Codepoint`, as well as the number of bytes
read.

### `__int__`

`__int__(self) -> Int`

Returns the numeric value of this scalar value as an integer.

**Returns:**

The numeric value of this scalar value as an integer.

### `__str__`

`__str__(self) -> String`

Formats this `Codepoint` as a single-character string.

**Returns:**

A string containing this single character.

### `is_ascii`

`is_ascii(self) -> Bool`

Returns True if this `Codepoint` is an ASCII character.

All ASCII characters are less than or equal to codepoint value 127, and
take exactly 1 byte to encode in UTF-8.

**Returns:**

A boolean indicating if this `Codepoint` is an ASCII character.

### `is_ascii_digit`

`is_ascii_digit(self) -> Bool`

Determines whether the given character is a digit \[0-9].

**Returns:**

True if the character is a digit.

### `is_ascii_upper`

`is_ascii_upper(self) -> Bool`

Determines whether the given character is an uppercase character.

This currently only respects the default "C" locale, i.e. returns True
iff the character specified is one of "ABCDEFGHIJKLMNOPQRSTUVWXYZ".

**Returns:**

True if the character is uppercase.

### `is_ascii_lower`

`is_ascii_lower(self) -> Bool`

Determines whether the given character is an lowercase character.

This currently only respects the default "C" locale, i.e. returns True
iff the character specified is one of "abcdefghijklmnopqrstuvwxyz".

**Returns:**

True if the character is lowercase.

### `is_ascii_printable`

`is_ascii_printable(self) -> Bool`

Determines whether the given character is a printable character.

**Returns:**

True if the character is a printable character, otherwise False.

### `is_python_space`

`is_python_space(self) -> Bool`

Determines whether this character is a Python whitespace string.

This corresponds to Python's [universal separators](https://docs.python.org/3/library/stdtypes.html#str.splitlines):
`" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`.

# Examples

Check if a string contains only whitespace:

```mojo
from testing import assert_true, assert_false

# ASCII space characters
assert_true(Codepoint.ord(" ").is_python_space())
assert_true(Codepoint.ord("	").is_python_space())

# Unicode paragraph separator:
assert_true(Codepoint.from_u32(0x2029).value().is_python_space())

# Letters are not space characters
assert_fales(Codepoint.ord("a").is_python_space())
```

.

**Returns:**

True if this character is one of the whitespace characters listed
above, otherwise False.

### `is_posix_space`

`is_posix_space(self) -> Bool`

Returns True if this `Codepoint` is a **space** character according to the [POSIX locale][1].

The POSIX locale is also known as the C locale.

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_01

This only respects the default "C" locale, i.e. returns True only if the
character specified is one of " \t\n\v\f\r". For semantics similar
to Python, use `String.isspace()`.

**Returns:**

True iff the character is one of the whitespace characters listed
above.

### `to_u32`

`to_u32(self) -> SIMD[uint32, 1]`

Returns the numeric value of this scalar value as an unsigned 32-bit integer.

**Returns:**

The numeric value of this scalar value as an unsigned 32-bit
integer.

### `unsafe_write_utf8`

`unsafe_write_utf8[optimize_ascii: Bool = True](self, ptr: UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, origin=origin]) -> UInt`

Shift unicode to utf8 representation.

Safety:
`ptr` MUST point to at least `self.utf8_byte_length()` allocated
bytes or else an out-of-bounds write will occur, which is undefined
behavior.

### Unicode (represented as UInt32 BE) to UTF-8 conversion:

* 1: 00000000 00000000 00000000 0aaaaaaa -> 0aaaaaaa
  * a
* 2: 00000000 00000000 00000aaa aabbbbbb -> 110aaaaa 10bbbbbb
  * (a >> 6)  | 0b11000000, b         | 0b10000000
* 3: 00000000 00000000 aaaabbbb bbcccccc -> 1110aaaa 10bbbbbb 10cccccc
  * (a >> 12) | 0b11100000, (b >> 6)  | 0b10000000, c        | 0b10000000
* 4: 00000000 000aaabb bbbbcccc ccdddddd -> 11110aaa 10bbbbbb 10cccccc
  10dddddd
  * (a >> 18) | 0b11110000, (b >> 12) | 0b10000000, (c >> 6) | 0b10000000,
    d | 0b10000000
    .

**Parameters:**

* ​optimize\_ascii (`Bool`): Optimize for languages with mostly ASCII characters.

**Args:**

* ​ptr (`UnsafePointer[SIMD[uint8, 1], address_space=address_space, alignment=alignment, origin=origin]`): Pointer value to write the encoded UTF-8 bytes. Must validly
  point to a sufficient number of bytes (1-4) to hold the encoded
  data.

**Returns:**

Returns the number of bytes written.

### `utf8_byte_length`

`utf8_byte_length(self) -> UInt`

Returns the number of UTF-8 bytes required to encode this character.

The returned value is always between 1 and 4 bytes.

**Returns:**

Byte count of UTF-8 bytes required to encode this character.

---

## CodepointsIter

`struct CodepointsIter[mut: Bool, //, origin: Origin[mut]]`

Iterator over the `Codepoint`s in a string slice, constructed by `StringSlice.codepoints()`.

## Parameters

* ​mut (`Bool`): Mutability of the underlying string data.
* ​origin (`Origin[mut]`): Origin of the underlying string data.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__next__`

`__next__(mut self) -> Codepoint`

Get the next codepoint in the underlying string slice.

This returns the next `Codepoint` encoded in the underlying string, and
advances the iterator state.

This function will abort if this iterator has been exhausted.

**Returns:**

The next character in the string.

### `__has_next__`

`__has_next__(self) -> Bool`

Returns True if there are still elements in this iterator.

**Returns:**

A boolean indicating if there are still elements in this iterator.

### `__len__`

`__len__(self) -> Int`

Returns the remaining length of this iterator in `Codepoint`s.

The value returned from this method indicates the number of subsequent
calls to `next()` that will return a value.

**Returns:**

Number of codepoints remaining in this iterator.

### `peek_next`

`peek_next(self) -> Optional[Codepoint]`

Check what the next codepoint in this iterator is, without advancing the iterator state.

Repeated calls to this method will return the same value.

# Examples

`peek_next()` does not advance the iterator, so repeated calls will
return the same value:

```mojo
from collections.string import Codepoint
from testing import assert_equal

var input = StringSlice("123")
var iter = input.codepoints()

assert_equal(iter.peek_next().value(), Codepoint.ord("1"))
assert_equal(iter.peek_next().value(), Codepoint.ord("1"))
assert_equal(iter.peek_next().value(), Codepoint.ord("1"))

# A call to `next()` return the same value as `peek_next()` had,
# but also advance the iterator.
assert_equal(iter.next().value(), Codepoint.ord("1"))

# Later `peek_next()` calls will return the _new_ next character:
assert_equal(iter.peek_next().value(), Codepoint.ord("2"))
```

.

**Returns:**

The next character in the underlying string, or None if the string
is empty.

### `next`

`next(mut self) -> Optional[Codepoint]`

Get the next codepoint in the underlying string slice, or None if the iterator is empty.

This returns the next `Codepoint` encoded in the underlying string, and
advances the iterator state.

**Returns:**

A character if the string is not empty, otherwise None.

---

## CodepointSliceIter

`struct CodepointSliceIter[mut: Bool, //, origin: Origin[mut], forward: Bool = True]`

Iterator for `StringSlice` over substring slices containing a single Unicode codepoint.

The `forward` parameter only controls the behavior of the `__next__()`
method used for normal iteration. Calls to `next()` will always take an
element from the front of the iterator, and calls to `next_back()` will
always take an element from the end.

## Parameters

* ​mut (`Bool`): Whether the slice is mutable.
* ​origin (`Origin[mut]`): The origin of the underlying string data.
* ​forward (`Bool`): The iteration direction. `False` is backwards.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__next__`

`__next__(mut self) -> StringSlice[origin]`

Get the next codepoint in the underlying string slice.

This returns the next single-codepoint substring slice encoded in the
underlying string, and advances the iterator state.

If `forward` is set to `False`, this will return the next codepoint
from the end of the string.

This function will abort if this iterator has been exhausted.

**Returns:**

The next character in the string.

### `__has_next__`

`__has_next__(self) -> Bool`

Returns True if there are still elements in this iterator.

**Returns:**

A boolean indicating if there are still elements in this iterator.

### `__len__`

`__len__(self) -> Int`

Returns the remaining length of this iterator in `Codepoint`s.

The value returned from this method indicates the number of subsequent
calls to `next()` that will return a value.

**Returns:**

Number of codepoints remaining in this iterator.

### `peek_next`

`peek_next(self) -> Optional[StringSlice[origin]]`

Check what the next single-codepoint slice in this iterator is, without advancing the iterator state.

Repeated calls to this method will return the same value.

# Examples

`peek_next()` does not advance the iterator, so repeated calls will
return the same value:

```mojo
from collections.string import Codepoint
from testing import assert_equal

var input = StringSlice("123")
var iter = input.codepoint_slices()

assert_equal(iter.peek_next().value(), "1")
assert_equal(iter.peek_next().value(), "1")
assert_equal(iter.peek_next().value(), "1")

# A call to `next()` return the same value as `peek_next()` had,
# but also advance the iterator.
assert_equal(iter.next().value(), "1")

# Later `peek_next()` calls will return the _new_ next character:
assert_equal(iter.peek_next().value(), "2")
```

.

**Returns:**

The next codepoint slice in the underlying string, or None if the
string is empty.

### `peek_back`

`peek_back(mut self) -> Optional[StringSlice[origin]]`

Check what the last single-codepoint slice in this iterator is, without advancing the iterator state.

Repeated calls to this method will return the same value.

# Examples

`peek_back()` does not advance the iterator, so repeated calls will
return the same value:

```mojo
from collections.string import Codepoint
from testing import assert_equal

var input = StringSlice("123")
var iter = input.codepoint_slices()

# Repeated calls to `peek_back()` return the same value.
assert_equal(iter.peek_back().value(), "3")
assert_equal(iter.peek_back().value(), "3")
assert_equal(iter.peek_back().value(), "3")

# A call to `next_back()` return the same value as `peek_back()` had,
# but also advance the iterator.
assert_equal(iter.next_back().value(), "3")

# Later `peek_back()` calls will return the _new_ next character:
assert_equal(iter.peek_back().value(), "2")
```

.

**Returns:**

The last codepoint slice in the underlying string, or None if the
string is empty.

### `next`

`next(mut self) -> Optional[StringSlice[origin]]`

Get the next codepoint slice in the underlying string slice, or None if the iterator is empty.

This returns the next single-codepoint substring encoded in the
underlying string, and advances the iterator state.

**Returns:**

A character if the string is not empty, otherwise None.

### `next_back`

`next_back(mut self) -> Optional[StringSlice[origin]]`

Get the last single-codepoint slice in this iterator is, or None if the iterator is empty.

This returns the last codepoint slice in this iterator, and advances
the iterator state.

**Returns:**

The last codepoint slice in the underlying string, or None if the
string is empty.

---

## collections

Implements the collections package.

## Packages

* [​`string`](/mojo/stdlib/collections/string/): The string package provides comprehensive Unicode string handling functionality for Mojo.

## Modules

* [​`bitset`](/mojo/stdlib/collections/bitset/): Provides a compact, grow-only set of non-negative integers.
* [​`counter`](/mojo/stdlib/collections/counter/): Defines the `Counter` type.
* [​`deque`](/mojo/stdlib/collections/deque/): Defines the Deque type.
* [​`dict`](/mojo/stdlib/collections/dict/): Defines `Dict`, a collection that stores key-value pairs.
* [​`inline_array`](/mojo/stdlib/collections/inline_array/): Provides a fixed-size array implementation with compile-time size checking.
* [​`interval`](/mojo/stdlib/collections/interval/): A self-balancing interval tree is a specialized binary search tree designed to efficiently store and query intervals.
* [​`linked_list`](/mojo/stdlib/collections/linked_list/):
* [​`list`](/mojo/stdlib/collections/list/): Defines the List type.
* [​`optional`](/mojo/stdlib/collections/optional/): Defines Optional, a type modeling a value which may or may not be present.
* [​`set`](/mojo/stdlib/collections/set/): Implements the  Set datatype.

---

## comm

The `gpu.comm` package provides communication primitives for GPUs.

This package includes functions for sending and receiving data between GPUs,
as well as for synchronizing threads across GPUs.

## Modules

* [​`allgather`](/mojo/stdlib/gpu/comm/allgather/): Multi-GPU allgather implementation that gathers values from multiple GPUs into an output buffer.
* [​`allreduce`](/mojo/stdlib/gpu/comm/allreduce/): Multi-GPU allreduce implementation for efficient tensor reduction across GPUs.

---

## Common

```c
#include "max/c/common.h"
```

## Functions

### `M_version()`

> const char \*M\_version()

Gets the MAX Engine version.

* **Returns:**

  A string containing the semantic version of the MAX Engine.

### `M_newStatus()`

> [M\_Status](types.md#_CPPv48M_Status) \*M\_newStatus()

Creates a new status object.

This is required as an argument for several functions, such as [`M_newRuntimeContext()`](context.md#context_8h_1a46a6c670f73e1ce560f3c2cc1de93175) and [`M_compileModel()`](model.md#model_8h_1a88afca26a64b945885e1e1a0d09b5750). They will update the status object and you can check for errors with [`M_isError()`](#common_8h_1adb7a61f1c8f9c5e7964e8788cd437468) and get the status message with [`M_getError()`](#common_8h_1aa294beac43a0884cef8386e69a6bfc1b). For example:

```c
M_Status *status = M_newStatus();
M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig();
M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}
```

* **Returns:**

  A pointer to the new status object. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeStatus()`](#common_8h_1ab5067fd51a5696b3679f7f629d3329c4).

### `M_getError()`

> const char \*M\_getError(const [M\_Status](types.md#_CPPv48M_Status) \*status)

Gets an error message from the `M_Status` parameter.

You should call this only if [`M_isError()`](#common_8h_1adb7a61f1c8f9c5e7964e8788cd437468) is true.

* **Parameters:**

  **status** – The status object for reporting errors and other messages.
* **Returns:**

  A pointer to a null-terminated string containing the error message.

### `M_isError()`

> int M\_isError(const [M\_Status](types.md#_CPPv48M_Status) \*status)

Checks if status holds an error value.

* **Parameters:**

  **status** – The status object for reporting errors and other messages.
* **Returns:**

  `0` if there is no error, `1` otherwise.

### `M_freeStatus()`

> void M\_freeStatus([M\_Status](types.md#_CPPv48M_Status) \*status)

Deallocates the memory for the status object. No-op if `status` is `NULL`.

* **Parameters:**

  **status** – The status object for reporting errors and other messages.

### `M_sizeOf()`

> size\_t M\_sizeOf([M\_Dtype](types.md#_CPPv47M_Dtype) type)

Gets the size (in bytes) of a data type.

* **Parameters:**

  **type** – The data type.
* **Returns:**

  Size in bytes of the given data type. If the data type is `M_UNKNOWN`, then `0`.

### `M_getDynamicDimensionValue()`

> int64\_t M\_getDynamicDimensionValue()

Gets the value representing dynamic dimension.

* **Returns:**

  Value representing dynamic dimension.

### `M_getDynamicRankValue()`

> int64\_t M\_getDynamicRankValue()

Gets the value representing dynamic rank.

* **Returns:**

  Value representing dynamic rank.

---

## compact_order

`compact_order(shape: IntTuple[origin], order: IntTuple[origin]) -> IntTuple`

Create a compact stride based on shape and order.

This function generates a stride tuple where lower order numbers imply
faster varying strides. The resulting shape and stride form a bijective layout.

Performance:

* Always inlined for optimal performance in tight loops.
* Flattens inputs and re-nests results for consistent behavior.

Example:

```mojo
from layout import IntTuple
from layout.int_tuple import compact_order

# Create a compact layout with dimensions (2,3,4,5) and ordering (1,4,3,5)
var x = compact_order(IntTuple(2,3,4,5), IntTuple(1,4,3,5))  # returns (1,8,2,24)

# Create a compact layout with nested dimensions and corresponding ordering
var y = compact_order(IntTuple(2,IntTuple(3,4),5), IntTuple(1,IntTuple(2,3),4))  # returns (1,(2,6),24)
```

.

**Args:**

* ​shape (`IntTuple[origin]`): The shape tuple defining dimensions.
* ​order (`IntTuple[origin]`): The order tuple defining the relative ordering of dimensions.

**Returns:**

A stride tuple that creates a compact memory layout according to the
specified order.

---

## comparable

## Traits

* [​`Comparable`](/mojo/stdlib/builtin/comparable/Comparable): A type which can be compared with other instances of itself.
* [​`GreaterThanComparable`](/mojo/stdlib/builtin/comparable/GreaterThanComparable): A type which can be greater than compared with other instances of itself.
* [​`GreaterThanOrEqualComparable`](/mojo/stdlib/builtin/comparable/GreaterThanOrEqualComparable): A type which can be greater than or equal to compared with other instances of itself.
* [​`LessThanComparable`](/mojo/stdlib/builtin/comparable/LessThanComparable): A type which can be less than compared with other instances of itself.
* [​`LessThanOrEqualComparable`](/mojo/stdlib/builtin/comparable/LessThanOrEqualComparable): A type which can be less than or equal to compared with other instances of itself.

---

## Comparable

A type which can be compared with other instances of itself.

## Implemented traits

`AnyType`,
`EqualityComparable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`UnknownDestructibility`

## Methods

### `__lt__`

`__lt__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is less than `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is less than `rhs`.

### `__le__`

`__le__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is less than or equal to `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is less than or equal to `rhs`.

### `__eq__`

`__eq__(self: _Self, other: _Self) -> Bool`

Define whether two instances of the object are equal to each other.

**Args:**

* ​other (`_Self`): Another instance of the same type.

**Returns:**

True if the instances are equal according to the type's definition
of equality, False otherwise.

### `__ne__`

`__ne__(self: _Self, other: _Self) -> Bool`

Define whether two instances of the object are not equal to each other.

**Args:**

* ​other (`_Self`): Another instance of the same type.

**Returns:**

True if the instances are not equal according to the type's definition
of equality, False otherwise.

### `__gt__`

`__gt__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is greater than `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is greater than `rhs`.

### `__ge__`

`__ge__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is greater than or equal to `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is greater than or equal to `rhs`.

---

## compatible

`compatible(a: IntTuple[origin], b: IntTuple[origin]) -> Bool`

Test if two shapes are compatible for tensor operations.

This function checks if shape A is compatible with shape B, meaning:

1. The total size of A and B are the same
2. Any coordinate into A can also be used as a coordinate into B

Compatible can also be thought of as a partial order on A and B: A a (`IntTuple[origin]`): The first `IntTuple` to compare.
* ​b (`IntTuple[origin]`): The second `IntTuple` to compare.

**Returns:**

True if shape A is compatible with shape B, False otherwise.

---

## CompilationTarget

`@register_passable(trivial)`
`struct CompilationTarget[value: target = _current_target()]`

A struct that provides information about a target architecture.

This struct encapsulates various methods to query target-specific information
such as architecture features, OS details, endianness, and memory characteristics.

## Parameters

* ​value (`target`): The target architecture to query. Defaults to the current target.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `has_sse4`

`static has_sse4() -> Bool`

Checks if the target supports SSE4 instructions.

**Returns:**

True if the target supports SSE4, False otherwise.

### `is_x86`

`static is_x86() -> Bool`

Checks if the target is an x86 architecture.

**Returns:**

True if the target is x86, False otherwise.

---

## compile

Provides utilities for compiling and inspecting Mojo code.

This module contains functionality for compiling Mojo functions and examining
their assembly, LLVM IR, or object code output. It is particularly useful for
kernel engineers who want to inspect the low-level implementation details of
specific functions without dealing with entire files or manual invocation of
compilation tools.

Key features:

* Compile individual functions to assembly, LLVM IR, or object code
* Get linkage names and module information
* Inspect number of captures and other function metadata
* Write compilation output to files
* Control compilation options and targets

Example:

```mojo
from compile import compile_info

fn my_func(x: Int) -> Int:
    return x

# Get assembly for the function
info = compile_info[my_func]()
print(info)
```

## Structs

* [​`Info`](/mojo/stdlib/compile/compile/Info): Contains compilation information and results for a function.

## Functions

* [​`compile_info`](/mojo/stdlib/compile/compile/compile_info): Compiles a function and returns detailed compilation information.

---

## compile

Provides utilities for compiling and inspecting Mojo code at runtime.

This module exposes functionality for compiling individual Mojo functions and
examining their low-level implementation details. It is particularly useful for:

* Inspecting assembly, LLVM IR, or object code output
* Getting linkage names and module information
* Examining function metadata like captures
* Writing compilation output to files
* Controlling compilation options and targets

Example:

```mojo
    from compile import compile_info

    fn my_func():
        print("Hello")

    # Get assembly for the function
    info = compile_info[my_func]()
    print(info.asm)
```

## Modules

* [​`compile`](/mojo/stdlib/compile/compile/): Provides utilities for compiling and inspecting Mojo code.
* [​`reflection`](/mojo/stdlib/compile/reflection/):

---

## compile

Implements functions that return compile-time information.

## Aliases

### `DebugLevel`

`alias DebugLevel = _DebugLevel()`

Represents the debug level used during compilation.

### `OptimizationLevel`

`alias OptimizationLevel = _OptimizationLevel()`

Represents the optimization level used during compilation.

## Functions

* [​`is_compile_time`](/mojo/stdlib/sys/compile/is_compile_time): Returns true if the current code is executed at compile time, false otherwise.

---

## compile_info

`compile_info[func_type: AnyTrivialRegType, //, func: func_type, /, *, emission_kind: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("asm"), compile_options: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), target: target = _current_target()]() -> Info[func_type, func, target]`

Compiles a function and returns detailed compilation information.

This function takes a Mojo function and compiles it, providing access to the
generated assembly code, linkage information, and other compilation
artifacts. It can be used for inspection, debugging, and low-level
optimization.

Example:

```mojo
from compile import compile_info

fn my_func(x: Int) -> Int:
    return x

info = compile_info[my_func]()
print(info)  # Print assembly
```

Note:
The compilation is always performed, even if the function is not used.
For performance-critical code, consider caching the compilation results.

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): Type of the function to compile. Must be a trivially-copyable
  register type.
* ​func (`func_type`): The function to compile. Must match the specified func\_type.
* ​emission\_kind (`StringSlice[StaticConstantOrigin]`): The desired output format. Valid options are:
  * "asm": Assembly code (default).
  * "llvm": Unoptimized LLVM IR.
  * "llvm-opt": Optimized LLVM IR.
  * "object": Object code.
* ​compile\_options (`StringSlice[StaticConstantOrigin]`): Additional compiler flags and options as a string.
* ​target (`target`): The target architecture to compile for. Defaults to current
  architecture.

**Returns:**

An `Info` struct containing:

* asm: The generated code in the requested format
* linkage\_name: The mangled function name for linking
* module\_hash: A unique hash of the compiled module
* num\_captures: Number of captured variables
* error: Any error message (empty if successful)
* failed: Boolean indicating if compilation failed

---

## compiler

## Functions

* [​`keep`](/mojo/stdlib/benchmark/compiler/keep): Provides a hint to the compiler to not optimize the variable use away.

---

## complement

`complement(layout: Layout, size: Int = 1) -> Layout`

Computes the complement layout for a given layout.

This function creates a layout that represents the "gaps" or complementary
structure of the input layout. It's useful for creating hierarchical layouts
where you need to fill in the spaces between existing layout elements.

Example:

```mojo
from layout import Layout, IntTuple
from layout.layout import complement

# Compute the complement of a layout
var base = Layout(IntTuple(2, 3), IntTuple(3, 1))
var comp = complement(base, 10)
# Result: A layout that fills the gaps in the original layout
```

.

**Args:**

* ​layout (`Layout`): The input layout to compute the complement for.
* ​size (`Int`): The total size of the memory region to consider. Defaults to 1.

**Returns:**

A new layout representing the complement of the input layout.

---

## complex

Implements the Complex type.

You can import these APIs from the `complex` package. For example:

```mojo
from complex import ComplexSIMD
```

## Aliases

### `ComplexFloat32`

`alias ComplexFloat32 = ComplexSIMD[float32, 1]`

### `ComplexFloat64`

`alias ComplexFloat64 = ComplexSIMD[float64, 1]`

## Structs

* [​`ComplexSIMD`](/mojo/stdlib/complex/complex/ComplexSIMD): Represents a complex SIMD value.

## Functions

* [​`abs`](/mojo/stdlib/complex/complex/abs): Performs elementwise abs (norm) on each element of the complex value.

---

## complex

Provides types and functions for working with complex numbers.

## Modules

* [​`complex`](/mojo/stdlib/complex/complex/): Implements the Complex type.

---

## ComplexSIMD

`@register_passable(trivial)`
`struct ComplexSIMD[type: DType, size: Int]`

Represents a complex SIMD value.

The class provides basic methods for manipulating complex values.

## Parameters

* ​type (`DType`): DType of the value.
* ​size (`Int`): SIMD width of the value.

## Fields

* ​re (`SIMD[type, size]`): The real part of the complex SIMD value.
* ​im (`SIMD[type, size]`): The imaginary part of the complex SIMD value.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `element_type`

`alias element_type = SIMD[type, size]`

## Methods

### `__init__`

`__init__(re: SIMD[type, size], im: SIMD[type, size] = __init__[__mlir_type.!pop.int_literal](0)) -> Self`

Initializes a complex SIMD value.

**Args:**

* ​re (`SIMD[type, size]`): The real part of the complex value.
* ​im (`SIMD[type, size]`): The imaginary part of the complex value.

### `__neg__`

`__neg__(self) -> Self`

Negates the complex value.

**Returns:**

The negative of the complex value.

### `__add__`

`__add__(self, rhs: Self) -> Self`

Adds two complex values.

**Args:**

* ​rhs (`Self`): Complex value to add.

**Returns:**

A sum of this and RHS complex values.

### `__sub__`

`__sub__(self, rhs: Self) -> Self`

Subtracts two complex values.

**Args:**

* ​rhs (`Self`): Complex value to subtract.

**Returns:**

A difference of this and RHS complex values.

### `__mul__`

`__mul__(self, rhs: Self) -> Self`

Multiplies two complex values.

**Args:**

* ​rhs (`Self`): Complex value to multiply with.

**Returns:**

A product of this and RHS complex values.

### `__truediv__`

`__truediv__(self, rhs: Self) -> Self`

Divides two complex values.

**Args:**

* ​rhs (`Self`): Complex value to divide by.

**Returns:**

A quotient of this and RHS complex values.

### `__str__`

`__str__(self) -> String`

Get the complex as a string.

**Returns:**

A string representation.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this complex value to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__abs__`

`__abs__(self) -> SIMD[type, size]`

Returns the magnitude of the complex value.

**Returns:**

Value of `sqrt(re*re + im*im)`.

### `norm`

`norm(self) -> SIMD[type, size]`

Returns the magnitude of the complex value.

**Returns:**

Value of `sqrt(re*re + im*im)`.

### `squared_norm`

`squared_norm(self) -> SIMD[type, size]`

Returns the squared magnitude of the complex value.

**Returns:**

Value of `re*re + im*im`.

### `fma`

`fma(self, b: Self, c: Self) -> Self`

Computes FMA operation.

Compute fused multiple-add with two other complex values:
`result = self * b + c`

**Args:**

* ​b (`Self`): Multiplier complex value.
* ​c (`Self`): Complex value to add.

**Returns:**

Computed `Self * B + C` complex value.

### `squared_add`

`squared_add(self, c: Self) -> Self`

Computes Square-Add operation.

Compute `Self * Self + C`.

**Args:**

* ​c (`Self`): Complex value to add.

**Returns:**

Computed `Self * Self + C` complex value.

### `__exp__`

`__exp__(self) -> Self`

Computes the exponential of the complex value.

**Returns:**

The exponential of the complex value.

---

## ComposedLayout

`struct ComposedLayout[LayoutA: LayoutTrait, LayoutB: LayoutTrait, offset: OptionalReg[Int] = OptionalReg[Int]({:@stdlib::@builtin::@int::@Int {0}, 0})]`

Layout composed of two layouts applied sequentially.

Combines two layouts. Output of the first (`LayoutA`) is input to
the second (`LayoutB`), with optional offset in between.

## Parameters

* ​LayoutA (`LayoutTrait`): The first layout to apply.
* ​LayoutB (`LayoutTrait`): The second layout to apply.
* ​offset (`OptionalReg[Int]`): Optional offset between layouts (default: 0).

## Fields

* ​layout\_a (`LayoutA`): The first layout to apply.
* ​layout\_b (`LayoutB`): The second layout to apply.

## Implemented traits

`AnyType`,
`Copyable`,
`LayoutTrait`,
`UnknownDestructibility`

## Aliases

### `has_shape`

`alias has_shape = get_vtable_entry(:trait LayoutA, "has_shape") if get_vtable_entry(:trait LayoutA, "has_shape") else get_vtable_entry(:trait LayoutB, "has_shape")`

True if either layout has a shape.

## Methods

### `__init__`

`__init__(out self, layout_a: LayoutA, layout_b: LayoutB)`

Initialize ComposedLayout with two layouts.

**Args:**

* ​layout\_a (`LayoutA`): The first layout.
* ​layout\_b (`LayoutB`): The second layout.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Copy constructor for ComposedLayout.

**Args:**

* ​other (`Self`): The ComposedLayout to copy from.

### `__call__`

`__call__(self, idx: IntTuple[origin]) -> Int`

Apply composed layout to an index.

Applies `LayoutA`, then adds offset, then applies `LayoutB`.

**Args:**

* ​idx (`IntTuple[origin]`): The index to transform.

**Returns:**

The transformed index.

`__call__(self, idx: IntTuple[origin], offset_val: Int) -> Int`

Apply composed layout with runtime offset.

Applies `LayoutA`, then adds runtime `offset_val`, then `LayoutB`.
Static offset must not be set when using runtime offset.

**Args:**

* ​idx (`IntTuple[origin]`): The index to transform.
* ​offset\_val (`Int`): Runtime offset to apply.

**Returns:**

The transformed index.

### `size`

`size(self) -> Int`

Get the size of the composed layout.

Returns the size of the first layout (`LayoutA`).

**Returns:**

The size of the first layout.

### `cosize`

`cosize(self) -> Int`

Get the cosize of the composed layout.

Returns the cosize of the second layout (`LayoutB`).

**Returns:**

The cosize of the second layout.

---

## composition

`composition(layout_a: Layout, layout_b: Layout) -> Layout`

Composes two layouts to create a new layout.

This function creates a new layout by composing two layouts, where the first
layout defines the outer structure and the second layout defines the inner
structure.

The new layout is compatible with `layout_b` (that is, it has the same `size`
and every set of coordinates in `layout_b` has an equivalent in the new
layout). You can think of `layout_b` as selecting a subset of elements
from `layout_a`.

Example:

```mojo
from layout.layout import Layout, IntTuple
from layout.layout import composition

# Compose a row-major layout with a tiling layout
var base = Layout.row_major(6, 8)
var tiling = Layout(IntTuple(3, 2), IntTuple(1, 3))
var composed = composition(base, tiling)
# Result: A layout that represents a 3x2 tile from
# layout_a
```

.

**Args:**

* ​layout\_a (`Layout`): The outer layout.
* ​layout\_b (`Layout`): The inner layout.

**Returns:**

A new layout representing the composition of the two layouts.

`composition(layout_a: Layout, tiler: List[Layout]) -> Layout`

Composes a layout with a list of layouts to create a hierarchical layout.

This function creates a new layout by composing each element of the first layout
with the corresponding element in the tiler list. If the tiler list is shorter
than the layout, the remaining elements from the layout are appended unchanged.

Example:

```mojo
from layout import Layout, LayoutList, IntTuple
from layout.layout import composition

# Compose a layout with a list of tiling layouts
var base = Layout.row_major(6, 8)
var tilers = LayoutList()
tilers.append(Layout(IntTuple(2, 2), IntTuple(1, 2)))
tilers.append(Layout(IntTuple(3, 3), IntTuple(1, 3)))
var composed = composition(base, tilers)
# Result: A layout with hierarchical tiling based on the tiler list
```

.

**Args:**

* ​layout\_a (`Layout`): The base layout to compose with the tiler.
* ​tiler (`List[Layout]`): A list of layouts to compose with the base layout.

**Returns:**

A new layout representing the composition of the base layout with the tiler.

---

## compressed_store

`compressed_store[dtype: DType, size: Int](value: SIMD[dtype, size], addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], mask: SIMD[bool, size])`

Compresses the lanes of `value`, skipping `mask` lanes, and stores at `addr`.

**Parameters:**

* ​dtype (`DType`): DType of `value`, the value to store.
* ​size (`Int`): Size of `value`, the value to store.

**Args:**

* ​value (`SIMD[dtype, size]`): The vector containing data to store.
* ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The memory location to store the compressed data.
* ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of
  `value`.

---

## concat

`concat(owned lhs: IntTuple[origin], rhs: IntTuple[origin]) -> IntTuple`

Concatenates two `IntTuple` instances into a single `IntTuple`.

This function appends all elements from the right-hand side tuple to the
left-hand side tuple, creating a new combined tuple. The operation preserves
the hierarchical structure of both tuples.

**Args:**

* ​lhs (`IntTuple[origin]`): The left-hand side `IntTuple` that will be modified (owned parameter).
* ​rhs (`IntTuple[origin]`): The right-hand side `IntTuple` whose elements will be appended.

**Returns:**

A new `IntTuple` containing all elements from both tuples in sequence.

---

## concat

`concat[rank: Int, type: DType, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None] = OptionalReg[fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]({:i1 0, 1})](output: NDBuffer[type, rank, origin], axis: Int, inputs: StaticTuple[NDBuffer[type, rank, MutableAnyOrigin], size], context: DeviceContextPtr = DeviceContextPtr())`

---

## concat

## Aliases

### `elementwise_epilogue_type`

`alias elementwise_epilogue_type = fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None`

## Functions

* [​`concat`](./concat):
* [​`concat_shape`](./concat_shape): Compute the output shape of a `pad` operation, and assert the inputs are compatible.
* [​`fused_concat`](./fused_concat):
* [​`memcpy_or_fuse`](./memcpy_or_fuse):

---

## concat_shape

`concat_shape[input_rank: Int, input_type: DType, single_thread_blocking_override: Bool](input_bufs: List[NDBuffer[input_type, input_rank, MutableAnyOrigin]], axis: Int) -> IndexList[input_rank]`

Compute the output shape of a `pad` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_rank (`Int`): Input\_rank of the input tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input\_bufs (`List[NDBuffer[input_type, input_rank, MutableAnyOrigin]]`): The input tensors list.
* ​axis (`Int`): The axis.

**Returns:**

The output shape.

---

## config

Standardized configuration for Pipeline Inference.

## `AudioGenerationConfig` {#max.pipelines.lib.config.AudioGenerationConfig}

> *class* max.pipelines.lib.config.AudioGenerationConfig(audio\_config: 'dict\[str, str]', \*\*kwargs: 'Any')

**Parameters:**

* **audio\_config** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  [`str`](https://docs.python.org/3/library/stdtypes.html#str) `]` )
* **kwargs** (`Any` )

### `audio_decoder` {#max.pipelines.lib.config.AudioGenerationConfig.audio_decoder}

> audio\_decoder\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= ''*

The name of the audio decoder model architecture.

### `audio_decoder_weights` {#max.pipelines.lib.config.AudioGenerationConfig.audio_decoder_weights}

> audio\_decoder\_weights\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= ''*

The path to the audio decoder weights file.

### `audio_prompt_speakers` {#max.pipelines.lib.config.AudioGenerationConfig.audio_prompt_speakers}

> audio\_prompt\_speakers\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= ''*

The path to the audio prompt speakers file.

## `PipelineConfig` {#max.pipelines.lib.config.PipelineConfig}

> *class* max.pipelines.lib.config.PipelineConfig(\*\*kwargs)

Configuration for a pipeline.

WIP - Once a PipelineConfig is fully initialized, it should be as immutable
as possible (frozen=True). All underlying dataclass fields should have been
initialized to their default values, be it user specified via some CLI
flag, config file, environment variable, or internally set to a reasonable
default.

**Parameters:**

**kwargs** (`Any` )

### `custom_architectures` {#max.pipelines.lib.config.PipelineConfig.custom_architectures}

> custom\_architectures\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[str](https://docs.python.org/3/library/stdtypes.html#str)]\*

A list of custom architecture implementations to register.
Each input can either be a raw module name or an import path followed by a colon and the module name.
Ex:

* my\_module
* folder/path/to/import:my\_module

Each module must expose an ARCHITECTURES list of architectures to register.

### `draft_model_config` {#max.pipelines.lib.config.PipelineConfig.draft_model_config}

> *property* draft\_model\_config\*: MAXModelConfig | [None](https://docs.python.org/3/library/constants.html#None)\*

### `enable_chunked_prefill` {#max.pipelines.lib.config.PipelineConfig.enable_chunked_prefill}

> enable\_chunked\_prefill\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= True*

Enable chunked prefill to split context encoding requests into multiple chunks
based on ‘target\_num\_new\_tokens’.

### `enable_echo` {#max.pipelines.lib.config.PipelineConfig.enable_echo}

> enable\_echo\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False*

Whether the model should be built with echo capabilities.

### `enable_in_flight_batching` {#max.pipelines.lib.config.PipelineConfig.enable_in_flight_batching}

> enable\_in\_flight\_batching\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False*

When enabled, prioritizes token generation by batching it with context
encoding requests.

### `engine` {#max.pipelines.lib.config.PipelineConfig.engine}

> engine\*: PipelineEngine | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

Engine backend to use for serving, ‘max’ for the max engine, or ‘huggingface’ as fallback option for improved model coverage.

### `graph_quantization_encoding` {#max.pipelines.lib.config.PipelineConfig.graph_quantization_encoding}

> *property* graph\_quantization\_encoding\*: [QuantizationEncoding](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) | [None](https://docs.python.org/3/library/constants.html#None)\*

Converts the CLI encoding to a MAX graph quantization encoding.

**Returns:**

The graph quantization encoding corresponding to the CLI encoding.

### `help()` {#max.pipelines.lib.config.PipelineConfig.help}

> *static* help()

Documentation for this config class. Return a dictionary of config
options and their descriptions.

**Return type:**

[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [str](https://docs.python.org/3/library/stdtypes.html#str)]

### `ignore_eos` {#max.pipelines.lib.config.PipelineConfig.ignore_eos}

> ignore\_eos\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False*

Ignore EOS and continue generating tokens, even when an EOS variable is hit.

### `max_batch_size` {#max.pipelines.lib.config.PipelineConfig.max_batch_size}

> max\_batch\_size\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

Maximum batch size to execute with the model.
This is set to one, to minimize memory consumption for the base case, in which a person is
running a local server to test out MAX. For users launching in a server scenario, the expectation
is that this value should be set higher based on server capacity.

### `max_ce_batch_size` {#max.pipelines.lib.config.PipelineConfig.max_ce_batch_size}

> max\_ce\_batch\_size\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 192*

Maximum cache size to reserve for a single context encoding batch.
The actual limit is the lesser of this and max\_batch\_size.

### `max_length` {#max.pipelines.lib.config.PipelineConfig.max_length}

> max\_length\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

Maximum sequence length of the model.

### `max_new_tokens` {#max.pipelines.lib.config.PipelineConfig.max_new_tokens}

> max\_new\_tokens\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= -1*

Maximum number of new tokens to generate during a single inference pass of the model.

### `max_num_steps` {#max.pipelines.lib.config.PipelineConfig.max_num_steps}

> max\_num\_steps\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= -1*

The number of steps to run for multi-step scheduling. -1 specifies a default value based on
configuration and platform. Ignored for models which are not auto-regressive (e.g. embedding
models).

### `model_config` {#max.pipelines.lib.config.PipelineConfig.model_config}

> *property* model\_config\*: MAXModelConfig\*

### `pad_to_multiple_of` {#max.pipelines.lib.config.PipelineConfig.pad_to_multiple_of}

> pad\_to\_multiple\_of\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 2*

Pad input tensors to be a multiple of value provided.

### `pdl_level` {#max.pipelines.lib.config.PipelineConfig.pdl_level}

> pdl\_level\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= '1'*

Level of overlap of kernel launch via programmatic dependent grid control.

### `pipeline_role` {#max.pipelines.lib.config.PipelineConfig.pipeline_role}

> pipeline\_role\*: PipelineRole\* *= 'prefill\_and\_decode'*

Whether the pipeline should serve both a prefill or decode role or both.

### `pool_embeddings` {#max.pipelines.lib.config.PipelineConfig.pool_embeddings}

> pool\_embeddings\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= True*

Whether to pool embedding outputs.

### `profiling_config` {#max.pipelines.lib.config.PipelineConfig.profiling_config}

> *property* profiling\_config\*: ProfilingConfig\*

### `resolve()` {#max.pipelines.lib.config.PipelineConfig.resolve}

> resolve()

Validates and resolves the config.

This method is called after the config is initialized, to ensure that all
config fields have been initialized to a valid state.

**Return type:**

None

### `sampling_config` {#max.pipelines.lib.config.PipelineConfig.sampling_config}

> *property* sampling\_config\*: SamplingConfig\*

### `target_num_new_tokens` {#max.pipelines.lib.config.PipelineConfig.target_num_new_tokens}

> target\_num\_new\_tokens\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

The target number of un-encoded tokens to include in each batch.
If not set, this will be set to a best-guess optimal value based on model, hardware, and available memory.

### `use_experimental_kernels` {#max.pipelines.lib.config.PipelineConfig.use_experimental_kernels}

> use\_experimental\_kernels\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= 'false'*

---

## config_in_smem

`config_in_smem[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool, //, max_smem: Int](config: MatmulConfig[a_type, b_type, c_type, transpose_b]) -> MatmulConfig[a_type, b_type, c_type, transpose_b]`

---

## congruent

`congruent(a: IntTuple[origin], b: IntTuple[origin]) -> Bool`

Test if two `IntTuple`s have the same hierarchical structure.

This function checks if two `IntTuple`s have identical nesting patterns,
regardless of the actual integer values they contain.

**Args:**

* ​a (`IntTuple[origin]`): First `IntTuple` to compare.
* ​b (`IntTuple[origin]`): Second `IntTuple` to compare.

**Returns:**

True if both `IntTuple`s have the same hierarchical structure,
False otherwise.

---

## Consistency

`@register_passable(trivial)`
`struct Consistency`

Represents memory consistency models for GPU memory operations.

This struct defines different memory consistency levels that control how memory
operations are ordered and synchronized between threads. The consistency model
affects both performance and correctness of parallel algorithms.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ACQUIRE`

`alias ACQUIRE = Consistency(2)`

Acquire consistency for synchronization operations.

Ensures all subsequent memory operations are ordered after this operation.
Used in producer-consumer patterns.

### `RELAXED`

`alias RELAXED = Consistency(1)`

Relaxed consistency with basic ordering guarantees.

Provides some ordering guarantees while still allowing optimizations.
Suitable for operations that don't require strict ordering.

### `RELEASE`

`alias RELEASE = Consistency(3)`

Release consistency for synchronization operations.

Ensures all previous memory operations are ordered before this operation.
Paired with acquire operations for synchronization.

### `WEAK`

`alias WEAK = Consistency(0)`

Weakest consistency model with minimal ordering guarantees.

Provides maximum flexibility for hardware/compiler optimizations but requires
careful synchronization by the programmer.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Tests if two Consistency instances are equal.

**Args:**

* ​other (`Self`): The Consistency instance to compare against.

**Returns:**

True if the consistency levels are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Tests if two Consistency instances are not equal.

**Args:**

* ​other (`Self`): The Consistency instance to compare against.

**Returns:**

True if the consistency levels are different, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Tests if two Consistency instances are identical.

**Args:**

* ​other (`Self`): The Consistency instance to compare against.

**Returns:**

True if the consistency levels are identical, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Tests if two Consistency instances are not identical.

**Args:**

* ​other (`Self`): The Consistency instance to compare against.

**Returns:**

True if the consistency levels are not identical, False otherwise.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the consistency level.

**Returns:**

A string describing the consistency level.

### `mnemonic`

`mnemonic(self) -> StringSlice[StaticConstantOrigin]`

Returns the mnemonic string for the consistency level.

**Returns:**

A string literal containing the consistency level mnemonic.

---

## Consistency

`@register_passable(trivial)`
`struct Consistency`

Represents the consistency model for atomic operations.

The class provides a set of constants that represent different consistency
models for atomic operations.

Attributes:
NOT\_ATOMIC: Not atomic.
UNORDERED: Unordered.
MONOTONIC: Monotonic.
ACQUIRE: Acquire.
RELEASE: Release.
ACQUIRE\_RELEASE: Acquire-release.
SEQUENTIAL: Sequentially consistent.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ACQUIRE`

`alias ACQUIRE = Consistency(__init__[__mlir_type.!pop.int_literal](3))`

Acquire.

### `ACQUIRE_RELEASE`

`alias ACQUIRE_RELEASE = Consistency(__init__[__mlir_type.!pop.int_literal](5))`

Acquire-release.

### `MONOTONIC`

`alias MONOTONIC = Consistency(__init__[__mlir_type.!pop.int_literal](2))`

Monotonic.

### `NOT_ATOMIC`

`alias NOT_ATOMIC = Consistency(__init__[__mlir_type.!pop.int_literal](0))`

Not atomic.

### `RELEASE`

`alias RELEASE = Consistency(__init__[__mlir_type.!pop.int_literal](4))`

Release.

### `SEQUENTIAL`

`alias SEQUENTIAL = Consistency(__init__[__mlir_type.!pop.int_literal](6))`

Sequentially consistent.

### `UNORDERED`

`alias UNORDERED = Consistency(__init__[__mlir_type.!pop.int_literal](1))`

Unordered.

## Methods

### `__init__`

`__init__(value: SIMD[uint8, 1]) -> Self`

Constructs a new Consistency object.

**Args:**

* ​value (`SIMD[uint8, 1]`): The value of the consistency model.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compares two Consistency objects for equality.

**Args:**

* ​other (`Self`): The other Consistency object to compare with.

**Returns:**

True if the objects are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Compares two Consistency objects for inequality.

**Args:**

* ​other (`Self`): The other Consistency object to compare with.

**Returns:**

True if the objects are not equal, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Checks if the Consistency object is the same as another.

**Args:**

* ​other (`Self`): The other Consistency object to compare with.

**Returns:**

True if the objects are the same, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Checks if the Consistency object is not the same as another.

**Args:**

* ​other (`Self`): The other Consistency object to compare with.

**Returns:**

True if the objects are not the same, False otherwise.

### `__mlir_attr`

`__mlir_attr(self) -> !kgen.deferred`

Returns the MLIR attribute representation of the Consistency object.

**Returns:**

The MLIR attribute representation of the Consistency object.

---

## constant_memory_mapping

This module provides functionality for mapping constant memory between host and device.

The module includes the `ConstantMemoryMapping` struct which represents a mapping of
constant memory that can be used for efficient data transfer between host and GPU device.

## Structs

* [​`ConstantMemoryMapping`](/mojo/stdlib/gpu/host/constant_memory_mapping/ConstantMemoryMapping): Represents a mapping of constant memory between host and device.

---

## ConstantMemoryMapping

`@register_passable(trivial)`
`struct ConstantMemoryMapping`

Represents a mapping of constant memory between host and device.

This struct encapsulates the information needed to manage constant memory
that can be accessed by GPU kernels. Constant memory provides a fast, read-only
cache accessible by all threads on the GPU device.

Attributes:
name: A string identifier for the constant memory mapping.
ptr: Pointer to the memory location.
byte\_count: Size of the memory mapping in bytes.

## Fields

* ​name (`StringSlice[StaticConstantOrigin]`): A string identifier for the constant memory mapping.
  This name is used to uniquely identify the constant memory region in the GPU
  programming model, allowing the runtime to properly associate the memory with
  kernel references to constant memory symbols.
* ​ptr (`UnsafePointer[NoneType]`): Pointer to the host memory location that will be mapped to device constant memory.
  This raw pointer represents the starting address of the memory region that will be
  accessible as constant memory on the GPU. The memory should remain valid for the
  lifetime of any kernels that access it.
* ​byte\_count (`Int`): Size of the memory mapping in bytes.
  Specifies the total size of the constant memory region. This value is used by the
  runtime to determine how much data to transfer between host and device. The size
  must be sufficient to hold all data needed by GPU kernels.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

---

## constants

Defines math utilities.

You can import these APIs from the `math` package. For example:

```mojo
from math import pi
```

## Aliases

### `e`

`alias e = 2.7182818284590451`

The euler constant e = 2.718281...

### `log2e`

`alias log2e = 1.4426950408889634`

log2e = log2(e), where e is Euler's constant.

### `pi`

`alias pi = 3.1415926535897931`

The mathematical constant π = 3.141592...

### `tau`

`alias tau = 6.2831853071795862`

The mathematical constant τ = 6.283185.... Tau is a circumference of a circle (2π).

---

## constrained

`constrained[cond: Bool, msg: StringSlice[StaticConstantOrigin], *extra: StringSlice[StaticConstantOrigin]]()`

Asserts that the condition must be true at compile time.

The `constrained()` function introduces a compile-time constraint on the
enclosing function. If the condition is true at compile time, the constraint
has no effect. If the condition is false, compilation fails and the message
is displayed.

This is similar to `static_assert` in C++. It differs from
[`debug_assert()`](/mojo/stdlib/builtin/debug_assert/debug_assert), which
is a run-time assertion.

Example:

```mojo
fn half[dtype: DType](a: Scalar[dtype]) -> Scalar[dtype]:
    constrained[
        dtype.is_numeric(),
        "dtype must be numeric."
    ]()
    return a / 2

def main():
    print(half(UInt8(5)))  # prints 2
    print(half(Scalar[DType.bool](True)))  # constraint failed:
                                           #     dtype must be numeric.
```

**Parameters:**

* ​cond (`Bool`): The bool value to assert.
* ​msg (`StringSlice[StaticConstantOrigin]`): The message to display on failure.
* ​\*extra (`StringSlice[StaticConstantOrigin]`): Additional messages to concatenate to msg.

`constrained[cond: Bool]()`

Asserts that the condition must be true at compile time.

The `constrained()` function introduces a compile-time constraint on the
enclosing function. If the condition is true at compile time, the constraint
has no effect. If the condition is false, compilation fails and a generic
message is displayed.

This is similar to `static_assert` in C++. It differs from
[`debug_assert()`](/mojo/stdlib/builtin/debug_assert/debug_assert), which
is a run-time assertion.

For an example, see the
[first overload](/mojo/stdlib/builtin/constrained/constrained).

**Parameters:**

* ​cond (`Bool`): The bool value to assert.

---

## constrained

Implements compile-time constraints.

These are Mojo built-ins, so you don't need to import them.

## Functions

* [​`constrained`](/mojo/stdlib/builtin/constrained/constrained): Asserts that the condition must be true at compile time.

---

## consumer_main_loop

`consumer_main_loop[accum_type: DType, a_type: DType, b_type: DType, c_reg_layout: Layout, a_smem_layout: Layout, b_smem_layout: Layout, wgmma_shape: IndexList[3], a_swizzle: TensorMapSwizzle, b_swizzle: TensorMapSwizzle, transpose_b: Bool, pipeline_stages: Int, /, *, cluster_shape: StaticTuple[SIMD[int32, 1], 3] = StaticTuple(__init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1)), promotion_frequency: Int = 1, num_consumer: Int = 1](final_c_reg_tile: LayoutTensor[accum_type, c_reg_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_reg_tile: LayoutTensor[accum_type, c_reg_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], a_smem_iter: LayoutTensorIter[a_type, a_smem_layout, origin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b_smem_iter: LayoutTensorIter[b_type, b_smem_layout, origin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut read_pipeline_states: PipelineState[pipeline_stages], full: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8], empty: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8], wgmma_op: TensorCoreAsync[accum_type, a_type, b_type, wgmma_shape, a_swizzle, b_swizzle, transpose_b], num_k_iters: Int, local_warp_group_idx: UInt, warp_group_thread_idx: UInt)`

---

## context

## `AudioGenerationRequest` {#max.pipelines.core.AudioGenerationRequest}

> *class* max.pipelines.core.AudioGenerationRequest(id: 'str', input: 'str', index: 'int', model: 'str', voice: 'str | None' = None, instructions: 'str' = '', response\_format: 'AudioFormat' = \, speed: 'float' = 1.0)

**Parameters:**

* **id** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **input** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **index** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **model** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **voice** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `None` )
* **instructions** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **response\_format** (`AudioFormat` )
* **speed** ([`float`](https://docs.python.org/3/library/functions.html#float) )

### `id` {#max.pipelines.core.AudioGenerationRequest.id}

> id\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

A unique identifier for the request. This ID can be used to trace and log
the request throughout its lifecycle, facilitating debugging and tracking.

### `index` {#max.pipelines.core.AudioGenerationRequest.index}

> index\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The sequence order of this request within a batch. This is useful for
maintaining the order of requests when processing multiple requests
simultaneously, ensuring that responses can be matched back to their
corresponding requests accurately.

### `input` {#max.pipelines.core.AudioGenerationRequest.input}

> input\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

The text to generate audio for. The maximum length is 4096 characters.

### `instructions` {#max.pipelines.core.AudioGenerationRequest.instructions}

> instructions\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= ''*

Control the voice of your generated audio with additional instructions.
Currently unused.

### `model` {#max.pipelines.core.AudioGenerationRequest.model}

> model\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

The name of the model to be used for generating audio chunks. This should match
the available models on the server and determines the behavior and
capabilities of the response generation.

### `response_format` {#max.pipelines.core.AudioGenerationRequest.response_format}

> response\_format\*: AudioFormat\* *= 'wav'*

The format to audio in. Currently only supports wav.

### `speed` {#max.pipelines.core.AudioGenerationRequest.speed}

> speed\*: [float](https://docs.python.org/3/library/functions.html#float)\* *= 1.0*

The speed of the generated audio. Select a value from 0.25 to 4.0.
Defaults to 1.0.

### `voice` {#max.pipelines.core.AudioGenerationRequest.voice}

> voice\*: [str](https://docs.python.org/3/library/stdtypes.html#str) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

The voice to use for audio generation.

## `AudioGenerator` {#max.pipelines.core.AudioGenerator}

> *class* max.pipelines.core.AudioGenerator(\*args, \*\*kwargs)

Interface for audio generation models.

### `decode()` {#max.pipelines.core.AudioGenerator.decode}

> decode(batch, num\_tokens)

Decodes speech tokens to audio bytes.

**Parameters:**

* **batch** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  `AudioGeneratorContext` `]` ) – Batch of audio generation contexts.
* **num\_tokens** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of speech tokens to decode.

**Returns:**

Dictionary mapping request IDs to WAV audio data.

**Return type:**

[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), DecoderOutput]

### `decoder_sample_rate` {#max.pipelines.core.AudioGenerator.decoder_sample_rate}

> *property* decoder\_sample\_rate\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The sample rate of the decoder.

### `next_chunk()` {#max.pipelines.core.AudioGenerator.next_chunk}

> next\_chunk(batch, num\_tokens)

Computes the next audio chunk for a single batch.

The new speech tokens are saved to the context.

**Parameters:**

* **batch** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  `AudioGeneratorContext` `]` ) – Batch of contexts.
* **num\_tokens** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of speech tokens to generate.

**Returns:**

Dictionary mapping request IDs to
speech token generation status.

**Return type:**

[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [TextGenerationStatus](#max.pipelines.core.TextGenerationStatus)]

### `release()` {#max.pipelines.core.AudioGenerator.release}

> release(context)

Releases resources associated with this context.

**Parameters:**

**context** (`AudioGeneratorContext` ) – Finished context.

**Return type:**

None

## `AudioGeneratorOutput` {#max.pipelines.core.AudioGeneratorOutput}

> *class* max.pipelines.core.AudioGeneratorOutput(audio\_data: 'torch.Tensor', metadata: 'dict\[str, Any]')

**Parameters:**

* **audio\_data** (`torch.Tensor` )
* **metadata** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  `Any` `]` )

### `audio_data` {#max.pipelines.core.AudioGeneratorOutput.audio_data}

> audio\_data\*: torch.Tensor\*

### `metadata` {#max.pipelines.core.AudioGeneratorOutput.metadata}

> metadata\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), Any]\*

## `EmbeddingsGenerator` {#max.pipelines.core.EmbeddingsGenerator}

> *class* max.pipelines.core.EmbeddingsGenerator(\*args, \*\*kwargs)

Interface for LLM embeddings-generator models.

### `encode()` {#max.pipelines.core.EmbeddingsGenerator.encode}

> encode(batch)

Computes embeddings for a batch of inputs.

**Parameters:**

**batch** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  `EmbeddingsGeneratorContext` `]` ) – Batch of contexts to generate
embeddings for.

**Returns:**

Dictionary mapping request IDs to their corresponding
embeddings. Each embedding is typically a numpy array or tensor of
floating point values.

**Return type:**

[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), Any]

## `EmbeddingsResponse` {#max.pipelines.core.EmbeddingsResponse}

> *class* max.pipelines.core.EmbeddingsResponse(embeddings)

Container for the response from embeddings pipeline.

**Parameters:**

**embeddings** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

### `embeddings` {#max.pipelines.core.EmbeddingsResponse.embeddings}

> embeddings\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

## `InputContext` {#max.pipelines.core.InputContext}

> *class* max.pipelines.core.InputContext(\*args, \*\*kwargs)

A base class for model contexts, represent model inputs for TokenGenerators.

Token array layout:
.                      +———- full prompt ———-+   CHUNK\_SIZE\*N v
. +——————–+—————+—————–+—————-+
. |     completed      |  next\_tokens  |                 |  preallocated  |
. +——————–+—————+—————–+—————-+
.            start\_idx ^    active\_idx ^         end\_idx ^

* completed: The tokens that have already been processed and encoded.
* next\_tokens: The tokens that will be processed in the next iteration.
  This may be a subset of the full prompt due to chunked prefill.
* preallocated: The token slots that have been preallocated. The token array
  resizes to multiples of CHUNK\_SIZE to accommodate the new tokens.

### `active_idx` {#max.pipelines.core.InputContext.active_idx}

> *property* active\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `active_length` {#max.pipelines.core.InputContext.active_length}

> *property* active\_length\*: [int](https://docs.python.org/3/library/functions.html#int)\*

num tokens input this iteration.

This will be the prompt size for context encoding, and simply 1 for
token generation.

**Type:**

Current sequence length

### `assign_to_cache()` {#max.pipelines.core.InputContext.assign_to_cache}

> assign\_to\_cache(cache\_seq\_id)

Assigns the context to a cache slot.

**Parameters:**

**cache\_seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

None

### `bump_token_indices()` {#max.pipelines.core.InputContext.bump_token_indices}

> bump\_token\_indices(start\_idx=0, active\_idx=0, end\_idx=0, committed\_idx=0)

Update the start\_idx, active\_idx and end\_idx without manipulating the token array.

**Parameters:**

* **start\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **active\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **end\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **committed\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

None

### `cache_seq_id` {#max.pipelines.core.InputContext.cache_seq_id}

> *property* cache\_seq\_id\*: [int](https://docs.python.org/3/library/functions.html#int)\*

Returns the cache slot assigned to the context, raising an error if not assigned.

### `committed_idx` {#max.pipelines.core.InputContext.committed_idx}

> *property* committed\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `compute_num_available_steps()` {#max.pipelines.core.InputContext.compute_num_available_steps}

> compute\_num\_available\_steps(max\_seq\_len)

Compute the max number of steps we can execute for a given context
without exceeding the max\_seq\_len.

**Parameters:**

**max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `current_length` {#max.pipelines.core.InputContext.current_length}

> *property* current\_length\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The current length of the sequence, including completed and active tokens.

### `end_idx` {#max.pipelines.core.InputContext.end_idx}

> *property* end\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `generated_tokens` {#max.pipelines.core.InputContext.generated_tokens}

> *property* generated\_tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

All generated tokens in the context.

### `ignore_eos` {#max.pipelines.core.InputContext.ignore_eos}

> *property* ignore\_eos\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

### `is_assigned_to_cache` {#max.pipelines.core.InputContext.is_assigned_to_cache}

> *property* is\_assigned\_to\_cache\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

Returns True if input is assigned to a cache slot, False otherwise.

### `json_schema` {#max.pipelines.core.InputContext.json_schema}

> *property* json\_schema\*: [str](https://docs.python.org/3/library/stdtypes.html#str) | [None](https://docs.python.org/3/library/constants.html#None)\*

A json schema to use during constrained decoding.

### `jump_ahead()` {#max.pipelines.core.InputContext.jump_ahead}

> jump\_ahead(new\_token, is\_eos=False)

Updates the token array, while ensuring the new token is returned to the user.

**Parameters:**

* **new\_token** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **is\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Return type:**

None

### `log_probabilities` {#max.pipelines.core.InputContext.log_probabilities}

> *property* log\_probabilities\*: [int](https://docs.python.org/3/library/functions.html#int)\*

When > 0, returns the log probabilities for the top N tokens for each
element token in the sequence.

### `log_probabilities_echo` {#max.pipelines.core.InputContext.log_probabilities_echo}

> *property* log\_probabilities\_echo\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

When True, the input tokens are added to the returned logprobs.

### `matcher` {#max.pipelines.core.InputContext.matcher}

> *property* matcher\*: xgr.GrammarMatcher | [None](https://docs.python.org/3/library/constants.html#None)\*

An optional xgr Grammar Matcher provided when using structured output.

### `max_length` {#max.pipelines.core.InputContext.max_length}

> *property* max\_length\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\*

The maximum length of this sequence.

### `next_tokens` {#max.pipelines.core.InputContext.next_tokens}

> *property* next\_tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

The next prompt tokens to be input during this iteration.

This should be a 1D array of tokens of length active\_length.

### `outstanding_completion_tokens()` {#max.pipelines.core.InputContext.outstanding_completion_tokens}

> outstanding\_completion\_tokens()

Return the list of outstanding completion tokens and log probabilities
that must be returned to the user.

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [*LogProbabilities*](#max.pipelines.core.LogProbabilities) | None]]

### `prompt_tokens` {#max.pipelines.core.InputContext.prompt_tokens}

> *property* prompt\_tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

Prompt tokens in the context.

### `reset()` {#max.pipelines.core.InputContext.reset}

> reset()

Resets the context’s state by combining all tokens into a new prompt.
This method is used when a request is evicted, meaning that the context
needed to be re-encoded in the following CE iteration.

**Return type:**

None

### `rollback()` {#max.pipelines.core.InputContext.rollback}

> rollback(idx)

Rollback and remove the last idx tokens.

**Parameters:**

**idx** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

None

### `set_draft_offset()` {#max.pipelines.core.InputContext.set_draft_offset}

> set\_draft\_offset(idx)

**Parameters:**

**idx** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

None

### `set_matcher()` {#max.pipelines.core.InputContext.set_matcher}

> set\_matcher(matcher)

Set a grammar matcher for use during constrained decoding.

**Parameters:**

**matcher** (`xgr.GrammarMatcher` )

**Return type:**

None

### `set_token_indices()` {#max.pipelines.core.InputContext.set_token_indices}

> set\_token\_indices(start\_idx=None, active\_idx=None, end\_idx=None, committed\_idx=None)

Set the token indices without manipulating the token array.

**Parameters:**

* **start\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **active\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **end\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **committed\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )

**Return type:**

None

### `start_idx` {#max.pipelines.core.InputContext.start_idx}

> *property* start\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `tokens` {#max.pipelines.core.InputContext.tokens}

> *property* tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

All tokens in the context.

### `unassign_from_cache()` {#max.pipelines.core.InputContext.unassign_from_cache}

> unassign\_from\_cache()

Unassigns the context from a cache slot.

**Return type:**

None

### `update()` {#max.pipelines.core.InputContext.update}

> update(new\_token, log\_probabilities=None, is\_eos=False)

Updates the next\_tokens and extends existing tokens to include all generated tokens.

**Parameters:**

* **new\_token** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **log\_probabilities** ([`LogProbabilities`](#max.pipelines.core.LogProbabilities)  `|`  `None` )
* **is\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Return type:**

None

## `LogProbabilities` {#max.pipelines.core.LogProbabilities}

> *class* max.pipelines.core.LogProbabilities(token\_log\_probabilities, top\_log\_probabilities)

Log probabilities for an individual output token.

**Parameters:**

* **token\_log\_probabilities** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`float`](https://docs.python.org/3/library/functions.html#float) `]` )
* **top\_log\_probabilities** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`float`](https://docs.python.org/3/library/functions.html#float) `]` `]` )

### `token_log_probabilities` {#max.pipelines.core.LogProbabilities.token_log_probabilities}

> token\_log\_probabilities

Probabilities of each token.

**Type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[float](https://docs.python.org/3/library/functions.html#float)]

### `top_log_probabilities` {#max.pipelines.core.LogProbabilities.top_log_probabilities}

> top\_log\_probabilities

Top tokens and their corresponding probabilities.

**Type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[int](https://docs.python.org/3/library/functions.html#int), [float](https://docs.python.org/3/library/functions.html#float)]]

## `PipelineAudioTokenizer` {#max.pipelines.core.PipelineAudioTokenizer}

> *class* max.pipelines.core.PipelineAudioTokenizer(\*args, \*\*kwargs)

Interface for LLM tokenizers.

### `decode()` {#max.pipelines.core.PipelineAudioTokenizer.decode}

> *async* decode(context, encoded, \*\*kwargs)

Decodes response tokens to text.

**Parameters:**

* **context** (`AudioGeneratorContext` ) – Current generation context.
* **encoded** (`TokenizerEncoded` ) – Encoded response tokens.
* **kwargs** (`Any` ) – Additional keyword arguments.

**Returns:**

Un-encoded response text.

**Return type:**

[str](https://docs.python.org/3/library/stdtypes.html#str)

### `encode()` {#max.pipelines.core.PipelineAudioTokenizer.encode}

> *async* encode(prompt, add\_special\_tokens)

Encodes text prompts as tokens.

**Parameters:**

* **prompt** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – Un-encoded prompt text.
* **add\_special\_tokens** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether to add special tokens to the
  prompt.

**Returns:**

Encoded tokens.

**Return type:**

TokenizerEncoded

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the prompt exceeds the configured maximum length.

### `eos` {#max.pipelines.core.PipelineAudioTokenizer.eos}

> *property* eos\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The end of sequence token for this tokenizer.

### `expects_content_wrapping` {#max.pipelines.core.PipelineAudioTokenizer.expects_content_wrapping}

> *property* expects\_content\_wrapping\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

If true, this tokenizer expects messages to have a content property.

Text messages are formatted as:

```json
{ "type": "text", "content": "text content" }
```

instead of the OpenAI spec:

```json
{ "type": "text", "text": "text content" }
```

NOTE: Multimodal messages omit the content property.
Both `image_urls` and `image` content parts are converted to:

```json
{ "type": "image" }
```

Their content is provided as byte arrays through the top-level property
on the request object, i.e., [`TokenGeneratorRequest.images`](#max.pipelines.core.TokenGeneratorRequest.images).

### `new_context()` {#max.pipelines.core.PipelineAudioTokenizer.new_context}

> *async* new\_context(request)

Creates a new context from a request object. This is sent to the
worker process once and then cached locally.

**Parameters:**

**request** ([`AudioGenerationRequest`](#max.pipelines.core.AudioGenerationRequest) ) – Incoming request.

**Returns:**

Initialized context.

**Return type:**

AudioGeneratorContext

## `PipelineTask` {#max.pipelines.core.PipelineTask}

> *class* max.pipelines.core.PipelineTask(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None)

### `AUDIO_GENERATION` {#max.pipelines.core.PipelineTask.AUDIO_GENERATION}

> AUDIO\_GENERATION *= 'audio\_generation'*

### `EMBEDDINGS_GENERATION` {#max.pipelines.core.PipelineTask.EMBEDDINGS_GENERATION}

> EMBEDDINGS\_GENERATION *= 'embeddings\_generation'*

### `TEXT_GENERATION` {#max.pipelines.core.PipelineTask.TEXT_GENERATION}

> TEXT\_GENERATION *= 'text\_generation'*

## `PipelineTokenizer` {#max.pipelines.core.PipelineTokenizer}

> *class* max.pipelines.core.PipelineTokenizer(\*args, \*\*kwargs)

Interface for LLM tokenizers.

### `decode()` {#max.pipelines.core.PipelineTokenizer.decode}

> *async* decode(context, encoded, \*\*kwargs)

Decodes response tokens to text.

**Parameters:**

* **context** (`TokenGeneratorContext` ) – Current generation context.
* **encoded** (`TokenizerEncoded` ) – Encoded response tokens.

**Returns:**

Un-encoded response text.

**Return type:**

[str](https://docs.python.org/3/library/stdtypes.html#str)

### `encode()` {#max.pipelines.core.PipelineTokenizer.encode}

> *async* encode(prompt, add\_special\_tokens)

Encodes text prompts as tokens.

**Parameters:**

* **prompt** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – Un-encoded prompt text.
* **add\_special\_tokens** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the prompt exceeds the configured maximum length.

**Return type:**

*TokenizerEncoded*

### `eos` {#max.pipelines.core.PipelineTokenizer.eos}

> *property* eos\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The end of sequence token for this tokenizer.

### `expects_content_wrapping` {#max.pipelines.core.PipelineTokenizer.expects_content_wrapping}

> *property* expects\_content\_wrapping\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

If true, this tokenizer expects messages to have a content property.

Text messages are formatted as:

```json
{ "type": "text", "content": "text content" }
```

instead of the OpenAI spec:

```json
{ "type": "text", "text": "text content" }
```

NOTE: Multimodal messages omit the content property.
Both `image_urls` and `image` content parts are converted to:

```json
{ "type": "image" }
```

Their content is provided as byte arrays through the top-level property
on the request object, i.e., `PipelineTokenizerRequest.images`.

### `new_context()` {#max.pipelines.core.PipelineTokenizer.new_context}

> *async* new\_context(request)

Creates a new context from a request object. This is sent to the
worker process once and then cached locally.

**Parameters:**

**request** (`PipelineTokenizerRequest` ) – Incoming request.

**Returns:**

Initialized context.

**Return type:**

TokenGeneratorContext

## `TTSContext` {#max.pipelines.core.TTSContext}

> *class* max.pipelines.core.TTSContext(\*args, \*\*kwargs)

A context for the TTS model.

### `next_speech_tokens()` {#max.pipelines.core.TTSContext.next_speech_tokens}

> next\_speech\_tokens(audio\_chunk\_size)

Returns a chunk of the next unseen speech tokens.

Calling this function will update the index of the last seen token.

**Parameters:**

**audio\_chunk\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of speech tokens to return.

**Returns:**

A chunk of speech tokens.

**Return type:**

[*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)

### `speech_tokens` {#max.pipelines.core.TTSContext.speech_tokens}

> *property* speech\_tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

### `update_speech_tokens()` {#max.pipelines.core.TTSContext.update_speech_tokens}

> update\_speech\_tokens(new\_tokens)

Updates the next\_tokens

**Parameters:**

**new\_tokens** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Return type:**

None

## `TextAndVisionContext` {#max.pipelines.core.TextAndVisionContext}

> *class* max.pipelines.core.TextAndVisionContext(cache\_seq\_id, prompt, max\_length, tokens, pixel\_values, extra\_model\_args, log\_probabilities=0, log\_probabilities\_echo=False, json\_schema=None, ignore\_eos=False)

A base class for model context, specifically for Vision model variants.

**Parameters:**

* **cache\_seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **prompt** (`Union` `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  `Sequence` `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` `]` )
* **max\_length** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **tokens** (`np.ndarray` )
* **pixel\_values** (`Sequence` `[` `np.ndarray` `]` )
* **extra\_model\_args** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  `Any` `]` )
* **log\_probabilities** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **log\_probabilities\_echo** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )
* **json\_schema** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `None` )
* **ignore\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

### `update()` {#max.pipelines.core.TextAndVisionContext.update}

> update(new\_token, log\_probabilities=None, is\_eos=False)

Updates the next\_tokens and extends existing tokens to include all generated tokens.

**Parameters:**

* **new\_token** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **log\_probabilities** ([`LogProbabilities`](#max.pipelines.core.LogProbabilities)  `|`  `None` )
* **is\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Return type:**

None

## `TextContext` {#max.pipelines.core.TextContext}

> *class* max.pipelines.core.TextContext(prompt, max\_length, tokens, cache\_seq\_id=None, log\_probabilities=0, log\_probabilities\_echo=False, json\_schema=None, ignore\_eos=False)

A base class for model context, specifically for Text model variants.

**Parameters:**

* **prompt** (`Union` `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  `Sequence` `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` `]` )
* **max\_length** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **tokens** (`np.ndarray` )
* **cache\_seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **log\_probabilities** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **log\_probabilities\_echo** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )
* **json\_schema** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `None` )
* **ignore\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

### `active_idx` {#max.pipelines.core.TextContext.active_idx}

> *property* active\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `active_length` {#max.pipelines.core.TextContext.active_length}

> *property* active\_length\*: [int](https://docs.python.org/3/library/functions.html#int)\*

num tokens input this iteration.

This will be the prompt size for context encoding, and simply 1 (or more) for
token generation.

**Type:**

Current sequence length

### `assign_to_cache()` {#max.pipelines.core.TextContext.assign_to_cache}

> assign\_to\_cache(cache\_seq\_id)

**Parameters:**

**cache\_seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

None

### `bump_token_indices()` {#max.pipelines.core.TextContext.bump_token_indices}

> bump\_token\_indices(start\_idx=0, active\_idx=0, end\_idx=0, committed\_idx=0)

Update the start\_idx, active\_idx and end\_idx without manipulating the token array.

**Parameters:**

* **start\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **active\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **end\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **committed\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

None

### `cache_seq_id` {#max.pipelines.core.TextContext.cache_seq_id}

> *property* cache\_seq\_id\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `committed_idx` {#max.pipelines.core.TextContext.committed_idx}

> *property* committed\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `compute_num_available_steps()` {#max.pipelines.core.TextContext.compute_num_available_steps}

> compute\_num\_available\_steps(max\_seq\_len)

Compute the max number of steps we can execute for a given context
without exceeding the max\_seq\_len.

**Parameters:**

**max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `current_length` {#max.pipelines.core.TextContext.current_length}

> *property* current\_length\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The current length of the sequence, including completed and active tokens.

### `end_idx` {#max.pipelines.core.TextContext.end_idx}

> *property* end\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `generated_tokens` {#max.pipelines.core.TextContext.generated_tokens}

> *property* generated\_tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

### `is_assigned_to_cache` {#max.pipelines.core.TextContext.is_assigned_to_cache}

> *property* is\_assigned\_to\_cache\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

### `jump_ahead()` {#max.pipelines.core.TextContext.jump_ahead}

> jump\_ahead(new\_token, is\_eos=False)

Updates the token array, while ensuring the new token is returned to the user.

**Parameters:**

* **new\_token** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **is\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Return type:**

None

### `next_tokens` {#max.pipelines.core.TextContext.next_tokens}

> *property* next\_tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

### `outstanding_completion_tokens()` {#max.pipelines.core.TextContext.outstanding_completion_tokens}

> outstanding\_completion\_tokens()

Return the list of outstanding completion tokens and log probabilities
that must be returned to the user.

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [*LogProbabilities*](#max.pipelines.core.LogProbabilities) | None]]

### `prompt_tokens` {#max.pipelines.core.TextContext.prompt_tokens}

> *property* prompt\_tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

### `reset()` {#max.pipelines.core.TextContext.reset}

> reset()

Resets the context’s state by combining all tokens into a new prompt.

**Return type:**

None

### `rollback()` {#max.pipelines.core.TextContext.rollback}

> rollback(idx)

**Parameters:**

**idx** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

None

### `set_draft_offset()` {#max.pipelines.core.TextContext.set_draft_offset}

> set\_draft\_offset(idx)

**Parameters:**

**idx** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

None

### `set_matcher()` {#max.pipelines.core.TextContext.set_matcher}

> set\_matcher(matcher)

**Parameters:**

**matcher** (`xgr.GrammarMatcher` )

**Return type:**

None

### `set_token_indices()` {#max.pipelines.core.TextContext.set_token_indices}

> set\_token\_indices(start\_idx=None, active\_idx=None, end\_idx=None, committed\_idx=None)

Set the token indices without manipulating the token array.

**Parameters:**

* **start\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **active\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **end\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **committed\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )

**Return type:**

None

### `start_idx` {#max.pipelines.core.TextContext.start_idx}

> *property* start\_idx\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `tokens` {#max.pipelines.core.TextContext.tokens}

> *property* tokens\*: [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

### `unassign_from_cache()` {#max.pipelines.core.TextContext.unassign_from_cache}

> unassign\_from\_cache()

**Return type:**

None

### `update()` {#max.pipelines.core.TextContext.update}

> update(new\_token, log\_probabilities=None, is\_eos=False)

Updates the next\_tokens and extends existing tokens to include all generated tokens.

**Parameters:**

* **new\_token** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **log\_probabilities** ([`LogProbabilities`](#max.pipelines.core.LogProbabilities)  `|`  `None` )
* **is\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Return type:**

None

## `TextGenerationResponse` {#max.pipelines.core.TextGenerationResponse}

> *class* max.pipelines.core.TextGenerationResponse(tokens, final\_status)

**Parameters:**

* **tokens** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TextResponse`](#max.pipelines.core.TextResponse) `]` )
* **final\_status** ([`TextGenerationStatus`](#max.pipelines.core.TextGenerationStatus) )

### `append_token()` {#max.pipelines.core.TextGenerationResponse.append_token}

> append\_token(token)

**Parameters:**

**token** ([`TextResponse`](#max.pipelines.core.TextResponse) )

**Return type:**

None

### `final_status` {#max.pipelines.core.TextGenerationResponse.final_status}

> *property* final\_status\*: [TextGenerationStatus](#max.pipelines.core.TextGenerationStatus)\*

### `is_done` {#max.pipelines.core.TextGenerationResponse.is_done}

> *property* is\_done\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

### `tokens` {#max.pipelines.core.TextGenerationResponse.tokens}

> *property* tokens\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[TextResponse](#max.pipelines.core.TextResponse)]\*

### `update_status()` {#max.pipelines.core.TextGenerationResponse.update_status}

> update\_status(status)

**Parameters:**

**status** ([`TextGenerationStatus`](#max.pipelines.core.TextGenerationStatus) )

**Return type:**

None

## `TextGenerationStatus` {#max.pipelines.core.TextGenerationStatus}

> *class* max.pipelines.core.TextGenerationStatus(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None)

### `ACTIVE` {#max.pipelines.core.TextGenerationStatus.ACTIVE}

> ACTIVE *= 'active'*

### `END_OF_SEQUENCE` {#max.pipelines.core.TextGenerationStatus.END_OF_SEQUENCE}

> END\_OF\_SEQUENCE *= 'end\_of\_sequence'*

### `MAXIMUM_LENGTH` {#max.pipelines.core.TextGenerationStatus.MAXIMUM_LENGTH}

> MAXIMUM\_LENGTH *= 'maximum\_length'*

### `is_done` {#max.pipelines.core.TextGenerationStatus.is_done}

> *property* is\_done\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

## `TextResponse` {#max.pipelines.core.TextResponse}

> *class* max.pipelines.core.TextResponse(next\_token, log\_probabilities=None)

A base class for model response, specifically for Text model variants.

**Parameters:**

* **next\_token** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **log\_probabilities** ([`LogProbabilities`](#max.pipelines.core.LogProbabilities)  `|`  `None` )

### `next_token` {#max.pipelines.core.TextResponse.next_token}

> next\_token

Encoded predicted next token.

**Type:**

[int](https://docs.python.org/3/library/functions.html#int) | [str](https://docs.python.org/3/library/stdtypes.html#str)

### `log_probabilities` {#max.pipelines.core.TextResponse.log_probabilities}

> log\_probabilities

Log probabilities of each output token.

**Type:**

[LogProbabilities](#max.pipelines.core.LogProbabilities) | None

## `TokenGenerator` {#max.pipelines.core.TokenGenerator}

> *class* max.pipelines.core.TokenGenerator(\*args, \*\*kwargs)

Interface for LLM token-generator models.

### `next_token()` {#max.pipelines.core.TokenGenerator.next_token}

> next\_token(batch, num\_steps)

Computes the next token response for a single batch.

**Parameters:**

* **batch** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  `TokenGeneratorContext` `]` ) – Batch of contexts.
* **int** (`num_steps` ) – Number of tokens to generate.
* **num\_steps** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Returns:**

List of encoded responses (indexed by req. ID)

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [TextResponse](#max.pipelines.core.TextResponse)]]

### `release()` {#max.pipelines.core.TokenGenerator.release}

> release(context)

Releases resources associated with this context.

**Parameters:**

**context** (`TokenGeneratorContext` ) – Finished context.

**Return type:**

None

## `TokenGeneratorRequest` {#max.pipelines.core.TokenGeneratorRequest}

> *class* max.pipelines.core.TokenGeneratorRequest(id: [str](https://docs.python.org/3/library/stdtypes.html#str), index: [int](https://docs.python.org/3/library/functions.html#int), model\_name: [str](https://docs.python.org/3/library/stdtypes.html#str), prompt: [str](https://docs.python.org/3/library/stdtypes.html#str) | [collections.abc.Sequence](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[int](https://docs.python.org/3/library/functions.html#int)] | NoneType = None, messages: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[max.pipelines.core.interfaces.text\_generation.TokenGeneratorRequestMessage](#max.pipelines.core.TokenGeneratorRequestMessage)] | [None](https://docs.python.org/3/library/constants.html#None) = None, images: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[bytes](https://docs.python.org/3/library/stdtypes.html#bytes)] | [None](https://docs.python.org/3/library/constants.html#None) = None, tools: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[max.pipelines.core.interfaces.text\_generation.TokenGeneratorRequestTool](#max.pipelines.core.TokenGeneratorRequestTool)] | [None](https://docs.python.org/3/library/constants.html#None) = None, response\_format: [max.pipelines.core.interfaces.text\_generation.TokenGeneratorResponseFormat](#max.pipelines.core.TokenGeneratorResponseFormat) | [None](https://docs.python.org/3/library/constants.html#None) = None, max\_new\_tokens: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None) = None, timestamp\_ns: [int](https://docs.python.org/3/library/functions.html#int) = 0, request\_path: [str](https://docs.python.org/3/library/stdtypes.html#str) = '/', logprobs: [int](https://docs.python.org/3/library/functions.html#int) = 0, echo: [bool](https://docs.python.org/3/library/functions.html#bool) = False, stop: [str](https://docs.python.org/3/library/stdtypes.html#str) | [list](https://docs.python.org/3/library/stdtypes.html#list)\[[str](https://docs.python.org/3/library/stdtypes.html#str)] | NoneType = None, ignore\_eos: [bool](https://docs.python.org/3/library/functions.html#bool) = False, chat\_template\_options: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), Any] | [None](https://docs.python.org/3/library/constants.html#None) = None)

**Parameters:**

* **id** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **index** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **model\_name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **prompt** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]`  `|`  `None` )
* **messages** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TokenGeneratorRequestMessage`](#max.pipelines.core.TokenGeneratorRequestMessage) `]`  `|`  `None` )
* **images** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`bytes`](https://docs.python.org/3/library/stdtypes.html#bytes) `]`  `|`  `None` )
* **tools** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TokenGeneratorRequestTool`](#max.pipelines.core.TokenGeneratorRequestTool) `]`  `|`  `None` )
* **response\_format** ([`TokenGeneratorResponseFormat`](#max.pipelines.core.TokenGeneratorResponseFormat)  `|`  `None` )
* **max\_new\_tokens** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **timestamp\_ns** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **request\_path** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **logprobs** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **echo** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )
* **stop** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `]`  `|`  `None` )
* **ignore\_eos** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )
* **chat\_template\_options** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `]`  `|`  `None` )

### `chat_template_options` {#max.pipelines.core.TokenGeneratorRequest.chat_template_options}

> chat\_template\_options\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [Any](https://docs.python.org/3/library/typing.html#typing.Any)] | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

Optional dictionary of options to pass when applying the chat template.

### `echo` {#max.pipelines.core.TokenGeneratorRequest.echo}

> echo\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False*

If set to True, the response will include the original prompt along with the
generated output. This can be useful for debugging or when you want to see how
the input relates to the output.

### `id` {#max.pipelines.core.TokenGeneratorRequest.id}

> id\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

A unique identifier for the request. This ID can be used to trace and log
the request throughout its lifecycle, facilitating debugging and tracking.

### `ignore_eos` {#max.pipelines.core.TokenGeneratorRequest.ignore_eos}

> ignore\_eos\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False*

If set to True, the response will ignore the EOS token, and continue to generate until the Max tokens or a
stop string is hit.

### `images` {#max.pipelines.core.TokenGeneratorRequest.images}

> images\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[bytes](https://docs.python.org/3/library/stdtypes.html#bytes)] | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

A list of image byte arrays that can be included as part of the request.
This field is optional and may be used for multimodal inputs where images
are relevant to the prompt or task.

### `index` {#max.pipelines.core.TokenGeneratorRequest.index}

> index\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The sequence order of this request within a batch. This is useful for
maintaining the order of requests when processing multiple requests
simultaneously, ensuring that responses can be matched back to their
corresponding requests accurately.

### `logprobs` {#max.pipelines.core.TokenGeneratorRequest.logprobs}

> logprobs\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 0*

The number of top log probabilities to return for each generated token. A value
of 0 means that log probabilities will not be returned. Useful for analyzing
model confidence in its predictions.

### `max_new_tokens` {#max.pipelines.core.TokenGeneratorRequest.max_new_tokens}

> max\_new\_tokens\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

The maximum number of new tokens to generate in the response. If not set,
the model may generate tokens until it reaches its internal limits or based
on other stopping criteria.

### `messages` {#max.pipelines.core.TokenGeneratorRequest.messages}

> messages\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[TokenGeneratorRequestMessage](#max.pipelines.core.TokenGeneratorRequestMessage)] | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

A list of messages for chat-based interactions. This is used in chat
completion APIs, where each message represents a turn in the conversation.
If provided, the model will generate responses based on these messages.

### `model_name` {#max.pipelines.core.TokenGeneratorRequest.model_name}

> model\_name\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

The name of the model to be used for generating tokens. This should match
the available models on the server and determines the behavior and
capabilities of the response generation.

### `prompt` {#max.pipelines.core.TokenGeneratorRequest.prompt}

> prompt\*: [str](https://docs.python.org/3/library/stdtypes.html#str) | [Sequence](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[int](https://docs.python.org/3/library/functions.html#int)] | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

The prompt to be processed by the model. This field supports legacy
completion APIs and can accept either a string or a sequence of integers
representing token IDs. If not provided, the model may generate output
based on the messages field.

### `request_path` {#max.pipelines.core.TokenGeneratorRequest.request_path}

> request\_path\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= '/'*

The endpoint path for the request. This is typically used for routing and
logging requests within the server infrastructure.

### `response_format` {#max.pipelines.core.TokenGeneratorRequest.response_format}

> response\_format\*: [TokenGeneratorResponseFormat](#max.pipelines.core.TokenGeneratorResponseFormat) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

Specifies the desired format for the model’s output. When set, it enables
structured generation, which adheres to the json\_schema provided.

### `stop` {#max.pipelines.core.TokenGeneratorRequest.stop}

> stop\*: [str](https://docs.python.org/3/library/stdtypes.html#str) | [list](https://docs.python.org/3/library/stdtypes.html#list)\[[str](https://docs.python.org/3/library/stdtypes.html#str)] | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

//platform.openai.com/docs/api-reference/chat/create#chat-create-stop)

**Type:**

Optional list of stop expressions (see

**Type:**

https

### `timestamp_ns` {#max.pipelines.core.TokenGeneratorRequest.timestamp_ns}

> timestamp\_ns\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 0*

The time (in nanoseconds) when the request was received by the server. This
can be useful for performance monitoring and logging purposes.

### `tools` {#max.pipelines.core.TokenGeneratorRequest.tools}

> tools\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[TokenGeneratorRequestTool](#max.pipelines.core.TokenGeneratorRequestTool)] | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

A list of tools that can be invoked during the generation process. This
allows the model to utilize external functionalities or APIs to enhance its
responses.

## `TokenGeneratorRequestFunction` {#max.pipelines.core.TokenGeneratorRequestFunction}

> *class* max.pipelines.core.TokenGeneratorRequestFunction

### `description` {#max.pipelines.core.TokenGeneratorRequestFunction.description}

> description\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

### `name` {#max.pipelines.core.TokenGeneratorRequestFunction.name}

> name\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

### `parameters` {#max.pipelines.core.TokenGeneratorRequestFunction.parameters}

> parameters\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\*

## `TokenGeneratorRequestMessage` {#max.pipelines.core.TokenGeneratorRequestMessage}

> *class* max.pipelines.core.TokenGeneratorRequestMessage

### `content` {#max.pipelines.core.TokenGeneratorRequestMessage.content}

> content\*: [str](https://docs.python.org/3/library/stdtypes.html#str) | [list](https://docs.python.org/3/library/stdtypes.html#list)\[[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [Any](https://docs.python.org/3/library/typing.html#typing.Any)]]\*

Content can be simple string or a list of message parts of different
modalities.

For example:

```json
{
  "role": "user",
  "content": "What'''s the weather like in Boston today?"
}
```

Or:

```json
{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "What'''s in this image?"
    },
    {
      "type": "image_url",
      "image_url": {
          "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
      }
    }
  ]
}
```

### `role` {#max.pipelines.core.TokenGeneratorRequestMessage.role}

> role\*: [Literal](https://docs.python.org/3/library/typing.html#typing.Literal)\['system', 'user', 'assistant']\*

## `TokenGeneratorRequestTool` {#max.pipelines.core.TokenGeneratorRequestTool}

> *class* max.pipelines.core.TokenGeneratorRequestTool

### `function` {#max.pipelines.core.TokenGeneratorRequestTool.function}

> function\*: [TokenGeneratorRequestFunction](#max.pipelines.core.TokenGeneratorRequestFunction)\*

### `type` {#max.pipelines.core.TokenGeneratorRequestTool.type}

> type\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

## `TokenGeneratorResponseFormat` {#max.pipelines.core.TokenGeneratorResponseFormat}

> *class* max.pipelines.core.TokenGeneratorResponseFormat

### `json_schema` {#max.pipelines.core.TokenGeneratorResponseFormat.json_schema}

> json\_schema\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\*

### `type` {#max.pipelines.core.TokenGeneratorResponseFormat.type}

> type\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

---

## Context

```c
#include "max/c/context.h"
```

## Functions

### `M_newRuntimeConfig()`

> [M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*M\_newRuntimeConfig()

Creates a new runtime config.

This configures runtime details such as the number of threads and log level.

By default, the config object’s number of threads will be set to `0`, which is internally used to refer to the number of physical processors in the first socket in the system. You can change this with [`M_setNumThreads()`](#context_8h_1a8734265a43df2dd1354c9f7237734aa2).

You need this as an argument for [`M_newRuntimeContext()`](#context_8h_1a46a6c670f73e1ce560f3c2cc1de93175).

* **Returns:**

  A pointer to the new runtime config. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeRuntimeConfig()`](#context_8h_1a47f7e22f7f71da9ab5fb3a1886911610).

### `M_setNumThreads()`

> void M\_setNumThreads([M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config, size\_t numThreads)

Sets the number of threads in a runtime’s threadpool.

* **Parameters:**

  * **config** – The runtime config.
  * **numThreads** – The number of threads.

### `M_setAllocatorType()`

> void M\_setAllocatorType([M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config, [M\_AllocatorType](types.md#_CPPv415M_AllocatorType) allocatorType)

Sets the memory allocator used for tensor allocations.

* **Parameters:**

  * **config** – The runtime config.
  * **allocatorType** – An identifier for the type of allocator to use. Currently must be kCaching or kSystem. kCaching uses an allocator that trades-off performance for memory usage by not freeing memory immediately to the system. kSystem on the other hand frees memory immediately to the system and may not be as performant in some cases. The default is kCaching.

### `M_setCPUAffinity()`

> void M\_setCPUAffinity([M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config, bool cpuAffinity)

Sets whether CPU affinity is enabled.

* **Parameters:**

  * **config** – The runtime config.
  * **cpuAffinity** – The new CPU affinity setting.

### `M_getNumThreads()`

> size\_t M\_getNumThreads([M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config)

Gets the number of threads in a runtime’s threadpool.

* **Parameters:**

  **config** – The runtime config.
* **Returns:**

  The number of threads in the the runtime’s threadpool. Otherwise, `0` if [`M_setNumThreads()`](#context_8h_1a8734265a43df2dd1354c9f7237734aa2) has not been called.

### `M_getCPUAffinity()`

> bool M\_getCPUAffinity([M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config)

Gets the current CPU affinity setting. Note that this does not guarantee that any CPU affinity will be set, however if this is false then it is guarantee that *no* CPU affinity will be set.

* **Parameters:**

  **config** – The runtime config.
* **Returns:**

  The current CPU affinity setting.

### `M_enableCrashLog()`

> void M\_enableCrashLog([M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config, const char \*crashDir)

Enables crash logging and sets the location where crash dumps are stored. Note that this will install signal handlers to do so; ensure that this method is called last to unwind to previously registered handlers.

* **Parameters:**

  * **config** – The runtime config.
  * **crashDir** – The crash dump directory.

### `M_freeRuntimeConfig()`

> void M\_freeRuntimeConfig([M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config)

Deallocates the memory for a runtime config. No-op if `config` is `NULL`.

* **Parameters:**

  **config** – The runtime config.

### `M_newRuntimeContext()`

> [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*M\_newRuntimeContext(const [M\_RuntimeConfig](types.md#_CPPv415M_RuntimeConfig) \*config, [M\_Status](types.md#_CPPv48M_Status) \*status)

Creates a runtime context.

The context is an application-level object that sets up various resources such as threadpool and allocators during inference. You need this before you can call [`M_compileModel()`](model.md#model_8h_1a88afca26a64b945885e1e1a0d09b5750).

It’s expected that there’s only one runtime context active in an inference session at a time. We recommended you create one context and use it throughout your application.

For example:

```c
M_Status *status = M_newStatus();
M_RuntimeConfig *runtimeConfig = M_newRuntimeConfig();
M_RuntimeContext *context = M_newRuntimeContext(runtimeConfig, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}
```

* **Parameters:**

  * **config** – The runtime config, from [`M_newRuntimeConfig()`](#context_8h_1a963f1d4eefd812ba8691acf516007cfc).
  * **status** – The status object for reporting errors. It is filled with an error message if construction of the runtime context fails.
* **Returns:**

  A pointer to the runtime context object. On success, this is a valid pointer. On failure, this is a `NULL` pointer with an error message in the status. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeRuntimeContext()`](#context_8h_1a2434a11d8d65890c66f6b5516243a730).

### `M_freeRuntimeContext()`

> void M\_freeRuntimeContext([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context)

Deallocates the memory for a runtime context. No-op if `context` is `NULL`.

* **Parameters:**

  **context** – The runtime context.

### `M_setDebugPrintOptions()`

> void M\_setDebugPrintOptions([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, [M\_ResultOutputStyle](types.md#_CPPv419M_ResultOutputStyle) style, unsigned int precision, const char \*directory)

Set the options for debugging printing of tensors when executing a model.

* **Parameters:**

  * **context** – The runtime context.
  * **style** – The way the data will be printed.
  * **precision** – The floating point print out precision.
  * **directory** – The directory to store binary output.

### `M_setMojoDefineBool()`

> void M\_setMojoDefineBool([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, const char \*key, bool value)

Sets a mojo compile-time define with an boolean value.

* **Parameters:**

  * **context** – The runtime context.
  * **key** – The name of the define.
  * **value** – The boolean to set the define to.

### `M_setMojoDefineInt()`

> void M\_setMojoDefineInt([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, const char \*key, int value)

Sets a mojo compile-time define with an integer value.

* **Parameters:**

  * **context** – The runtime context.
  * **key** – The name of the define.
  * **value** – The integer to set the define to.

### `M_setMojoDefineString()`

> void M\_setMojoDefineString([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, const char \*key, const char \*value)

Sets a mojo compile-time define with an string value.

* **Parameters:**

  * **context** – The runtime context.
  * **key** – The name of the define.
  * **value** – The string to set the define to.

---

## Context encoding

Context encoding (also known as "prefill") is the first phase in a [transformer
model](transformer.mdx) that converts input data into a cached numerical
representation ([KV cache](kv-cache.mdx)) and predicts the first token. It
occurs after the input has already been [tokenized](tokenization.mdx)
(preprocessed).

Context encoding is then followed by the [autoregressive](autoregression.mdx)
token generation phase, which produces one token at a time. If it weren't for
the KV cache built during context encoding, the model would have to recalculate
the [self-attention](self-attention.mdx) score for each token in the original
input, every time it starts to predict a new token.

Context encoding is usually the most computationally expensive phase in an LLM,
because it must calculate attention scores for every token in the input
sequence. Although this process may be parallelized across thousands of GPU
threads (because each token can be processed separately), it is still a
significant latency factor for time-to-first-token (TTFT). The model can
usually produce subsequent tokens much faster than the first one because each
round of token generation needs to calculate an attention score for only one
token (the new one).

---

## Continuous batching

Continuous batching is a [batching](batching.mdx) technique that can
continuously dispatch inference requests to the GPU for [token
generation](token-generation.mdx) and dramatically improve GPU utilization.
Continuous batching can start executing a new batch even before the previous
batch finishes its pass through the model, because this batching technique
schedules new processing at the "token level."

That is, because large language models (LLMs) generate responses one token at a
time, there is a repeated cycle during inference (the token generation phase)
in which a new batch can jump in to utilize the GPU, even before a previous
batch finishes its pass through the model. That's what it means to operate at
the "token level"—the batch scheduler focuses on keeping the GPU busy with
token generation at all times, instead of waiting for the previous batch to
finish its complete forward pass.

This is sometimes called "in-flight batching" in cases where context
encoding and token generation requests are combined into the same batch.

---

## continuous_batching_cache

Continuous Batching enabled KV cache for the Transformer leveraging the mo.opaque pattern.

## `ContinuousBatchingKVCache` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCache}

> *class* max.nn.kv\_cache.continuous\_batching\_cache.ContinuousBatchingKVCache(value)

Continuous Mojo KV cache graph value.

Value is abstract, it shouldn’t be constructed directly.

**Parameters:**

**value** ([`Value`](../../graph/Value.md#max.graph.Value)  `|`  `\_Value` `[` `mo.OpaqueType` `]` )

## `ContinuousBatchingKVCacheCollection` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection}

> *class* max.nn.kv\_cache.continuous\_batching\_cache.ContinuousBatchingKVCacheCollection(value)

The graph value for a view of the KV cache.

Value is abstract, it shouldn’t be constructed directly.

**Parameters:**

**value** ([`Value`](../../graph/Value.md#max.graph.Value)  `|`  `\_Value` `[` `mo.OpaqueType` `]` )

## `ContinuousBatchingKVCacheCollectionType` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollectionType}

> *class* max.nn.kv\_cache.continuous\_batching\_cache.ContinuousBatchingKVCacheCollectionType

The graph type for a “view” of the cache for the given sequences in the
batch.

This object does not own the underlying buffers in k\_cache and v\_cache,
it’s borrowing them from the BlockWrappers in our ContinuousKVCacheManager.
It does own the Pointer\[NDBuffer\[type, 3]] and valid\_lengths buffer

Creates an opaque type containing a continuous batching KV cache collection.

## `ContinuousBatchingKVCacheInputSymbols` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheInputSymbols}

> *class* max.nn.kv\_cache.continuous\_batching\_cache.ContinuousBatchingKVCacheInputSymbols(kv\_blocks: 'TensorType', cache\_lengths: 'TensorType', lookup\_table: 'TensorType', max\_lengths: 'TensorType')

**Parameters:**

* **kv\_blocks** ([`TensorType`](../../graph/type.md#max.graph.type.TensorType) )
* **cache\_lengths** ([`TensorType`](../../graph/type.md#max.graph.type.TensorType) )
* **lookup\_table** ([`TensorType`](../../graph/type.md#max.graph.type.TensorType) )
* **max\_lengths** ([`TensorType`](../../graph/type.md#max.graph.type.TensorType) )

### `cache_lengths` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheInputSymbols.cache_lengths}

> cache\_lengths\*: [TensorType](../../graph/type.md#max.graph.type.TensorType)\*

### `kv_blocks` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheInputSymbols.kv_blocks}

> kv\_blocks\*: [TensorType](../../graph/type.md#max.graph.type.TensorType)\*

### `lookup_table` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheInputSymbols.lookup_table}

> lookup\_table\*: [TensorType](../../graph/type.md#max.graph.type.TensorType)\*

### `max_lengths` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheInputSymbols.max_lengths}

> max\_lengths\*: [TensorType](../../graph/type.md#max.graph.type.TensorType)\*

## `ContinuousBatchingKVCacheManager` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheManager}

> *class* max.nn.kv\_cache.continuous\_batching\_cache.ContinuousBatchingKVCacheManager(params, max\_batch\_size, max\_seq\_len, num\_layers, devices, session)

**Parameters:**

* **params** ([`KVCacheParams`](cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **max\_batch\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **num\_layers** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **devices** (`Sequence` `[` [`Device`](../../driver.md#max.driver.Device) `]` )
* **session** ([`InferenceSession`](../../engine.md#max.engine.InferenceSession) )

### `block_shape()` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheManager.block_shape}

> block\_shape(n\_sequences)

Returns the shape of the KV cache blocks for the given number of sequences.

Defines the 6-dimensional shape of the cache blocks used to store key and value
tensors for transformer attention. The dimensions represent:
\[n\_sequences, 2, num\_layers, max\_seq\_len, n\_kv\_heads\_per\_device, head\_dim]
where 2 represents separate storage for keys and values.

**Parameters:**

**n\_sequences** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of sequences that will be cached

**Returns:**

sequences, key/value split, layers, sequence length, attention heads, and head dimension

**Return type:**

List describing the shape of the cache blocks with dimensions for

### `estimated_memory_size()` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheManager.estimated_memory_size}

> *classmethod* estimated\_memory\_size(params, max\_batch\_size, max\_seq\_len, num\_layers, available\_cache\_memory, devices, \*\*kwargs)

Returns the estimated total memory usage of the kv cache.

**Parameters:**

* **params** ([`KVCacheParams`](cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **max\_batch\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **num\_layers** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **available\_cache\_memory** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **devices** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Device`](../../driver.md#max.driver.Device) `]` )
* **kwargs** ([`Any`](https://docs.python.org/3/library/typing.html#typing.Any) )

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `fetch()` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheManager.fetch}

> fetch(batch, num\_steps=1)

Fetches the KV cache state for the given sequence IDs.

This method retrieves the current cache state for a batch of sequences, including their
cache lengths and lookup information. It’s used during token generation to access
previously cached key/value pairs.

**Parameters:**

* **batch** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `T` `]` ) – List of KVCacheAwareContext for which to fetch cache state for.
* **num\_steps** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of steps to run for multi-step scheduling.

**Returns:**

* blocks: Tensor containing the KV cache blocks
* cache\_lengths: Tensor of current cache lengths for each sequence
* lookup\_table: Tensor mapping sequence IDs to cache positions
* max\_lengths: Tensor containing \[max\_seq\_length, max\_cache\_length]

**Return type:**

List of tuples for each device containing

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If any seq\_id exceeds max\_batch\_size or doesn’t exist in cache

### `infer_optimal_batch_size()` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheManager.infer_optimal_batch_size}

> *classmethod* infer\_optimal\_batch\_size(params, max\_seq\_len, num\_layers, available\_cache\_memory, devices, \*\*kwargs)

Returns the estimated optimal batch size for the kv cache.

**Parameters:**

* **params** ([`KVCacheParams`](cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **num\_layers** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **available\_cache\_memory** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **devices** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Device`](../../driver.md#max.driver.Device) `]` )
* **kwargs** ([`Any`](https://docs.python.org/3/library/typing.html#typing.Any) )

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `input_symbols()` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheManager.input_symbols}

> input\_symbols()

Returns the expected input tensor types for fetch on each device.

Defines the tensor specifications needed by the cache implementation,
including shapes and data types. This is used for graph construction
and validation.

**Returns:**

List of ContinuousBatchingKVCacheInputSymbols for each device
containing TensorTypes for:

* KV cache blocks: 6D tensor for storing keys and values
* Cache lengths: 1D tensor tracking sequence lengths
* Lookup table: 1D tensor mapping sequence IDs to cache positions
* Maximum lengths: 2D tensor tracking maximum sequence and cache lengths per step.

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*ContinuousBatchingKVCacheInputSymbols*](#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheInputSymbols)]

## `ContinuousBatchingKVCacheType` {#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheType}

> *class* max.nn.kv\_cache.continuous\_batching\_cache.ContinuousBatchingKVCacheType

Continuous Mojo KV Cache graph type.

Creates an opaque type containing a continuous batching KV Cache.

## `FetchContinuousBatchingKVCacheCollection` {#max.nn.kv_cache.continuous_batching_cache.FetchContinuousBatchingKVCacheCollection}

> *class* max.nn.kv\_cache.continuous\_batching\_cache.FetchContinuousBatchingKVCacheCollection(kv\_params, \*\*kwargs)

**Parameters:**

* **kv\_params** ([`KVCacheParams`](cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **kwargs** (`Any` )

---

## ContinuousBatchingKVCache

`@register_passable(trivial)`
`struct ContinuousBatchingKVCache[type_: DType, kv_params_: KVCacheStaticParams, assert_write_mode: Int = 0]`

Wrapper for the ContinuousKVCache of a given layer in the transformer model.

This abstracts the Pointer indirection for accessing the ContinuousKVCache for a
given batch entry.

THIS IS THE TYPE THAT IS PASSED TO KV PROJECTION AND FLASH ATTENTION KERNELS.

## Fields

* ​blocks (`NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`):
* ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`):
* ​lookup\_table (`NDBuffer[uint32, 1, MutableAnyOrigin]`):
* ​max\_seq\_length (`SIMD[uint32, 1]`):
* ​max\_cache\_length (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`KVCacheT`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `blocks_shape`

`alias blocks_shape = __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))`

### `blocks_stride`

`alias blocks_stride = _strides_from_shape[::DimList,::Int]()`

### `blocks_type`

`alias blocks_type = NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`

### `kv_params`

`alias kv_params = kv_params_`

### `type`

`alias type = type_`

## Methods

### `__init__`

`__init__(blocks: NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1]) -> Self`

### `max_tile_size`

`static max_tile_size() -> Int`

Returns the maximum tile size for the KVCache.

### `cache_lengths_nd`

`cache_lengths_nd(self) -> NDBuffer[uint32, 1, MutableAnyOrigin]`

### `cache_length`

`cache_length(self, batch_idx: Int) -> Int`

### `load`

`load[width: Int](self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int) -> SIMD[type_, width]`

### `store`

`store(self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int, val: SIMD[type_, size])`

### `empty_cache`

`empty_cache(self) -> Bool`

Returns true if the cache\_lengths for all requests is 0, false otherwise.

### `max_prompt_length`

`max_prompt_length(self) -> SIMD[uint32, 1]`

Returns the maximum sequence length across all batches of the current request.

### `max_context_length`

`max_context_length(self) -> SIMD[uint32, 1]`

Returns the maximum cache length used across all batches of the current request.

### `block_paged_ptr`

`block_paged_ptr[tile_size: Int](self, batch_idx: Int, start_tok_idx: Int, head_idx: Int, head_dim_idx: Int = 0) -> UnsafePointer[SIMD[type_, 1]]`

---

## ContinuousBatchingKVCacheCollection

`struct ContinuousBatchingKVCacheCollection[type_: DType, kv_params_: KVCacheStaticParams, assert_write_mode: Int = 0]`

This is a "view" of the cache for the given sequences in the batch.

This object does not own the underlying buffers in k\_cache and v\_cache,
it's borrowing them from the BlockWrappers in our KVCacheManager.
It does own the Pointer\[NDBuffer\[type, 3]] and valid\_lengths buffer

## Fields

* ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`):
* ​lookup\_table (`NDBuffer[uint32, 1, MutableAnyOrigin]`):
* ​blocks (`NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`):
* ​max\_seq\_length (`SIMD[uint32, 1]`):
* ​max\_cache\_length (`SIMD[uint32, 1]`):
* ​kv\_cache\_dynamic\_shape (`IndexList[4]`):
* ​kv\_cache\_dynamic\_strides (`IndexList[4]`):

## Implemented traits

`AnyType`,
`Copyable`,
`KVCollectionT`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `blocks_shape`

`alias blocks_shape = DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))`

### `blocks_stride`

`alias blocks_stride = _strides_from_shape[::DimList,::Int]()`

### `blocks_type`

`alias blocks_type = NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(-31337), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`

### `CacheType`

`alias CacheType = ContinuousBatchingKVCache[type_, kv_params_, assert_write_mode]`

### `kv_params`

`alias kv_params = kv_params_`

### `name_str`

`alias name_str = "continuous_batching"`

### `type`

`alias type = type_`

## Methods

### `__init__`

`__init__(out self, blocks: NDBuffer[type_, 6, MutableAnyOrigin], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1])`

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

### `get_key_cache`

`get_key_cache(self, layer_idx: Int) -> ContinuousBatchingKVCache[type_, kv_params_, assert_write_mode]`

### `get_value_cache`

`get_value_cache(self, layer_idx: Int) -> ContinuousBatchingKVCache[type_, kv_params_, assert_write_mode]`

### `cache_length`

`cache_length(self, bs_idx: Int) -> Int`

---

## Control flow

Mojo includes several traditional control flow structures for conditional and
repeated execution of code blocks.

## The `if` statement

Mojo supports the `if` statement for conditional code execution. With it you can
conditionally execute an indented code block if a given
[boolean](/mojo/manual/types#booleans) expression evaluates to `True`.

```mojo
temp_celsius = 25
if temp_celsius > 20:
    print("It is warm.")
    print("The temperature is", temp_celsius * 9 / 5 + 32, "Fahrenheit." )
```

```output
It is warm.
The temperature is 77.0 Fahrenheit.
```

You can write the entire `if` statement as a single line if all you need to
execute conditionally is a single, short statement.

```mojo
temp_celsius = 22
if temp_celsius  20: print("It is warm.")
```

```output
It is warm.
```

Optionally, an `if` statement can include any number of additional `elif`
clauses, each specifying a boolean condition and associated code block to
execute if `True`. The conditions are tested in the order given. When a
condition evaluates to `True`, the associated code block is executed and no
further conditions are tested.

Additionally, an `if` statement can include an optional `else` clause providing
a code block to execute if all conditions evaluate to `False`.

```mojo
temp_celsius = 25
if temp_celsius  Bool:
    print("Executing true_func")
    return True

def false_func() -> Bool:
    print("Executing false_func")
    return False

print('Short-circuit "or" evaluation')
if true_func() or false_func():
    print("True result")
```

```output
Short-circuit "or" evaluation
Executing true_func
True result
```

If the first argument to an `and` operator evaluates to `False`, the second
argument is not evaluated.

```mojo
print('Short-circuit "and" evaluation')
if false_func() and true_func():
    print("True result")
```

```output
Short-circuit "and" evaluation
Executing false_func
```

### Conditional expressions

Mojo also supports conditional expressions (or what is sometimes called a
[*ternary conditional operator*](https://en.wikipedia.org/wiki/Ternary_conditional_operator))
using the syntaxtrue_result if boolean_expression else false_result, just as
in Python. This is most often used as a concise way to assign one of two
different values to a variable, based on a boolean condition.

```mojo
temp_celsius = 15
forecast = "warm" if temp_celsius > 20 else "cool"
print("The forecast for today is", forecast)
```

```output
The forecast for today is cool
```

The alternative, written as a multi-line `if` statement, is more verbose.

```mojo
if temp_celsius > 20:
    forecast = "warm"
else:
    forecast = "cool"
print("The forecast for today is", forecast)
```

```output
The forecast for today is cool
```

## The `while` statement

The `while` loop repeatedly executes a code block while a given boolean
expression evaluates to `True`. For example, the following loop prints values
from the Fibonacci series that are less than 50.

```mojo
fib_prev = 0
fib_curr = 1

print(fib_prev, end="")
while fib_curr < 50:
    print(",", fib_curr, end="")
    fib_prev, fib_curr = fib_curr, fib_prev + fib_curr
```

```output
0, 1, 1, 2, 3, 5, 8, 13, 21, 34
```

A `continue` statement skips execution of the rest of the code block and
resumes with the loop test expression.

```mojo
n = 0
while n < 5:
    n += 1
    if n == 3:
        continue
    print(n, end=", ")
```

```output
1, 2, 4, 5,
```

A `break` statement terminates execution of the loop.

```mojo
n = 0
while n < 5:
    n += 1
    if n == 3:
        break
    print(n, end=", ")
```

```output
1, 2,
```

Optionally, a `while` loop can include an `else` clause. The body of the `else`
clause executes when the loop's boolean condition evaluates to `False`, even if
it occurs the first time tested.

```mojo
n = 5

while n < 4:
    print(n)
    n += 1
else:
    print("Loop completed")

```

```output
Loop completed
```

:::note

The `else` clause does *not* execute if a `break` or `return` statement
exits the `while` loop.

:::

```mojo
n = 0
while n < 5:
    n += 1
    if n == 3:
        break
    print(n)
else:
    print("Executing else clause")
```

```output
1
2
```

## The `for` statement

The `for` loop iterates over a sequence, executing a code block for each
element in the sequence.
The Mojo `for` loop can iterate over any type that implements an `__iter__()`
method that returns a type that defines `__next__()` and `__len__()` methods.

### Iterating over Mojo collections

All of the collection types in the [`collections`](/mojo/stdlib/collections)
module support `for` loop iteration. See the
[Collection types](/mojo/manual/types#collection-types) documentation for more
information on Mojo collection types.

:::caution TODO

Iterating over Mojo native collections currently assigns the loop index variable
a [`Pointer`](/mojo/stdlib/memory/pointer/Pointer) to each item, not the
item itself. You can access the item using the dereference operator, `[]`, as
shown in the examples below. This may change in a future version of Mojo.

:::

The following shows an example of iterating over a Mojo
[`List`](/mojo/stdlib/collections/list/List).

```mojo
from collections import List

states = List[String]("California", "Hawaii", "Oregon")
for state in states:
    print(state[])
```

```output
California
Hawaii
Oregon
```

The same technique works for iterating over a Mojo
[`Set`](/mojo/stdlib/collections/set/Set).

```mojo
from collections import Set

values = Set[Int](42, 0)
for item in values:
    print(item[])
```

```output
42
0
```

There are two techniques for iterating over a Mojo
[`Dict`](/mojo/stdlib/collections/dict/Dict). The first is to iterate directly
using the `Dict`, which produces a sequence of the dictionary's keys.

```mojo
capitals: Dict[String, String] = {
    "California": "Sacramento",
    "Hawaii": "Honolulu",
    "Oregon": "Salem"
}

for state in capitals:
    print(capitals[state[]] + ", " + state[])
```

```output
Sacramento, California
Honolulu, Hawaii
Salem, Oregon
```

The second approach to iterating over a Mojo `Dict` is to invoke its
[`items()`](/mojo/stdlib/collections/dict/Dict#items) method, which produces a
sequence of [`DictEntry`](/mojo/stdlib/collections/dict/DictEntry) objects.
Within the loop body, you can then access the `key` and `value` fields of the
entry.

```mojo
for item in capitals.items():
    print(item[].value + ", " + item[].key)
```

```output
Sacramento, California
Honolulu, Hawaii
Salem, Oregon
```

Another type of iterable provided by the Mojo standard library is a *range*,
which is a sequence of integers generated by the
[`range()`](/mojo/stdlib/builtin/range/range) function. It differs from the
collection types shown above in that it's implemented as a
[generator](https://en.wikipedia.org/wiki/Generator_\(computer_programming\)),
producing each value as needed rather than materializing the entire sequence
in memory. Additionally, each value assigned to the loop index variable is
simply the `Int` value rather than a `Pointer` to the value, so you should
not use the dereference operator on it within the loop. For example:

```mojo
for i in range(5):
    print(i, end=", ")
```

```output
0, 1, 2, 3, 4,
```

### `for` loop control statements

A `continue` statement skips execution of the rest of the code block and
resumes the loop with the next element of the collection.

```mojo
for i in range(5):
    if i == 3:
        continue
    print(i, end=", ")
```

```output
0, 1, 2, 4,
```

A `break` statement terminates execution of the loop.

```mojo
for i in range(5):
    if i == 3:
        break
    print(i, end=", ")
```

```output
0, 1, 2,
```

Optionally, a `for` loop can include an `else` clause. The body of the `else`
clause executes after iterating over all of the elements in a collection.

```mojo
for i in range(5):
    print(i, end=", ")
else:
    print("\nFinished executing 'for' loop")
```

```output
0, 1, 2, 3, 4,
Finished executing 'for' loop
```

The `else` clause executes even if the collection is empty.

```mojo
from collections import List

empty = List[Int]()
for i in empty:
    print(i[])
else:
    print("Finished executing 'for' loop")
```

```output
Finished executing 'for' loop
```

:::note

The `else` clause does *not* execute if a `break` or `return` statement
terminates the `for` loop.

:::

```mojo
from collections import List

animals = List[String]("cat", "aardvark", "hippopotamus", "dog")
for animal in animals:
    if animal[] == "dog":
        print("Found a dog")
        break
else:
    print("No dog found")
```

```output
Found a dog
```

### Iterating over Python collections

The Mojo `for` loop supports iterating over Python collection types. Each item
retrieved by the loop is a
[`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) wrapper around
the Python object. Refer to the [Python types](/mojo/manual/python/types)
documentation for more information on manipulating Python objects from Mojo.

The following is a simple example of iterating over a mixed-type Python list.

```mojo
from python import Python

def main():
    # Create a mixed-type Python list
    py_list = Python.list(42, "cat", 3.14159)
    for py_obj in py_list:  # Each element is of type "PythonObject"
        print(py_obj)
```

```output
42
cat
3.14159
```

:::note TODO

Iterating over a Mojo collection currently assigns the loop index variable a
`Pointer` to each element, which then requires you to use the dereference
operator within the loop body. In contrast, iterating over a Python collection
assigns a `PythonObject` wrapper for the element, which does *not* require you
to use the dereference operator.

:::

There are two techniques for iterating over a Python dictionary. The first is to
iterate directly using the dictionary, which produces a sequence of its keys.

```mojo
from python import Python

def main():
    # Create a mixed-type Python dictionary
    py_dict = Python.evaluate("{'a': 1, 'b': 2.71828, 'c': 'sushi'}")
    for py_key in py_dict:  # Each key is of type "PythonObject"
        print(py_key, py_dict[py_key])
```

```output
a 1
b 2.71828
c sushi
```

The second approach to iterating over a Python dictionary is to invoke its
`items()` method, which produces a sequence of 2-tuple objects.
Within the loop body, you can then access the key and value by index.

```mojo
from python import Python

def main():
    # Create a mixed-type Python dictionary
    py_dict = Python.evaluate("{'a': 1, 'b': 2.71828, 'c': 'sushi'}")
    for py_tuple in py_dict.items():  # Each 2-tuple is of type "PythonObject"
        print(py_tuple[0], py_tuple[1])
```

```output
a 1
b 2.71828
c sushi
```

---

## conv

The `conv` module provides classes for performing convolution operations in
various dimensions (1D, 2D, and 3D) on tensor inputs. These convolution
operations are core building blocks for neural networks, especially in computer
vision and sequence processing tasks.

Here’s an example demonstrating how to use a 1D convolution:

```python
import max.nn as nn
from max.graph import Graph, ops, Weight, DeviceRef
from max.dtype import DType
import numpy as np

with Graph(name="conv_example") as graph:
    # Define dimensions
    batch_size = 2
    seq_length = 10
    in_channels = 16
    out_channels = 32
    kernel_size = 3

    # Create input tensor [batch_size, sequence_length, channels]
    x_data = np.zeros((batch_size, seq_length, in_channels), dtype=np.float32)
    x = ops.constant(x_data, dtype=DType.float32, device=DeviceRef.CPU())

    # Create weights for convolution
    filter_1d = Weight(
        name="filter_weight",
        dtype=DType.float32,
        shape=[kernel_size, in_channels, out_channels]
        device=DeviceRef.CPU()
    )
    bias_1d = Weight(
        name="bias_weight",
        dtype=DType.float32,
        shape=[out_channels]
        device=DeviceRef.CPU()
    )

    # Create and apply Conv1D layer
    conv1d = nn.Conv1D(
        filter=filter_1d,
        bias=bias_1d,
        stride=1,
        padding=1
    )

    output_1d = conv1d(x)
    print(f"Conv1D output shape: {output_1d.shape}")
    # Output: Conv1D output shape: [Dim(2), Dim(10), Dim(32)]
```

## `Conv1D` {#max.nn.conv.Conv1D}

> *class* max.nn.conv.Conv1D(kernel\_size, in\_channels, out\_channels, dtype, stride=1, padding=0, dilation=1, num\_groups=1, device=None, has\_bias=False, permute=False, name=None)

A 1D convolution over an input signal composed of several input
planes.

## Example

```python
conv = nn.Conv1D(
    kernel_size=3,
    in_channels=64,
    out_channels=128,
    dtype=DType.float32,
    stride=1,
    padding=0,
    has_bias=False,
    name="conv1d_weight",
    device=DeviceRef.GPU(),
)
```

Initializes the Conv1D layer with weights and optional bias.

**Parameters:**

* **kernel\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Size of the convolving kernel.
* **in\_channels** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of channels in the input signal.
* **out\_channels** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of channels produced by the convolution.
* **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The data type for both weights and bias.
* **stride** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Stride of the convolution. Default: 1
* **padding** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Padding added to both sides of the input. Default: 0
* **dilation** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Spacing between kernel elements. Default: 1
* **num\_groups** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of blocked connections from input channels to output channels. Default: 1
* **device** (`DeviceRef`  `|`  `None` ) – The target device for computation.
  Weights remain on CPU until moved during computation.
* **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `None` ) – Base name for weights (appended with `.weight` and
  `.bias` if applicable).
* **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – When [`True`](https://docs.python.org/3/library/constants.html#True), adds a bias vector to the layer.
  Defaults to [`False`](https://docs.python.org/3/library/constants.html#False).
* **permute** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

### `bias` {#max.nn.conv.Conv1D.bias}

> bias\*: [Weight](../graph/Weight.md#max.graph.Weight) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

The optional bias vector stored on CPU with shape (out\_channels,).
Model init moves the bias to [`device`](#max.nn.conv.Conv1D.device) if present.

### `device` {#max.nn.conv.Conv1D.device}

> device\*: DeviceRef | [None](https://docs.python.org/3/library/constants.html#None)\*

The device where matrix operations are performed.

### `dilation` {#max.nn.conv.Conv1D.dilation}

> dilation\*: [int](https://docs.python.org/3/library/functions.html#int)\*

Controls the dilation rate.

### `filter` {#max.nn.conv.Conv1D.filter}

> filter\*: [Weight](../graph/Weight.md#max.graph.Weight)\*

The weight matrix stored on CPU with shape (kernel\_size, in\_channels / num\_groups, out\_channels).
Model init moves the weight to [`device`](#max.nn.conv.Conv1D.device).

### `num_groups` {#max.nn.conv.Conv1D.num_groups}

> num\_groups\*: [int](https://docs.python.org/3/library/functions.html#int)\*

Number of blocked connections from input channels to output channels.

### `padding` {#max.nn.conv.Conv1D.padding}

> padding\*: [int](https://docs.python.org/3/library/functions.html#int)\*

Controls the amount of padding applied before and after the input.

### `permute` {#max.nn.conv.Conv1D.permute}

> permute\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False*

bool controls whether self.filter is permuted from PyTorch order to max order.
PyTorch order is: (out\_channels, in\_channels / num\_groups, kernel\_size)
Max API order: (kernel\_size, in\_channels / num\_groups, out\_channels).

### `stride` {#max.nn.conv.Conv1D.stride}

> stride\*: [int](https://docs.python.org/3/library/functions.html#int)\*

Controls the stride for the cross-correlation.

## `Conv1DV1` {#max.nn.conv.Conv1DV1}

> *class* max.nn.conv.Conv1DV1(filter, bias=None, stride=1, padding=0, dilation=1, groups=1)

A 1D convolution over an input signal composed of several input
planes.

Deprecated: Use Conv1D instead.

## Example

```python
conv = nn.Conv1DV1(
    filter=filter_1d,
    bias=bias_1d,
    stride=1,
    padding=1
)
```

**Parameters:**

* **filter** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **bias** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )
* **stride** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **padding** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **dilation** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **groups** ([`int`](https://docs.python.org/3/library/functions.html#int) )

### `bias` {#max.nn.conv.Conv1DV1.bias}

> bias\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

### `dilation` {#max.nn.conv.Conv1DV1.dilation}

> dilation\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 1*

### `filter` {#max.nn.conv.Conv1DV1.filter}

> filter\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

### `groups` {#max.nn.conv.Conv1DV1.groups}

> groups\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 1*

### `padding` {#max.nn.conv.Conv1DV1.padding}

> padding\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 0*

### `stride` {#max.nn.conv.Conv1DV1.stride}

> stride\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 1*

## `Conv2DV1` {#max.nn.conv.Conv2DV1}

> *class* max.nn.conv.Conv2DV1(filter, bias=None, stride=(1, 1), padding=(0, 0, 0, 0), dilation=(1, 1), groups=1)

A 2D convolution over an input signal composed of several input
planes.

## Example

```python
conv = nn.Conv2DV1(
    filter=filter_2d,
    bias=bias_2d,
    stride=2,
    padding=1
)
output = conv(x)
```

**Parameters:**

* **filter** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **bias** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )
* **stride** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` )
* **padding** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` )
* **dilation** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` )
* **groups** ([`int`](https://docs.python.org/3/library/functions.html#int) )

### `bias` {#max.nn.conv.Conv2DV1.bias}

> bias\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

### `dilation` {#max.nn.conv.Conv2DV1.dilation}

> dilation\*: [int](https://docs.python.org/3/library/functions.html#int) | [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* *= (1, 1)*

### `filter` {#max.nn.conv.Conv2DV1.filter}

> filter\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

### `groups` {#max.nn.conv.Conv2DV1.groups}

> groups\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 1*

### `padding` {#max.nn.conv.Conv2DV1.padding}

> padding\*: [int](https://docs.python.org/3/library/functions.html#int) | [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* *= (0, 0, 0, 0)*

### `stride` {#max.nn.conv.Conv2DV1.stride}

> stride\*: [int](https://docs.python.org/3/library/functions.html#int) | [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* *= (1, 1)*

## `Conv3D` {#max.nn.conv.Conv3D}

> *class* max.nn.conv.Conv3D(depth, height, width, in\_channels, out\_channels, dtype, stride=1, padding=0, dilation=1, num\_groups=1, device=None, has\_bias=False, permute=False, name=None)

A 3D convolution over an input signal composed of several input
planes.

## Example

```python
conv = nn.Conv3D(
    depth=,
    height=,
    width=,
    in_channels=,
    out_channels=,
    dtype=DType.float32,
    stride=1,
    padding=0,
    has_bias=False,
    name="conv3d_weight",
    device=DeviceRef.GPU(),
)
```

Initializes the Conv3D layer with weights and optional bias.

**Parameters:**

* **depth** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – kernel\_size\[0]
* **height** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – kernel\_size\[1]
* **width** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – kernel\_size\[2]
* **in\_channels** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – number of channels in the input image.
* **out\_channels** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – dimensionality of the output.
* **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The data type for both weights and bias.
* **stride** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – Stride of the convolution. Default: 1
* **padding** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – Padding added to all six sides of the input. Default: 0
* **dilation** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – Spacing between kernel elements. Default: 1
* **num\_groups** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of blocked connections from input channels to output channels. Default: 1.
* **device** (`DeviceRef`  `|`  `None` ) – The target device for computation.
  Weights remain on CPU until moved during computation.
* **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `None` ) – Base name for weights (appended with `.weight` and
  `.bias` if applicable).
* **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – When [`True`](https://docs.python.org/3/library/constants.html#True), adds a bias vector to the layer.
  Defaults to [`False`](https://docs.python.org/3/library/constants.html#False).
* **permute** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

### `bias` {#max.nn.conv.Conv3D.bias}

> bias\*: [Weight](../graph/Weight.md#max.graph.Weight) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

The optional bias vector stored on CPU with shape (out\_channels,).
Model init moves the bias to [`device`](#max.nn.conv.Conv3D.device) if present.

### `device` {#max.nn.conv.Conv3D.device}

> device\*: DeviceRef | [None](https://docs.python.org/3/library/constants.html#None)\*

The device where matrix operations are performed.

### `dilation` {#max.nn.conv.Conv3D.dilation}

> dilation\*: [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\*

Not implemented yet. Assuming dilation = 1 for now.

### `filter` {#max.nn.conv.Conv3D.filter}

> filter\*: [Weight](../graph/Weight.md#max.graph.Weight)\*

The weight matrix stored on CPU with shape (depth, height, width, in\_channels / num\_groups, out\_channels).
Model init moves the weight to [`device`](#max.nn.conv.Conv3D.device).

### `num_groups` {#max.nn.conv.Conv3D.num_groups}

> num\_groups\*: [int](https://docs.python.org/3/library/functions.html#int)\*

Not implemented yet. Assuming num\_groups = 1 for now.

### `padding` {#max.nn.conv.Conv3D.padding}

> padding\*: [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\*

Controls the amount of padding applied before and after the input for depth, height, and width dimensions.

### `permute` {#max.nn.conv.Conv3D.permute}

> permute\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False*

bool controls whether self.filter is permuted from PyTorch order to max order.
PyTorch order is: (out\_channels, in\_channels / num\_groups, depth, height, width)
Max API order: (depth, height, width, in\_channels / num\_groups, out\_channels).

### `stride` {#max.nn.conv.Conv3D.stride}

> stride\*: [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\*

Controls the stride for the cross-correlation.

## `Conv3DV1` {#max.nn.conv.Conv3DV1}

> *class* max.nn.conv.Conv3DV1(filter, bias=None, stride=(1, 1, 1), padding=(0, 0, 0, 0, 0, 0), dilation=(1, 1, 1), groups=1)

A 3D convolution over an input signal composed of several input
planes.

Deprecated: Use Conv3D instead.

## Example

```python
conv = nn.Conv3DV1(
    filter=filter_3d,
    bias=bias_3d,
    stride=1,
    padding=1
)
```

**Parameters:**

* **filter** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **bias** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )
* **stride** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` )
* **padding** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` )
* **dilation** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` )
* **groups** ([`int`](https://docs.python.org/3/library/functions.html#int) )

### `bias` {#max.nn.conv.Conv3DV1.bias}

> bias\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

### `dilation` {#max.nn.conv.Conv3DV1.dilation}

> dilation\*: [int](https://docs.python.org/3/library/functions.html#int) | [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* *= (1, 1, 1)*

### `filter` {#max.nn.conv.Conv3DV1.filter}

> filter\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

### `groups` {#max.nn.conv.Conv3DV1.groups}

> groups\*: [int](https://docs.python.org/3/library/functions.html#int)\* *= 1*

### `padding` {#max.nn.conv.Conv3DV1.padding}

> padding\*: [int](https://docs.python.org/3/library/functions.html#int) | [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* *= (0, 0, 0, 0, 0, 0)*

### `stride` {#max.nn.conv.Conv3DV1.stride}

> stride\*: [int](https://docs.python.org/3/library/functions.html#int) | [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int), [int](https://docs.python.org/3/library/functions.html#int)]\* *= (1, 1, 1)*

---

## conv

## Structs

* [​`ConvDirectNHWC`](./ConvDirectNHWC): Implement the outer loops for direct convolution. Collapse N, HO, WO into one dimension n\_ho\_wo. Tile n\_ho\_wo, C, and F. The tile factor for C and F are chosen by a heuristic prioritizing C. n\_ho\_wo is tiled by micro kernel's height.
* [​`Naive2dConvolution`](./Naive2dConvolution): Struct wrapper for naive 2d convolution implementation.

## Functions

* [​`accumulate_wo_tile_1d`](./accumulate_wo_tile_1d): Update one row in the output for a given (c, f) tile.
* [​`accumulate_wo_tile_2d`](./accumulate_wo_tile_2d):
* [​`accumulate_wo_tile_3d`](./accumulate_wo_tile_3d):
* [​`check_cudnn_error`](./check_cudnn_error):
* [​`conv1d_update_wo_tile`](./conv1d_update_wo_tile):
* [​`conv2d_gpu_naive_nhwc_rscf`](./conv2d_gpu_naive_nhwc_rscf):
* [​`conv2d_update_wo_tile`](./conv2d_update_wo_tile):
* [​`conv3d_gpu_naive_ndhwc_qrscf`](./conv3d_gpu_naive_ndhwc_qrscf):
* [​`conv3d_update_wo_tile`](./conv3d_update_wo_tile):
* [​`conv_cudnn`](./conv_cudnn):
* [​`conv_gpu`](./conv_gpu):
* [​`conv_nhwc_direct`](./conv_nhwc_direct):
* [​`conv_shape`](./conv_shape): Compute the output shape of a `conv` operation, and assert the inputs are compatible.
* [​`pack_conv_filter_shape`](./pack_conv_filter_shape): Compute the output shape of convolution filter packing.
* [​`pack_filter`](./pack_filter): This packs the filter form RSCF to FRSCf. Use the default micro kernel size for dynamic shapes.
* [​`pack_filter_shape`](./pack_filter_shape): Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel.
* [​`pack_filter_shape_impl`](./pack_filter_shape_impl): Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel.

---

## conv_cudnn

`conv_cudnn[input_dim: DimList, filter_dim: DimList, output_dim: DimList, input_type: DType, filter_type: DType, output_type: DType](input: UnsafePointer[SIMD[input_type, 1]], filter: UnsafePointer[SIMD[filter_type, 1]], output: UnsafePointer[SIMD[output_type, 1]], stride: IndexList[2], dilation: IndexList[2], padding: IndexList[2], num_groups: Int, ctx: DeviceContext)`

---

## conv_gpu

`conv_gpu[input_rank: Int, filter_rank: Int, input_dim: DimList, filter_dim: DimList, output_dim: DimList, input_type: DType, filter_type: DType, output_type: DType, maybe_epilogue_func: OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]({:i1 0, 1})](input: NDBuffer[input_type, input_rank, MutableAnyOrigin, input_dim], filter: NDBuffer[filter_type, filter_rank, MutableAnyOrigin, filter_dim], output: NDBuffer[output_type, input_rank, MutableAnyOrigin, output_dim], stride: IndexList[(input_rank + -2)], dilation: IndexList[(input_rank + -2)], padding: IndexList[(input_rank + -2)], num_groups: Int, ctx: DeviceContext)`

---

## conv_nhwc_direct

`conv_nhwc_direct[input_rank: Int, filter_rank: Int, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, filter_packed: Bool, conv_info_static: ConvInfoStatic[(input_rank + -2)], lambdas_have_fusion: Bool, elementwise_lambda: fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None](input: NDBuffer[input_type, input_rank, origin, input_shape], filter: NDBuffer[filter_type, filter_rank, origin, filter_shape], output: NDBuffer[output_type, input_rank, origin, output_shape], stride: IndexList[(input_rank + -2)], dilation: IndexList[(input_rank + -2)], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], num_groups: Int)`

---

## conv_shape

`conv_shape[input_rank: Int, filter_rank: Int, input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], filter_buf: NDBuffer[filter_type, filter_rank, origin], strides_buf: NDBuffer[strides_type, 1, origin], dilations_buf: NDBuffer[dilations_type, 1, origin], paddings_buf: NDBuffer[paddings_type, 1, origin], num_groups_scalar: SIMD[dtype, 1]) -> IndexList[input_rank]`

Compute the output shape of a `conv` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_rank (`Int`): Rank of the input tensor.
* ​filter\_rank (`Int`): Rank of the filter tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​filter\_type (`DType`): Type of the filter tensor.
* ​strides\_type (`DType`): Type of the strides tensor.
* ​dilations\_type (`DType`): Type of the dilations tensor.
* ​paddings\_type (`DType`): Type of the paddings tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  ssynchronouslysing a single thread.

**Args:**

* ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor.
* ​filter\_buf (`NDBuffer[filter_type, filter_rank, origin]`): The filter tensor.
* ​strides\_buf (`NDBuffer[strides_type, 1, origin]`): The strides tensor.
* ​dilations\_buf (`NDBuffer[dilations_type, 1, origin]`): The dilations tensor.
* ​paddings\_buf (`NDBuffer[paddings_type, 1, origin]`): The paddings tensor.
* ​num\_groups\_scalar (`SIMD[dtype, 1]`): The num\_groups scalar.

**Returns:**

The output shape.

---

## conv_transpose

## Structs

* [​`ConvTransposedPacked`](./ConvTransposedPacked):

## Functions

* [​`accumulate_wo_tile`](./accumulate_wo_tile):
* [​`conv_transpose_naive`](./conv_transpose_naive): Implements the ConvTranspose operator from the MO spec.
* [​`conv_transpose_shape`](./conv_transpose_shape): Compute the output shape of a `conv-transpose` operation, and assert the inputs are compatible.
* [​`conv_transposed`](./conv_transposed):
* [​`get_num_partitions`](./get_num_partitions): Partition the worload in (batch\&group, C, F, H) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions.
* [​`get_partition`](./get_partition):
* [​`pack_filter`](./pack_filter): This packs the filter form RSFC to FRSCf.
* [​`pack_filter_shape`](./pack_filter_shape): Compute the output shape of transposed convolution filter packing.
* [​`update_w_tile_2d`](./update_w_tile_2d):
* [​`update_w_tile_3d`](./update_w_tile_3d):

---

## conv_transpose_naive

`conv_transpose_naive[type: DType](output: NDBuffer[type, 5, MutableAnyOrigin], input: NDBuffer[type, 5, MutableAnyOrigin], filter: NDBuffer[type, 5, MutableAnyOrigin], stride: IndexList[3], dilation: IndexList[3], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2])`

Implements the ConvTranspose operator from the MO spec.

**Parameters:**

* ​type (`DType`): Type of the input, output, and kernel tensors.

**Args:**

* ​output (`NDBuffer[type, 5, MutableAnyOrigin]`): Output data tensor that contains the result of the convolution.
* ​input (`NDBuffer[type, 5, MutableAnyOrigin]`): Input data tensor from previous layer, with size of (N x H x W x C),
  where N is the batch size, C is the number of channels, and H and
  W are the height and width.
* ​filter (`NDBuffer[type, 5, MutableAnyOrigin]`): The weight (kernel) tensor, with size of (kH x kW x M/groups x C),
  where C is the number of channels, kH and kW are the height and
  width of the kernel, and M is the number of feature maps.
* ​stride (`IndexList[3]`): Stride along each spatial axis.
* ​dilation (`IndexList[3]`): Dilation value along each spatial axis of the filter.
* ​pad\_d (`IndexList[2]`): Padding in depth dimension.
* ​pad\_h (`IndexList[2]`): Padding in height dimension.
* ​pad\_w (`IndexList[2]`): Padding in width dimension.

---

## conv_transpose_shape

`conv_transpose_shape[input_rank: Int, kernel_rank: Int, type: DType, strides_type: DType, dilations_type: DType, pads_type: DType, output_pads_type: DType, single_thread_blocking_override: Bool](input: NDBuffer[type, input_rank, origin], kernel: NDBuffer[type, kernel_rank, origin], strides: NDBuffer[strides_type, 1, origin], dilations: NDBuffer[dilations_type, 1, origin], pads: NDBuffer[pads_type, 1, origin], output_pads: NDBuffer[output_pads_type, 1, origin]) -> IndexList[input_rank]`

Compute the output shape of a `conv-transpose` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_rank (`Int`): Rank of the input tensor.
* ​kernel\_rank (`Int`): Rank of the kernel tensor.
* ​type (`DType`): Element type of the input and kernel tensor.
* ​strides\_type (`DType`): Element type of the strides tensor.
* ​dilations\_type (`DType`): Element type of the dilations tensor.
* ​pads\_type (`DType`): Element type of the pads tensor.
* ​output\_pads\_type (`DType`): Element type of the output\_pads tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input (`NDBuffer[type, input_rank, origin]`): The input tensor.
* ​kernel (`NDBuffer[type, kernel_rank, origin]`): The kernel tensor.
* ​strides (`NDBuffer[strides_type, 1, origin]`): The strides tensor.
* ​dilations (`NDBuffer[dilations_type, 1, origin]`): The dilations tensor.
* ​pads (`NDBuffer[pads_type, 1, origin]`): The paddings tensor.
* ​output\_pads (`NDBuffer[output_pads_type, 1, origin]`): The output paddings tensor.

**Returns:**

The output shape.

---

## conv_transposed

`conv_transposed[input_rank: Int, filter_rank: Int, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, filter_packed: Bool, lambdas_have_fusion: Bool, elementwise_lambda: fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None](output: NDBuffer[output_type, input_rank, origin, output_shape], input: NDBuffer[input_type, input_rank, origin, input_shape], filter: NDBuffer[filter_type, filter_rank, origin, filter_shape], stride: IndexList[(input_rank + -2)], dilation: IndexList[(input_rank + -2)], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2])`

---

## conv_utils

## Aliases

### `elementwise_epilogue_type`

`alias elementwise_epilogue_type = fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None`

### `elementwise_simd_epilogue_type`

`alias elementwise_simd_epilogue_type = fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None`

## Structs

* [​`ConvAlgorithm`](./ConvAlgorithm):
* [​`ConvInfoStatic`](./ConvInfoStatic):
* [​`ConvPartition`](./ConvPartition): Work range for a partition.
* [​`ConvShape`](./ConvShape): A shape struct describing the convolution dimensions.

## Functions

* [​`align_down_residual`](./align_down_residual): Returns the remainder after aligning down value to alignment.
* [​`append_shape`](./append_shape): Append input shape by inserting `last2nd` and `last` at the end.
* [​`extend_shape`](./extend_shape): Extend input shape by inserting `first` and `last` at both ends.
* [​`get_conv2d_shape`](./get_conv2d_shape):
* [​`get_conv_num_partitions`](./get_conv_num_partitions): Partition the worload in (batch, C, F, HOWO) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions.
* [​`get_conv_num_tasks`](./get_conv_num_tasks):
* [​`get_conv_shape`](./get_conv_shape):
* [​`get_conv_tile_shape`](./get_conv_tile_shape): Compute the (c, f) tile shape in L2. Assume NHWC layout, the tile shape is (R, S, c\_tile, f\_tile). R and S are by default fully covered. The heuristic tried to block in C as much as possible. If C is small, it would start to block F.
* [​`get_conv_tile_size`](./get_conv_tile_size):
* [​`get_direct_conv_micro_kernel_height`](./get_direct_conv_micro_kernel_height):
* [​`get_direct_conv_micro_kernel_width`](./get_direct_conv_micro_kernel_width):
* [​`get_micro_kernel_shape`](./get_micro_kernel_shape):
* [​`get_partition`](./get_partition):
* [​`reorder_padding`](./reorder_padding):

---

## conv1d_update_wo_tile

`conv1d_update_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, filter_packed: Bool, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType, elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], first_c_tile: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[rank], n: Int, wo: Int)`

---

## conv2d_gpu_naive_nhwc_rscf

`conv2d_gpu_naive_nhwc_rscf[input_dim: DimList, filter_dim: DimList, output_dim: DimList, input_type: DType, filter_type: DType, output_type: DType, block_size: Int, maybe_epilogue_func: OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]](input: NDBuffer[input_type, 4, MutableAnyOrigin, input_dim], filter: NDBuffer[filter_type, 4, MutableAnyOrigin, filter_dim], output: NDBuffer[output_type, 4, MutableAnyOrigin, output_dim], stride: IndexList[2], dilation: IndexList[2], padding: IndexList[2])`

---

## conv2d_update_wo_tile

`conv2d_update_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, filter_packed: Bool, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType, elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], first_c_tile: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[2], n: Int, howo: IndexList[2])`

---

## conv3d_gpu_naive_ndhwc_qrscf

`conv3d_gpu_naive_ndhwc_qrscf[input_dim: DimList, filter_dim: DimList, output_dim: DimList, input_type: DType, filter_type: DType, output_type: DType, block_size: Int, maybe_epilogue_func: OptionalReg[fn[DType, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]](input: NDBuffer[input_type, 5, MutableAnyOrigin, input_dim], filter: NDBuffer[filter_type, 5, MutableAnyOrigin, filter_dim], output: NDBuffer[output_type, 5, MutableAnyOrigin, output_dim], stride: IndexList[3], dilation: IndexList[3], padding: IndexList[3])`

---

## conv3d_update_wo_tile

`conv3d_update_wo_tile[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, filter_packed: Bool, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType, elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], first_c_tile: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[3], n: Int, dohowo: IndexList[3])`

---

## ConvAlgorithm

`@register_passable(trivial)`
`struct ConvAlgorithm`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `Default`

`alias Default = ConvAlgorithm(0)`

### `Direct`

`alias Direct = ConvAlgorithm(2)`

### `Im2Col`

`alias Im2Col = ConvAlgorithm(1)`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

---

## ConvDirectNHWC

`struct ConvDirectNHWC[input_mut: Bool, filter_mut: Bool, //, input_rank: Int, filter_rank: Int, output_rank: Int, input_origin: Origin[input_mut], filter_origin: Origin[filter_mut], output_origin: MutableOrigin, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, filter_packed: Bool, conv_attr: ConvInfoStatic[(input_rank + -2)], elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})]`

Implement the outer loops for direct convolution. Collapse N, HO, WO into one dimension n\_ho\_wo. Tile n\_ho\_wo, C, and F. The tile factor for C and F are chosen by a heuristic prioritizing C. n\_ho\_wo is tiled by micro kernel's height.

If n\_ho\_wo is large enough to spill LLC, we may need to tile n\_ho\_wo as the
outer most loop with a factor fit in LLC.

Assume F is divisible at least by simd\_size.

## Fields

* ​output (`NDBuffer[output_type, output_rank, output_origin, output_shape]`):
* ​input (`NDBuffer[input_type, input_rank, input_origin, input_shape]`):
* ​filter (`NDBuffer[filter_type, filter_rank, filter_origin, filter_shape]`):
* ​conv\_shape (`ConvShape[(input_rank + -2)]`):
* ​partition (`ConvPartition`):
* ​cf\_tile\_size (`IndexList[2]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `packed_and_fully_static`

`alias packed_and_fully_static = filter_packed if filter_shape.all_known[::Int]() if output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else filter_shape.all_known[::Int]() if output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else output_shape.all_known[::Int,::Int]() if input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known() else input_shape.all_known[::Int,::Int]() if conv_attr.all_known() else conv_attr.all_known()`

## Methods

### `run`

`static run(output: NDBuffer[output_type, output_rank, output_origin, output_shape], input: NDBuffer[input_type, input_rank, input_origin, input_shape], filter: NDBuffer[filter_type, filter_rank, filter_origin, filter_shape], conv_shape: ConvShape[(input_rank + -2)])`

### `is_new_c_accum`

`is_new_c_accum(self, c_idx: Int) -> Bool`

### `update_output_tile_no_padding`

`update_output_tile_no_padding[micro_kernel_height: Int, micro_kernel_width: Int, c_fully_cached: Bool, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int, output_flat_coord: Int)`

### `output_space_flat_loop`

`output_space_flat_loop[micro_kernel_f_size: Int, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int)`

### `output_space_loop`

`output_space_loop[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int)`

### `output_space_loop_1d`

`output_space_loop_1d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)`

### `output_space_loop_2d`

`output_space_loop_2d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)`

### `output_space_loop_3d`

`output_space_loop_3d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)`

---

## ConvertibleFromPython

Denotes a type that can attempt construction from a read-only Python object.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self: _Self, obj: PythonObject)`

Attempt to construct an instance of this object from a read-only Python value.

**Args:**

* ​obj (`PythonObject`): The Python object to convert from.

**Raises:**

If conversion was not successful.

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

---

## ConvInfoStatic

`struct ConvInfoStatic[rank: Int]`

## Fields

* ​pad (`DimList`):
* ​stride (`DimList`):
* ​dilation (`DimList`):
* ​num\_groups (`Dim`):

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

`__init__(out self, pad: DimList, stride: DimList, dilation: DimList, num_groups: Dim)`

`__init__(out self, pad: DimList, stride: DimList, dilation: DimList, input_c: Dim, filter_c: Dim)`

### `all_known`

`all_known(self) -> Bool`

### `pad_left`

`pad_left(self) -> Int`

### `pad_bottom`

`pad_bottom(self) -> Int`

### `strides`

`strides(self) -> IndexList[2]`

### `dilations`

`dilations(self) -> IndexList[2]`

---

## ConvPartition

`@register_passable(trivial)`
`struct ConvPartition`

Work range for a partition.

## Fields

* ​ng\_offset (`Int`):
* ​ng\_size (`Int`):
* ​f\_offset (`Int`):
* ​f\_size (`Int`):
* ​ho\_or\_howo\_offset (`Int`):
* ​ho\_or\_howo\_size (`Int`):
* ​c\_offset (`Int`):
* ​c\_size (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `empty`

`empty(self) -> Bool`

---

## ConvShape

`@register_passable(trivial)`
`struct ConvShape[rank: Int]`

A shape struct describing the convolution dimensions.

## Fields

* ​n (`Int`):
* ​input\_dims (`IndexList[rank]`):
* ​output\_dims (`IndexList[rank]`):
* ​filter\_dims (`IndexList[rank]`):
* ​c (`Int`):
* ​f (`Int`):
* ​stride (`IndexList[rank]`):
* ​dilation (`IndexList[rank]`):
* ​pad\_d (`IndexList[2]`):
* ​pad\_h (`IndexList[2]`):
* ​pad\_w (`IndexList[2]`):
* ​num\_groups (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `d`

`d(self) -> Int`

Input depth.

### `h`

`h(self) -> Int`

Input height.

### `w`

`w(self) -> Int`

Input width.

### `do`

`do(self) -> Int`

Output depth.

### `ho`

`ho(self) -> Int`

Output height.

### `wo`

`wo(self) -> Int`

Output width.

### `q`

`q(self) -> Int`

Filter window depth.

### `r`

`r(self) -> Int`

Filter window height.

### `s`

`s(self) -> Int`

Filter windown width.

### `filter_window_flat_size`

`filter_window_flat_size(self) -> Int`

### `input_image_flat_size`

`input_image_flat_size(self) -> Int`

### `output_image_flat_size`

`output_image_flat_size(self) -> Int`

### `output_space_dims`

`output_space_dims(self) -> IndexList[rank]`

### `output_flat_coord_to_input_offset`

`output_flat_coord_to_input_offset(self, n: Int, output_flat_coord: Int) -> Int`

### `matmul_M`

`matmul_M(self) -> Int`

### `matmul_N`

`matmul_N(self) -> Int`

### `matmul_K`

`matmul_K(self) -> Int`

### `padded`

`padded(self) -> Bool`

### `c_per_group`

`c_per_group(self) -> Int`

Returns the number of channels per group. Channel count must be divisible by group size.

### `f_per_group`

`f_per_group(self) -> Int`

Returns the number of filters per group. Filter count must be divisible by group size.

### `f_to_group`

`f_to_group(self, f_idx: Int) -> Int`

Given a global filter idx, returns the group idx of the group the filter belongs to.

### `c_to_group`

`c_to_group(self, c_idx: Int) -> Int`

Given a global channel idx, returns the group idx of the group the channel belongs to.

### `f_in_group`

`f_in_group(self, f_idx: Int) -> Int`

Given a global filter idx, returns the offset of the filter in its group.

### `c_in_group`

`c_in_group(self, c_idx: Int) -> Int`

Given a global channel idx, returns the offset of the channel in its group.

---

## ConvTransposedPacked

`struct ConvTransposedPacked[input_mut: Bool, filter_mut: Bool, //, input_rank: Int, filter_rank: Int, output_rank: Int, input_origin: Origin[input_mut], filter_origin: Origin[filter_mut], output_origin: MutableOrigin, input_shape: DimList, filter_shape: DimList, output_shape: DimList, input_type: DType, filter_type: DType, output_type: DType, conv_attr: ConvInfoStatic[(input_rank + -2)], elementwise_epilogue: OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None] = OptionalReg[fn[Int](coords: IndexList[$0], f_size: Int) capturing -> None]({:i1 0, 1})]`

## Fields

* ​output (`NDBuffer[output_type, output_rank, output_origin, output_shape]`):
* ​input (`NDBuffer[input_type, input_rank, input_origin, input_shape]`):
* ​filter (`NDBuffer[filter_type, filter_rank, filter_origin, filter_shape]`):
* ​conv\_shape (`ConvShape[(input_rank + -2)]`):
* ​partition (`ConvPartition`):
* ​cf\_tile\_size (`IndexList[2]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `run`

`static run(output: NDBuffer[output_type, output_rank, output_origin, output_shape], input: NDBuffer[input_type, input_rank, input_origin, input_shape], filter: NDBuffer[filter_type, filter_rank, filter_origin, filter_shape], conv_shape: ConvShape[(input_rank + -2)])`

### `input_space_loop`

`input_space_loop[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool](self, n: Int, f_tile_offset: Int, f_tile_size: Int, c_tile_offset: Int, c_tile_size: Int)`

### `input_space_loop_2d`

`input_space_loop_2d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)`

### `input_space_loop_3d`

`input_space_loop_3d[micro_kernel_height: Int, micro_kernel_width: Int, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](self, output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], n: Int, first_c_tile_in_group: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, left_pad_impact_end: Int, right_pad_impact_start: Int)`

### `apply_epilogue`

`apply_epilogue(self, n: Int, g: Int)`

---

## coord_transform

`coord_transform[mode: CoordinateTransformationMode](out_coord: Int, in_dim: Int, out_dim: Int, scale: SIMD[float32, 1]) -> SIMD[float32, 1]`

---

## CoordinateTransformationMode

`struct CoordinateTransformationMode`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `AlignCorners`

`alias AlignCorners = CoordinateTransformationMode(1)`

### `Asymmetric`

`alias Asymmetric = CoordinateTransformationMode(2)`

### `HalfPixel`

`alias HalfPixel = CoordinateTransformationMode(0)`

### `HalfPixel1D`

`alias HalfPixel1D = CoordinateTransformationMode(3)`

## Methods

### `__init__`

`@implicit`
`__init__(out self, value: Int)`

### `__eq__`

`__eq__(self, other: Self) -> Bool`

---

## copy

`copy[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), thread_scope: ThreadScope = ThreadScope(0), row_major: Bool = False](dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Synchronously copy data from local memory (registers) to SRAM (shared memory).

This function performs a synchronous copy operation from register memory to
shared memory in a GPU context, distributing the workload across multiple
threads for parallel execution. It's particularly useful for transferring
processed data from registers to shared memory for inter-thread
communication.

Example:

```mojo
from layout import LayoutTensor, Layout
var register_data = LayoutTensor[DType.float32, Layout((16, 16)),
                                address_space=AddressSpace.LOCAL]()
var shared_data = LayoutTensor[DType.float32, Layout((16, 16)),
                                address_space=AddressSpace.SHARED]()

# Process data in registers
# ...

# Copy processed data to shared memory for inter-thread communication
copy[Layout((8, 8))](shared_data, register_data)
```

Performance:

* Distributes the copy workload across multiple threads for parallel execution.
* Can use swizzling to optimize memory access patterns and reduce bank conflicts.
* Optimized for transferring data from registers to shared memory.
* On AMD GPUs, the `row_major` parameter can be used to match the memory
  access pattern used during prefetching from DRAM to registers.

Notes:

* The destination tensor must be in `SHARED` address space (SRAM).
* The source tensor must be in `LOCAL` address space (registers).
* This function is particularly useful in GPU kernels for sharing processed
  data between threads in the same block.
* The `row_major` parameter is specifically designed for AMD GPUs when using
  a prefetching pattern from DRAM to SRAM via registers.

**Constraints:**

* Destination tensor must be in SHARED address space.
* Source tensor must be in LOCAL address space.
* For optimal performance, the thread layout should match the memory
  access patterns of the tensors.

**Parameters:**

* ​thread\_layout (`Layout`): Layout defining how threads are organized for the
  operation. This determines how the workload is distributed among
  threads.
* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination
  indices, which can improve memory access patterns and reduce bank
  conflicts.
* ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or
  `WARP` level. `BLOCK` scope involves all threads in a thread block,
  while `WARP` scope restricts operations to threads within the same
  warp. Defaults to `ThreadScope.BLOCK`.
* ​row\_major (`Bool`): Whether to use row-major ordering for the copy operation.
  This is particularly relevant when prefetching from DRAM to SRAM
  via registers on AMD GPUs. Defaults to False.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in local memory (registers).

---

## copy_dram_to_local

`copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_base: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], offset: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}))`

Efficiently copy data from global memory (DRAM) to registers for AMD GPUs.

This function implements an optimized memory transfer operation specifically
for AMD GPU architectures. It utilizes the hardware's buffer\_load intrinsic
to efficiently transfer data from global memory to registers while handling
bounds checking. The function distributes the copy operation across multiple
threads for maximum throughput.

Notes:

* The offset calculation method significantly impacts performance.
  Current implementation optimizes for throughput over flexibility.
* This function is particularly useful for prefetching data into registers
  before performing computations, reducing memory access latency.

**Constraints:**

* Only supported on AMD GPUs.
* The destination element layout size must match the SIMD width.
* Source fragments must be rank 2 with known dimensions.

**Parameters:**

* ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor
  across threads. This determines how the workload is divided among
  participating threads.
* ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or
  `WARP` level. `BLOCK` scope involves all threads in a thread block,
  while `WARP` scope restricts operations to threads within the same
  warp. Defaults to `ThreadScope.BLOCK`.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in register memory (LOCAL address space).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in global memory (DRAM) to be copied.
* ​src\_base (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The original global memory tensor from which src is derived.
  This is used to construct the buffer descriptor required by AMD's
  `buffer_load` intrinsic.
* ​offset (`OptionalReg[UInt]`): The offset in the global memory.

`copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bounds: SIMD[uint32, 1])`

Efficiently copy data from global memory (DRAM) to registers for AMD GPUs.

This function implements an optimized memory transfer operation specifically
for AMD GPU architectures. It utilizes the hardware's buffer\_load intrinsic
to efficiently transfer data from global memory to registers while handling
bounds checking. The function distributes the copy operation across multiple
threads for maximum throughput.

Notes:

* The offset calculation method significantly impacts performance.
  Current implementation optimizes for throughput over flexibility.
* This function is particularly useful for prefetching data into registers
  before performing computations, reducing memory access latency.

**Constraints:**

* Only supported on AMD GPUs.
* The destination element layout size must match the SIMD width.
* Source fragments must be rank 2 with known dimensions.

**Parameters:**

* ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor
  across threads. This determines how the workload is divided among
  participating threads.
* ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or
  `WARP` level. `BLOCK` scope involves all threads in a thread block,
  while `WARP` scope restricts operations to threads within the same
  warp. Defaults to `ThreadScope.BLOCK`.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in register memory (LOCAL address space).
* ​src\_iter (`LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`): The source tensor iterator.
* ​bounds (`SIMD[uint32, 1]`): Bounds of the buffer, based on the ptr of the src\_iter.

`copy_dram_to_local[src_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Efficiently copy data from global memory (DRAM) to registers.

This function implements an optimized memory transfer operation from
global memory to register memory. It distributes the copy operation across
multiple threads for maximum throughput while handling bounds checking for
safety.

**Constraints:**

* The source tensor must be in GLOBAL address space (DRAM).
* The destination tensor must be in LOCAL address space (registers).
* Both tensors must have compatible data types.

**Parameters:**

* ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor
  across threads. This determines how the workload is divided among
  participating threads.
* ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or
  `WARP` level. `BLOCK` scope involves all threads in a thread block,
  while `WARP` scope restricts operations to threads within the same
  warp. Defaults to `ThreadScope.BLOCK`.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in register memory (LOCAL address space).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in global memory (DRAM).

---

## copy_dram_to_sram

`copy_dram_to_sram[src_thread_layout: Layout, dst_thread_layout: Layout = src_thread_layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = src_thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Synchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context.

This function performs a synchronous copy operation from global memory
(DRAM) to shared memory (SRAM) in a GPU context, distributing the workload
across multiple threads for parallel execution. It uses thread affinity
mapping to ensure efficient work distribution and supports vectorized memory
operations for optimal performance.

Example:

```mojo
from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()

# Copy data using a 2D thread layout with 8x8 threads
copy_dram_to_sram[Layout((8, 8))](shared_data, global_data)
```

Performance:

* Distributes the copy workload across multiple threads for parallel
  execution.
* Supports vectorized loads and stores for better memory throughput.
* Can use swizzling to optimize memory access patterns and reduce bank
  conflicts.
* Thread affinity mapping ensures efficient work distribution.
* For masked tensors, performs bounds checking to handle edge cases
  correctly.

Notes:

* The source tensor must be in GENERIC or GLOBAL address space (DRAM).
* The destination tensor must be in SHARED address space (SRAM).
* Both tensors must have the same data type.
* This function is synchronous, meaning all threads must complete their
  copy operations before proceeding.
* For optimal performance, the thread layouts should match the memory
  access patterns of the tensors.
* This function is particularly useful in GPU kernels for loading data
  from global memory to shared memory for faster access.

**Constraints:**

* Source and destination tensors must have the same data type.
* Source tensor must be in GENERIC or GLOBAL address space.
* Destination tensor must be in SHARED address space.
* For non-masked tensors, the fragment sizes must match.

**Parameters:**

* ​src\_thread\_layout (`Layout`): Layout defining how threads are organized for the
  source tensor. This determines how the workload is distributed among
  threads.
* ​dst\_thread\_layout (`Layout`): Layout defining how threads are organized for the
  destination tensor. Defaults to the same as `src_thread_layout` if
  not specified.
* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination
  indices, which can improve memory access patterns and reduce bank
  conflicts.
* ​num\_threads (`Int`): Total number of threads participating in the copy
  operation. Defaults to the size of `src_thread_layout`.
* ​thread\_scope (`ThreadScope`): Scope at which thread operations are performed (`BLOCK` or
  `WARP`). Defaults to `ThreadScope.BLOCK`, where all threads in a
  block participate.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory
  (DRAM).

`copy_dram_to_sram[src_thread_layout: Layout, dst_thread_layout: Layout = src_thread_layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = src_thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bound: Int)`

Efficiently copy data from global memory (DRAM) to shared memory (SRAM) on AMD GPUs.

This function implements an optimized memory transfer operation specifically
for AMD GPU architectures. It utilizes the hardware's `buffer_load`
intrinsic to efficiently transfer data while handling bounds checking. The
function distributes the copy operation across multiple threads for maximum
throughput.

**Parameters:**

* ​src\_thread\_layout (`Layout`): The layout used to distribute the source tensor
  across threads. This determines how the workload is divided among
  participating threads.
* ​dst\_thread\_layout (`Layout`): The layout used to distribute the destination tensor
  across threads. Defaults to the same layout as `src_thread_layout`.
* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling pattern to apply when distributing the
  destination tensor. This can improve memory access patterns and
  reduce bank conflicts. Defaults to None (no swizzling).
* ​num\_threads (`Int`): The total number of threads participating in the copy
  operation. Defaults to the size of `src_thread_layout`.
* ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or
  `WARP` level. `BLOCK` scope involves all threads in a thread block,
  while `WARP` scope restricts operations to threads within the same
  warp. Defaults to `ThreadScope.BLOCK`.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in shared memory (SRAM).
* ​src\_iter (`LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`): The source tensor iterator in global memory (DRAM) to be
  copied.
* ​bound (`Int`): The bound of the source tensor iterator.

`copy_dram_to_sram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_iter: LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], bound: Int)`

Synchronously copy data from DRAM to SRAM using a unified thread layout for AMD GPUs.

This is a convenience wrapper around the more general `copy_dram_to_sram()`
function that uses the same layout for both source and destination tensors.
It's specifically designed for AMD GPUs where the buffer\_load intrinsic
requires the original base tensor.

Performance:

* Simplifies API usage when the same thread layout is appropriate for both
  source and destination tensors.
* Optimized for AMD GPUs using buffer\_load intrinsics for efficient memory
  transfers.
* Distributes the copy workload across multiple threads for parallel
  execution.

Notes:

* This function is only supported on AMD GPUs.
* The source tensor must be in GENERIC or GLOBAL address space (DRAM).
* The destination tensor must be in SHARED address space (SRAM).
* Both tensors must have the same data type.

**Parameters:**

* ​thread\_layout (`Layout`): Layout defining how threads are organized for both source
  and destination. This determines how the workload is distributed
  among threads.
* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination
  indices, which can improve memory access patterns and reduce bank
  conflicts.
* ​num\_threads (`Int`): Total number of threads participating in the copy
  operation. Defaults to the size of thread\_layout.
* ​thread\_scope (`ThreadScope`): Scope at which thread operations are performed (`BLOCK` or
  `WARP`). Defaults to `BLOCK`, where all threads in a block
  participate.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM).
* ​src\_iter (`LayoutTensorIter[type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`): The source tensor iterator, which must be in global or generic
  memory (DRAM).
* ​bound (`Int`): The bound of the source tensor iterator.

`copy_dram_to_sram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Synchronously copy data from DRAM to SRAM using a unified thread layout.

This is a convenience wrapper around the more general `copy_dram_to_sram()`
function that uses the same layout for both source and destination tensors.
It simplifies the API for the common case where the same thread distribution
pattern works well for both tensors.

Example:

```mojo
from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()

# Copy data using a 2D thread layout with 8x8 threads
copy_dram_to_sram[Layout((8, 8))](shared_data, global_data)
```

Performance:

* Simplifies API usage when the same thread layout is appropriate for both
  source and destination tensors.
* Distributes the copy workload across multiple threads for parallel
  execution.
* Supports vectorized loads and stores for better memory throughput.
* Can use swizzling to optimize memory access patterns and reduce bank
  conflicts.

Notes:

* The source tensor must be in `GENERIC` or `GLOBAL` address space (DRAM).
* The destination tensor must be in `SHARED` address space (SRAM).
* Both tensors must have the same data type.
* This function is synchronous, meaning all threads must complete their
  copy operations before proceeding.

**Parameters:**

* ​thread\_layout (`Layout`): Layout defining how threads are organized for both source
  and destination. This determines how the workload is distributed
  among threads.
* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination
  indices, which can improve memory access patterns and reduce bank
  conflicts.
* ​num\_threads (`Int`): Total number of threads participating in the copy
  operation. Defaults to the size of `thread_layout`.
* ​thread\_scope (`ThreadScope`): Scope at which thread operations are performed
  (`BLOCK` or `WARP)`. Defaults to `ThreadScope.BLOCK`, where all
  threads in a block participate.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory
  (DRAM).

---

## copy_dram_to_sram_async

`copy_dram_to_sram_async[src_thread_layout: Layout, dst_thread_layout: Layout, swizzle: Bool = False, fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0), num_threads: Int = src_thread_layout.size()](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Asynchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context.

This function performs an asynchronous copy operation from global memory
(DRAM) to shared memory (SRAM) in a GPU context, using NVIDIA's cp.async
hardware mechanism. It distributes the workload across multiple threads and
allows computation to overlap with memory transfers for improved
performance.

Example:

```mojo
from layout import LayoutTensor, Layout
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()

# Asynchronously copy data using thread layouts
copy_dram_to_sram_async[Layout((8, 8)), Layout((8, 8))](shared_data, global_data)

# Perform other computations while the copy is in progress

# Wait for the asynchronous copy to complete
async_copy_wait_all()
```

Performance:

* Performs asynchronous transfers, allowing computation to overlap with
  memory operations.
* Distributes the copy workload across multiple threads for parallel
  execution.
* Can use swizzling to optimize memory access patterns and reduce bank
  conflicts.
* Supports different cache eviction policies to optimize memory hierarchy
  usage.
* For masked tensors, performs bounds checking to handle edge cases
  correctly.

Notes:

* This function requires NVIDIA GPUs with `cp.async` support (compute
  capability 8.0+).
* The source tensor must be in GENERIC or GLOBAL address space (DRAM).
* The destination tensor must be in SHARED address space (SRAM).
* Both tensors must have the same data type.
* This function is asynchronous, so you must call
  [`async_copy_wait_all()`](/mojo/stdlib/gpu/memory/async_copy_wait_all/)
  or
  [`async_copy_wait_group()`](/mojo/stdlib/gpu/memory/async_copy_wait_group/)
  to ensure the copy has completed before using the data.
* The maximum size of each element that can be copied is 16 bytes.

**Constraints:**

* Requires NVIDIA GPUs with cp.async support (compute capability 8.0+).
* Source tensor must be in `GENERIC` or `GLOBAL` address space.
* Destination tensor must be in `SHARED` address space.
* Both tensors must have the same data type.
* Element size must be 4, 8, or 16 bytes.

**Parameters:**

* ​src\_thread\_layout (`Layout`): Layout defining how threads are organized for the
  source tensor. This determines how the workload is distributed among
  threads.
* ​dst\_thread\_layout (`Layout`): Layout defining how threads are organized for the
  destination tensor.
* ​swizzle (`Bool`): Whether to apply swizzling to the destination indices to
  reduce bank conflicts. Defaults to False.
* ​fill (`Fill`): Fill policy for handling out-of-bounds accesses. Options
  include:
  * `Fill.NONE`: No special handling (default).
  * `Fill.ZERO`: Fill out-of-bounds elements with zeros.
* ​eviction\_policy (`CacheEviction`): Cache eviction policy for the source data. Options
  include:
  * `CacheEviction.EVICT_NORMAL`: Normal eviction (default).
  * `CacheEviction.EVICT_FIRST`: Evict data after first use.
  * `CacheEviction.EVICT_LAST`: Keep data in cache until last use.
* ​num\_threads (`Int`): Total number of threads participating in the copy operation.
  Defaults to the size of src\_thread\_layout.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory (DRAM).

`copy_dram_to_sram_async[thread_layout: Layout, swizzle: Bool = False, masked: Bool = False, fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0), num_threads: Int = thread_layout.size()](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Asynchronous copy from DRAM to SRAM with thread affinity mapping.

This function performs an asynchronous memory transfer from DRAM (global
memory) to SRAM (shared memory) using the specified thread layout for
distribution.

Notes:

This is a convenience wrapper around the more general
`copy_dram_to_sram_async()` function, using the same thread layout for
both source and destination.

**Parameters:**

* ​thread\_layout (`Layout`): The layout used to distribute work across threads.
* ​swizzle (`Bool`): Whether to apply memory access swizzling for better performance.
* ​masked (`Bool`): Whether the copy operation should use masking.
* ​fill (`Fill`): Fill policy for uninitialized memory regions.
* ​eviction\_policy (`CacheEviction`): Cache eviction policy to use during the transfer.
* ​num\_threads (`Int`): Number of threads to use for the operation, defaults to
  the size of `thread_layout`.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Destination tensor in SRAM.
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Source tensor in DRAM.

---

## copy_local_to_dram

`copy_local_to_dram[dst_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Efficiently copy data from registers (LOCAL) to global memory (DRAM).

This function implements a high-performance memory transfer operation from
register memory to global memory. It distributes the copy operation across
multiple threads for maximum throughput while handling bounds checking for
safety.

**Constraints:**

* The source tensor must be in LOCAL address space (registers).
* The destination tensor must be in GENERIC or GLOBAL address space (DRAM).
* Both tensors must have compatible data types.

**Parameters:**

* ​dst\_thread\_layout (`Layout`): The layout used to distribute the destination tensor
  across threads. This determines how the workload is divided among
  participating threads.
* ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or
  `WARP` level. `BLOCK` scope involves all threads in a thread block,
  while `WARP` scope restricts operations to threads within the same
  warp. Defaults to `ThreadScope.BLOCK`.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in global memory (DRAM).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in register memory (LOCAL) to be copied.

`copy_local_to_dram[dst_thread_layout: Layout, thread_scope: ThreadScope = ThreadScope(0)](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], dst_base: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Efficiently copy data from registers (LOCAL) to global memory (DRAM) on AMD GPUs.

This function implements an optimized memory transfer operation specifically
for AMD GPU architectures. It utilizes the hardware's buffer\_store intrinsic
to efficiently transfer data from registers to global memory while handling
bounds checking. The function distributes the copy operation across multiple
threads for maximum throughput.

Notes:

* This function is particularly useful for writing computed results from
  registers back to global memory with minimal latency.
* The offset calculation is optimized for performance rather than
  flexibility.

**Constraints:**

* Only supported on AMD GPUs.
* Destination tensor must be in GLOBAL address space.
* Source tensor must be in LOCAL address space.
* Data types must match between source and destination tensors.

**Parameters:**

* ​dst\_thread\_layout (`Layout`): The layout used to distribute the destination tensor
  across threads. This determines how the workload is divided among
  participating threads.
* ​thread\_scope (`ThreadScope`): Defines whether operations are performed at `BLOCK` or
  `WARP` level. `BLOCK` scope involves all threads in a thread block,
  while `WARP` scope restricts operations to threads within the same
  warp. Defaults to `ThreadScope.BLOCK`.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in global memory (DRAM).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in register memory (LOCAL address space) to be
  copied.
* ​dst\_base (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The original global memory tensor from which dst is derived.
  This is used to construct the buffer descriptor required by AMD's
  `buffer_store` intrinsic.

---

## copy_local_to_local

`copy_local_to_local(dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Synchronously copy data between local memory (register) tensors with type conversion.

This function performs a synchronous copy operation between register tensors
in a GPU context, with support for converting from float32 to half-precision
formats (bfloat16/float16). It's particularly optimized for specific tensor
layouts commonly used in matrix multiplication operations.

Example:

```mojo
from layout import LayoutTensor, Layout
from layout.layout_tensor import copy_local_to_local

var src_reg = LayoutTensor[DType.float32, Layout((16, 8)),
                            address_space=AddressSpace.LOCAL]()
var dst_reg = LayoutTensor[DType.bfloat16, Layout((16, 8)),
                            address_space=AddressSpace.LOCAL]()

# Process data in float32 registers
# ...

# Convert and copy to bfloat16 registers
copy_local_to_local(dst_reg, src_reg)
```

Performance:

* Optimized for specific 2D tensor layouts with contiguous inner dimensions.
* Special fast path for 2D tensors with specific layouts used in matrix
  multiplication.
* For MMA (Matrix Multiply-Accumulate) operations, efficiently handles the
  conversion between output fragments and input fragments with different
  layouts.
* Falls back to element-wise copy for general cases.

Notes:

* Both source and destination tensors must be in `LOCAL` address space
  (registers).
* This function currently only supports copying from float32 to half-precision formats.
* For 2D tensors with stride\[1] == 1, a specialized fast path is used that's optimized
  for matrix multiplication patterns.
* This function is particularly useful in GPU kernels for converting between different
  precision formats while keeping data in registers.

**Constraints:**

* Destination tensor must be in `LOCAL` address space.
* Source tensor must be in `LOCAL` address space.
* Destination tensor must have a half-precision floating-point data type.
* Source tensor must have float32 data type.
* Both tensors must have the same total size.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in local memory (registers)
  and have a half-precision floating-point data type (bfloat16 or
  float16).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in local memory (registers) and
  have float32 data type.

---

## copy_sram_to_dram

`copy_sram_to_dram[thread_layout: Layout, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), num_threads: Int = thread_layout.size(), binary_op: OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]]({:i1 0, 1})](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Synchronously copy data from SRAM (shared memory) to DRAM (global memory).

This function performs a synchronous memory transfer from SRAM (shared
memory) to DRAM (global memory) using the specified thread layout for
workload distribution. It supports optional swizzling for optimized memory
access patterns and binary operations for combining data during the
transfer.

Example:

```mojo
from layout import LayoutTensor, Layout
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()

# Copy data using a 2D thread layout with 8x8 threads
copy_sram_to_dram[Layout((8, 8))](global_data, shared_data)
```

Performance:

* Distributes the copy workload across multiple threads for parallel
  execution.
* Supports vectorized loads and stores for better memory throughput.
* Can use swizzling to optimize memory access patterns.
* Supports binary operations to combine data during transfer (e.g., for
  reduction operations).

Notes:

* The source tensor must be in `SHARED` address space (SRAM).
* The destination tensor must be in `GENERIC` or `GLOBAL` address space
  (DRAM).
* Supports FP32 to half-precision downcast during copy if needed.
* Handles masked tensors with proper bounds checking.
* This function is synchronous, meaning all threads must complete their
  copy operations before proceeding.

**Constraints:**

* Source tensor must be in SHARED address space with a static layout.
* Destination tensor must be in GENERIC or GLOBAL address space.
* For type conversion, only FP32 to half-precision is supported.
* For vectorized copy with type conversion, both tensors must have
  element layouts matching the SIMD width of the destination type.

**Parameters:**

* ​thread\_layout (`Layout`): Layout defining how threads are organized for both source
  and destination. This determines how the workload is distributed
  among threads.
* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the source indices,
  which can improve memory access patterns and reduce bank conflicts.
* ​num\_threads (`Int`): Total number of threads participating in the copy
  operation. Defaults to the size of thread\_layout.
* ​binary\_op (`OptionalReg[fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]]`): Optional binary operation to apply during the copy, combining
  source data with existing destination data.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in global or generic memory
  (DRAM).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in shared memory (SRAM).

---

## copy_sram_to_local

`copy_sram_to_local[src_warp_layout: Layout, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1})](dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Synchronously copy data from SRAM (shared memory) to local memory.

This function performs a synchronous memory transfer from SRAM (shared
memory) to local memory (registers) using the specified thread layout for
workload distribution.

Example:

```mojo
from layout import LayoutTensor, Layout
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                            address_space=AddressSpace.SHARED]()
var local_data = LayoutTensor[DType.float32, Layout((4, 4)),
                            address_space=AddressSpace.LOCAL]()

# Copy data using a thread layout with 8 threads
copy_sram_to_local[Layout(8)](local_data, shared_data)
```

Performance:

* Distributes the copy workload across multiple threads for parallel
  execution.
* Optimized for transferring data from shared memory to registers.
* Supports optional axis-specific distribution for specialized access
  patterns.

**Constraints:**

* The source tensor must be in SHARED address space (SRAM).
* The destination tensor must be in LOCAL address space (registers).
* Both tensors must have the same data type.

**Parameters:**

* ​src\_warp\_layout (`Layout`): Layout defining how threads are organized for the
  source tensor. This determines how the workload is distributed among
  threads.
* ​axis (`OptionalReg[Int]`): Optional parameter specifying which axis to distribute along.
  When provided, distribution happens along the specified axis.
  When None (default), distribution uses the standard layout pattern.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in local memory (registers).
* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in shared memory (SRAM).

---

## copy_to_slice

`copy_to_slice[type: DType, start_type: DType, end_type: DType, step_type: DType, in_rank: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](buffer: NDBuffer[type, in_rank, origin], in_slice: NDBuffer[type, in_rank, origin], start: NDBuffer[start_type, 1, origin], end: NDBuffer[end_type, 1, origin], step: NDBuffer[step_type, 1, origin], context: DeviceContextPtr = DeviceContextPtr())`

---

## Copyable

The Copyable trait denotes a type whose value can be copied.

Example implementing the `Copyable` trait on `Foo` which requires the `__copyinit__`
method:

```mojo
struct Foo(Copyable):
    var s: String

    @implicit
    fn __init__(out self, s: String):
        self.s = s

    fn __copyinit__(out self, other: Self):
        print("copying value")
        self.s = other.s
```

You can now copy objects inside a generic function:

```mojo
fn copy_return[T: Copyable](foo: T) -> T:
    var copy = foo
    return copy

var foo = Foo("test")
var res = copy_return(foo)
```

```plaintext
copying value
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

---

## copysign

`copysign[dtype: DType, width: Int, //](magnitude: SIMD[dtype, width], sign: SIMD[dtype, width]) -> SIMD[dtype, width]`

Returns a value with the magnitude of the first operand and the sign of the second operand.

**Constraints:**

The type of the input must be numeric.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​magnitude (`SIMD[dtype, width]`): The magnitude to use.
* ​sign (`SIMD[dtype, width]`): The sign to copy.

**Returns:**

Copies the sign from sign to magnitude.

---

## coroutine

Implements classes and methods for coroutines.

These are Mojo built-ins, so you don't need to import them.

## Aliases

### `AnyCoroutine`

`alias AnyCoroutine = !co.routine`

## Structs

* [​`Coroutine`](/mojo/stdlib/builtin/coroutine/Coroutine): Represents a coroutine.
* [​`RaisingCoroutine`](/mojo/stdlib/builtin/coroutine/RaisingCoroutine): Represents a coroutine that can raise.

---

## Coroutine

`@register_passable`
`struct Coroutine[type: AnyType, origins: origin.set]`

Represents a coroutine.

Coroutines can pause execution saving the state of the program (including
values of local variables and the location of the next instruction to be
executed). When the coroutine is resumed, execution continues from where it
left off, with the saved state restored.

## Parameters

* ​type (`AnyType`): Type of value returned upon completion of the coroutine.
* ​origins (`origin.set`): The origin of the coroutine's captures.

## Implemented traits

`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(handle: !co.routine) -> Self`

Construct a coroutine object from a handle.

**Args:**

* ​handle (`!co.routine`): The init handle.

### `__await__`

`__await__(owned self, out result: type)`

Suspends the current coroutine until the coroutine is complete.

**Returns:**

The coroutine promise.

### `force_destroy`

`force_destroy(owned self)`

Destroy the coroutine object.

---

## cos

`cos[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `cos` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `cos` of the input.

---

## cosh

`cosh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `cosh` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `cosh` of the input.

---

## cosize

`cosize(l: Layout) -> Int`

Returns the size of the memory region spanned by the layout.

This is a standalone function equivalent to the Layout.cosize() method.

**Args:**

* ​l (`Layout`): The layout to calculate the cosize for.

**Returns:**

The size of the memory region required by the layout.

---

## count_leading_zeros

`count_leading_zeros(val: Int) -> Int`

Counts the number of leading zeros of an integer.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The number of leading zeros of the input.

`count_leading_zeros[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Counts the per-element number of leading zeros in a SIMD vector.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `DType` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` contains the number of
leading zeros at position `i` of the input value.

---

## count_trailing_zeros

`count_trailing_zeros(val: Int) -> Int`

Counts the number of trailing zeros for an integer.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The number of trailing zeros of the input.

`count_trailing_zeros[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Counts the per-element number of trailing zeros in a SIMD vector.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` contains the number of
trailing zeros at position `i` of the input value.

---

## counter

Defines the `Counter` type.

You can import these APIs from the `collections` package. For example:

```mojo
from collections import Counter
```

## Structs

* [​`Counter`](/mojo/stdlib/collections/counter/Counter): A container for counting hashable items.
* [​`CountTuple`](/mojo/stdlib/collections/counter/CountTuple): A tuple representing a value and its count in a Counter.

---

## Counter

`struct Counter[V: KeyElement]`

A container for counting hashable items.

The value type must be specified statically, unlike a Python
Counter, which can accept arbitrary value types.

The value type must implement the `KeyElement` trait, as its values are
stored in the dictionary as keys.

Usage:

```mojo
from collections import Counter
var c = Counter[String]("a", "a", "a", "b", "b", "c", "d", "c", "c")
print(c["a"]) # prints 3
print(c["b"]) # prints 2
```

## Parameters

* ​V (`KeyElement`): The value type to be counted. Currently must be KeyElement.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Create a new, empty Counter object.

`__init__(out self, owned *values: V)`

Create a new Counter from a list of values.

Usage:

```mojo
from collections import Counter
var c = Counter[String]("a", "a", "a", "b", "b", "c", "d", "c", "c")
print(c["a"])  # print 3
print(c["b"])  # print 2
```

**Args:**

* ​\*values (`V`): A list of values to count.

`@implicit`
`__init__(out self, items: List[V, hint_trivial_type])`

Create a from an input iterable.

Usage:

```mojo
from collections import Counter
var c = Counter[String](["a", "a", "a", "b", "b", "c", "d", "c", "c"])
print(c["a"]) # prints 3
print(c["b"]) # prints 2
```

**Args:**

* ​items (`List[V, hint_trivial_type]`): A list of items to count.

### `__bool__`

`__bool__(self) -> Bool`

Check if the Counter is empty or not.

**Returns:**

`False` if the Counter is empty, `True` otherwise.

### `__getitem__`

`__getitem__(self, key: V) -> Int`

Get the count of a key.

**Args:**

* ​key (`V`): The key to get the count of.

**Returns:**

The count of the key.

### `__setitem__`

`__setitem__(mut self, value: V, count: Int)`

Set a value in the keyword Counter by key.

**Args:**

* ​value (`V`): The value to associate with the specified count.
* ​count (`Int`): The count to store in the Counter.

### `__neg__`

`__neg__(self) -> Self`

Substract from an empty Counter. Strips positive and zero counts, and flips the sign on negative counts.

**Returns:**

A new Counter with stripped counts and negative counts.

### `__pos__`

`__pos__(self) -> Self`

Return a shallow copy of the Counter, stripping non-positive counts.

**Returns:**

A shallow copy of the Counter.

### `__lt__`

`__lt__(self, other: Self) -> Bool`

Check if all counts are less than in the other Counter.

**Args:**

* ​other (`Self`): The other Counter to compare to.

**Returns:**

True if all counts are less than in the other Counter, False
otherwise.

### `__le__`

`__le__(self, other: Self) -> Bool`

Check if all counts are less than or equal to the other Counter.

**Args:**

* ​other (`Self`): The other Counter to compare to.

**Returns:**

True if all counts are less than or equal to the other Counter,
False otherwise.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Check if all counts agree. Missing counts are treated as zero.

**Args:**

* ​other (`Self`): The other Counter to compare to.

**Returns:**

True if the two Counters are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Check if all counts disagree. Missing counts are treated as zero.

**Args:**

* ​other (`Self`): The other Counter to compare to.

**Returns:**

True if the two Counters are not equal, False otherwise.

### `__gt__`

`__gt__(self, other: Self) -> Bool`

Check if all counts are greater than in the other Counter.

**Args:**

* ​other (`Self`): The other Counter to compare to.

**Returns:**

True if all counts are greater than in the other Counter, False
otherwise.

### `__ge__`

`__ge__(self, other: Self) -> Bool`

Check if all counts are greater than or equal to the other Counter.

**Args:**

* ​other (`Self`): The other Counter to compare to.

**Returns:**

True if all counts are greater than or equal to the other Counter,
False otherwise.

### `__contains__`

`__contains__(self, key: V) -> Bool`

Check if a given key is in the dictionary or not.

**Args:**

* ​key (`V`): The key to check.

**Returns:**

True if there key exists in the dictionary, False otherwise.

### `__add__`

`__add__(self, other: Self) -> Self`

Add counts from two Counters.

**Args:**

* ​other (`Self`): The other Counter to add to this Counter.

**Returns:**

A new Counter with the counts from both Counters added together.

### `__sub__`

`__sub__(self, other: Self) -> Self`

Subtract counts, but keep only results with positive counts.

**Args:**

* ​other (`Self`): The other Counter to subtract from this Counter.

**Returns:**

A new Counter with the counts from the other Counter subtracted from
this Counter.

### `__and__`

`__and__(self, other: Self) -> Self`

Intersection: keep common elements with the minimum count.

**Args:**

* ​other (`Self`): The other Counter to intersect with.

**Returns:**

A new Counter with the common elements and the minimum count of
the two Counters.

### `__or__`

`__or__(self, other: Self) -> Self`

Union: keep all elements with the maximum count.

**Args:**

* ​other (`Self`): The other Counter to union with.

**Returns:**

A new Counter with all elements and the maximum count of the two
Counters.

### `__iadd__`

`__iadd__(mut self, other: Self)`

Add counts from another Counter to this Counter.

**Args:**

* ​other (`Self`): The other Counter to add to this Counter.

### `__isub__`

`__isub__(mut self, other: Self)`

Subtract counts from another Counter from this Counter.

**Args:**

* ​other (`Self`): The other Counter to subtract from this Counter.

### `__iand__`

`__iand__(mut self, other: Self)`

Intersection: keep common elements with the minimum count.

**Args:**

* ​other (`Self`): The other Counter to intersect with.

### `__ior__`

`__ior__(mut self, other: Self)`

Union: keep all elements with the maximum count.

**Args:**

* ​other (`Self`): The other Counter to union with.

### `copy`

`copy(self) -> Self`

Create a new Counter by copying another Counter.

**Returns:**

A copy of the value.

### `fromkeys`

`static fromkeys(keys: List[V, hint_trivial_type], value: Int) -> Self`

Create a new Counter from a list of keys and a default value.

**Args:**

* ​keys (`List[V, hint_trivial_type]`): The keys to create the Counter from.
* ​value (`Int`): The default value to associate with each key.

**Returns:**

A new Counter with the keys and default value.

### `__iter__`

`__iter__(self) -> _DictKeyIter[V, Int, self._data]`

Iterate over the keyword dict's keys as immutable references.

**Returns:**

An iterator of immutable references to the Counter values.

### `__len__`

`__len__(self) -> Int`

Returns the number of elements currently stored in the Counter.

**Returns:**

The number of elements in the Counter.

### `get`

`get(self, value: V) -> Optional[Int]`

Get a value from the counter.

**Args:**

* ​value (`V`): The value to search for in the Counter.

**Returns:**

An optional value containing a copy of the value if it was present,
otherwise an empty Optional.

`get(self, value: V, default: Int) -> Int`

Get a value from the Counter.

**Args:**

* ​value (`V`): The value to search for in the counter.
* ​default (`Int`): Default count to return.

**Returns:**

A copy of the value if it was present, otherwise default.

### `pop`

`pop(mut self, value: V) -> Int`

Remove a value from the Counter by value.

**Args:**

* ​value (`V`): The value to remove from the Counter.

**Returns:**

The value associated with the key, if it was in the Counter.

**Raises:**

"KeyError" if the key was not present in the Counter.

`pop(mut self, value: V, owned default: Int) -> Int`

Remove a value from the Counter by value.

**Args:**

* ​value (`V`): The value to remove from the Counter.
* ​default (`Int`): Optionally provide a default value to return if the value
  was not found instead of raising.

**Returns:**

The value associated with the key, if it was in the Counter.
If it wasn't, return the provided default value instead.

**Raises:**

"KeyError" if the key was not present in the Counter and no
default value was provided.

### `keys`

`keys(ref self) -> _DictKeyIter[V, Int, self_is_origin._data]`

Iterate over the Counter's keys as immutable references.

**Returns:**

An iterator of immutable references to the Counter keys.

### `values`

`values(ref self) -> _DictValueIter[V, Int, self_is_origin._data]`

Iterate over the Counter's values as references.

**Returns:**

An iterator of references to the Counter values.

### `items`

`items(self) -> _DictEntryIter[V, Int, self._data]`

Iterate over the dict's entries as immutable references.

**Returns:**

An iterator of immutable references to the Counter entries.

### `clear`

`clear(mut self)`

Remove all elements from the Counter.

### `popitem`

`popitem(mut self) -> CountTuple[V]`

Remove and return an arbitrary (key, value) pair from the Counter.

**Returns:**

A CountTuple containing the key and value of the removed item.

**Raises:**

"KeyError" if the Counter is empty.

### `total`

`total(self) -> UInt`

Return the total of all counts in the Counter.

**Returns:**

The total of all counts in the Counter.

### `most_common`

`most_common(self, n: UInt) -> List[CountTuple[V]]`

Return a list of the `n` most common elements and their counts from the most common to the least.

**Args:**

* ​n (`UInt`): The number of most common elements to return.

**Returns:**

A list of the n most common elements and their counts.

### `elements`

`elements(self) -> List[V]`

Return an iterator over elements repeating each as many times as its count.

**Returns:**

An iterator over the elements in the Counter.

### `update`

`update(mut self, other: Self)`

Update the Counter, like `dict.update()` but add counts instead of replacing them.

**Args:**

* ​other (`Self`): The Counter to update this Counter with.

### `subtract`

`subtract(mut self, other: Self)`

Subtract count. Both inputs and outputs may be zero or negative.

**Args:**

* ​other (`Self`): The Counter to subtract from this Counter.

---

## CountTuple

`struct CountTuple[V: KeyElement]`

A tuple representing a value and its count in a Counter.

## Parameters

* ​V (`KeyElement`): The value in the Counter.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, value: V, count: UInt)`

Create a new CountTuple.

**Args:**

* ​value (`V`): The value in the Counter.
* ​count (`UInt`): The count of the value in the Counter.

### `__getitem__`

`__getitem__(self, idx: Int) -> Variant[V, Int]`

Get an element in the tuple.

**Args:**

* ​idx (`Int`): The element to return.

**Returns:**

The value if idx is 0 and the count if idx is 1.

### `__lt__`

`__lt__(self, other: Self) -> Bool`

Compare two CountTuples by count, then by value.

**Args:**

* ​other (`Self`): The other CountTuple to compare to.

**Returns:**

True if this CountTuple is less than the other, False otherwise.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compare two CountTuples for equality.

**Args:**

* ​other (`Self`): The other CountTuple to compare to.

**Returns:**

True if the two CountTuples are equal, False otherwise.

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

---

## cp_async_bulk_commit_group

`cp_async_bulk_commit_group()`

Commits all prior initiated but uncommitted cp.async.bulk instructions into a cp.async.bulk-group.

This function commits all previously initiated but uncommitted cp.async.bulk instructions into a
cp.async.bulk-group. The cp.async.bulk instructions are used for asynchronous bulk memory transfers
on NVIDIA GPUs.

The function creates a synchronization point for bulk memory transfers, allowing better control over
memory movement and synchronization between different stages of computation.

Note:
This functionality is only available on NVIDIA GPUs. Attempting to use this function on
non-NVIDIA GPUs will result in a compile time error.

---

## cp_async_bulk_tensor_global_shared_cta

`cp_async_bulk_tensor_global_shared_cta[src_type: AnyType, rank: Int, /, eviction_policy: CacheEviction = CacheEviction(0)](src_mem: UnsafePointer[src_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], coords: IndexList[rank])`

Initiates an asynchronous copy operation to transfer tensor data from shared CTA memory to global memory using NVIDIA's Tensor Memory Access (TMA) mechanism.

This function provides an efficient way to write data back from shared memory to global
memory using TMA. It supports both rank-1 and rank-2 tensors and allows control over
cache eviction policy.

Notes:

* This operation is asynchronous - use appropriate memory barriers to ensure completion.
* Only supports rank-1 and rank-2 tensors.
* Requires NVIDIA GPU with TMA support.
* The source memory must be properly aligned for TMA operations.
* The TMA descriptor must be properly initialized before use.

**Parameters:**

* ​src\_type (`AnyType`): The data type of the source tensor elements.
* ​rank (`Int`): The dimensionality of the tensor (must be 1 or 2).
* ​eviction\_policy (`CacheEviction`): Optional cache eviction policy that controls how the data is handled
  in the cache hierarchy. Defaults to EVICT\_NORMAL.

**Args:**

* ​src\_mem (`UnsafePointer[src_type, address_space=AddressSpace(3)]`): Pointer to the source data in shared memory that will be copied to global
  memory. Must be properly aligned according to TMA requirements.
* ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor containing metadata about tensor layout
  and memory access patterns.
* ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to copy. For rank-1 tensors,
  this is a single coordinate. For rank-2 tensors, this contains both row and
  column coordinates.

---

## cp_async_bulk_tensor_reduce

`cp_async_bulk_tensor_reduce[src_type: AnyType, rank: Int, /, *, reduction_kind: ReduceOp, eviction_policy: CacheEviction = CacheEviction(0)](src_mem: UnsafePointer[src_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], coords: IndexList[rank])`

Initiates an asynchronous reduction operation between shared CTA memory and global memory using NVIDIA's Tensor Memory Access (TMA) mechanism.

This function performs an in-place reduction operation, combining data from shared memory
with data in global memory using the specified reduction operation. The operation is
performed asynchronously and uses TMA's tile mode for efficient memory access.

Notes:

* This operation is asynchronous - use appropriate memory barriers to ensure completion.
* Only supports rank-1 and rank-2 tensors.
* Requires NVIDIA GPU with TMA support.
* The source memory must be properly aligned for TMA operations.
* The TMA descriptor must be properly initialized before use.
* The reduction operation is performed atomically to ensure correctness.

**Parameters:**

* ​src\_type (`AnyType`): The data type of the source tensor elements.
* ​rank (`Int`): The dimensionality of the tensor (must be 1 or 2).
* ​reduction\_kind (`ReduceOp`): The type of reduction operation to perform. Supported operations are:
  "add", "min", "max", "inc", "dec", "and", "or", "xor".
* ​eviction\_policy (`CacheEviction`): Optional cache eviction policy that controls how the data is handled
  in the cache hierarchy. Defaults to `EVICT_NORMAL`.

**Args:**

* ​src\_mem (`UnsafePointer[src_type, address_space=AddressSpace(3)]`): Pointer to the source data in shared memory that will be reduced with the
  global memory data. Must be properly aligned according to TMA requirements.
* ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor containing metadata about tensor layout
  and memory access patterns.
* ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to operate on. For rank-1
  tensors, this is a single coordinate. For rank-2 tensors, this contains both
  row and column coordinates.

---

## cp_async_bulk_tensor_shared_cluster_global

`cp_async_bulk_tensor_shared_cluster_global[dst_type: AnyType, mbr_type: AnyType, rank: Int](dst_mem: UnsafePointer[dst_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], mem_bar: UnsafePointer[mbr_type, address_space=AddressSpace(3)], coords: IndexList[rank])`

Initiates an asynchronous bulk copy operation of tensor data from global memory to shared memory.

This function performs an asynchronous copy of tensor data using NVIDIA's Tensor Memory Access (TMA)
mechanism. It supports both rank-1 and rank-2 tensors and uses cluster-level synchronization for
efficient data movement.

Notes:

* This operation is asynchronous - use appropriate memory barriers to ensure
  copy completion.
* Only supports rank-1 and rank-2 tensors.
* Requires NVIDIA GPU with TMA support.
* The memory barrier should be properly initialized before use.

**Parameters:**

* ​dst\_type (`AnyType`): The data type of the destination memory.
* ​mbr\_type (`AnyType`): The data type of the memory barrier.
* ​rank (`Int`): The dimensionality of the tensor (1, 2, or 3).

**Args:**

* ​dst\_mem (`UnsafePointer[dst_type, address_space=AddressSpace(3)]`): Pointer to the destination in shared memory where the tensor data will be copied.
  Must be properly aligned according to TMA requirements.
* ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor that contains metadata about the tensor layout
  and memory access patterns.
* ​mem\_bar (`UnsafePointer[mbr_type, address_space=AddressSpace(3)]`): Pointer to a shared memory barrier used for synchronizing the asynchronous copy
  operation across threads in the cluster.
* ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to copy. For rank-1 tensors,
  this is a single coordinate. For rank-2 tensors, this contains both row and
  column coordinates.

---

## cp_async_bulk_tensor_shared_cluster_global_multicast

`cp_async_bulk_tensor_shared_cluster_global_multicast[dst_type: AnyType, mbr_type: AnyType, rank: Int](dst_mem: UnsafePointer[dst_type, address_space=AddressSpace(3)], tma_descriptor: UnsafePointer[NoneType], mem_bar: UnsafePointer[mbr_type, address_space=AddressSpace(3)], coords: IndexList[rank], multicast_mask: SIMD[uint16, 1])`

Initiates an asynchronous multicast load operation using NVIDIA's Tensor Memory Access (TMA) to copy tensor data from global memory to shared memories of multiple CTAs in a cluster.

This function performs an optimized multicast copy operation where a single global memory read
can be distributed to multiple CTAs' shared memories simultaneously, reducing memory bandwidth
usage. It supports both rank-1 and rank-2 tensors and uses cluster-level synchronization.

Notes:

* This operation is asynchronous - use appropriate memory barriers to ensure copy completion.
* Only supports rank-1 and rank-2 tensors.
* Requires NVIDIA GPU with TMA support.
* The memory barrier should be properly initialized before use.
* The multicast\_mask must be properly configured based on cluster size and desired distribution.

**Parameters:**

* ​dst\_type (`AnyType`): The data type of the destination tensor elements.
* ​mbr\_type (`AnyType`): The data type of the memory barrier.
* ​rank (`Int`): The dimensionality of the tensor (must be 1 or 2).

**Args:**

* ​dst\_mem (`UnsafePointer[dst_type, address_space=AddressSpace(3)]`): Pointer to the destination in shared memory where the tensor data will be copied.
  Must be properly aligned according to TMA requirements.
* ​tma\_descriptor (`UnsafePointer[NoneType]`): Pointer to the TMA descriptor containing metadata about tensor layout
  and memory access patterns.
* ​mem\_bar (`UnsafePointer[mbr_type, address_space=AddressSpace(3)]`): Pointer to a shared memory barrier used for synchronizing the asynchronous copy
  operation across threads in the cluster.
* ​coords (`IndexList[rank]`): Coordinates specifying which tile of the tensor to copy. For rank-1 tensors,
  this is a single coordinate. For rank-2 tensors, this contains both row and
  column coordinates.
* ​multicast\_mask (`SIMD[uint16, 1]`): A 16-bit bitmask where each bit corresponds to a CTA in the cluster.
  Set bits indicate which CTAs will receive a copy of the loaded data.
  This enables efficient data sharing across multiple CTAs.

---

## cp_async_bulk_wait_group

`cp_async_bulk_wait_group[n: SIMD[int32, 1], read: Bool = True]()`

Waits for completion of asynchronous bulk memory transfer groups.

This function causes the executing thread to wait until a specified number of the most recent
bulk async-groups are pending. It provides synchronization control for bulk memory transfers
on NVIDIA GPUs.

Note:
This functionality is only available on NVIDIA GPUs. Attempting to use this function on
non-NVIDIA GPUs will result in a compile time error.

Example:

```mojo
from gpu.sync import cp_async_bulk_wait_group

# Wait until at most 2 async groups are pending
cp_async_bulk_wait_group[2]()

# Wait for all async groups to complete
cp_async_bulk_wait_group[0]()
```

**Parameters:**

* ​n (`SIMD[int32, 1]`): The number of most recent bulk async-groups allowed to remain pending. When n=0,
  waits for all prior bulk async-groups to complete.
* ​read (`Bool`): If True, indicates that subsequent reads to the transferred memory are expected,
  enabling optimizations for read access patterns. Defaults to True.

---

## cp_async_k_major

`cp_async_k_major[type: DType, eviction_policy: CacheEviction = CacheEviction(0)](dst: LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with K-major layout.

This function performs an asynchronous copy operation from global memory
(DRAM) to shared memory (SRAM) using NVIDIA's Tensor Memory Accelerator
(TMA) hardware. It optimizes for K-major memory access patterns, which is
particularly beneficial for certain tensor operations like matrix
multiplications where the inner dimension (K) is accessed contiguously.

The function automatically determines the optimal tile size and thread
distribution based on the tensor shapes and hardware capabilities,
leveraging TMA's efficient memory transfer mechanisms.

Example:

```mojo
from layout import LayoutTensor, Layout
from layout.layout_tensor import cp_async_k_major
from gpu.memory import async_copy_wait_all

var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()

# Copy data with K-major layout optimization
cp_async_k_major[DType.float32](shared_data, global_data)

# Wait for the asynchronous copy to complete
async_copy_wait_all()
```

Performance:

* Uses TMA hardware acceleration for optimal memory transfer performance.
* Optimizes for K-major access patterns, which can significantly improve
  performance for certain tensor operations like matrix multiplications.
* Performs asynchronous transfers, allowing computation to overlap with
  memory operations.
* Automatically determines optimal tile sizes based on tensor dimensions.
* Uses hardware-accelerated swizzling to reduce shared memory bank
  conflicts.

Notes:

* This function requires NVIDIA GPUs with TMA support (compute capability
  9.0+).
* The source tensor must be in GENERIC or GLOBAL address space (DRAM).
* The destination tensor must be in SHARED address space (SRAM).
* Both tensors must have the same data type.
* This function is asynchronous, so you must call
  [`async_copy_wait_all()`](/mojo/stdlib/gpu/memory/async_copy_wait_all/)
  or
  [`async_copy_wait_group()`](/mojo/stdlib/gpu/memory/async_copy_wait_group/)
  to ensure the copy has completed before using the data.
* K-major layout is particularly beneficial for matrix multiplication
  operations where the inner dimension (K) is accessed contiguously.

**Constraints:**

* Requires NVIDIA GPUs with TMA support (compute capability 9.0+).
* Source tensor must be in GENERIC or GLOBAL address space.
* Destination tensor must be in SHARED address space.
* Both tensors must have the same data type.
* Source and destination tensors must be 2D.

**Parameters:**

* ​type (`DType`): The data type of the tensor elements.
* ​eviction\_policy (`CacheEviction`): The cache eviction policy to use. Default is `CacheEviction.EVICT_NORMAL`.

**Args:**

* ​dst (`LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM).
* ​src (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory
  (DRAM).

---

## cp_async_mn_major

`cp_async_mn_major[type: DType, eviction_policy: CacheEviction = CacheEviction(0)](dst: LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with MN-major layout.

This function performs an asynchronous copy operation from global memory
(DRAM) to shared memory (SRAM) using NVIDIA's Tensor Memory Accelerator
(TMA) hardware. It optimizes for MN-major memory access patterns, which is
particularly beneficial for tensor operations where the outer dimensions (M,
N) are accessed contiguously.

The function automatically determines the optimal tile size and thread
distribution based on the tensor shapes and hardware capabilities,
leveraging TMA's efficient memory transfer mechanisms.

Example:

```mojo
from layout import LayoutTensor, Layout
from layout.layout_tensor import cp_async_mn_major
from gpu.memory import async_copy_wait_all

var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()

# Copy data with MN-major layout optimization
cp_async_mn_major[DType.float32](shared_data, global_data)

# Wait for the asynchronous copy to complete
async_copy_wait_all()
```

Performance:

* Uses TMA hardware acceleration for optimal memory transfer performance.
* Optimizes for MN-major access patterns, which can significantly improve
  performance for certain tensor operations where outer dimensions are accessed
  contiguously.
* Performs asynchronous transfers, allowing computation to overlap with memory operations.
* Automatically determines optimal tile sizes based on tensor dimensions.
* Uses hardware-accelerated swizzling to reduce shared memory bank conflicts.

Notes:

* This function requires NVIDIA GPUs with TMA support (compute capability 9.0+).
* The source tensor must be in `GENERIC` or `GLOBAL` address space (DRAM).
* The destination tensor must be in `SHARED` address space (SRAM).
* Both tensors must have the same data type.
* This function is asynchronous, so you must call
  [`async_copy_wait_all()`](/mojo/stdlib/gpu/memory/async_copy_wait_all/)
  or
  [`async_copy_wait_group()`](/mojo/stdlib/gpu/memory/async_copy_wait_group/)
  to ensure the copy has completed before using the data.
* MN-major layout is particularly beneficial for operations where the outer
  dimensions are accessed contiguously, such as certain convolution operations.

**Constraints:**

* Requires NVIDIA GPUs with TMA support (compute capability 9.0+).
* Source tensor must be in `GENERIC` or `GLOBAL` address space.
* Destination tensor must be in `SHARED` address space.
* Both tensors must have the same data type.
* Source and destination tensors must be 2D.

**Parameters:**

* ​type (`DType`): The data type of the tensor elements.
* ​eviction\_policy (`CacheEviction`): The cache eviction policy to use. Default is `CacheEviction.EVICT_NORMAL`.

**Args:**

* ​dst (`LayoutTensor[type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor, which must be in shared memory (SRAM).
* ​src (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor, which must be in global or generic memory
  (DRAM).

---

## crd2idx

`crd2idx(crd: IntTuple[origin], shape: IntTuple[origin]) -> Int`

Map a logical coordinate to a linear index.

This function converts a multi-dimensional coordinate to a linear index based on the shape.
It uses default strides computed from the shape.

**Args:**

* ​crd (`IntTuple[origin]`): The coordinate tuple to convert.
* ​shape (`IntTuple[origin]`): The shape of the tensor/array.

**Returns:**

The linear index corresponding to the coordinate.

`crd2idx(crd: IntTuple[origin], shape: IntTuple[origin], _stride: IntTuple[origin]) -> Int`

Map a logical coordinate to a linear index with custom strides.

This function converts a multi-dimensional coordinate to a linear index based on the shape
and stride information. If no stride is provided, it computes default strides from the shape.

The function handles various input combinations:

* Tuple coordinates with tuple shapes and strides
* Single integer coordinate with tuple shapes and strides
* Single integer coordinate with single integer shape and stride

Aborts:

```
- If coordinate and shape dimensions don't match.
- If shape and stride dimensions don't match.
- If input type combinations are invalid.
```

**Args:**

* ​crd (`IntTuple[origin]`): The coordinate(s) to convert, can be a single value or a tuple of coordinates.
* ​shape (`IntTuple[origin]`): The shape of the tensor/array, can be a single value or a tuple of dimensions.
* ​\_stride (`IntTuple[origin]`): Optional custom strides, defaults to row-major strides if not provided.

**Returns:**

The linear index corresponding to the coordinate.

---

## crd2idx

`crd2idx[: ImmutableOrigin, : ImmutableOrigin, : ImmutableOrigin, //, crd_t: IntTuple[$2], shape_t: IntTuple[$1], stride_t: IntTuple[$0], out_type: DType = uint64](crd: RuntimeTuple[crd_t, element_type=element_type], shape: RuntimeTuple[shape_t, element_type=element_type], stride: RuntimeTuple[stride_t, element_type=element_type]) -> SIMD[out_type, 1]`

Converts multi-dimensional coordinates to a linear index.

This function is the inverse of idx2crd, transforming a set of coordinates
into a flat index based on the provided shape and stride information.
This is essential for mapping multi-dimensional tensor elements to linear memory.

**Parameters:**

* ​crd\_t (`IntTuple[$2]`): Type of the coordinates.
* ​shape\_t (`IntTuple[$1]`): Type of the shape.
* ​stride\_t (`IntTuple[$0]`): Type of the stride.
* ​out\_type (`DType`): The output data type for the index (default: uint64).

**Args:**

* ​crd (`RuntimeTuple[crd_t, element_type=element_type]`): The coordinates to convert.
* ​shape (`RuntimeTuple[shape_t, element_type=element_type]`): The shape of the multi-dimensional array.
* ​stride (`RuntimeTuple[stride_t, element_type=element_type]`): The stride values for each dimension.

**Returns:**

A scalar value representing the linear index corresponding to the given coordinates.

---

## Create a knowledge base with a text embedding model

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import SmallCards from '@site/src/components/SmallCards';
import InstallModular from '@site/docs/_includes/install-modular.mdx';
import Requirements from '@site/src/components/Requirements';
import { requirementsNoGPU } from '@site/docs/max/requirements';

Text embeddings are rich numerical representations of text that power many
modern natural language processing (NLP) applications. This tutorial shows you
how to serve and interact with an embedding model using an OpenAI-compatible
endpoint. Specifically, we'll use MAX to serve the
[all-mpnet-base-v2](https://builds.modular.com/models/all-mpnet-base-v2/5B)
model, which is a powerful transformer that excels at capturing semantic
relationships in text.

In this tutorial, you'll learn how to:

- Set up a local embeddings server using the `all-mpnet-base-v2` model
- Build a smart knowledge base system using semantic similarity
- Implement document clustering and topic-based organization
- Create robust search functionality using embeddings

System requirements:

## Set up your environment

Create a Python project to install our APIs and CLI tools:

## Serve the embedding model

Now start serving the
[all-mpnet-base-v2](https://builds.modular.com/models/all-mpnet-base-v2/5B)
model locally using MAX:

1. Start a local endpoint for `all-mpnet-base-v2`:

    ```sh
    max serve --model-path=sentence-transformers/all-mpnet-base-v2
    ```

    This will create a server running the `all-mpnet-base-v2` embedding model on
    `http://localhost:8000/v1/embeddings`, an [OpenAI compatible
    endpoint](https://platform.openai.com/docs/api-reference/embeddings).

    The endpoint is ready when you see the URI printed in your terminal:

    ```output
    Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
    ```

2. Send a curl request to see what kind of response we get back.

    With the server running in your first terminal, run the following command in the second terminal:

    ```sh
    curl http://localhost:8000/v1/embeddings \
    -H "Content-Type: application/json" \
    -d '{
        "input": "Run an embedding model with MAX Serve!",
        "model": "sentence-transformers/all-mpnet-base-v2"
    }'
    ```

    The following is the expected output.

    ```output
    {"data":[{"index":0,"embedding":[-0.06595132499933243,0.005941616836935282,0.021467769518494606,0.23037832975387573,
    ```

The text has been shortened for brevity. This returns a numerical representation
of the input text that can be used for semantic comparisons.

Now that the endpoint is active and responsive, let's create an application that uses the embedding model
and retrieves information.

## Build a knowledge base system

Now, let's build a smart knowledge base using the `all-mpnet-base-v2` model. You'll
create a system that can match user queries to relevant documentation and
automatically organize content into topics.

### 1. Install dependencies

Add the following libraries to your virtual environment:

  
  ```sh
  pip install numpy scikit-learn requests
  ```

  
  ```sh
  uv pip install numpy scikit-learn requests
  ```

  
  Add three new libraries to `magic`:

    ```sh
    magic add numpy scikit-learn requests
    ```

  Change folders into your working directory:

    ```sh
    cd src/quickstart
    ```

  
These libraries help measure similarity of sentences and handle various
computational tasks. The requests library enables API communication with the
embeddings endpoint.

### 2. Implement the knowledge base system

Now we will create a smart knowledge base system that can:

- Process and store documents with their semantic embeddings
- Search for relevant documents using natural language queries
- Automatically organize content into topics using clustering
- Suggest relevant topics based on user queries

The system uses embeddings from the `all-mpnet-base-v2` model to understand the
meaning of text, enabling semantic search and intelligent document organization.

1. Create a new Python file called `kb_system.py` in your working
directory and add the following:

    ```python
    import numpy as np
    from sklearn.metrics.pairwise import cosine_similarity
    from sklearn.cluster import KMeans
    import requests
    from typing import List, Dict, Tuple
    from functools import lru_cache
    import logging

    logging.basicConfig(level=logging.INFO)
    logger = logging.getLogger(__name__)

    class SmartKnowledgeBase:
        def __init__(self, endpoint: str = "http://localhost:8000/v1/embeddings"):
            self.endpoint = endpoint
            self.documents: List[str] = []
            self.doc_titles: List[str] = []
            self.embeddings: np.ndarray = None
            self.clusters: Dict[int, List[int]] = {}

        def _get_embedding(self, texts: List[str], max_retries: int = 3) -> np.ndarray:
            """Get embeddings with retry logic."""
            for attempt in range(max_retries):
                try:
                    response = requests.post(
                        self.endpoint,
                        headers={"Content-Type": "application/json"},
                        json={"input": texts, "model": "sentence-transformers/all-mpnet-base-v2"},
                        timeout=5
                    ).json()
                    return np.array([item["embedding"] for item in response["data"]])
                except Exception as e:
                    if attempt == max_retries - 1:
                        raise Exception(f"Failed to get embeddings after {max_retries} attempts: {e}")
                    logger.warning(f"Attempt {attempt + 1} failed, retrying...")

        @lru_cache(maxsize=1000)
        def _get_embedding_cached(self, text: str) -> np.ndarray:
            """Cached version for single text embedding."""
            return self._get_embedding([text])[0]

        def add_document(self, title: str, content: str):
            """Add a single document with title."""
            self.doc_titles.append(title)
            self.documents.append(content)

            # Update embeddings
            if len(self.documents) == 1:
                self.embeddings = self._get_embedding([content])
            else:
                self.embeddings = np.vstack([self.embeddings, self._get_embedding([content])])

            # Recluster if we have enough documents
            if len(self.documents) >= 3:
                self._cluster_documents()

        def _cluster_documents(self, n_clusters: int = None):
            """Cluster documents into topics."""
            if n_clusters is None:
                n_clusters = max(2, len(self.documents) // 5)

            n_clusters = min(n_clusters, len(self.documents))
            kmeans = KMeans(n_clusters=n_clusters, random_state=42).fit(self.embeddings)

            self.clusters = {}
            for i in range(n_clusters):
                self.clusters[i] = np.where(kmeans.labels_ == i)[0].tolist()

        def search(self, query: str, top_k: int = 3) -> List[Tuple[str, str, float]]:
            """Find documents most similar to the query."""
            query_embedding = self._get_embedding_cached(query)
            similarities = cosine_similarity([query_embedding], self.embeddings)[0]
            top_indices = np.argsort(similarities)[-top_k:][::-1]
            return [(self.doc_titles[i], self.documents[i], similarities[i])
                    for i in top_indices]

        def get_topic_documents(self, topic_id: int) -> List[Tuple[str, str]]:
            """Get all documents in a topic cluster."""
            return [(self.doc_titles[i], self.documents[i])
                    for i in self.clusters.get(topic_id, [])]

        def suggest_topics(self, query: str, top_k: int = 2) -> List[Tuple[int, float]]:
            query_embedding = self._get_embedding_cached(query)
            topic_similarities = []

            for topic_id, doc_indices in self.clusters.items():
                topic_embeddings = self.embeddings[doc_indices]
                similarity = cosine_similarity([query_embedding], topic_embeddings).max()
                topic_similarities.append((topic_id, similarity))  # Remove [0]

            return sorted(topic_similarities, key=lambda x: x[1], reverse=True)[:top_k]

    # Example usage
    if __name__ == "__main__":
        # Initialize knowledge base
        kb = SmartKnowledgeBase()

        # Add technical documentation
        kb.add_document(
            "Password Reset Guide",
            "To reset your password: 1. Click 'Forgot Password' 2. Enter your email "
            "3. Follow the reset link 4. Create a new password meeting security requirements"
        )

        kb.add_document(
            "Account Security",
            "Secure your account by enabling 2FA, using a strong password, and regularly "
            "monitoring account activity. Enable login notifications for suspicious activity."
        )

        kb.add_document(
            "Billing Overview",
            "Your billing cycle starts on the 1st of each month. View charges, update "
            "payment methods, and download invoices from the Billing Dashboard."
        )

        kb.add_document(
            "Payment Methods",
            "We accept credit cards, PayPal, and bank transfers. Update payment methods "
            "in Billing Settings. New payment methods are verified with a $1 hold."
        )

        kb.add_document(
            "Installation Guide",
            "Install by downloading the appropriate package for your OS. Run with admin "
            "privileges. Follow prompts to select installation directory and components."
        )

        kb.add_document(
            "System Requirements",
            "Minimum: 8GB RAM, 2GB storage, Windows 10/macOS 11+. Recommended: 16GB RAM, "
            "4GB storage, SSD, modern multi-core processor for optimal performance."
        )

        # Example 1: Search for password-related help
        print("\nSearching for password help:")
        results = kb.search("How do I change my password?")
        for title, content, score in results:
            print(f"\nTitle: {title}")
            print(f"Relevance: {score:.2f}")
            print(f"Content: {content[:100]}...")

        # Example 2: Get topic suggestions
        print("\nGetting topics for billing query:")
        query = "Where can I update my credit card?"
        topics = kb.suggest_topics(query)
        for topic_id, relevance in topics:
            print(f"\nTopic {topic_id} (Relevance: {relevance:.2f}):")
            for title, content in kb.get_topic_documents(topic_id):
                print(f"- {title}: {content[:50]}...")

        # Example 3: Get all documents in a topic
        print("\nAll documents in Topic 0:")
        for title, content in kb.get_topic_documents(0):
            print(f"\nTitle: {title}")
            print(f"Content: {content[:100]}...")
    ```

    The `SmartKnowledgeBase` class implements an intelligent document retrieval and
    organization system using embeddings. You can add documents
    (`kb.add_document()`), search based on the user's question (`kb.searchsearch()`),
    and retrieve results.

2. Run the script:

    With the server running in your first terminal, run the following command in
    a second terminal within your working directory:

    
      ```sh
      python kb_system.py
      ```

      
      ```sh
      python kb_system.py
      ```

      
      ```sh
      magic run python kb_system.py
      ```

      
    On your first run, this might take longer.
    The following is the expected output.

    ```output
    Title: Password Reset Guide
    Relevance: 0.61
    Content: To reset your password: 1. Click 'Forgot Password' 2. Enter your email 3. Follow the reset link 4. C...
    ```
    The text has been shortened for brevity.

## Conclusion

In this tutorial, you learned how to:

- Set up and test a local embeddings server using the `all-mpnet-base-v2` model
- Build a smart knowledge base system that can process and retrieve documents based on semantic similarity
- Implement document clustering and topic-based organization
- Create a robust search functionality using embeddings

---

## create_matmul_configs_ampere

`create_matmul_configs_ampere[key: String, a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool]() -> MatmulConfig[a_type, b_type, c_type, transpose_b]`

---

## create_task

`create_task(owned handle: Coroutine[type, origins], out task: Task[type, origins])`

Run the coroutine as a task on the AsyncRT Runtime.

This function creates a task from a coroutine and schedules it for execution
on the async runtime. The task will execute asynchronously without blocking
the current execution context.

**Args:**

* ​handle (`Coroutine[type, origins]`): The coroutine to execute as a task. Ownership is transferred.

**Returns:**

The `task` output parameter is initialized with the created task.

---

## create_tile_configs

`create_tile_configs[key: String, a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool]() -> MatmulConfig[a_type, b_type, c_type, transpose_b]`

---

## create_tma_tile

`create_tma_tile[*tile_sizes: Int, *, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))](ctx: DeviceContext, tensor: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> TMATensorTile[dtype, row_major[::Origin[::Bool(_to_int_tuple[*::Int]())]`

Creates a `TMATensorTile` with specified tile dimensions and swizzle mode.

This function creates a hardware-accelerated Tensor Memory Access (TMA) descriptor
for efficient asynchronous data transfers between global memory and shared memory.
It configures the tile dimensions and memory access patterns based on the provided
parameters.

**Constraints:**

* The last dimension's size in bytes must not exceed the swizzle mode's byte limit
  (32B for SWIZZLE\_32B, 64B for SWIZZLE\_64B, 128B for SWIZZLE\_128B).
* Only supports 2D tensors in this overload.

**Parameters:**

* ​\*tile\_sizes (`Int`): The dimensions of the tile to be transferred. For 2D tensors, this should be
  \[height, width]. The dimensions determine the shape of data transferred in each
  TMA operation.
* ​swizzle\_mode (`TensorMapSwizzle`):
  The swizzling mode to use for memory access optimization. Swizzling can improve
  memory access patterns for specific hardware configurations.

**Args:**

* ​ctx (`DeviceContext`):
  The CUDA device context used to create the TMA descriptor.
* ​tensor (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`):
  The source tensor from which data will be transferred. This defines the
  global memory layout and data type.

**Returns:**

A `TMATensorTile` configured with the specified tile dimensions and swizzle mode,
ready for use in asynchronous data transfer operations.

`create_tma_tile[type: DType, rank: Int, tile_shape: IndexList[rank], /, is_k_major: Bool = True, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), *, __tile_layout: Layout = row_major(tile_shape.__getitem__[::Indexer](0), tile_shape.__getitem__[::Indexer](1)), __desc_layout: Layout = _tma_desc_tile_layout[::DType,::Int,::IndexList[$1, ::DType()](ctx: DeviceContext, tensor: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> TMATensorTile[type, __tile_layout, __desc_layout]`

Creates a `TMATensorTile` with advanced configuration options for 2D or 3D tensors.

This overload provides more control over the TMA descriptor creation, allowing
specification of data type, rank, and layout orientation. It supports both 2D and 3D
tensors and provides fine-grained control over the memory access patterns.

**Constraints:**

* Only supports 2D and 3D tensors (rank must be 2 or 3).
* For non-SWIZZLE\_NONE modes, the K dimension size in bytes must be a multiple
  of the swizzle mode's byte size.
* For MN-major layout, only SWIZZLE\_128B is supported.
* For 3D tensors, only K-major layout is supported.

**Parameters:**

* ​type (`DType`): DType
  The data type of the tensor elements.
* ​rank (`Int`): Int
  The dimensionality of the tensor (must be 2 or 3).
* ​tile\_shape (`IndexList[rank]`): IndexList\[rank]
  The shape of the tile to be transferred.
* ​is\_k\_major (`Bool`): Bool = True
  Whether the tensor layout is K-major (True) or MN-major (False).
  K-major is typically used for weight matrices, while MN-major is used for
  activation matrices in matrix multiplication operations.
* ​swizzle\_mode (`TensorMapSwizzle`): TensorMapSwizzle = TensorMapSwizzle.SWIZZLE\_NONE
  The swizzling mode to use for memory access optimization.
* ​\_\_tile\_layout (`Layout`): Layout = Layout.row\_major(tile\_shape\[0], tile\_shape\[1])
  Internal parameter for the tile layout in shared memory.
* ​\_\_desc\_layout (`Layout`): Layout = \_tma\_desc\_tile\_layout\[...]
  Internal parameter for the descriptor layout, which may differ from the
  tile layout to accommodate hardware requirements.

**Args:**

* ​ctx (`DeviceContext`): DeviceContext
  The CUDA device context used to create the TMA descriptor.
* ​tensor (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): LayoutTensor\[type, \**, \*\**]
  The source tensor from which data will be transferred. This defines the
  global memory layout and must match the specified data type.

**Returns:**

A `TMATensorTile` configured with the specified parameters, ready for use in
asynchronous data transfer operations.

---

## cumsum

`cumsum[rank: Int, type: DType, exclusive: Bool, reverse: Bool](output: NDBuffer[type, rank, origin], input: NDBuffer[type, rank, origin], axis: Int)`

Implements the CumSum operator from the ONNX spec:  Computes cumulative sum of the input elements along the given axis. Cumulative sum can be inclusive or exclusive of the top element, and normal or reverse (direction along a given axis).

**Parameters:**

* ​rank (`Int`): Rank of the input and output tensors.
* ​type (`DType`): Type of the input and output tensors.
* ​exclusive (`Bool`): If set to True, return exclusive sum (top element not included).
* ​reverse (`Bool`): If set to True, perform cumsum operation in reverse direction.

**Args:**

* ​output (`NDBuffer[type, rank, origin]`): The output tensor.
* ​input (`NDBuffer[type, rank, origin]`): The input tensor.
* ​axis (`Int`): The axis on which to perform the cumsum operation.

---

## cumsum

## Functions

* [​`cumsum`](./cumsum): Implements the CumSum operator from the ONNX spec:  Computes cumulative sum of the input elements along the given axis. Cumulative sum can be inclusive or exclusive of the top element, and normal or reverse (direction along a given axis).

---

## cumsum

`cumsum(dst: NDBuffer[type, 1, origin], src: NDBuffer[type, 1, origin, shape, strides])`

Computes the cumulative sum of all elements in a buffer.    dst\[i] = src\[i] + src\[i-1] + ... + src\[0].

**Args:**

* ​dst (`NDBuffer[type, 1, origin]`): The buffer that stores the result of cumulative sum operation.
* ​src (`NDBuffer[type, 1, origin, shape, strides]`): The buffer of elements for which the cumulative sum is computed.

---

## cwd

`cwd() -> Path`

Gets the current directory.

**Returns:**

The current directory.

---

## Death of a value

As soon as a value/object is no longer used, Mojo destroys it. Mojo does *not*
wait until the end of a code block—or even until the end of an expression—to
destroy an unused value. It destroys values using an “as soon as possible”
(ASAP) destruction policy that runs after every sub-expression. Even within an
expression like `a+b+c+d`, Mojo destroys the intermediate values as soon as
they're no longer needed.

Mojo uses static compiler analysis to find the point where a value is last used.
Then, Mojo immediately ends the value's lifetime and calls the `__del__()`
destructor to perform any necessary cleanup for the type.

For example, notice when the `__del__()` destructor is called for each instance
of `MyPet`:

```mojo
@value
struct MyPet:
    var name: String
    var age: Int

    fn __del__(owned self):
        print("Destruct", self.name)

fn pets():
    var a = MyPet("Loki", 4)
    var b = MyPet("Sylvie", 2)
    print(a.name)
    # a.__del__() runs here for "Loki"

    a = MyPet("Charlie", 8)
    # a.__del__() runs immediately because "Charlie" is never used

    print(b.name)
    # b.__del__() runs here

pets()
```

```output
Loki
Destruct Loki
Destruct Charlie
Sylvie
Destruct Sylvie
```

Notice that each initialization of a value is matched with a call to the
destructor, and `a` is actually destroyed multiple times—once for each time it receives
a new value.

Also notice that this `__del__()` implementation doesn't actually do
anything. Most structs don't require a custom destructor, and Mojo automatically
adds a no-op destructor if you don't define one.

### Default destruction behavior

You may be wondering how Mojo can destroy a type without a custom destructor, or
why a no-op destructor is useful. If a type is simply a collection of fields,
like the `MyPet` example, Mojo only needs to destroy the fields: `MyPet` doesn't
dynamically allocate memory or use any long-lived resources (like file handles).
There's no special action to take when a `MyPet` value is destroyed.

Looking at the individual fields, `MyPet` includes an `Int` and a `String`. The
`Int` is what Mojo calls a *trivial type*. It's a statically-sized bundle of
bits. Mojo knows exactly how big it is, so those bits can be reused to store
something else.

The `String` value is a little more complicated. Mojo strings are mutable. The
`String` object has an internal buffer—a
[`List`](/mojo/stdlib/collections/list/List) field,
which holds the characters that make up the string. A `List` stores
its contents in dynamically allocated memory on the heap, so the string can
grow or shrink. The string itself doesn't have any special destructor logic,
but when Mojo destroys a string, it calls the destructor for the
`List` field, which de-allocates the memory.

Since `String` and `Int` don't require any custom destructor logic, they both
have no-op destructors: literally, `__del__()` methods that don't do anything.
This may seem pointless, but it means that Mojo can call the destructor on any
value when its lifetime ends. This makes it easier to write generic containers
and algorithms.

### Benefits of ASAP destruction

Similar to other languages, Mojo follows the principle that objects/values
acquire resources in a constructor (`__init__()`) and release resources in a
destructor (`__del__()`). However, Mojo's ASAP destruction has some advantages
over scope-based destruction (such as the C++ [RAII
pattern](https://en.cppreference.com/w/cpp/language/raii), which waits until
the end of the code scope to destroy values):

* Destroying values immediately at last-use composes nicely with the "move"
  optimization, which transforms a "copy+del" pair into a "move" operation.

* Destroying values at end-of-scope in C++ is problematic for some common
  patterns like tail recursion, because the destructor call happens after the
  tail call. This can be a significant performance and memory problem for
  certain functional programming patterns, which is not a problem in Mojo,
  because the destructor call always happens before the tail call.

Additionally, Mojo's ASAP destruction works great within Python-style `def`
functions. That's because Python doesn't really provide scopes beyond a
function scope, so the Python garbage collector cleans up resources more often
than a scope-based destruction policy would. However, Mojo does not use a
garbage collector, so the ASAP destruction policy provides destruction
guarantees that are even more fine-grained than in Python.

The Mojo destruction policy is more similar to how Rust and Swift work, because
they both have strong value ownership tracking and provide memory safety. One
difference is that Rust and Swift require the use of a [dynamic "drop
flag"](https://doc.rust-lang.org/nomicon/drop-flags.html)—they maintain hidden
shadow variables to keep track of the state of your values to provide safety.
These are often optimized away, but the Mojo approach eliminates this overhead
entirely, making the generated code faster and avoiding ambiguity.

## Destructor

Mojo calls a value's destructor (`__del__()` method) when the value's lifetime
ends (typically the point at which the value is last used). As we mentioned
earlier, Mojo provides a default, no-op destructor for all types, so in most
cases you don't need to define the `__del__()` method.

You should define the `__del__()` method to perform any kind of cleanup the
type requires. Usually, that includes freeing memory for any fields where you
dynamically allocated memory (for example, via `UnsafePointer`) and
closing any long-lived resources such as file handles.

However, any struct that is just a simple collection of other types does not
need to implement the destructor.

For example, consider this simple struct:

```mojo
struct MyPet:
    var name: String
    var age: Int

    fn __init__(out self, name: String, age: Int):
        self.name = name
        self.age = age
```

There's no need to define the `__del__()` destructor for this, because it's a
simple collection of other types (`String` and `Int`), and it doesn't
dynamically allocate memory.

Whereas, the following struct must define the `__del__()` method to free the
memory allocated by its
[`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer):

```mojo
from memory import UnsafePointer

struct HeapArray:
    var data: UnsafePointer[Int]
    var size: Int

    fn __init__(out self, size: Int, val: Int):
        self.size = size
        self.data = UnsafePointer[Int].alloc(self.size)
        for i in range(self.size):
            (self.data + i).init_pointee_copy(val)

    fn __del__(owned self):
        for i in range(self.size):
            (self.data + i).destroy_pointee()
        self.data.free()
```

Note that a pointer doesn't *own* any values in the memory it points to, so
when a pointer is destroyed, Mojo doesn't call the destructors on those values.

So in the `HeapArray` example above, calling `free()` on the pointer releases
the memory, but doesn't call the destructors on the stored values. To invoke
the destructors, use the `destroy_pointee()` method provided by the
`UnsafePointer` type.

:::note

You can't just call the destructor explicitly. Because `__del__()`
takes `self` as an `owned` value, and owned arguments are copied by default,
`foo.__del__()` actually creates and destroys a *copy* of `foo`. When Mojo
destroys a value, however, it passes in the original value as `self`, not a
copy.

:::

It's important to notice that the `__del__()` method is an "extra" cleanup
event, and your implementation does not override any default destruction
behaviors. For example, Mojo still destroys all the fields in `MyPet` even
if you implement `__del__()` to do nothing:

```mojo
struct MyPet:
    var name: String
    var age: Int

    fn __init__(out self, name: String, age: Int):
        self.name = name
        self.age = age

    fn __del__(owned self):
        # Mojo destroys all the fields when they're last used
        pass
```

However, the `self` value inside the `__del__()` destructor is still whole (so
all fields are still usable) until the destructor returns, as we'll discuss
more in the following section.

:::note Destructors cannot raise errors

Currently a Mojo destructor isn't allowed to raise an error. This means that the
destructor must be defined as an `fn` function without the `raises` keyword.
Mojo won't allow you to define a destructor using `fn raises` or `def`.

:::

## Field lifetimes

In addition to tracking the lifetime of all objects in a program, Mojo also
tracks each field of a structure independently. That is, Mojo keeps track of
whether a "whole object" is fully or partially initialized/destroyed, and it
destroys each field independently with its ASAP destruction policy.

For example, consider this code that changes the value of a field:

```mojo
@value
struct MyPet:
    var name: String
    var age: Int

fn use_two_strings():
    var pet = MyPet("Po", 8)
    print(pet.name)
    # pet.name.__del__() runs here, because this instance is
    # no longer used; it's replaced below

    pet.name = String("Lola") # Overwrite pet.name
    print(pet.name)
    # pet.__del__() runs here
```

The `pet.name` field is destroyed after the first `print()`, because Mojo knows
that it will be overwritten below. You can also see this behavior when using the
transfer sigil:

```mojo
fn consume(owned arg: String):
    pass

fn use(arg: MyPet):
    print(arg.name)

fn consume_and_use():
    var pet = MyPet("Selma", 5)
    consume(pet.name^)
    # pet.name.__moveinit__() runs here, which destroys pet.name
    # Now pet is only partially initialized

    # use(pet)  # This fails because pet.name is uninitialized

    pet.name = String("Jasper")  # All together now
    use(pet)                     # This is ok
    # pet.__del__() runs here (and only if the object is whole)
```

Notice that the code transfers ownership of the `name` field to `consume()`.
For a period of time after that, the `name` field is uninitialized.
Then `name` is reinitialized before it is passed to the `use()` function. If you
try calling `use()` before `name` is re-initialized, Mojo rejects the code
with an uninitialized field error.

Also, if you don't re-initialize the name by the end of the `pet` lifetime, the
compiler complains because it's unable to destroy a partially initialized
object.

Mojo's policy here is powerful and intentionally straight-forward: fields can
be temporarily transferred, but the "whole object" must be constructed with the
aggregate type's initializer and destroyed with the aggregate destructor. This
means it's impossible to create an object by initializing only its fields, and
it's likewise impossible to destroy an object by destroying only its fields.

### Field lifetimes during destruct and move

The consuming-move constructor and destructor face an interesting situation
with field lifetimes, because, unlike other lifecycle methods, they both take
an instance of their own type as an `owned` argument, which is about to be
destroyed. You don't really need to worry about this detail when implementing
these methods, but it might help you better understand field lifetimes.

Just to recap, the move constructor and destructor method signatures
look like this:

```mojo
struct TwoStrings:
    fn __moveinit__(out self, owned existing: Self):
        # Initializes a new `self` by consuming the contents of `existing`
    fn __del__(owned self):
        # Destroys all resources in `self`
```

:::note

There are two kinds of "self" here: capitalized `Self` is an alias
for the current type name (used as a type specifier for the `existing`
argument), whereas lowercase `self` is the argument name for the
implicitly-passed reference to the current instance (also called "this" in
other languages, and also implicitly a `Self` type).

:::

Both of these methods face an interesting but obscure problem: they both must
dismantle the `existing`/`self` value that's `owned`. That is, `__moveinit__()`
implicitly destroys sub-elements of `existing` in order to transfer ownership
to a new instance (read more about the [move
constructor](/mojo/manual/lifecycle/life#move-constructor)),
while `__del__()` implements the deletion logic for its `self`. As such, they
both need to own and transform elements of the `owned` value, and they
definitely don't want the original `owned` value's destructor to also run—that
could result in a double-free error, and in the case of the `__del__()` method,
it would become an infinite loop.

To solve this problem, Mojo handles these two methods specially by assuming
that their whole values are destroyed upon reaching any return from the method.
This means that the whole object may be used as usual, up until the field
values are transferred or the method returns.

For example, the following code works as you would expect (within the
destructor, we can still pass ownership of a field value to another function,
and there's no infinite loop to destroy `self`):

```mojo
fn consume(owned str: String):
    print('Consumed', str)

struct TwoStrings:
    var str1: String
    var str2: String

    fn __init__(out self, one: String):
        self.str1 = one
        self.str2 = String("bar")

    fn __moveinit__(out self, owned existing: Self):
        self.str1 = existing.str1
        self.str2 = existing.str2

    fn __del__(owned self):
        self.dump() # Self is still whole here
        # Mojo calls self.str2.__del__() since str2 isn't used anymore

        consume(self.str1^)
        # self.str1 has been transferred so it is also destroyed now;
        # `self.__del__()` is not called (avoiding an infinite loop).

    fn dump(mut self):
        print('str1:', self.str1)
        print('str2:', self.str2)

fn use_two_strings():
    var two_strings = TwoStrings("foo")
```

## Explicit lifetime extension

So far, we've described how Mojo destroys a value at the point it's last used,
and this works great in almost all situations. Mojo
[origins](/mojo/manual/values/lifetimes) help the compiler track
values that are allocated in one place and used in another.

However, there are very rare situations in which you may need to explicitly
extend the lifetime of a value. This can happen:

- When you're writing tests that generate values that aren't actually used,
  to avoid the compiler issuing warnings and/or optimizing away values.
- You're writing unsafe code (for example, code that explicitly manipulates
  a value's `origin`).

In these cases, you can force Mojo to keep a value alive up to a certain point
by assigning the value to the `_` discard pattern at the point where it's okay
to destroy it. For example:

```mojo
# Keep foo alive until this point
_ = foo
```

If you don't _know_ you need to do this, you probably don't.

---

## debug

This module includes the debug hook functions.

## Functions

* [​`breakpointhook`](/mojo/stdlib/sys/debug/breakpointhook): Cause an execution trap with the intention of requesting the attention of a debugger.

---

## debug_assert

`debug_assert[: origin.set, //, cond: fn() capturing -> Bool, write_mode: Int = 0, assert_mode: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("none"), cpu_only: Bool = False, *Ts: Writable = *?](*messages: *Ts, *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the condition is true at run time.

If the condition is false, the assertion displays the given message and
causes the program to exit.

You can pass in multiple arguments to generate a formatted
message. No string allocation occurs unless the assertion is triggered.

```mojo
x = 0
debug_assert(x > 0, "expected x to be more than 0 but got: ", x)
```

Normal assertions are off by default—they only run when the program is
compiled with all assertions enabled. You can set the `assert_mode` to
`safe` to create an assertion that's on by default:

```mojo
debug_assert[assert_mode="safe"](
    x > 0, "expected x to be more than 0 but got: ", x
)
```

Use the `ASSERT` variable to turn assertions on or off when building or
running a Mojo program:

```sh
mojo -D ASSERT=all main.mojo
```

The `ASSERT` variable takes the following values:

* all: Turn on all assertions.
* safe: Turn on "safe" assertions only. This is the default.
* none: Turn off all assertions, for performance at the cost of safety.
* warn: Turn on all assertions, but print any errors instead of exiting.

To ensure that you have no run-time penalty from your assertions even when
they're disabled, make sure there are no side effects in your message and
condition expressions. For example:

```mojo
person = "name: john, age: 50"
name = "john"
debug_assert(String("name: ") + name == person, "unexpected name")
```

This will have a run-time penalty due to allocating a `String` in the
condition expression, even when assertions are disabled. To avoid this, put
the condition inside a closure so it runs only when the assertion is turned
on:

```mojo
fn check_name() capturing -> Bool:
    return String("name: ") + name == person

debug_assert[check_name]("unexpected name")
```

If you need to allocate, and so don't want the assert to ever run on GPU,
you can set it to CPU only:

```mojo
debug_assert[check_name, cpu_only=True]("unexpected name")
```

For compile-time assertions, see
[`constrained()`](/mojo/stdlib/builtin/constrained/constrained).

**Parameters:**

* ​cond (`fn() capturing -> Bool`): The function to invoke to check if the assertion holds.
* ​write\_mode (`Int`): Determines whether to keep values in register or not.
* ​assert\_mode (`StringSlice[StaticConstantOrigin]`): Determines when the assert is turned on.
  * default ("none"): Turned on when compiled with `-D ASSERT=all`.
  * "safe": Turned on by default.
* ​cpu\_only (`Bool`): If true, only run the assert on CPU.
* ​\*Ts (`Writable`): The element types for the message arguments.

**Args:**

* ​\*messages (`*Ts`): A set of [`Writable`](/mojo/stdlib/utils/write/Writable/)
  arguments to convert to a `String` message.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

`debug_assert[write_mode: Int = 0, assert_mode: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("none"), cpu_only: Bool = False, *Ts: Writable = *?](cond: Bool, *messages: *Ts, *, location: Optional[_SourceLocation] = Optional(None))`

Asserts that the condition is true at run time.

If the condition is false, the assertion displays the given message and
causes the program to exit.

You can pass in multiple arguments to generate a formatted
message. No string allocation occurs unless the assertion is triggered.

```mojo
x = 0
debug_assert(x > 0, "expected x to be more than 0 but got: ", x)
```

Normal assertions are off by default—they only run when the program is
compiled with all assertions enabled. You can set the `assert_mode` to
`safe` to create an assertion that's on by default:

```mojo
debug_assert[assert_mode="safe"](
    x > 0, "expected x to be more than 0 but got: ", x
)
```

Use the `ASSERT` variable to turn assertions on or off when building or
running a Mojo program:

```sh
mojo -D ASSERT=all main.mojo
```

The `ASSERT` variable takes the following values:

* all: Turn on all assertions.
* safe: Turn on "safe" assertions only. This is the default.
* none: Turn off all assertions, for performance at the cost of safety.
* warn: Turn on all assertions, but print any errors instead of exiting.

To ensure that you have no run-time penalty from your assertions even when
they're disabled, make sure there are no side effects in your message and
condition expressions. For example:

```mojo
person = "name: john, age: 50"
name = "john"
debug_assert(String("name: ") + name == person, "unexpected name")
```

This will have a run-time penalty due to allocating a `String` in the
condition expression, even when assertions are disabled. To avoid this, put
the condition inside a closure so it runs only when the assertion is turned
on:

```mojo
fn check_name() capturing -> Bool:
    return String("name: ") + name == person

debug_assert[check_name]("unexpected name")
```

If you need to allocate, and so don't want the assert to ever run on GPU,
you can set it to CPU only:

```mojo
debug_assert[check_name, cpu_only=True]("unexpected name")
```

For compile-time assertions, see
[`constrained()`](/mojo/stdlib/builtin/constrained/constrained).

**Parameters:**

* ​write\_mode (`Int`): Determines whether to keep values in register or not.
* ​assert\_mode (`StringSlice[StaticConstantOrigin]`): Determines when the assert is turned on.
  * default ("none"): Turned on when compiled with `-D ASSERT=all`.
  * "safe": Turned on by default.
* ​cpu\_only (`Bool`): If true, only run the assert on CPU.
* ​\*Ts (`Writable`): The element types for the message arguments.

**Args:**

* ​cond (`Bool`): The bool value to assert.
* ​\*messages (`*Ts`): A set of [`Writable`](/mojo/stdlib/utils/write/Writable/)
  arguments to convert to a `String` message.
* ​location (`Optional[_SourceLocation]`): The location of the error (defaults to `__call_location`).

---

## debug_assert

Implements run-time assertions.

These are Mojo built-ins, so you don't need to import them.

## Aliases

### `ASSERT_MODE`

`alias ASSERT_MODE = env_get_string[::StringSlice[::Bool()`

### `WRITE_MODE`

`alias WRITE_MODE = Int`

### `WRITE_MODE_MEM`

`alias WRITE_MODE_MEM = 1`

### `WRITE_MODE_REG`

`alias WRITE_MODE_REG = 0`

## Functions

* [​`debug_assert`](/mojo/stdlib/builtin/debug_assert/debug_assert): Asserts that the condition is true at run time.

---

## Debugging

The Mojo extension for Visual Studio Code enables you to use VS Code's built-in
debugger with Mojo programs. (The Mojo extension also supports debugging C, C++,
and Objective-C.)

For complete coverage of VS Code's debugging features, see
[Debugging in Visual Studio Code](https://code.visualstudio.com/docs/editor/debugging).

This page describes the features available through the Mojo extension, as well
as current limitations of the Mojo debugger.

The MAX SDK includes the [LLDB debugger](https://lldb.llvm.org/) and a Mojo
LLDB plugin. Together these provide the low-level debugging interface for the
Mojo extension. You can also use the `mojo debug` command to start a
command-line debugging session using LLDB or to launch a Mojo debugging session
in VS Code.

The MAX SDK also includes support for debugging Mojo programs running on GPU.
This requires some extra software and configuration. Currently GPU debugging
only works with NVIDIA GPUs. For details, see
[GPU debugging](/mojo/tools/gpu-debugging).

## Start debugging

There are several ways to start a debug session in VS Code.

To start debugging, you'll need to have a Mojo project to debug. There are a
number of examples ranging from simple to complex in [our GitHub
repo](https://github.com/modular/modular/tree/main/examples/mojo).

:::note **VS Code veteran?**

If you're already familiar with debugging in VS Code, the
material in this section will mostly be review. You might want to skip ahead to
[Launch configurations](#launch-configurations)
or see [Using the debugger](#using-the-debugger) for notes on the features
supported in the Mojo debugger.

:::

### Quick run or debug

If your active editor tab contains a Mojo file with an `fn main()` entry point,
one of the quickest ways to run or debug it is using the **Run or Debug** button
in the Editor toolbar.

![](images/quick-run-or-debug-button.png)

To start debugging the current file:

* Open the **Run or Debug** dropdown menu and choose **Debug Mojo File** or
  **Debug Mojo File in Dedicated Terminal**.

  ![](images/quick-run-or-debug-menu.png)

The two debug configurations differ in how they handle input and output:

* **Debug Mojo File** launches the Mojo program detached from any terminal.
  Standard output and standard error output for the program are displayed in the
  **Debug Console**. You can't write to the program's standard input, but you can
  see the program's output and interact with the debugger in a single location.

* **Debug Mojo File in Dedicated Terminal** creates a new instance of VS Code's
  integrated terminal and attaches the program's input and output to the terminal.
  This lets you interact with the program's standard input, standard output and
  standard error output in the terminal, while the **Debug Console** is used only
  for interactions with the debugger.

The **Run or Debug** button uses predefined launch configurations. There's
currently no way to modify the `args`, `env`, `cwd` or other settings for
programs launched with the **Run or Debug** configurations. If you need to
customize any of these things, see [Edit launch
configurations](#edit-launch-configurations).

After you choose one of the debug configurations, the button updates to show
the debug symbol. Click the button to re-run the previous configuration.

![](images/quick-run-or-debug-button-debug.png).

### Run and Debug view

The **Run and Debug** view includes a button to launch debug sessions and a
menu to select debug configurations. It also has areas to display current
variables, watch expressions, the current call stack, and breakpoints.

![](images/run-and-debug-view.png)

Figure 1. Run and Debug view

To open **Run and Debug** view, click the **Run and Debug** icon in the
**Activity Bar** (on the left side of the VS Code window) or press Control+Shift+D
(Command+Shift+D on macOS).

![](images/run-and-debug-icon.png)

If you haven't created any launch configurations in the current project,
VS Code shows the **Run start view**.

![](images/run-start-view.png)

Figure 2. Run start view

If you've already launched a debug session or created a `launch.json` file to
define launch configurations, you'll see the **Launch configurations** menu,
which lets you choose configurations and start debug sessions:

![](images/launch-configuration-menu.png)

Figure 3. Launch configurations menu

### Other ways to start a debug session

There are a number of other ways to start a debug session.

#### Launching from the Command Palette

If you have a Mojo file open in your active editor, you can also start a debug
session from the **Command Palette**.

1. Click **View** > **Command Palette** or press Control+Shift+P
   (Command+Shift+P on macOS).

2. Enter "Mojo" at the prompt to bring up the Mojo commands. You should see the
   same debug configurations described in [Quick run or
   debug](#quick-run-or-debug).

#### Launch from the File Explorer

To launch a debug session from the **File Explorer** view:

1. Right-click on a Mojo file.
2. Select a Mojo debug configuration.

You should see the same debug configurations described in [Quick run or
debug](#quick-run-or-debug).

#### Debug with F5

Press F5 to start a debug session using the current debug configuration.

If you don't have any existing debug configurations available to select, and
your active editor contains a Mojo file with an `fn main()` entry point,
pressing F5 will launch and debug the current file using the **Debug Mojo
File** action described in [Quick run or debug](#quick-run-or-debug).

## Starting the debugger from the command line

Use the `mojo debug` command to start a debug session from the command line. You
can choose from two debugging interfaces:

* With the `--vscode` flag, `mojo debug` starts a debug session on VS Code if
  it's running and the Mojo extension is enabled.

* Without the `--vscode` flag, `mojo debug` starts a command-line [LLDB
  debugger](https://lldb.llvm.org/) session.

You can choose to build and debug a Mojo file, run and debug a compiled binary,
or to attach the debugger to a running process.

:::note Environment variables

When you debug a program from the command line using `--vscode`, the program runs
with the environment variables set in the terminal. When launching from inside
VS Code via the GUI, the environment is defined by the VS Code [launch
configuration](#launch-configurations).

:::

For a full list of command-line options, see the [`mojo debug` reference
page](/mojo/cli/debug).

### Start a debug session from the command line

With VS Code open, run the following command (either from VS Code's integrated
terminal or an external shell):

```bash
mojo debug --vscode myproject.mojo
```

Or to debug a compiled binary:

```bash
mojo debug --vscode myproject
```

For best results, build with the `-O0 -g` command-line options when you build a
binary that you intend to debug—this produces a binary with full debug info.
(When you call `mojo debug` on a Mojo source file, it includes debug
information by default.) See the [`mojo build` reference page](/mojo/cli/build)
for details on compilation options.

### Attach the debugger to a running process from the command line

You can also attach the debugger to a running process by specifying either the
process ID or process name on the command line:

```bash
mojo debug --vscode --pid 
```

Or:

```bash
mojo debug --vscode --process-name 
```

## Launch configurations

VS Code *launch configurations* let you define setup information for debugging
your applications.

The Mojo debugger provides the following launch configuration templates:

* Debug current Mojo file. Launches and debugs the Mojo file in the active
  editor tab. Effectively the same as the **Debug Mojo File** action described in
  [Quick run or debug](#quick-run-or-debug), but with more configuration options.

* Debug Mojo file. Like the previous entry, except that it identifies a
  specific file to launch and debug, no matter what file is displayed in the
  active editor.

* Debug binary. This configuration operates on a prebuilt binary, which could
  be written in any mixture of languages supported by LLDB (Mojo, C, C++, etc.).
  You need to set the `program` field to the path of your binary.

* Attach to process. Launches a debug session attached to a running process. On
  launch, you choose the process you want to debug from a list of running
  processes.

You can edit any of these templates to customize them. All VS Code launch
configurations must contain the following attributes:

* `name`. The name of the launch configuration, which shows up in the UI (for
  example, "Run current Mojo file").
* `request`. Can be either `launch` (to run a program from VS Code) or `attach`
  (to attach to and debug a running file).
* `type`. Use `mojo-lldb` for the Mojo debugger. Use `mojo-cuda-gdb` to debug on
  GPU.

In addition, Mojo launch configurations can contain the following attributes:

* `args`. Any command-line arguments to be passed to the program.
* `cwd`. The current working directory to run the program in.
* `description`. A longer description of the configuration, not shown in the UI.
* `env`. Environment variables to be set before running the program.
* `mojoFile`. Path to a Mojo file to launch and debug.
* `pid`. Process ID of the running process to attach to.
* `program`. Path to a compiled binary to launch and debug, or the
  program to attach to.
* `runInTerminal`. True to run the program with a dedicated terminal, which
  allows the program to receive standard input from the terminal. False to run
  the program with its output directed to the **Debug Console**.

Mojo GPU launch configurations can contain the following attributes:

* `breakOnLaunch`. Set to true to automatically break when a GPU kernel
  launches.

* `initCommands`. An array of commands to issue to the debugger on startup. To
  use the classic CUDA-GDB debugger backend, add the following lines to your
  configuration:

  ```json
  "initCommands": [
      "set environment CUDBG_USE_LEGACY_DEBUGGER=1"
  ],
  ```

* `legacyDebugger`. Set to true to use the classic debugger backend.

If configuration is a `launch` request, the configuration must include either
the `mojoFile` or `program` attribute.

For `attach` requests, the configuration must include either the `pid` or
`program` attribute.

VS Code performs variable substitution on the launch configurations. You can
use `${workspaceFolder}` to substitute the path to the current workspace, and
`${file}` to represent the file in the active editor tab. For a complete list
of variables, see the VS Code [Variables
reference](https://code.visualstudio.com/docs/editor/variables-reference).

For more information, see the VS Code documentation for [Launch
configurations](https://code.visualstudio.com/docs/editor/debugging#_launch-configurations).

:::note Compilation options

Mojo launch configurations don't allow you to specify compilation options. If
you need to specify compilation options, you can build the binary using [`mojo
build`](/mojo/cli/build), then use a launch configuration with the `program`
option to launch the compiled binary. Or if you [start the debugger from the
command line](#starting-the-debugger-from-the-command-line), you can pass
compilation options to the `mojo debug` command.

:::

### Edit launch configurations

To edit launch configurations:

1. If the **Run and Debug** view isn't already open, click the **Run and
   Debug** icon in the **Activity Bar** (on the left side of the VS Code window)
   or press Control+Shift+D (Command+Shift+D on macOS).

   ![](images/run-and-debug-icon.png)

2. Create or open the `launch.json` file:
   1. If you see the **Run start view**, click **create a launch.json file**.
   2. If you already have launch configurations set up, click the gear icon
      next to the **Launch configurations** menu.
      ![](images/launch-configuration-menu.png)

3. Select **Mojo** from the list of debuggers.

VS Code opens the new `launch.json` file in an editor tab, with templates for
some common debug actions. Click **Add configuration** to add a new
configuration template.

## Using the debugger

When a debug session is running, use the debug toolbar to pause, continue, and
step through the program.

![](images/debug-toolbar.png)

The buttons on the toolbar are:

* **Continue/Pause**: If the program is stopped, resume the normal execution of the
  program up to the next breakpoint, signal or crash. Otherwise, pause all the
  threads of the program at once.

* **Step Over**: Execute the next line of code without stopping at function calls.

* **Step Into**: Execute the next line of code and stop at the first function call. If the program is stopped just before a function call, steps into the function so you can step through it line-by-line.

* **Step Out**: Finish the execution of the current function and stop right after
  returning to the parent function.

* **Restart**: If this is a `launch` session, terminate the current program and
  restart the debug session. Otherwise, detach from the target process and
  reattach to it.

* **Stop**: If this is a `launch` session, terminate the current program. Otherwise,
  detach from the target process without killing it.

The debugger currently has the following limitations:

* No support for breaking automatically on Mojo errors.

* When stepping out of a function, the returned value is not displayed.

* LLDB doesn't support stopping or resuming individual threads.

### Breakpoints

The Mojo debugger supports setting [standard
breakpoints](https://code.visualstudio.com/docs/editor/debugging#_breakpoints),
[logpoints](https://code.visualstudio.com/docs/editor/debugging#_logpoints),
[function breakpoints](https://code.visualstudio.com/docs/editor/debugging#_function-breakpoints),
[data breakpoints](https://code.visualstudio.com/docs/editor/debugging#_data-breakpoints),
and [triggered breakpoints](https://code.visualstudio.com/docs/editor/debugging#_triggered-breakpoints),
as described in the VS Code documentation.
The Mojo debugger also supports *error breakpoints* (also known as "break on
raise"), which break whenever a `raise` statement is executed.

When debugging Mojo code, the debugger doesn't support conditional breakpoints
based on an expression (it does
support hit counts, which VS Code classifies as a kind of conditional
breakpoint).

When editing a breakpoint, you're offered four options:

* **Expression**. Set a conditional breakpoint (not currently supported).
* **Hit Count**. Add a hit count to a breakpoint (supported).
* **Log Message**. Add a logpoint (supported)
* **Wait for Breakpoint**. Add a triggered breakpoint (supported).

#### Set a hit count breakpoint

A hit count breakpoint is a breakpoint that only breaks execution after the
debugger hits it a specified number of times.

To add a hit count breakpoint:

1. Right click in the left gutter of the editor where you want to place the
   breakpoint, and select **Add Conditional Breakpoint.**
2. Select **Hit Count** from the menu and enter the desired hit count.

To change an existing breakpoint to a hit count breakpoint:

1. Right click on the breakpoint in the left gutter of the editor and select
   **Edit breakpoint**.
2. Select **Hit Count** from the menu and enter the desired hit count.

You can also edit a breakpoint from the **Breakpoints** section of the **Run and
Debug** view:

* Right-click on the breakpoint and select **Edit Condition**, or,
* Click the **Edit Condition** icon next to the breakpoint.

This brings up the same menu, **next to the breakpoint in the editor tab**.

#### Enable error breakpoints

You can enable and disable error breakpoints in VS Code by selecting "Mojo
Raise" in the **Breakpoints** section of the **Run and Debug** view. If enabled
during debugging, executing a `raise` statement causes the debugger to stop
execution and highlight the line of code where the error was raised.

![VS Code window showing a program paused in the debugger with the Run and Debug view visible. The program is paused at a raise statement.](images/break-on-raise.png)

### View local variables

When a program is paused in the debugger, the editor shows local variable values
inline. You can also find them in the **Variables** section of the **Run and
Debug** view.

![VS Code window showing a program paused in the debugger, with the variables sections of the Run and Debug view visible. The edit shows three functions (nested2, nested1, and main). The program is paused at a breakpoint in nested2.](images/debugger-variables.png)

Figure 4. Local variable values displayed in the debugger

### View the call stack

When a program is paused in the debugger, the **Run and Debug** view shows the
current call stack. (You may see multiple call stacks, one for each active
thread in the program.)

![VS Code window showing a program paused in the debugger, with the call stack and variables sections of the Run and Debug view visible. The call stack shows three functions (nested2, nested1, and main). The program is paused at a breakpoint in nested2; the parent function nested1 is selected in the call stack, and editor highlights the current line in nested1 (the call to nested2()).](images/debugger-call-stack-nested1.png)

Figure 5. Call stack in Run and Debug view

The **Call Stack** section of the Run and Debug view shows a stack frame for
each function call in the current call stack. Clicking on the name of the
function highlights the current line in that function. For example, in Figure
5, the program is paused at a breakpoint in `nested2()`, but the parent
function, `nested1()` is selected in the call stack. The editor highlights the
current line in `nested1()` (that is, the call to `nested2()`) and shows the
current local variable values for `nested1()`.

### Use the Debug Console

The **Debug Console** gives you a command-line interface to the debugger. The
**Debug Console** processes LLDB commands and Mojo expressions.

Anything prefixed with a colon (`:`) is treated as an LLDB command. Any other
input is treated as an expression.

Currently Mojo expressions are limited to inspecting variables and their fields.
The console also supports subscript notation (`vector[index]`) for certain data
structures in the standard library, including  `List` and `SIMD`.

In the future, we intend to provide a way for arbitrary data structures to
support subscript notation in the **Debug Console**.

:::note

The **Debug Console** only accepts input when the program is paused.

:::

## Tips and tricks

There are several features in the standard library that aren't directly related
to the debugger, but which can help you debug your programs. These include:

* Programmatic breakpoints.
* Setting parameters from the Mojo command line.

### Set a programmatic breakpoint

To break at a specific point in your code, you can use the built-in
[`breakpoint()`](/mojo/stdlib/builtin/breakpoint/breakpoint) function:

```mojo
if some_value.is_valid():
   do_the_right_thing()
else:
   # We should never get here!
   breakpoint()
```

If you have VS Code open and run this code in debug mode (either using VS Code
or `mojo debug`), hitting the `breakpoint()` call causes an error, which
triggers the debugger.

:::note Assertions

The [`testing`](/mojo/stdlib/testing/testing/) module includes a number of
ways to specify assertions. Assertions also trigger an error, so can open the
debugger in the same way that a `breakpoint()` call will.

:::

### Set parameters from the Mojo command line

You can use the [`param_env`](/mojo/stdlib/sys/param_env/) module to retrieve
parameter values specified on the Mojo command line. Among other things, this
is an easy way to switch debugging logic on and off. For example:

```mojo
from param_env import is_defined

def some_function_with_issues():
    # ...
    @parameter
    if is_defined["DEBUG_ME"]():
        breakpoint()
```

To activate this code, use the [`-D` command-line
option](/mojo/cli/debug#compilation-options) to define `DEBUG_ME`:

```bash
mojo debug -D DEBUG_ME main.mojo
```

The `is_defined()` function returns a compile-time true or false value based on
whether the specified name is defined. Since the `breakpoint()` call is inside a
[parametric `if` statement](/mojo/manual/decorators/parameter#parametric-if-statement),
it is only included in the compiled code when the `DEBUG_ME` name is defined on
the command line.

## Troubleshooting

### `error: can't connect to the RPC debug server socket`

If using `mojo debug --vscode` gives you the message `error: can't connect to
the RPC debug server socket: Connection refused`, try the following possible
fixes:

* Make sure VS Code is open.
* If VS Code is already open, try restarting VS Code.
* If there are other VS Code windows open, try closing them and then restarting.
  This error can sometimes occur when multiple windows have opened and closed in
  certain orders.

### `error: couldn't get a valid response from the RPC server`

If using `mojo debug --vscode` gives you the message `error: couldn't get a
valid response from the RPC server`, try the following possible fixes:

* Make sure VS Code is open to a valid Mojo codebase. This error can sometimes
  happen if the VS Code window is open to some other codebase.
* If there are multiple VS Code windows open, try closing all but the one you
  wish to debug in.
* Restart VS Code.
* Reinstall the SDK and restart VSCode.
* If you are working on a development version of the SDK, make sure that all
  SDK tools are properly built with your build system, and then reload VS Code.
* As a last resort, restarting your entire computer can fix this problem.

If these steps don't help, please file an issue. We'd love your help identifying
possible causes and fixes!

---

## default_config_sm90

`default_config_sm90[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool, wgmma_shape: IndexList[3]]() -> MatmulConfig[a_type, b_type, c_type, transpose_b, wgmma_shape]`

---

## Defaultable

The `Defaultable` trait describes a type with a default constructor.

Implementing the `Defaultable` trait requires the type to define
an `__init__` method with no arguments:

```mojo
struct Foo(Defaultable):
    var s: String

    fn __init__(out self):
        self.s = "default"
```

You can now construct a generic `Defaultable` type:

```mojo
fn default_init[T: Defaultable]() -> T:
    return T()

var foo = default_init[Foo]()
print(foo.s)
```

```plaintext
default
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self: _Self)`

Create a default instance of the value.

---

## Deploy a PyTorch model from Hugging Face

import SmallCards from '@site/src/components/SmallCards';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import InstallModular from '@site/docs/_includes/install-modular.mdx';
import Requirements from '@site/src/components/Requirements';
import { requirementsWithGPU } from '@site/docs/max/requirements';

We designed MAX to simplify the entire AI development workflow, and that
includes deploying PyTorch models with a high-performance serving endpoint. As
we'll show you in this tutorial, deploying an endpoint with MAX is as simple as
deploying a Docker container—you don't have to write any new code to use MAX.

Currently, the MAX container includes a REST API that supports
large-language models (LLMs) only, so that's what we'll deploy. Specifically,
we'll deploy the [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-1.5b-Instruct) model, but you
can select a different PyTorch LLM from Hugging Face. (See our
[README](https://github.com/modular/modular/tree/main/max) for a list
of model architectures we currently support.) We've also included instructions
to deploy to the cloud provider of your choice, either AWS, GCP, or Azure.

:::caution MAX Serve preview

This is an early look at our Max container for PyTorch models. Currently, MAX
Serve runs most PyTorch models using PyTorch eager execution, which will use
GPUs for acceleration. In the future, MAX Serve will also accelerate PyTorch
models using our MAX graph compiler.

:::

If you want to instead deploy a highly-optimized LLM built with MAX, see
[Deploy Llama 3 with MAX Serve on GPU](/max/tutorials/max-serve-local-to-cloud).

System requirements:

## Deploy to a local endpoint

In this section, you'll use MAX to serve the Qwen2.5 model on a local endpoint.

1. Set up your environment:

    
2. Start a local endpoint for the Qwen2.5 model:

    ```sh
    max serve --model-path=Qwen/Qwen2.5-1.5B-Instruct
    ```

    In addition to starting a local server, this downloads the model weights
    and compiles the model, which might take some time.

    The endpoint is ready when you see the URI printed in your terminal:

    ```output
    Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
    ```

3. Now open another terminal to send a request using `curl`:

    ```sh
    curl -N http://localhost:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "Qwen/Qwen2.5-1.5B-Instruct",
            "stream": true,
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "What is the capital of Mongolia?"}
            ]
        }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
    ```

You should see a response in your command line similar to the following:

```output
The capital city of Mongolia is Ulaanbaatar.
```

That's it! In just a few steps, you've connected a Hugging Face LLM model to an
endpoint so it can receive and respond to inference requests.

Now let's deploy the same thing to a GPU instance on the cloud.

## Deploy to a cloud provider

In the first part of this tutorial, you used MAX to deploy a Hugging Face model
to a local endpoint. In this next part, you'll use a prebuilt Docker container
to deploy a model to a cloud provider.

### Prerequisites

This tutorial shows you how to deploy a model to one of three cloud providers:

- AWS
- GCP
- Azure

To complete this tutorial, you should:

- Be familiar with the basics of at least one of these cloud providers
- Have the appropriate CLI tools installed:
  - [AWS CLI v2](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).
  - [Google Cloud SDK](https://cloud.google.com/sdk/docs/install).
  - [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli).
- Have a project set up that you can use to deploy the Docker container.
- Verify that you have access to the [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-1.5b-Instruct)
  model.
- Enable any billing permissions so you can install the appropriate APIs and
  launch the designated GPU instances.

### Initialize CLI tools

If you haven't already done so, make sure that you've initialized your CLI tools
and logged in to your account.

  
Configure the AWS CLI:
```bash
aws configure
```
Login to your AWS account:
```bash
aws sso login
```
Check the credentials via `cat ~/.aws/credentials` to make sure it is set up correctly.
You can also include the credentials as environment variables:
```bash
export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"
```
  

Initialize the Google Cloud SDK:
```bash
gcloud init
```
Login to your Google Cloud account:
```bash
gcloud auth login
```
  
  
Initialize the Azure CLI:
```bash
az init
```
Login to your Azure account:
```bash
az login
```
  

### Create your deployment

In this section, you'll go through the steps needed to create a deployment.
These steps vary depending on the Cloud provider you prefer to use.

  
For AWS, we'll create a AWS CloudFormation template to define and configure our
deployment.

1. Create a working directory for the Infrastructure as Code files.

   ```bash
   mkdir aws
   ```

   Then, navigate to that directory.

   ```bash
   cd aws
   ```

2. Set the AWS region. In this case, we'll use `us-east-1`, but you can use
   whatever region you prefer.

   ```bash
   export REGION="us-east-1"
   ```

3. Create an AWS CloudFormation file, `max-serve-aws.yaml`.

   ```bash
   touch max-serve-aws.yaml
   ```

   Then, using the editor of your choice, paste the following:

   
        max-serve-aws.yaml

        ```yaml
        AWSTemplateFormatVersion: '2010-09-09'
        Description: CloudFormation template to deploy MAX Serve on an EC2 instance.

            Parameters:
            InstanceType:
                Type: String
                Default: g5.4xlarge
                AllowedValues:
                - g5.4xlarge
                - p4d.24xlarge
                Description: EC2 instance type for the MAX Serve deployment.

            AmiId:
                Type: AWS::EC2::Image::Id
                Default: ami-02769e6d1f6a88067
                Description: AMI ID for Deep Learning Base OSS Nvidia Driver AMI (Amazon Linux 2) in us-east-1.

            HuggingFaceHubToken:
                Type: String
                NoEcho: true
                Description: HuggingFace Hub API Token for accessing models.

            HuggingFaceRepoId:
                Type: String
                Default: Qwen/Qwen2.5-1.5b-instruct
                Description: Hugging Face Repository ID for the Model.

            Resources:
            MaxServeInstanceProfile:
                Type: AWS::IAM::InstanceProfile
                Properties:
                Roles:
                    - !Ref MaxServeInstanceRole

            MaxServeInstanceRole:
                Type: AWS::IAM::Role
                Properties:
                AssumeRolePolicyDocument:
                    Version: '2012-10-17'
                    Statement:
                    - Effect: Allow
                        Principal:
                        Service:
                            - ec2.amazonaws.com
                        Action:
                        - sts:AssumeRole
                ManagedPolicyArns:
                    - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
                    - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
                    - arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
                Policies:
                    - PolicyName: CloudWatchLogsAccess
                    PolicyDocument:
                        Version: '2012-10-17'
                        Statement:
                        - Effect: Allow
                            Action:
                            - logs:CreateLogStream
                            - logs:PutLogEvents
                            - logs:DescribeLogStreams
                            Resource: !Sub 'arn:aws:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/ec2/${AWS::StackName}-logs:*'

            MaxServeLogGroup:
                Type: AWS::Logs::LogGroup
                DeletionPolicy: Delete
                UpdateReplacePolicy: Delete
                Properties:
                LogGroupName: !Sub '/aws/ec2/${AWS::StackName}-logs'
                RetentionInDays: 1

            MaxServeSecurityGroup:
                Type: AWS::EC2::SecurityGroup
                Properties:
                GroupDescription: Enable HTTP access on port 80 and SSH on port 22
                SecurityGroupIngress:
                    - IpProtocol: tcp
                    FromPort: 80
                    ToPort: 80
                    CidrIp: 0.0.0.0/0
                    - IpProtocol: tcp
                    FromPort: 22
                    ToPort: 22
                    CidrIp: 0.0.0.0/0

            MaxServeInstance:
                Type: AWS::EC2::Instance
                Properties:
                InstanceType: !Ref InstanceType
                ImageId: !Ref AmiId
                SecurityGroupIds:
                    - !Ref MaxServeSecurityGroup
                IamInstanceProfile: !Ref MaxServeInstanceProfile
                BlockDeviceMappings:
                    - DeviceName: /dev/xvda
                    Ebs:
                        VolumeSize: 100
                        VolumeType: gp3
                        DeleteOnTermination: true
                UserData:
                    'Fn::Base64': !Sub |
                    #!/bin/bash
                    set -xe  # Enable detailed logging
                    # Redirect all output to a log file for debugging
                    exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1

                    echo "Starting user data script execution..."

                    # Install CloudWatch agent first
                    echo "Installing CloudWatch agent..."
                    sudo yum install -y amazon-cloudwatch-agent

                    # Create log files and directory with proper permissions
                    sudo mkdir -p /var/log/max-serve
                    sudo touch /var/log/max-serve/container.log
                    sudo chmod 644 /var/log/max-serve/container.log
                    sudo chown root:root /var/log/max-serve/container.log

                    # Configure CloudWatch agent early
                    cat  /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json
                    {
                        "agent": {
                        "metrics_collection_interval": 60,
                        "run_as_user": "root"
                        },
                        "logs": {
                        "logs_collected": {
                            "files": {
                            "collect_list": [
                                {
                                "file_path": "/var/log/messages",
                                "log_group_name": "/aws/ec2/${AWS::StackName}-logs",
                                "log_stream_name": "instance-logs",
                                "timestamp_format": "%b %d %H:%M:%S",
                                "timezone": "UTC"
                                },
                                {
                                "file_path": "/var/log/max-serve/container.log",
                                "log_group_name": "/aws/ec2/${AWS::StackName}-logs",
                                "log_stream_name": "instance-logs",
                                "timestamp_format": "%Y-%m-%d %H:%M:%S",
                                "timezone": "UTC"
                                },
                                {
                                "file_path": "/var/log/user-data.log",
                                "log_group_name": "/aws/ec2/${AWS::StackName}-logs",
                                "log_stream_name": "instance-logs",
                                "timestamp_format": "%Y-%m-%d %H:%M:%S",
                                "timezone": "UTC"
                                }
                            ]
                            }
                        },
                        "force_flush_interval": 15
                        }
                    }
                    EOF

                    # Start the CloudWatch agent
                    sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json -s
                    sudo systemctl enable amazon-cloudwatch-agent
                    sudo systemctl start amazon-cloudwatch-agent

                    # Verify CloudWatch agent is running
                    sudo systemctl status amazon-cloudwatch-agent

                    # Continue with Docker installation and rest of the setup
                    echo "Installing docker..."
                    sudo yum update -y
                    sudo yum install -y docker aws-cfn-bootstrap
                    sudo systemctl enable docker
                    sudo systemctl start docker
                    sudo usermod -a -G docker ec2-user

                    # Verify docker is running
                    echo "Checking docker status..."
                    sudo systemctl status docker
                    docker --version

                    # Install NVIDIA Container Toolkit
                    echo "Installing NVIDIA Container Toolkit..."
                    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
                    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
                    sudo yum clean expire-cache
                    sudo yum install -y nvidia-docker2
                    sudo systemctl restart docker

                    # Verify NVIDIA docker installation
                    echo "Checking NVIDIA docker installation..."
                    nvidia-smi
                    docker info | grep -i nvidia

                    # Pull and run the MAX Serve container
                    echo "Pulling and running MAX Serve container..."

                    # Add error checking for docker pull
                    if ! sudo docker pull docker.modular.com/modular/max-nvidia-full:latest; then
                        echo "Failed to pull container image"
                        /opt/aws/bin/cfn-signal -e 1 --stack ${AWS::StackName} --resource MaxServeInstance --region ${AWS::Region}
                        exit 1
                    fi

                    sudo docker images

                    # Start the container and capture logs
                    CONTAINER_ID=$(sudo docker run -d \
                        --env "HF_TOKEN=${HuggingFaceHubToken}" \
                        --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
                        -v /home/ec2-user/.cache/huggingface:/root/.cache/huggingface \
                        --gpus 1 \
                        -p 8000:8000 \
                        --ipc=host \
                        docker.modular.com/modular/max-nvidia-full:latest \
                        --model-path ${HuggingFaceRepoId})

                    if [ $? -ne 0 ]; then
                        echo "Failed to start container"
                        /opt/aws/bin/cfn-signal -e 1 --stack ${AWS::StackName} --resource MaxServeInstance --region ${AWS::Region}
                        exit 1
                    fi

                    # Start following container logs in the background
                    sudo docker logs -f $CONTAINER_ID > /var/log/max-serve/container.log 2>&1 &

                    # Verify container is running
                    echo "Checking container status..."
                    if ! sudo docker ps | grep max-nvidia-full; then
                        echo "Container is not running"
                        /opt/aws/bin/cfn-signal -e 1 --stack ${AWS::StackName} --resource MaxServeInstance --region ${AWS::Region}
                        exit 1
                    fi

            Outputs:
            InstanceId:
                Description: Instance ID of the EC2 instance
                Value: !Ref MaxServeInstance

            PublicDNS:
                Description: Public DNS of the EC2 instance
                Value: !GetAtt MaxServeInstance.PublicDnsName
        ```

    
4. Create the stack.

    ```bash
    aws cloudformation create-stack --stack-name max-serve-stack \
	    --template-body file://max-serve-aws.yaml \
	    --parameters ParameterKey=InstanceType,ParameterValue=p4d.24xlarge \
		    ParameterKey=HuggingFaceHubToken,ParameterValue= \
		    ParameterKey=HuggingFaceRepoId,ParameterValue=Qwen/Qwen2.5-1.5b-instruct \
	    --capabilities CAPABILITY_IAM \
	    --region $REGION
    ```

   Note that you must replace `` with your actual
   token.

   In addition, this command defines the model that we want to deploy. For this
   tutorial, we'll use the [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-1.5b-Instruct)
   model.

   This deployment can take a few minutes to complete. Track the status of the
   deployment by running the following command:

    ```bash
    aws cloudformation describe-stacks --stack-name max-serve-stack \
        --region $REGION --query 'Stacks[0].StackStatus' --output text
    ```

    When the CloudFormation stack is deployed, you should see a status of
    `CREATE_COMPLETE`. Type `q` to exit this prompt in your CLI.

  
For GCP, we'll create a `.jinja` and `.yaml` file to define and
configure our deployment.

1. Create a working directory for the Infrastructure as Code files.

   ```bash
   mkdir gcp
   ```

   Then, navigate to that directory.

   ```bash
   cd gcp
   ```

2. Next, let's define a PROJECT_ID variable, which you'll use for some of the
   other commands you'll run later.

   ```bash
   PROJECT_ID="YOUR_PROJECT_ID"
   ```

   Remember to replace `YOUR_PROJECT_ID` with the ID of your GCP project.

3. Enable the following APIs by running the following command:

   ```bash
   gcloud services enable deploymentmanager.googleapis.com --project=${PROJECT_ID} && \
   gcloud services enable logging.googleapis.com --project=${PROJECT_ID} && \
   gcloud services enable compute.googleapis.com --project=${PROJECT_ID}
   ```

4. Create a file, `max-serve-gcp.jinja`.

   ```bash
   touch max-serve-gcp.jinja
   ```

   Then, using the editor of your choice, paste in the following:

   
        max-serve-gcp.jinja

        ```yaml
        resources:
        # Main compute instance
        - name: {{ properties['instanceName'] }}
          type: compute.v1.instance
          properties:
            zone: {{ properties['zone'] }}
            machineType: zones/{{ properties['zone'] }}/machineTypes/{{ properties['machineType'] }}
            guestAccelerators:
                - acceleratorType: zones/{{ properties['zone'] }}/acceleratorTypes/{{ properties['acceleratorType'] }}
                  acceleratorCount: {{ properties['acceleratorCount'] }}
            disks:
                - deviceName: boot
                  boot: true
                  autoDelete: true
                  initializeParams:
                    sourceImage: projects/deeplearning-platform-release/global/images/{{ properties['sourceImage'] }}
                    diskSizeGb: 100 # Disk space in GB
            networkInterfaces:
                - network: global/networks/default
                  accessConfigs:
                    - name: External NAT
                      type: ONE_TO_ONE_NAT
            serviceAccounts:
                - email: default
                  scopes:
                    - https://www.googleapis.com/auth/cloud-platform
            scheduling:
                preemptible: false
                onHostMaintenance: TERMINATE  # Disables live migration for GPU instances
                automaticRestart: true
            metadata:
                items:
                - key: startup-script
                  value: |
                    #!/bin/bash
                    set -xe  # Enable detailed logging
                    curl -sSO https://dl.google.com/cloudagents/add-logging-agent-repo.sh
                    sudo bash add-logging-agent-repo.sh

                    # Update and install dependencies
                    sudo apt-get update
                    sudo apt-get install -y google-fluentd curl apt-transport-https ca-certificates gnupg lsb-release software-properties-common

                    # Configure Stackdriver logging
                    sudo service google-fluentd start
                    sudo systemctl enable google-fluentd

                    # Install the NVIDIA drivers if not installed
                    if [ ! -f /opt/google/cuda-installer ]; then
                        sudo /opt/deeplearning/install-driver.sh
                    fi

                    # Add Docker GPG key and Docker repository
                    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
                    echo \
                      "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
                      $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

                    # Install Docker
                    sudo apt-get update
                    sudo apt-get install -y docker-ce docker-ce-cli containerd.io

                    # Add NVIDIA Docker repository
                    distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
                    && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
                    && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

                    # Install NVIDIA container runtime
                    sudo apt-get update
                    sudo apt-get install -y nvidia-container-toolkit
                    sudo systemctl restart docker

                    # Add user to docker group
                    sudo usermod -aG docker $(whoami)

                    # Run the Docker container with GPU support
                    docker run \
                      --env "HF_TOKEN={{ properties['huggingFaceHubToken'] }}" \
                      --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
                      -v $HOME/.cache/huggingface:/root/.cache/huggingface \
                      --gpus 1 \
                      -p 8000:8000 \
                      --ipc host \
                        docker.modular.com/modular/max-nvidia-full:latest \
                      --model-path {{ properties['pytorch_model'] }}

        # Add firewall rule directly in template
        - name: allow-http-8000
          type: compute.v1.firewall
          properties:
            network: global/networks/default
            sourceRanges: ["0.0.0.0/0"]
            targetTags: ["http-server"]
            allowed:
              - IPProtocol: tcp
                ports: ["8000"]

        # Outputs section to output public IP and instance details
        outputs:
        - name: instanceName
          value: $(ref.{{ properties['instanceName'] }}.name)
          description: Name of the GCP Compute instance.

        - name: instancePublicIP
          value: $(ref.{{ properties['instanceName'] }}.networkInterfaces[0].accessConfigs[0].natIP)
          description: Public IP address of the GCP Compute instance.
        ```

   
   This file contains a couple of variables:

   - **hugging_face_hub_token**: Defines your Hugging Face hub token so you can
     access the appropriate model
   - **pytorch_model**: Defines the PyTorch model that you want to deploy.

   We'll define those variables in the next section.

5. Your next step is to define the deployment. This deployment file defines a
number of properties, in particular the model that we want to deploy. For this
tutorial, we'll use the [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-1.5b-Instruct)
model.

   In your working directory, create a file, `max-serve-gcp.yaml`.

   ```bash
   touch max-serve-gcp.yaml
   ```

   Then, using the editor of your choice, paste in the following:

   
        max-serve-gcp.yaml

        ```yaml
        imports:
          - path: max-serve-gcp.jinja

        resources:
          - name: max-serve-deployment
            type: max-serve-gcp.jinja
            properties:
              instanceName: max-serve-instance
              zone: us-central1-b
              machineType: a2-highgpu-1g
              acceleratorType: nvidia-tesla-a100
              acceleratorCount: 1
              sourceImage: common-cu124-v20241118-ubuntu-2004-py310
              hugging_face_hub_token: 
              pytorch_model: Qwen/Qwen2.5-1.5B-Instruct

        ```

    
   :::note

   Make sure you replace `` with your actual
   Hugging Face hub token.

   :::

6. Create your deployment by running the following command:

   ```bash
   gcloud deployment-manager deployments create max-serve-deployment \
       --config max-serve-gcp.yaml \
       --project ${PROJECT_ID}
   ```

   The deployment might take a few minutes to complete. To track the status of
   the deployment, run the following command:

   ```bash
   gcloud deployment-manager deployments describe max-serve-deployment \
       --project=${PROJECT_ID}
   ```

1. Create a working directory for the Infrastructure as Code files.

    ```bash
    mkdir azure
    ```

    Then, navigate to that directory.

    ```bash
    cd azure
    ```

2. Set the Azure region. In this case, we'll use `eastus`, but you can use
   whatever region you prefer.

   ```bash
   export REGION="eastus"
   ```

3. Create the resource group.

   ```bash
   az group create --name maxServeResourceGroup --location $REGION
   ```

   The following is the expected output:

   ```output
   {
       "id": "/subscriptions/SUBSCRIPTION_ID/resourceGroups/RESOURCE_GROUP_NAME",
       "location": "eastus",
       "managedBy": null,
       "name": "RESOURCE_GROUP_NAME",
       "properties": {
       "provisioningState": "Succeeded"
       },
       "tags": null,
       "type": "Microsoft.Resources/resourceGroups"
   }
   ```

4. Verify that the resource group was created successfully:

   ```bash
   az group show -n maxServeResourceGroup --query properties.provisioningState -o tsv
   ```

   The following is the expected output:

   ```output
   Succeeded
   ```

5. Create a file named `startup.sh` and paste in the following contents:

   
        startup.sh

        ```bash
        echo '#!/bin/bash

        sudo usermod -aG docker $USER

        sudo systemctl restart docker

        sleep 10

        sudo docker run \
                --env "HF_TOKEN=" \
                -v $HOME/.cache/huggingface:/root/.cache/huggingface \
                --gpus 1 \
                -p 8000:8000 \
                --ipc host \
                docker.modular.com/modular/max-nvidia-full:latest \
                --model-path Qwen/Qwen2.5-1.5B-Instruct | base64
        ```

    
   :::note

   Make sure to replace `` with your Hugging Face
   hub token.

   In addition, this uses the [Qwen2.5](https://huggingface.co/Qwen/Qwen2.5-1.5b-Instruct)
   model. However, you can later use any PyTorch LLM model.

   :::

   Then, encode the script using base64:

   ```bash
   base64 -i startup.sh | tr -d '\n' > encoded-script.txt
   ```

   Use the output of this script for the placeholder ``
   in the next step.

6. Create a new file, `parameters.json` and paste in the following contents.

   Be sure to replace `` with the encoded output from
   the previous step, and `` with your own secure
   password.

   
        parameters.json

        ```
        {
            "adminUsername": {
                "value": "azureuser"
            },
            "adminPassword": {
                "value": ""
            },
            "vmSize": {
                "value": "Standard_NV36ads_A10_v5"
            },
            "osDiskSizeGB": {
                "value": 128
            },
            "vnetAddressPrefix": {
                "value": "10.0.0.0/16"
            },
            "subnetAddressPrefix": {
                "value": "10.0.0.0/24"
            },
            "location": {
                "value": "[parameters('location')]"
            },
            "startupScript": {
                "value": ""
            }
        }
        ```
    

7. Create a new file, `max-serve-azure.json` and paste in the following:

    
        max-serve-azure.json

        ```json
        {
        "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
        "contentVersion": "1.0.0.0",
        "parameters": {
            "adminUsername": {
            "type": "string",
            "metadata": {
                "description": "Admin username for the virtual machine."
            }
            },
            "adminPassword": {
            "type": "securestring",
            "metadata": {
                "description": "Admin password for the virtual machine."
            }
            },
            "vmSize": {
            "type": "string",
            "defaultValue": "standard_nc24ads_a100_v4",
            "metadata": {
                "description": "Size of the virtual machine."
            }
            },
            "osDiskSizeGB": {
            "type": "int",
            "defaultValue": 128,
            "metadata": {
                "description": "OS disk size in GB."
            }
            },
            "vnetAddressPrefix": {
            "type": "string",
            "defaultValue": "10.0.0.0/16",
            "metadata": {
                "description": "Address space for the virtual network."
            }
            },
            "subnetAddressPrefix": {
            "type": "string",
            "defaultValue": "10.0.0.0/24",
            "metadata": {
                "description": "Subnet address space."
            }
            },
            "startupScript": {
            "type": "string",
            "metadata": {
                "description": "Base64-encoded startup script."
            }
            },
            "location": {
            "type": "string",
            "defaultValue": "westus3",
            "metadata": {
                "description": "Location for all resources."
            }
            }
        },
        "resources": [
            {
            "type": "Microsoft.Network/virtualNetworks",
            "apiVersion": "2021-03-01",
            "name": "maxServeVNet",
            "location": "[parameters('location')]",
            "properties": {
                "addressSpace": {
                "addressPrefixes": [
                    "[parameters('vnetAddressPrefix')]"
                ]
                },
                "subnets": [
                {
                    "name": "maxServeSubnet",
                    "properties": {
                    "addressPrefix": "[parameters('subnetAddressPrefix')]"
                    }
                }
                ]
            }
            },
            {
            "type": "Microsoft.Network/publicIPAddresses",
            "apiVersion": "2021-03-01",
            "name": "maxServePublicIP",
            "location": "[parameters('location')]",
            "properties": {
                "publicIPAllocationMethod": "Dynamic"
            }
            },
            {
            "type": "Microsoft.Network/networkSecurityGroups",
            "apiVersion": "2021-02-01",
            "name": "maxServeNSG",
            "location": "[parameters('location')]",
            "properties": {
                "securityRules": [
                {
                    "name": "allowHTTP",
                    "properties": {
                    "priority": 100,
                    "protocol": "Tcp",
                    "access": "Allow",
                    "direction": "Inbound",
                    "sourceAddressPrefix": "*",
                    "sourcePortRange": "*",
                    "destinationAddressPrefix": "*",
                    "destinationPortRange": "80",
                    "description": "Allow HTTP traffic on port 80"
                    }
                },
                {
                    "name": "allowSSH",
                    "properties": {
                    "priority": 200,
                    "protocol": "Tcp",
                    "access": "Allow",
                    "direction": "Inbound",
                    "sourceAddressPrefix": "*",
                    "sourcePortRange": "*",
                    "destinationAddressPrefix": "*",
                    "destinationPortRange": "22",
                    "description": "Allow SSH traffic on port 22"
                    }
                },
                {
                    "name": "allowOutbound",
                    "properties": {
                    "priority": 300,
                    "protocol": "Tcp",
                    "access": "Allow",
                    "direction": "Outbound",
                    "sourceAddressPrefix": "*",
                    "sourcePortRange": "*",
                    "destinationAddressPrefix": "*",
                    "destinationPortRange": "*",
                    "description": "Allow all outbound traffic"
                    }
                }
                ]
            }
            },
            {
            "type": "Microsoft.Network/networkInterfaces",
            "apiVersion": "2021-03-01",
            "name": "maxServeNIC",
            "location": "[parameters('location')]",
            "dependsOn": [
                "[resourceId('Microsoft.Network/publicIPAddresses', 'maxServePublicIP')]",
                "[resourceId('Microsoft.Network/virtualNetworks', 'maxServeVNet')]",
                "[resourceId('Microsoft.Network/networkSecurityGroups', 'maxServeNSG')]"
            ],
            "properties": {
                "ipConfigurations": [
                {
                    "name": "ipconfig1",
                    "properties": {
                    "subnet": {
                        "id": "[resourceId('Microsoft.Network/virtualNetworks/subnets', 'maxServeVNet', 'maxServeSubnet')]"
                    },
                    "privateIPAllocationMethod": "Dynamic",
                    "publicIPAddress": {
                        "id": "[resourceId('Microsoft.Network/publicIPAddresses', 'maxServePublicIP')]"
                    }
                    }
                }
                ],
                "networkSecurityGroup": {
                "id": "[resourceId('Microsoft.Network/networkSecurityGroups', 'maxServeNSG')]"
                }
            }
            },
            {
            "type": "Microsoft.Compute/virtualMachines",
            "apiVersion": "2021-03-01",
            "name": "maxServeVM",
            "location": "[parameters('location')]",
            "dependsOn": [
                "[resourceId('Microsoft.Network/networkInterfaces', 'maxServeNIC')]"
            ],
            "plan": {
                "name": "nvaie_gpu_1_gen2",
                "publisher": "nvidia",
                "product": "nvidia-ai-enterprise"
            },
            "properties": {
                "hardwareProfile": {
                "vmSize": "[parameters('vmSize')]"
                },
                "osProfile": {
                "computerName": "maxServeVM",
                "adminUsername": "[parameters('adminUsername')]",
                "adminPassword": "[parameters('adminPassword')]"
                },
                "storageProfile": {
                "imageReference": {
                    "publisher": "nvidia",
                    "offer": "nvidia-ai-enterprise",
                    "sku": "nvaie_gpu_1_gen2",
                    "version": "24.07.03"
                },
                "osDisk": {
                    "createOption": "FromImage",
                    "managedDisk": {
                    "storageAccountType": "Standard_LRS"
                    },
                    "diskSizeGB": "[parameters('osDiskSizeGB')]"
                }
                },
                "networkProfile": {
                "networkInterfaces": [
                    {
                    "id": "[resourceId('Microsoft.Network/networkInterfaces', 'maxServeNIC')]"
                    }
                ]
                }
            }
            },
            {
            "type": "Microsoft.Compute/virtualMachines/extensions",
            "apiVersion": "2021-03-01",
            "name": "maxServeVM/customScriptExtension",
            "location": "[resourceGroup().location]",
            "dependsOn": [
                "[resourceId('Microsoft.Compute/virtualMachines', 'maxServeVM')]"
            ],
            "properties": {
                "publisher": "Microsoft.Azure.Extensions",
                "type": "CustomScript",
                "typeHandlerVersion": "2.1",
                "autoUpgradeMinorVersion": true,
                "settings": {
                "fileUris": [],
                "script": "[parameters('startupScript')]"
                }
            }
            }
        ],
        "outputs": {
            "vmName": {
            "type": "string",
            "value": "[reference('maxServeVM').osProfile.computerName]"
            }
        }
        }
        ```

    
8. Create the deployment.

   ```bash
   az deployment group create \
       --name maxServeDeployment \
       --resource-group maxServeResourceGroup \
       --template-file max-serve-azure.json \
       --parameters @parameters.json location="$REGION"
   ```

9. Track the status of the deployment by running the following command:

   ```bash
   az deployment group wait --name maxServeDeployment \
       --resource-group maxServeResourceGroup \
       --created
   ```

### Retrieve instance information

At this point, you should have confirmation that your instance is up and
running! Let's get some of the information we need to test the deployment.

Let's get the instance ID and public IP address and assign them to environment
variables:

```
INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name max-serve-stack --query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" --output text --region $REGION)
PUBLIC_IP=$(aws ec2 describe-instances --instance-ids $INSTANCE_ID --query 'Reservations[0].Instances[0].PublicIpAddress' --output text --region $REGION)
echo "Instance ID: $INSTANCE_ID"
echo "Public IP: $PUBLIC_IP"
aws ec2 wait instance-running --instance-ids $INSTANCE_ID --region $REGION
```

1. Get the instance name and zone. Be sure to update the `INSTANCE_NAME`
   variable if you changed it from `max-serve-instance`.

    ```bash
    INSTANCE_NAME=max-serve-instance

    ZONE=$(gcloud compute instances list \
    --filter="name:${INSTANCE_NAME}" \
    --format="value(zone)")

    echo "Instance Name: $INSTANCE_NAME"
    echo "Zone: $ZONE"
    ```

2. Add a tag to the instance.

   ```bash
   gcloud compute instances add-tags "${INSTANCE_NAME}" \
       --project=${PROJECT_ID} \
       --zone "${ZONE}" \
       --tags http-server
   ```

3. Retrieve the public IP address for the instance:

   ```bash
   PUBLIC_IP=$(gcloud compute instances describe "${INSTANCE_NAME}" \
   --zone "${ZONE}" \
   --format="get(networkInterfaces[0].accessConfigs[0].natIP)" \
   --project=${PROJECT_ID})
   echo "Public IP: $PUBLIC_IP"
   ```

Get the public IP address of our deployment.

    ```bash
    PUBLIC_IP=$(az network public-ip show \
        --resource-group maxServeResourceGroup \
        --name maxServePublicIP \
        --query ipAddress -o tsv)
    ```

### Test the endpoint

We've confirmed that the instance is available. However, it can still take a few
minutes to pull the MAX Docker image and start it. In this section, you'll learn
how to check to see if the service is ready to receive inference requests, then
run a `curl` command to send and receive a request to the container.

To track when the instance is ready, you can use the AWS CloudWatch console to
view the log group, `/aws/ec2/max-serve-stack-logs` and find the logs
for `instance-logs`. Alternatively, you can use the
following bash script:

    check-logs.sh

    ```bash
    REGION=$1
    STACK_NAME=$2
    MAX_WAIT_MINUTES=30
    START_TIME=$(date +%s)
    LOG_GROUP="/aws/ec2/$STACK_NAME-logs"

    fetch_logs() {
        local stream_name=$1
        local stream_type=$2
        local limit=$3
        echo "=== $stream_type Logs ==="

        if [ -n "$limit" ]; then
            aws logs get-log-events \
                --log-group-name "$LOG_GROUP" \
                --log-stream-name "$stream_name" \
                --limit $limit \
                --region $REGION \
                --query 'events[*].[timestamp,message]' \
                --output text
        else
            aws logs get-log-events \
                --log-group-name "$LOG_GROUP" \
                --log-stream-name "$stream_name" \
                --start-time $(($(date +%s) - 60))000 \
                --region $REGION \
                --query 'events[*].[timestamp,message]' \
                --output text
        fi
        echo "===================="
    }

    check_server_status() {
        local logs=$1
        echo "🔍 Checking logs for server status..."

        # Check for Uvicorn startup message in container logs
        if echo "$logs" | grep -q "Server ready on http://0.0.0.0:8000" ||
            echo "$logs" | grep -q "Application startup complete"; then
            echo "✅ Found server running message"
            return 0
        fi

        echo "❌ Server running message not found"
        return 1
    }

    echo "🔍 Starting monitoring for MAX server (max wait: ${MAX_WAIT_MINUTES} minutes)..."

    while true; do
        current_time=$(date +%s)
        elapsed_minutes=$(((current_time - START_TIME) / 60))

        if [ $elapsed_minutes -ge $MAX_WAIT_MINUTES ]; then
            echo "❌ Timeout after ${MAX_WAIT_MINUTES} minutes. Server might still be starting up."
            exit 1
        fi

        EC2_LOG_STREAM=$(aws logs describe-log-streams \
            --log-group-name "$LOG_GROUP" \
            --log-stream-name-prefix "instance-logs" \
            --region $REGION \
            --query "logStreams[0].logStreamName" \
            --output text)

        echo "⏳ Checking logs... (${elapsed_minutes}/${MAX_WAIT_MINUTES} minutes)"

        if [ "$EC2_LOG_STREAM" != "None" ]; then
            echo "📜 Instance Logs:"
            EC2_LOGS=$(fetch_logs "$EC2_LOG_STREAM" "Instance" 50)
            if check_server_status "$EC2_LOGS"; then
                echo "✅ Server is ready! (took ${elapsed_minutes} minutes)"
                echo "📋 Latest logs:"
                echo "$EC2_LOGS"
                exit 0
            fi
        else
            echo "⏳ Logs not yet available..."
        fi

        echo "⏳ Server still starting up... checking again in 60 seconds"
        echo "-------------------------------------------"
        sleep 60
    done
    ```

The instance is ready when you can see a log entry similar to the following:

```
Server ready on http://0.0.0.0:8000
```

After you see this log entry, you can test the endpoint by running the following
`curl` command:

```
curl -N http://$PUBLIC_IP/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5b-instruct",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of Mongolia"}
        ]
    }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
    ```

1. Assign the instance ID to an environment variable, `INSTANCE_ID`.

    ```bash
    INSTANCE_ID=$(gcloud compute instances describe ${INSTANCE_NAME} \
        --zone=${ZONE} \
        --project=${PROJECT_ID} \
        --format="value(id)")
    ```

2. Get the current logs by running the following command:

    ```bash
    gcloud logging read \
        "resource.type=gce_instance AND \
        resource.labels.instance_id=${INSTANCE_ID} AND \
        jsonPayload.message:*" \
        --project=${PROJECT_ID} \
        --format="table(timestamp,jsonPayload.message)" \
        --limit=10
    ```

    The instance is ready when you can see a log entry similar to the following:

    ```output
    uvicorn running on http://0.0.0.0:8000
    ```

3. Test the endpoint by sending the following `curl` request:

    ```bash
    curl -N http://$PUBLIC_IP:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "Qwen/Qwen2.5-1.5B-Instruct",
            "stream": true,
            "messages": [
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": "What is the capital of Mongolia"}
            ]
        }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
    ```

1. Verify that the container is running.

    ```bash
    ssh azuresuer@$PUBLIC_IP

    # Use the password that you set in your parameters.json file.

    sudo cat /var/log/azure/custom-script/handler.log
    sudo cat /var/lib/waagent/custom-script/download/0/stdout
    sudo cat /var/lib/waagent/custom-script/download/0/stderr
    ```

    The instance is ready when you can see a log entry similar to the following:

    ```output
    uvicorn running on http://0.0.0.0:8000
    ```

3. Test the endpoint by sending the following `curl` request:

    ```bash
    curl -N http://$PUBLIC_IP:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen2.5-1.5B-Instruct",
        "stream": true,
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of Mongolia?"}
        ]
    }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
    ```

You should see a response in your command line similar to the following:

```output
The capital city of Mongolia is Ulaanbaatar.
```

### Delete the cloud resources

Take a few minutes to explore your deployment. When you're finished, be sure to
delete the resources created in this tutorial so you don't incur any unnecessary
charges.

1. Delete the stack.

    ```bash
    aws cloudformation delete-stack --stack-name max-serve-stack
    ```

2. Verify that the stack deleted successfully.

    ```
    aws cloudformation describe-stacks --stack-name max-serve-stack \
        --region $REGION --query 'Stacks[0].StackStatus' --output text
    ```

```
gcloud deployment-manager deployments delete max-serve-deployment \
--project=${PROJECT_ID}
```

```bash
az group delete --name maxServeResourceGroup
```

## Next steps

In this tutorial, you've deployed a Hugging Face Pytorch model to the cloud
using a MAX Docker container.

Keep in mind that this is just a preview of MAX Serve for PyTorch models and
it's currently compatible with LLMs only. We're working on support for more
models and more model optimizations with the MAX graph compiler.

Here are some other topics to explore next:

export const cards = [
  {
    title: 'Deploy Llama 3 on GPU with MAX Serve',
    link: '/max/tutorials/max-serve-local-to-cloud',
    description: `Learn how to deploy Llama 3 on GPU with MAX Serve.`,
  },
  {
    title: 'Benchmark MAX Serve on an NVIDIA H100 GPU',
    link: '/max/tutorials/benchmark-max-serve',
    description: `Learn how to use our benchmarking script to measure the performance of MAX Serve.`,
  },
  {
    title: 'Bring your own fine-tuned model to MAX pipelines',
    link: '/max/tutorials/max-pipeline-bring-your-own-model',
    description: `Learn how to customize your own model in MAX pipelines.`,
  },
  {
    title: 'Deploy Llama 3 on GPU-powered Kubernetes clusters',
    link: '/max/tutorials/deploy-max-serve-on-kubernetes',
    description:
    `Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs`,
  },
];

To stay up to date with new releases,
[sign up for our newsletter](https://www.modular.com/modverse#signup) and
[join our community](https://www.modular.com/community). And if you're
interested in becoming a design partner to get early access and give us
feedback, please [contact us](https://www.modular.com/company/contact).

---

## Deploy Llama 3 on GPU with MAX Serve

import SmallCards from '@site/src/components/SmallCards';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import Requirements from '@site/src/components/Requirements';
import { requirementsWithGPU } from '@site/docs/max/requirements';
import InstallModular from '@site/docs/_includes/install-modular.mdx';

This guide walks through serving Llama 3 models with MAX Serve, from local
testing to production deployment on major cloud platforms. You'll learn to
automate the deployment process using Infrastructure-as-Code (Iac) and optimize
performance with GPU resources.

MAX Serve provides a streamlined way to deploy large language models (LLMs) with
production-ready features like GPU acceleration, automatic scaling, and
monitoring capabilities. Whether you're building a prototype or preparing for
production deployment, this guide will help you set up a robust serving
infrastructure for Llama 3.

The tutorial is organized into the following sections:

- **[Local setup](#local-setup)**: Run Llama 3 locally to verify its basic
functionality.
- **[Cloud deployment](#cloud-deployment)**: Deploy Llama 3 to AWS, GCP, or
Azure using IaC templates and CLI commands.

System requirements:

## Local setup

In this section, you will set up and run Llama 3 locally to understand its
capabilities and validate functionality before moving to the cloud.

### 1. Set up your environment

Create a Python project to install our APIs and CLI tools.

### 2. Run Llama 3 locally

Next, use the `max` CLI tool to interact with the Llama 3 model locally and
ensure that the model runs as expected before deploying it in the cloud.

1. Export your Hugging Face token. To create a Hugging Face user access token,
  see [Access Tokens](https://huggingface.co/settings/tokens).

  ```bash
  export HF_TOKEN=""
  ```

2. Generate a response to a prompt with the following command:

    ```bash
    max generate --model-path=modularai/Llama-3.1-8B-Instruct-GGUF \
      --prompt "What is the meaning of life?" \
      --max-length 250
    ```

    :::note Available flags

    Use the `max generate --help` command to explore available flags such as
    `--devices`. Supported GPUs include NVIDIA H100, A100, A10G, L4, and L40.

    :::

3. Start the model server using `max serve`. The `--model-path`
flag specifies which model to load.

    ```bash
    max serve --model-path modularai/Llama-3.1-8B-Instruct-GGUF
    ```

    This starts a local server where you can test Llama 3's response generation
    capabilities.

:::note GPU-enabled Docker containers
We provide a pre-configured GPU-enabled Docker container that simplifies
deployment. For more information, see [MAX container](/max/container). We'll use
the MAX container later in the [cloud deployment](#cloud-deployment) steps. This
container includes all necessary dependencies and configurations for running
Llama 3 with GPU acceleration.
:::

### 3. Test the local endpoint

After starting the model server, you can test its functionality by sending a
`curl` request from a new window:

  ```bash
  curl -N http://0.0.0.0:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
          "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
          "messages": [
              {"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": "Who won the World Series in 2020?"}
          ]
      }' | jq -r '.choices[].message.content'
  ```

After starting your server, you can go to
[http://0.0.0.0:8000/docs](http://0.0.0.0:8000/docs) to learn more about
available endpoints and API specifications.

Now that the model works locally, we'll transition to cloud deployment.

## Cloud deployment paths {#cloud-deployment}

We will use Infrastructure-as-Code (IaC) to create, configure, and deploy Llama
3 in the cloud. The cloud deployment instructions are divided by provider: AWS,
GCP, and Azure.

### Cloud deployment overview

For AWS, we will use CloudFormation, for GCP, we will use Deployment Manager,
and for Azure, we will use Resource Manager. These IaC templates handle resource
provisioning, networking, and security configuration. This approach simplifies
deployments and ensures they are repeatable.

The key steps are:

- **Create and Deploy Stack/Resources**: Use IaC templates for each cloud
provider to deploy Llama 3.
- **Test the Endpoint**: Retrieve the public IP address after deployment and
send a request to test the Llama 3 endpoint in the cloud.

Each cloud-specific tab provides complete commands for setup, configuration,
deployment, and testing.

To better understand the flow of the deployment, here is a high-level overview
of the architecture:

  
  Figure 1. Architecture diagram of the cloud stack for deploying MAX Serve.

This architecture diagram illustrates the two-phase deployment setup for serving
the Llama 3 model with MAX on cloud provider infrastructure.

The deployment process is divided into two phases:

* **Phase 1: Cloud stack creation**: In this initial phase, the following
infrastructure is provisioned and configured to prepare for serving requests:
  * **Public IP assignment**: The cloud provider assigns a public IP to the
  virtual machine (VM), allowing it to be accessed externally.
  * **Firewall/Security group configuration**: Security settings, such as
  firewall rules or security groups, are applied to allow traffic on port 80.
  This setup ensures that only HTTP requests can access the instance securely.
  * **GPU compute instance setup**: A GPU-enabled VM is created to handle model
  inference efficiently. This instance includes:
    * **GPU drivers/runtime installation**: Necessary GPU drivers and runtime
    libraries are installed to enable hardware acceleration for model processing.
    * **Docker container initialization**: A Docker container is launched on the
    VM, where it pulls the necessary images from the Docker Container Registry.
      This registry serves as a central repository for storing Docker images,
      making it easy to deploy and update the application.

Inside the container, MAX Serve is set up alongside the Llama 3 model. This
setup prepares the environment for serving requests but does not yet expose the
endpoint to users.

:::note GPU-enabled Docker containers
The pre-configured GPU-enabled Docker container includes all necessary
dependencies and configurations for running Llama 3 with GPU acceleration.

The provided IaC templates initialize the MAX container. If you don't use the
provided templates for infrastructure set up, you can initialize the container
image with the `docker run` command. For more information, see
[MAX container](/max/container).
:::

* **Phase 2: Serving the user endpoint**: Once the cloud stack is configured and
the VM is set up, the deployment enters the second phase, where it starts
serving user requests:
  * **HTTP endpoint exposure**: With the VM and Docker container ready, the
  system opens an OpenAI compatible HTTP endpoint on port 80, allowing users to
  interact with the deployed Llama 3 model.
  * **Request handling by MAX Serve**: When a user sends an HTTP request to the
  public IP, MAX Serve processes the incoming request within the Docker
  container and forwards it to the Llama 3 model for inference. The model
  generates a response, which is then returned to the user via the endpoint.

:::caution

For the sake of this tutorial, we expose the public IP address of the VM to the
internet. This is not recommended for direct use in production environments as
it may expose your model to security risks.

:::

### Prerequisites

Be sure that you have the following prerequisites, as well as appropriate access
and permissions for the cloud provider of your choice.

- **GPU resources**: You'll need access to GPU resources in your cloud account
with the following specifications:
  - **Minimum GPU memory**: 24GB
  - **Supported GPU types**: NVIDIA H100, A100, A10G, L4 and L40

  :::note

  This tutorial has been tested on `g5.4xlarge` (A10G 24GB) on AWS,
  `g2-standard-8` (L4 32GB) on GCP, and `Standard_NV36ads_A10_v5` (A10G 24GB)
  on Azure

  :::

- **A Hugging Face user access token**: A valid Hugging Face token is required
to access the model. To create a Hugging Face user access token, see
[Access Tokens](https://huggingface.co/settings/tokens). You must make your
token available in your environment with the following command:

  ```bash
  export HF_TOKEN=""
  ```

- **Docker installation**: Install the
[Docker Engine and CLI](https://docs.docker.com/engine/install/). We use a
pre-configured GPU-enabled Docker container from our public repository. The
container image (`docker.modular.com/modular/max-nvidia-full:latest`) is available on
[Docker Hub](https://hub.docker.com/r/modular/max-nvidia-full). For more
information, see [MAX container](/max/container).

- **Cloud CLI tools**: Before deploying, ensure that you have the respective
cloud provider CLI tools installed.
  - [AWS CLI v2](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html)
  installed and configured with appropriate credentials
  - [Google Cloud SDK](https://cloud.google.com/sdk/docs/install) installed and
  initialized
  - [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli)
  installed and logged in and configured

  
Configure the AWS CLI:

```bash
aws configure
```

Log in to your AWS account:

```bash
aws sso login
```

Check the credentials via `cat ~/.aws/credentials` to make sure it is set up
correctly. You can also include the credentials as environment variables:

```bash
export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY_ID"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_ACCESS_KEY"
```

   
Initialize the Google Cloud SDK:

```bash
gcloud init
```

Log in to your Google Cloud account:

```bash
gcloud auth login
```

   
Initialize the Azure CLI:

```bash
az init
```

Log in to your Azure account:

```bash
az login
```

  
### 1. Create stack/deployment

In this section, we'll walk through creating a deployment stack on AWS, GCP,
and Azure. Each cloud provider has its own configuration steps, detailed below,
but we simplify the setup by using Infrastructure-as-Code (IaC) templates.

Start by cloning the MAX repository and navigating to the
`max/tutorials/max-serve-cloud-configs/` directory, where the necessary IaC
templates and configuration files are organized for each cloud provider.

```bash
git clone -b stable https://github.com/modular/modular && cd max/tutorials/max-serve-cloud-configs
```

This directory includes all files required to deploy the MAX
Serve setup to AWS, GCP, or Azure:

```bash
max/tutorials/max-serve-cloud-configs/
├── aws
│   ├── max-serve-aws.yaml
│   └── notify.sh
├── azure
│   ├── max-serve-azure.json
│   └── notify.sh
└── gcp
    ├── max-serve-gcp.jinja
    └── notify.sh
```

With these IaC templates ready, choose your preferred cloud provider and follow
the step-by-step instructions specific to each platform.

:::note Preparing the deployment takes some time

Stack creation may take some time to complete and completion times differ across
cloud providers.
:::

  
First navigate to the AWS directory:

```bash
cd aws
```

Set the region in your environment:

```bash
export REGION="REGION" # example: `us-east-1`
```

Then, create the stack. You can explore the `max-serve-aws.yaml` file for AWS
CloudFormation configuration information.

:::note Stack naming

The stack name must be **unique** so please be sure to change the `--stack-name`
if you create multiple stacks.

:::

```bash
export STACK_NAME="max-serve-stack"

aws cloudformation create-stack --stack-name ${STACK_NAME} \
 --template-body file://max-serve-aws.yaml \
 --parameters \
   ParameterKey=InstanceType,ParameterValue=g5.4xlarge \
   ParameterKey=HuggingFaceHubToken,ParameterValue=${HF_TOKEN} \
   ParameterKey=HuggingFaceRepoId,ParameterValue=modularai/Llama-3.1-8B-Instruct-GGUF \
 --capabilities CAPABILITY_IAM \
 --region $REGION
```

   
:::note GCP access requirements

You must have access to `deploymentmanager.googleapis.com`,
`logging.googleapis.com`, `compute.googleapis.com` and be able to use
`gcloud compute firewall-rules` to configure inbound traffic.
:::

First, navigate to the GCP directory:

```bash
cd gcp
```

Set the project ID:

```bash
PROJECT_ID="YOUR PROJECT ID"
export ZONE="ZONE" # example `us-east1-d`
```

Enable the required APIs:

```bash
gcloud services enable deploymentmanager.googleapis.com --project=${PROJECT_ID} && \
gcloud services enable logging.googleapis.com --project=${PROJECT_ID} && \
gcloud services enable compute.googleapis.com --project=${PROJECT_ID}
```

Create the deployment with the following command. You can explore the
`max-serve-gcp.jinja` file for more information on the Deployment Manager
configuration.

:::note Deployment naming

The deployment name must be **unique** so please be sure to change the
`DEPLOYMENT_NAME` if you create multiple deployments.

:::

```bash
export DEPLOYMENT_NAME="max-serve-deployment"
export INSTANCE_NAME="max-serve-instance"

gcloud deployment-manager deployments create ${DEPLOYMENT_NAME} \
  --template max-serve-gcp.jinja \
  --properties "\
instanceName:${INSTANCE_NAME},\
zone:${ZONE},\
machineType:g2-standard-8,\
acceleratorType:nvidia-l4,\
acceleratorCount:1,\
sourceImage:common-cu123-v20240922-ubuntu-2204-py310,\
huggingFaceHubToken:${HF_TOKEN},\
huggingFaceRepoId:modularai/Llama-3.1-8B-Instruct-GGUF" \
  --project ${PROJECT_ID}
```

   
First, navigate to the Azure directory:

```bash
cd azure
```

Set the region:

```bash
export REGION="REGION" # example `westus3`
```

Then, create the resource group:

:::note Resource group and deployment naming

If you receive an error about resource group location conflicts, it means the
resource group already exists in a different location.

You can either:

- Use a new resource group name
- Use the existing resource group's location

Additionally, the deployment name must be **unique** so please be sure to change
the `DEPLOYMENT_NAME` if you create multiple deployments.

:::

```bash
export RESOURCE_GROUP_NAME="maxServeResourceGroup"
export DEPLOYMENT_NAME="maxServeDeployment"
az group create --name ${RESOURCE_GROUP_NAME} --location ${REGION}
```

Check the status of the resource group:

```bash
az group show -n ${RESOURCE_GROUP_NAME} --query properties.provisioningState -o tsv
```

Create and encode the startup script:

```bash
STARTUP_SCRIPT='#!/bin/bash

sudo usermod -aG docker $USER

sudo systemctl restart docker

sleep 10

HF_TOKEN=$1
HUGGING_FACE_REPO_ID=${2:-modularai/Llama-3.1-8B-Instruct-GGUF}

sudo docker run -d \
  --restart unless-stopped \
  --env "HF_TOKEN=${HF_TOKEN}" \
  --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  --gpus 1 \
  -p 80:8000 \
  --ipc=host \
  docker.modular.com/modular/max-nvidia-full:latest \
  --model-path ${HUGGING_FACE_REPO_ID}'

export STARTUP_SCRIPT=$(echo "$STARTUP_SCRIPT" | base64)
```

Then, create the deployment:

:::note NVIDIA license agreement

You may be required to accept the Azure Marketplace image terms for the NVIDIA
AI enterprise image:

```bash
az vm image terms accept --urn nvidia:nvidia-ai-enterprise:nvaie_gpu_1_gen2:latest
```
:::

:::caution Set an admin password

Replace `YOUR-SECURE-PASSWORD-123` with your own secure password to be able to
`ssh` into the VM that we will use later.

:::

```bash
export VM_PASSWORD="YOUR-SECURE-PASSWORD-123"

az deployment group create \
    --name ${DEPLOYMENT_NAME} \
    --resource-group ${RESOURCE_GROUP_NAME} \
    --template-file max-serve-azure.json \
    --parameters \
        adminUsername="azureuser" \
        adminPassword=${VM_PASSWORD} \
        vmSize="Standard_NV36ads_A10_v5" \
        osDiskSizeGB=128 \
        vnetAddressPrefix="10.0.0.0/16" \
        subnetAddressPrefix="10.0.0.0/24" \
        startupScript="${STARTUP_SCRIPT}" \
        location="${REGION}"
```

  
### 2. Wait for resources to be ready

In this step, we'll wait for the resources to be ready. Stack and deployment
creation may take some time to complete.

  
```bash
aws cloudformation wait stack-create-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}
```

   
```bash
gcloud deployment-manager deployments describe ${DEPLOYMENT_NAME} \
--project=${PROJECT_ID}
```

   
Wait for the deployment to be completed and report its status:

```bash
az deployment group wait \
--name ${DEPLOYMENT_NAME} \
--resource-group ${RESOURCE_GROUP_NAME} \
--created
```

  
### 3. Retrieve instance information

After the resources are deployed, you'll need to get the instance information,
such as the public DNS or IP address that we will use to test the endpoint.

  
```bash
INSTANCE_ID=$(aws cloudformation describe-stacks --stack-name ${STACK_NAME} \
  --query "Stacks[0].Outputs[?OutputKey=='InstanceId'].OutputValue" \
  --output text \
  --region ${REGION})
PUBLIC_IP=$(aws ec2 describe-instances --instance-ids ${INSTANCE_ID} \
  --query 'Reservations[0].Instances[0].PublicIpAddress' \
  --output text \
  --region ${REGION})
echo "Instance ID: ${INSTANCE_ID}"
echo "Public IP: ${PUBLIC_IP}"
aws ec2 wait instance-running --instance-ids ${INSTANCE_ID} --region ${REGION}
```

   
First, check if the firewall rule already exists:

```bash
EXISTING_RULE=$(gcloud compute firewall-rules list \
  --filter="name=allow-http" \
  --format="value(name)" \
  --project=${PROJECT_ID})

if [ -z "$EXISTING_RULE" ]; then
  echo "Creating firewall rule..."
  gcloud compute firewall-rules create allow-http \
    --allow tcp:80 \
    --source-ranges 0.0.0.0/0 \
    --target-tags http-server \
    --description "Allow HTTP traffic on port 80" \
    --project=${PROJECT_ID}
else
  echo "Firewall rule 'allow-http' already exists"
fi
```

Check if the instance exists and tag it with `http-server`:

```bash
INSTANCE_EXISTS=$(gcloud compute instances list \
  --filter="name=${INSTANCE_NAME}" \
  --format="value(name)" \
  --project=${PROJECT_ID})

if [ -n "$INSTANCE_EXISTS" ]; then
  echo "Adding tags to instance ${INSTANCE_NAME}"
  gcloud compute instances add-tags "${INSTANCE_NAME}" \
    --project=${PROJECT_ID} \
    --zone "${ZONE}" \
    --tags http-server
else
  echo "Error: Instance ${INSTANCE_NAME} not found"
  exit 1
fi
```

Then, get the public IP:

```bash
PUBLIC_IP=$(gcloud compute instances describe "${INSTANCE_NAME}" \
  --zone "${ZONE}" \
  --format="get(networkInterfaces[0].accessConfigs[0].natIP)" \
  --project=${PROJECT_ID})
echo "Public IP: $PUBLIC_IP"
```

   
```bash
PUBLIC_IP=$(az network public-ip show \
--resource-group ${RESOURCE_GROUP_NAME} \
--name maxServePublicIP \
--query ipAddress -o tsv)
echo "Public IP: ${PUBLIC_IP}"
```

  
### 4. Test the endpoint

:::note Wait until the server is ready to test the endpoint

It will take some time for the stack or deployment to pull the MAX serve Docker
image and set it up for serving. We need to wait for the Docker logs to appear
and then make sure that the Docker container is running on port `8000`.

The server is ready when you see the following log:

```output
Server ready on http://0.0.0.0:8000
```

We provide a simple script to monitor the startup progress and notify you when
the server is ready.

  
For AWS, you can see the logs in the AWS CloudWatch UI within the log group
`/aws/ec2/${STACK_NAME}-logs` and log stream `instance-logs`.

Alternatively, you can use the provided bash script to monitor the logs until
the server is ready:

```bash
bash notify.sh ${REGION} ${STACK_NAME} ${PUBLIC_IP}
```

  
For GCP, first make sure that the Docker container is running on port `8000`.

You can view the logs in the Compute Engine VM instances UI. Within the UI,
choose **Observability**, then choose **Logs**.

Alternatively, you can use the provided bash script to monitor the logs until
the server is ready:

```bash
bash notify.sh ${PROJECT_ID} ${INSTANCE_NAME} ${ZONE} ${PUBLIC_IP}
```

  
For Azure, you can monitor the Docker container status (running on port `8000`)
using one of the following methods:

#### Option 1: Use the monitoring script

1. Install the required dependencies for the monitoring script:
   - Install
   [sshpass](https://www.cyberciti.biz/faq/noninteractive-shell-script-ssh-password-provider/)
   on your local machine to enable automated SSH password authentication

2. Set up and run the monitoring script:

   ```bash
   bash notify.sh ${RESOURCE_GROUP_NAME} ${VM_PASSWORD} ${PUBLIC_IP}
   ```

#### Option 2: Manual SSH access

1. Connect to the VM:
   ```bash
   ssh azureuser@$PUBLIC_IP
   ```
   > **Note:** Use the password you set previously when creating the deployment.

2. View the startup logs:
   ```bash
   sudo cat /var/log/azure/custom-script/handler.log
   sudo cat /var/lib/waagent/custom-script/download/0/stdout
   sudo cat /var/lib/waagent/custom-script/download/0/stderr
   sudo docker logs $(docker ps -q -f ancestor=docker.modular.com/modular/max-nvidia-full:latest)
   ```

Both methods will help you confirm that the server is running correctly. The
logs will show the startup progress and any potential issues that need to be
addressed.

  
:::

We will use the public IP address that we obtained from previous step to test
the endpoint with the following `curl` request:

:::tip

After the server starts, there may be a brief delay before the cloud provider
exposes the public IP address. If you receive an error, please wait
approximately one minute and try again.

:::

```bash
curl -N http://$PUBLIC_IP/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
        "stream": true,
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the World Series in 2020?"}
        ]
    }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
```

:::note Benchmarking MAX Serve

You can also use the public IP address of your deployed MAX Serve endpoint to
benchmark the performance of Llama 3.1. MAX includes a benchmarking script that
allows you to evaluate throughput, latency, and GPU utilization metrics. For
more detailed instructions on benchmarking, please see
[Benchmark MAX Serve](https://github.com/modular/modular/tree/main/benchmark).

:::

### 5. Delete the cloud resources

Cleaning up resources to avoid unwanted costs is critical. Use the following
commands to delete resources for each platform. This section provides steps to
safely terminate all resources used in the tutorial.

  
First, delete the stack:

```bash
aws cloudformation delete-stack --stack-name ${STACK_NAME}
```

Wait for the stack to be deleted:

```bash
aws cloudformation wait stack-delete-complete \
--stack-name ${STACK_NAME} \
--region ${REGION}
```

   
```bash
gcloud deployment-manager deployments delete ${DEPLOYMENT_NAME} \
--project=${PROJECT_ID}
```

   
```bash
az group delete --name ${RESOURCE_GROUP_NAME}
```

  
### Cost estimate

When deploying Llama 3 in a cloud environment, several cost factors come into
play:

**Primary cost components:**

- **Compute Resources**: GPU instances (like AWS `g5.4xlarge`, GCP
`g2-standard-8`, or Azure `Standard_NV36ads_A10_v5`) form the bulk of the costs
- **Network Transfer**: Costs associated with data ingress/egress, which is
critical for high-traffic applications
- **Storage**: Expenses for boot volumes and any additional storage requirements
- **Additional Services**: Costs for logging, monitoring, and other supporting
cloud services

For detailed cost estimates specific to your use case, we recommend using these
official pricing calculators:

- [AWS Pricing Calculator](https://calculator.aws)
- [GCP Pricing Calculator](https://cloud.google.com/products/calculator)
- [Azure Pricing Calculator](https://azure.microsoft.com/en-us/pricing/calculator/)

:::tip

Cloud cost optimization tips:

- Consider using spot/preemptible instances for non-critical workloads
- Implement auto-scaling to match resource allocation with demand
- Monitor and optimize network usage patterns
- Set up cost alerts and budgets to avoid unexpected charges

Remember to factor in your expected usage patterns, regional pricing
differences, and any applicable enterprise discounts when calculating total cost
of ownership (TCO).

:::

## Next steps

Congratulations on successfully running MAX Pipelines locally and deploying
Llama 3 to the cloud! 🎉

Now that you've mastered the essentials of setting up and deploying the Llama 3
model with MAX Serve, here are some other topics to explore next:

export const cards = [
  {
    title: 'Deploy a PyTorch model from Hugging Face',
    link: '/max/tutorials/deploy-pytorch-llm',
    description:
    `Learn how to deploy a PyTorch model to the cloud using MAX Serve.`,
  },
  {
    title: 'Benchmark MAX Serve on an NVIDIA H100 GPU',
    link: '/max/tutorials/benchmark-max-serve',
    description: `Learn how to use our benchmarking script to measure the performance of MAX Serve.`,
  },
  {
    title: 'Bring your own fine-tuned model to MAX pipelines',
    link: '/max/tutorials/max-pipeline-bring-your-own-model',
    description: `Learn how to customize your own model in MAX pipelines.`,
  },
  {
    title: 'Deploy Llama 3 on GPU-powered Kubernetes clusters',
    link: '/max/tutorials/deploy-max-serve-on-kubernetes',
    description:
    `Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs`,
  },
];

To stay up to date with new releases,
[sign up for our newsletter](https://www.modular.com/modverse#signup) and
[join our community](https://www.modular.com/community). And if you're
interested in becoming a design partner to get early access and give us
feedback, please [contact us](https://www.modular.com/company/contact).

---

## Deploy Llama 3 on GPU-powered Kubernetes clusters

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import Requirements from '@site/src/components/Requirements';
import { requirementsWithGPU } from '@site/docs/max/requirements';

MAX simplifies the process to deploy an LLM with high performance
on GPUs. And if you want to deploy at scale using Kubernetes' built-in
monitoring, scaling, and cluster management, then you're in the right place. In
this tutorial, you'll learn how to deploy our MAX container on Kubernetes,
using your pick of AWS, GCP, or Azure.

You'll create a GPU-enabled Kubernetes cluster with your chosen cloud provider
(AWS, GCP, or Azure), then use Helm to deploy the MAX container, which provides
an OpenAI-compatible endpoint for making inference requests.

:::note GPU required

When selecting your cloud environment, make sure it includes a [compatible
GPU](/max/faq#gpu-requirements).

:::

## Set up your environment

Most of this tutorial involves interaction with your cloud service, so make sure
you have the appropriate access and permissions. Most importantly, this tutorial
uses GPU-powered Kubernetes clusters that may require special privileges.

1. To get started, select your cloud provider and install the corresponding
required tools:

    
        To work with AWS, you'll need to install and configure two command-line tools.
        Begin by installing the AWS CLI using the [AWS CLI installation guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html), then install eksctl following the [eksctl installation guide](https://eksctl.io/installation/).

        After installation, authenticate your AWS account using:

        ```bash
        aws configure
        ```

        This will prompt you for your AWS credentials. For a complete setup walkthrough, refer to the [AWS authentication guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html) or the [Amazon EKS setup documentation](https://docs.aws.amazon.com/eks/latest/userguide/setting-up.html).
        

        Start by installing the Google Cloud CLI following the [GCP CLI installation guide](https://cloud.google.com/sdk/docs/install-sdk).

        Once installed, authenticate and configure your GCP environment:

        ```bash
        gcloud auth login
        gcloud config set project YOUR_PROJECT_ID
        gcloud config set compute/region us-central1-a
        ```

        For additional configuration options, see the [GCP authentication guide](https://cloud.google.com/docs/authentication/gcloud).
        

        Begin by installing the Azure CLI following the [Azure CLI installation guide](https://learn.microsoft.com/en-us/cli/azure/install-azure-cli).

        Once installed, authenticate your Azure account:

        ```bash
        az login
        ```

        This command will open your default browser to complete the authentication process. For additional authentication methods, consult the [Azure authentication guide](https://learn.microsoft.com/en-us/cli/azure/authenticate-azure-cli).
        
    
2. **Install additional tools**:

   1. Install kubectl: Follow the
      [kubectl installation guide](https://kubernetes.io/docs/tasks/tools/).

   2. Install Helm: Follow the
      [Helm installation guide](https://helm.sh/docs/intro/install/).

Now that you have the prerequisites out of the way, you can create a Kubernetes
cluster with GPU nodes on your preferred cloud provider.

## Create a Kubernetes cluster with GPU nodes

To get started, you'll need a Kubernetes cluster equipped with GPU nodes to
handle the compute demands of LLM inference. We recommend using _NVIDIA's A100_
instances for their high performance and efficiency in AI workloads.

{/** For more information on instances, see
[Compatible cloud instances and virtual machines](). **/}

    
Run the following command to create a cluster with an full OpenID Connect (OIDC)
provider for authentication, private networking, full Elastic Container Registry
(ECR) access, and multi-zone deployment:

```bash
eksctl create cluster \
  --name max-cluster \
  --region us-east-1 \
  --node-type p4d.24xlarge \
  --nodes 1
```

For more information on `eksctl create cluster`, see
[Create an Amazon EKS Cluster](https://docs.aws.amazon.com/eks/latest/userguide/create-cluster.html).

Run the following command to create a GKE cluster with GPU nodes configured with
autoscaling and network policies:

```bash
gcloud container clusters create max-cluster \
  --region us-central1 \
  --node-locations us-central1-a \
  --machine-type a2-highgpu-1g \
  --num-nodes 1 \
  --accelerator type=nvidia-tesla-a100,count=1
```

Then set up the required NVIDIA driver:

```bash
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
```

For more information on `gcloud container clusters create`, see
[Creating a zonal cluster](https://cloud.google.com/kubernetes-engine/docs/how-to/creating-a-zonal-cluster).

    
First, run the following command to create a resource group in your chosen
region:

```bash
az group create --name my-resource-group --location eastus
```

Then, run the following command to create the AKS cluster:

```bash
az aks create \
  --resource-group my-resource-group \
  --name max-cluster \
  --node-count 1 \
  --generate-ssh-keys \
  --node-vm-size "standard_nc24ads_a100_v4"
```

After the cluster is created, configure your local environment to connect to it
by retrieving the cluster credentials:

```bash
az aks get-credentials --resource-group my-resource-group --name max-cluster
```

For more information on `az aks create`, see
[Deploy an AKS cluster using Azure CLI](https://learn.microsoft.com/en-us/azure/aks/learn/quick-kubernetes-deploy-cli).

## Set up a Kubernetes namespace

Next, we'll create a dedicated namespace:

```bash
kubectl create namespace max-openai-api-demo
```

{/** Then, set your Hugging Face token:

```bash
kubectl create secret generic huggingface-secret \
    --from-literal=HF_TOKEN=your_token_here \
    --namespace max-openai-api-demo
```

*/}

Then set this namespace as our default:

```bash
kubectl config set-context --current --namespace=max-openai-api-demo
```

## Deploy using Helm

    
Now we'll deploy the Llama 3.1 model graph with MAX using Helm:

```bash
helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
  --version 25.1.0 \
  --namespace max-openai-api-demo \
  --set huggingfaceRepoId=modularai/Llama-3.1-8B-Instruct-GGUF \
  --set maxServe.maxLength=512 \
  --set maxServe.maxBatchSize=16 \
  --set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
  --timeout 15m0s \
  --wait
```

Now we'll deploy the Llama 3.1 model graph with MAX using Helm:

```bash
helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
  --version 25.1.0 \
  --namespace max-openai-api-demo \
  --set huggingfaceRepoId=modularai/Llama-3.1-8B-Instruct-GGUF \
  --set maxServe.maxLength=512 \
  --set maxServe.maxBatchSize=16 \
  --set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
  --set "resources.limits.nvidia\\.com/gpu=1" \
  --set "resources.requests.nvidia\\.com/gpu=1" \
  --timeout 15m0s \
  --wait
```

    
Now we'll deploy the Llama 3.1 model graph with MAX using Helm:

```bash
helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
  --version 25.1.0 \
  --namespace max-openai-api-demo \
  --set huggingfaceRepoId=modularai/Llama-3.1-8B-Instruct-GGUF \
  --set maxServe.maxLength=512 \
  --set maxServe.maxBatchSize=16 \
  --set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
  --timeout 15m0s \
  --wait
```

Resolve error getting credentials

If you encounter this error when running the Helm install command:

```output
Error: INSTALLATION FAILED: error getting credentials - err: exec: "docker-credential-desktop": executable file not found in $PATH, out:
```

This occurs because Helm is trying to use Docker Desktop's credential helper,
but it's not available in your PATH. To resolve this, configure Docker to use
the basic credential store:

```bash
docker login
```

After applying, retry your Helm install command.

When you run this command, Helm begins a multi-stage deployment process. First,
it pulls the MAX container image from Docker Hub, which includes the Llama
model. Next, it downloads the Llama 3.1 GGUF model weights. Finally, it
configures and launches the model as an endpoint, making it accessible on port
`8000`. You'll need to set up port forwarding to access this endpoint.

:::note

Use `--set envSecret.HF_TOKEN=` if your
model is a gated model and requires a Hugging Face token.

:::

## Verify and test the deployment

After deploying, follow these steps to verify and test your deployment:

1. Watch the pod status to ensure it's running:

    ```bash
    kubectl get pods -w
    ```

2. Check the logs for any startup issues:

    ```bash
    kubectl logs -f POD_NAME
    ```

3. Set up port forwarding to access the service locally:

    1. Get the name of your MAX pod:

        ```bash
        POD_NAME=$(kubectl get pods -l "app.kubernetes.io/name=max-openai-api-chart,app.kubernetes.io/instance=max-openai-api" -o jsonpath="{.items[0].metadata.name}")
        ```

    2. Retrieve the container port that MAX is listening on:

        ```bash
        CONTAINER_PORT=$(kubectl get pod $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
        ```

    3. Make the MAX endpoint accessible on `localhost:8000`:

        ```bash
        kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT &
        ```

## Send an inference request

Now that your deployment is verified and port forwarding is set up, you can test
the model by sending it a chat request. You will use
[OpenAI's chat completion](https://platform.openai.com/docs/guides/text-generation)
endpoint to send the request.

Open a new tab in your terminal and run the following command:

```bash
curl -N http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'
```

The following is the expected output:

```output
The Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
```

## Monitoring

Once deployed, you can monitor your deployment's health and performance.

The following optional commands will help you monitor your deployment:

- Check pod logs:

  ```bash
  kubectl logs -f $POD_NAME
  ```

- Monitor node resources:

  ```bash
  kubectl top nodes
  ```

- Monitor pod resources:

  ```bash
  kubectl top pods
  ```

- Monitor GPU utilization:

  ```bash
  kubectl exec -it $(kubectl get pods --namespace max-openai-api-demo -l app.kubernetes.io/name=max-openai-api-chart -o jsonpath='{.items[0].metadata.name}') -- nvidia-smi
  ```

For more information on benchmarking and additional performance metrics, see
[Benchmark MAX performance](https://github.com/modular/modular/tree/main/benchmark).

## Cleanup

When you're done testing or need to tear down the environment:

1. Uninstall the Helm release:

    ```bash
    helm uninstall max-openai-api --namespace max-openai-api-demo
    ```

2. Delete the Kubernetes namespace:

    ```bash
    kubectl delete namespace max-openai-api-demo
    ```

3. Delete your Kubernetes cluster:

    
        The following command deletes an Amazon EKS cluster and all associated resources in a specified region:

        ```bash
        eksctl delete cluster --name max-cluster --region us-east-1
        ```

        For more information on `eksctl delete cluster`, see [Delete a cluster](https://docs.aws.amazon.com/eks/latest/userguide/delete-cluster.html).

        
        The following command deletes a GKE cluster and its associated resources in a specified zone:

        ```bash
        gcloud container clusters delete max-cluster
        ```

        For more information on `gcloud container clusters delete`, see [Deleting a cluster](https://cloud.google.com/kubernetes-engine/docs/how-to/deleting-a-cluster).

        
        The following command deletes an AKS cluster and its associated resources in a specified resource group:

        ```bash
        az aks delete --resource-group my-resource-group --name max-cluster
        ```

        For more information on `az aks delete`, see [Delete an Azure Kubernetes Service cluster](https://learn.microsoft.com/en-us/azure/aks/delete-cluster).

        
## Next steps

You now have a GPU-powered MAX deployment running in the cloud, ready to handle
LLM inference at scale with features like optimized GPU utilization, automatic
scaling, and robust monitoring. Be sure to monitor performance and costs, and
tailor configurations to your specific workload needs.

Keep in mind that this is just a preview of MAX on NVIDIA GPUs. We're working
hard to add support for more hardware, including AMD GPUs, and optimize
performance for more GenAI models.

To stay up to date with new releases,
[sign up for our newsletter](https://www.modular.com/modverse#signup),
[check out the community](https://www.modular.com/community), and
[join our forum](https://forum.modular.com/).

And if you're interested in becoming a design partner to get early access and
give us feedback, please [contact us](https://www.modular.com/company/contact).

---

## Deploying

import MDXListing from '@site/src/components/Listing/MDXListing';
import TutorialStack from '@site/src/components/TutorialStack';

Our Kubernetes-ready Docker container simplifies the process of deploying a
GenAI model to the cloud with your own endpoint. We also offer step-by-step
tutorials to deploy your endpoint with services such as AWS, GCP, and Azure.

## Guides

export const docs = [
    '../container',
    '../../mammoth/index',
]

## Tutorials

export const tutorials = [
  'max-serve-local-to-cloud',
  'deploy-max-serve-on-kubernetes',
  'deploy-serverless-cloud-run',
];

---

## depth

`depth(src: IntTuple[origin]) -> Int`

Calculates the maximum nesting depth of an `IntTuple`.

This function recursively traverses the `IntTuple` structure to determine
its maximum nesting depth. A scalar value has depth 0, a flat tuple has
depth 1, and nested tuples increase the depth accordingly.

Example:

```mojo
from layout import IntTuple, depth

print(depth(IntTuple(1))) # prints 0
print(depth(IntTuple(1, 2))) # prints 1
print(depth((IntTuple(1, 2)))) # prints 2
```

.

**Args:**

* ​src (`IntTuple[origin]`): The `IntTuple` to measure the depth of.

**Returns:**

An integer representing the maximum nesting depth.

---

## deque

Defines the Deque type.

You can import these APIs from the `collections` package.

Examples:

```mojo
from collections import Deque
```

## Structs

* [​`Deque`](/mojo/stdlib/collections/deque/Deque): Implements a double-ended queue.

---

## Deque

`struct Deque[ElementType: Copyable & Movable]`

Implements a double-ended queue.

It supports pushing and popping from both ends in O(1) time resizing the
underlying storage as needed.

## Parameters

* ​ElementType (`Copyable & Movable`): The type of the elements in the deque.
  Must implement the traits `Copyable` and `Movable`.

## Implemented traits

`AnyType`,
`Boolable`,
`ExplicitlyCopyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `default_capacity`

`alias default_capacity = 64`

The default capacity of the deque: must be the power of 2.

## Methods

### `__init__`

`__init__(out self, *, owned elements: Optional[List[ElementType]] = Optional(None), capacity: Int = 64, min_capacity: Int = 64, maxlen: Int = -1, shrink: Bool = True)`

Constructs a deque.

**Args:**

* ​elements (`Optional[List[ElementType]]`): The optional list of initial deque elements.
* ​capacity (`Int`): The initial capacity of the deque.
* ​min\_capacity (`Int`): The minimum allowed capacity of the deque when shrinking.
* ​maxlen (`Int`): The maximum allowed capacity of the deque when growing.
* ​shrink (`Bool`): Should storage be de-allocated when not needed.

`__init__(out self, owned *values: ElementType)`

Constructs a deque from the given values.

**Args:**

* ​\*values (`ElementType`): The values to populate the deque with.

`__init__(out self, *, owned elements: VariadicListMem[ElementType, origin, is_owned])`

Constructs a deque from the given values.

**Args:**

* ​elements (`VariadicListMem[ElementType, origin, is_owned]`): The values to populate the deque with.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Moves data of an existing deque into a new one.

**Args:**

* ​existing (`Self`): The existing deque.

### `__del__`

`__del__(owned self)`

Destroys all elements in the deque and free its memory.

### `__bool__`

`__bool__(self) -> Bool`

Checks whether the deque has any elements or not.

**Returns:**

`False` if the deque is empty, `True` if there is at least one element.

### `__getitem__`

`__getitem__(ref self, idx: Int) -> ref [self] ElementType`

Gets the deque element at the given index.

**Args:**

* ​idx (`Int`): The index of the element.

**Returns:**

A reference to the element at the given index.

### `__eq__`

`__eq__[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], other: Deque[T]) -> Bool`

Checks if two deques are equal.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the trait `EqualityComparable`.

**Args:**

* ​other (`Deque[T]`): The deque to compare with.

**Returns:**

`True` if the deques are equal, `False` otherwise.

### `__ne__`

`__ne__[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], other: Deque[T]) -> Bool`

Checks if two deques are not equal.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the trait `EqualityComparable`.

**Args:**

* ​other (`Deque[T]`): The deque to compare with.

**Returns:**

`True` if the deques are not equal, `False` otherwise.

### `__contains__`

`__contains__[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], value: T) -> Bool`

Verify if a given value is present in the deque.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the trait `EqualityComparable`.

**Args:**

* ​value (`T`): The value to find.

**Returns:**

True if the value is contained in the deque, False otherwise.

### `__add__`

`__add__(self, other: Self) -> Self`

Concatenates self with other and returns the result as a new deque.

**Args:**

* ​other (`Self`): Deque whose elements will be appended to the elements of self.

**Returns:**

The newly created deque with the properties of `self`.

### `__mul__`

`__mul__(self, n: Int) -> Self`

Concatenates `n` deques of `self` and returns a new deque.

**Args:**

* ​n (`Int`): The multiplier number.

**Returns:**

The new deque.

### `__iadd__`

`__iadd__(mut self, other: Self)`

Appends the elements of other deque into self.

**Args:**

* ​other (`Self`): Deque whose elements will be appended to self.

### `__imul__`

`__imul__(mut self, n: Int)`

Concatenates self `n` times in place.

**Args:**

* ​n (`Int`): The multiplier number.

### `copy`

`copy(self) -> Self`

Creates a deepcopy of the given deque.

**Returns:**

A copy of the value.

### `__iter__`

`__iter__(ref self) -> _DequeIter[ElementType, self_is_origin]`

Iterates over elements of the deque, returning the references.

**Returns:**

An iterator of the references to the deque elements.

### `__reversed__`

`__reversed__(ref self) -> _DequeIter[ElementType, self_is_origin, False]`

Iterate backwards over the deque, returning the references.

**Returns:**

A reversed iterator of the references to the deque elements.

### `__len__`

`__len__(self) -> Int`

Gets the number of elements in the deque.

**Returns:**

The number of elements in the deque.

### `write_to`

`write_to[T: Representable & Copyable & Movable, WriterType: Writer](self: Deque[T], mut writer: WriterType)`

Writes `my_deque.__str__()` to a `Writer`.

**Parameters:**

* ​T (`Representable & Copyable & Movable`): The type of the Deque elements.
  Must implement the trait `Representable`.
* ​WriterType (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`WriterType`): The object to write to.

### `__str__`

`__str__[T: Representable & Copyable & Movable, //](self: Deque[T]) -> String`

Returns a string representation of a `Deque`.

Note that since we can't condition methods on a trait yet,
the way to call this method is a bit special. Here is an example below:

```mojo
my_deque = Deque[Int](1, 2, 3)
print(my_deque.__str__())
```

When the compiler supports conditional methods, then a simple `String(my_deque)` will
be enough.

**Parameters:**

* ​T (`Representable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the trait `Representable`.

**Returns:**

A string representation of the deque.

### `__repr__`

`__repr__[T: Representable & Copyable & Movable, //](self: Deque[T]) -> String`

Returns a string representation of a `Deque`.

Note that since we can't condition methods on a trait yet,
the way to call this method is a bit special. Here is an example below:

```mojo
my_deque = Deque[Int](1, 2, 3)
print(my_deque.__repr__())
```

When the compiler supports conditional methods, then a simple `repr(my_deque)` will
be enough.

**Parameters:**

* ​T (`Representable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the trait `Representable`.

**Returns:**

A string representation of the deque.

### `append`

`append(mut self, owned value: ElementType)`

Appends a value to the right side of the deque.

**Args:**

* ​value (`ElementType`): The value to append.

### `appendleft`

`appendleft(mut self, owned value: ElementType)`

Appends a value to the left side of the deque.

**Args:**

* ​value (`ElementType`): The value to append.

### `clear`

`clear(mut self)`

Removes all elements from the deque leaving it with length 0.

Resets the underlying storage capacity to `_min_capacity`.

### `count`

`count[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], value: T) -> Int`

Counts the number of occurrences of a `value` in the deque.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the trait `EqualityComparable`.

**Args:**

* ​value (`T`): The value to count.

**Returns:**

The number of occurrences of the value in the deque.

### `extend`

`extend(mut self, owned values: List[ElementType])`

Extends the right side of the deque by consuming elements of the list argument.

**Args:**

* ​values (`List[ElementType]`): List whose elements will be added at the right side of the deque.

### `extendleft`

`extendleft(mut self, owned values: List[ElementType])`

Extends the left side of the deque by consuming elements from the list argument.

Acts as series of left appends resulting in reversed order of elements in the list argument.

**Args:**

* ​values (`List[ElementType]`): List whose elements will be added at the left side of the deque.

### `index`

`index[T: EqualityComparable & Copyable & Movable, //](self: Deque[T], value: T, start: Int = 0, stop: Optional[Int] = Optional(None)) -> Int`

Returns the index of the first occurrence of a `value` in a deque restricted by the range given the `start` and `stop` bounds.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the `EqualityComparable` trait.

**Args:**

* ​value (`T`): The value to search for.
* ​start (`Int`): The starting index of the search, treated as a slice index
  (defaults to 0).
* ​stop (`Optional[Int]`): The ending index of the search, treated as a slice index
  (defaults to None, which means the end of the deque).

**Returns:**

The index of the first occurrence of the value in the deque.

**Raises:**

ValueError: If the value is not found in the deque.

### `insert`

`insert(mut self, idx: Int, owned value: ElementType)`

Inserts the `value` into the deque at position `idx`.

**Args:**

* ​idx (`Int`): The position to insert the value into.
* ​value (`ElementType`): The value to insert.

**Raises:**

IndexError: If deque is already at its maximum size.

### `remove`

`remove[T: EqualityComparable & Copyable & Movable, //](mut self: Deque[T], value: T)`

Removes the first occurrence of the `value`.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the deque.
  Must implement the `EqualityComparable` trait.

**Args:**

* ​value (`T`): The value to remove.

**Raises:**

ValueError: If the value is not found in the deque.

### `peek`

`peek(self) -> ElementType`

Inspect the last (rightmost) element of the deque without removing it.

**Returns:**

The last (rightmost) element of the deque.

**Raises:**

IndexError: If the deque is empty.

### `peekleft`

`peekleft(self) -> ElementType`

Inspect the first (leftmost) element of the deque without removing it.

**Returns:**

The first (leftmost) element of the deque.

**Raises:**

IndexError: If the deque is empty.

### `pop`

`pop(mut self) -> ElementType`

Removes and returns the element from the right side of the deque.

**Returns:**

The popped value.

**Raises:**

IndexError: If the deque is empty.

### `popleft`

`popleft(mut self) -> ElementType`

Removes and returns the element from the left side of the deque.

**Returns:**

The popped value.

**Raises:**

IndexError: If the deque is empty.

### `reverse`

`reverse(mut self)`

Reverses the elements of the deque in-place.

### `rotate`

`rotate(mut self, n: Int = 1)`

Rotates the deque by `n` steps.

If `n` is positive, rotates to the right.
If `n` is negative, rotates to the left.

**Args:**

* ​n (`Int`): Number of steps to rotate the deque
  (defaults to 1).

---

## Developing

import MDXListing from '@site/src/components/Listing/MDXListing';
import TutorialStack from '@site/src/components/TutorialStack';

We built the Modular Platform from the ground up to simplify AI development for
production and get the most out of your GPUs. Although it's not a machine
learning framework, Modular provides programmability at every layer of the
stack. You can build graphs in Python and write custom ops with
hardware-agnostic GPU kernels in Mojo. None of it uses CUDA or other
vendor-specific frameworks.

## Guides

export const docs = [
    '../custom-ops/index',
    '../graph/quantize',
]

## Tutorials

export const tutorials = [
  'max-pipeline-bring-your-own-model',
  'build-custom-ops',
  'get-started-with-max-graph-in-python',
];

export const mojoTutorials = [
  'gpu/intro-tutorial',
];

---

## Device

`struct Device`

## Fields

* ​idx (`Int`):
* ​device (`_DeviceImpl`):

## Implemented traits

`AnyType`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(out self, idx: Int = 0)`

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

### `get_driver_version`

`get_driver_version(self) -> DriverVersion`

Returns NVIDIA driver version.

### `max_mem_clock`

`max_mem_clock(self) -> Int`

### `max_graphics_clock`

`max_graphics_clock(self) -> Int`

### `mem_clocks`

`mem_clocks(self) -> List[Int, True]`

### `graphics_clocks`

`graphics_clocks(self, memory_clock_mhz: Int) -> List[Int, True]`

### `set_clock`

`set_clock(self, mem_clock: Int, graphics_clock: Int)`

### `gpu_turbo_enabled`

`gpu_turbo_enabled(self) -> Bool`

Returns True if the gpu turbo is enabled.

### `set_gpu_turbo`

`set_gpu_turbo(self, enabled: Bool = True)`

Sets the GPU turbo state.

### `get_persistence_mode`

`get_persistence_mode(self) -> Bool`

Returns True if the gpu persistence mode is enabled.

### `set_persistence_mode`

`set_persistence_mode(self, enabled: Bool = True)`

Sets the persistence mode.

### `set_max_gpu_clocks`

`set_max_gpu_clocks(device)`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

### `__repr__`

`__repr__(self) -> String`

---

## device_attribute

This module defines GPU device attributes that can be queried from CUDA-compatible devices.

The module provides the `DeviceAttribute` struct which encapsulates the various device
properties and capabilities that can be queried through the CUDA driver API. Each attribute
is represented as a constant with a corresponding integer value that maps to the CUDA
driver's attribute enumeration.

These attributes allow applications to query specific hardware capabilities and limitations
of GPU devices, such as maximum thread counts, memory sizes, compute capabilities, and
supported features.

## Structs

* [​`DeviceAttribute`](/mojo/stdlib/gpu/host/device_attribute/DeviceAttribute): Represents CUDA device attributes that can be queried from a GPU device.

---

## device_context

This module provides functionality for interacting with accelerators. In particular the [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) struct, which represents a single stream of execution on a given accelerator. You can use this struct to allocate accelerator memory, copy data to and from the accelerator, and compile and execute functions on the accelerator.

## Structs

* [​`DeviceBuffer`](/mojo/stdlib/gpu/host/device_context/DeviceBuffer): Represents a block of device-resident storage. For GPU devices, a device buffer is allocated in the device's global memory.
* [​`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext): Represents a single stream of execution on a particular accelerator (GPU).
* [​`DeviceExternalFunction`](/mojo/stdlib/gpu/host/device_context/DeviceExternalFunction): Represents an external device function loaded from PTX/SASS assembly.
* [​`DeviceFunction`](/mojo/stdlib/gpu/host/device_context/DeviceFunction): Represents a compiled device function for GPU execution.
* [​`DeviceMulticastBuffer`](/mojo/stdlib/gpu/host/device_context/DeviceMulticastBuffer): Represents a muticast memory object enables special memory operations to be broadcast across a group of devices.
* [​`DeviceStream`](/mojo/stdlib/gpu/host/device_context/DeviceStream): Represents a CUDA/HIP stream for asynchronous GPU operations.
* [​`HostBuffer`](/mojo/stdlib/gpu/host/device_context/HostBuffer): Represents a block of host-resident storage. For GPU devices, a host buffer is allocated in the host's global memory.

---

## device_passable

## Traits

* [​`DevicePassable`](/mojo/stdlib/builtin/device_passable/DevicePassable): This trait marks types as passable to accelerator devices.

---

## DeviceAttribute

`@register_passable(trivial)`
`struct DeviceAttribute`

Represents CUDA device attributes that can be queried from a GPU device.

This struct encapsulates the various device properties and capabilities that can be
queried through the CUDA driver API. Each attribute is represented as a constant
with a corresponding integer value that maps to the CUDA driver's attribute enum.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `CLOCK_RATE`

`alias CLOCK_RATE = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](13))`

Typical clock frequency in kilohertz

### `COMPUTE_CAPABILITY_MAJOR`

`alias COMPUTE_CAPABILITY_MAJOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](75))`

Major compute capability version number

### `COMPUTE_CAPABILITY_MINOR`

`alias COMPUTE_CAPABILITY_MINOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](76))`

Minor compute capability version number

### `MAX_ACCESS_POLICY_WINDOW_SIZE`

`alias MAX_ACCESS_POLICY_WINDOW_SIZE = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](109))`

CUDA-only: Maximum value of CUaccessPolicyWindow::num\_bytes.

### `MAX_BLOCK_DIM_X`

`alias MAX_BLOCK_DIM_X = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](2))`

Maximum block dimension X

### `MAX_BLOCK_DIM_Y`

`alias MAX_BLOCK_DIM_Y = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](3))`

Maximum block dimension Y

### `MAX_BLOCK_DIM_Z`

`alias MAX_BLOCK_DIM_Z = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](4))`

Maximum block dimension Z

### `MAX_BLOCKS_PER_MULTIPROCESSOR`

`alias MAX_BLOCKS_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](106))`

Maximum resident blocks per multiprocessor

### `MAX_GRID_DIM_X`

`alias MAX_GRID_DIM_X = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](5))`

Maximum grid dimension X

### `MAX_GRID_DIM_Y`

`alias MAX_GRID_DIM_Y = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](6))`

Maximum grid dimension Y

### `MAX_GRID_DIM_Z`

`alias MAX_GRID_DIM_Z = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](7))`

Maximum grid dimension Z

### `MAX_REGISTERS_PER_BLOCK`

`alias MAX_REGISTERS_PER_BLOCK = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](12))`

Maximum number of 32-bit registers available per block

### `MAX_REGISTERS_PER_MULTIPROCESSOR`

`alias MAX_REGISTERS_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](82))`

Maximum number of 32-bit registers available per multiprocessor

### `MAX_SHARED_MEMORY_PER_BLOCK`

`alias MAX_SHARED_MEMORY_PER_BLOCK = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](8))`

Maximum shared memory available per block in bytes

### `MAX_SHARED_MEMORY_PER_MULTIPROCESSOR`

`alias MAX_SHARED_MEMORY_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](81))`

Maximum shared memory available per multiprocessor in bytes

### `MAX_THREADS_PER_BLOCK`

`alias MAX_THREADS_PER_BLOCK = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](1))`

Maximum number of threads per block

### `MAX_THREADS_PER_MULTIPROCESSOR`

`alias MAX_THREADS_PER_MULTIPROCESSOR = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](39))`

Maximum resident threads per multiprocessor

### `MULTIPROCESSOR_COUNT`

`alias MULTIPROCESSOR_COUNT = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](16))`

Number of multiprocessors on device

### `WARP_SIZE`

`alias WARP_SIZE = DeviceAttribute(__init__[__mlir_type.!pop.int_literal](10))`

Warp size in threads

---

## DeviceBuffer

`struct DeviceBuffer[type: DType]`

Represents a block of device-resident storage. For GPU devices, a device buffer is allocated in the device's global memory.

To allocate a `DeviceBuffer`, use one of the methods provided by
`DeviceContext`, such as
[`enqueue_create_buffer()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_create_buffer).

## Parameters

* ​type (`DType`): Data type to be stored in the buffer.

## Implemented traits

`AnyType`,
`Copyable`,
`DevicePassable`,
`Movable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `device_type`

`alias device_type = UnsafePointer[SIMD[type, 1]]`

DeviceBuffer types are remapped to UnsafePointer when passed to accelerator devices.

## Methods

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Creates a copy of an existing device buffer by incrementing its reference count.

This copy constructor creates a new reference to the same underlying device buffer
by incrementing the reference count of the native buffer object. Both the original
and the copy will refer to the same memory on the device.

**Args:**

* ​existing (`Self`): The device buffer to copy.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Initializes this buffer by taking ownership of an existing buffer.

This move constructor transfers ownership of the device buffer from the existing
instance to the new instance without incrementing the reference count.

**Args:**

* ​existing (`Self`): The buffer to move from, which will no longer be valid after this call.

### `__del__`

`__del__(owned self)`

Releases resources associated with this device buffer.

This function schedules an owned buffer free using the stream in the
device context. The actual deallocation may occur asynchronously after
all operations using this buffer have completed.

### `get_type_name`

`static get_type_name() -> String`

Gets this type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls.

**Returns:**

This type's name.

### `get_device_type_name`

`static get_device_type_name() -> String`

Gets device\_type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls.

**Returns:**

This type's name.

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

### `__len__`

`__len__(self) -> Int`

Returns the number of elements in this buffer.

This method calculates the number of elements by dividing the total byte size
of the buffer by the size of each element.

**Returns:**

The number of elements in the buffer.

### `create_sub_buffer`

`create_sub_buffer[view_type: DType](self, offset: Int, size: Int) -> DeviceBuffer[view_type]`

Creates a sub-buffer view of this buffer with a different element type.

This method creates a new buffer that references a subset of the memory in this
buffer, potentially with a different element type. The sub-buffer shares the
underlying memory with the original buffer.

**Parameters:**

* ​view\_type (`DType`): The data type for elements in the new sub-buffer.

**Args:**

* ​offset (`Int`): The starting offset in elements from the beginning of this buffer.
* ​size (`Int`): The number of elements in the new sub-buffer.

**Returns:**

A new DeviceBuffer referencing the specified region with the specified element type.

### `enqueue_copy_to`

`enqueue_copy_to(self, dst: Self)`

Enqueues an asynchronous copy from this buffer to another device buffer.

This method schedules a memory copy operation from this buffer to the destination
buffer. The operation is asynchronous and will be executed in the stream associated
with this buffer's context.

**Args:**

* ​dst (`Self`): The destination device buffer to copy data to.

`enqueue_copy_to(self, dst: HostBuffer[type])`

Enqueues an asynchronous copy from this buffer to a host buffer.

This method schedules a memory copy operation from this buffer to the destination
buffer. The operation is asynchronous and will be executed in the stream associated
with this buffer's context.

**Args:**

* ​dst (`HostBuffer[type]`): The destination host buffer to copy data to.

`enqueue_copy_to(self, dst_ptr: UnsafePointer[SIMD[type, 1]])`

Enqueues an asynchronous copy from this buffer to host memory.

This method schedules a memory copy operation from this device buffer to the
specified host memory location. The operation is asynchronous and will be
executed in the stream associated with this buffer's context.

**Args:**

* ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the destination host memory location.

### `enqueue_copy_from`

`enqueue_copy_from(self, src: Self)`

Enqueues an asynchronous copy to this buffer from another device buffer.

This method schedules a memory copy operation to this buffer from the source
buffer. The operation is asynchronous and will be executed in the stream
associated with this buffer's context.

**Args:**

* ​src (`Self`): The source device buffer to copy data from.

`enqueue_copy_from(self, src: HostBuffer[type])`

Enqueues an asynchronous copy to this buffer from a host buffer.

This method schedules a memory copy operation to this buffer from the source
buffer. The operation is asynchronous and will be executed in the stream
associated with this buffer's context.

**Args:**

* ​src (`HostBuffer[type]`): The source host buffer to copy data from.

`enqueue_copy_from(self, src_ptr: UnsafePointer[SIMD[type, 1]])`

Enqueues an asynchronous copy to this buffer from host memory.

This method schedules a memory copy operation to this device buffer from the
specified host memory location. The operation is asynchronous and will be
executed in the stream associated with this buffer's context.

**Args:**

* ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the source host memory location.

### `enqueue_fill`

`enqueue_fill(self, val: SIMD[type, 1]) -> Self`

Enqueues an operation to fill this buffer with a specified value.

This method schedules a memory set operation that fills the entire buffer
with the specified value. The operation is asynchronous and will be executed
in the stream associated with this buffer's context.

**Args:**

* ​val (`SIMD[type, 1]`): The value to fill the buffer with.

**Returns:**

Self reference for method chaining.

### `reassign_ownership_to`

`reassign_ownership_to(self, ctx: DeviceContext)`

Transfers ownership of this buffer to another device context.

This method changes the device context that owns this buffer. This can be
useful when sharing buffers between different contexts or when migrating
workloads between devices.

**Args:**

* ​ctx (`DeviceContext`): The new device context to take ownership of this buffer.

### `take_ptr`

`take_ptr(owned self) -> UnsafePointer[SIMD[type, 1]]`

Takes ownership of the device pointer from this buffer.

This method releases the device pointer from the buffer's control and
returns it to the caller. After this call, the buffer no longer owns
the pointer, and the caller is responsible for managing its lifecycle.

**Returns:**

The raw device pointer that was owned by this buffer.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[SIMD[type, 1]]`

Returns the raw device pointer without transferring ownership.

This method provides direct access to the underlying device pointer
for advanced use cases. The buffer retains ownership of the pointer.

**Returns:**

The raw device pointer owned by this buffer.

### `context`

`context(self) -> DeviceContext`

Returns the device context associated with this buffer.

This method retrieves the device context that owns this buffer and is
responsible for managing its lifecycle and operations.

**Returns:**

The device context associated with this buffer.

### `map_to_host`

`map_to_host(self, out mapped_buffer: _HostMappedBuffer[type])`

Maps this device buffer to host memory for CPU access.

This method creates a host-accessible view of the device buffer's contents.
The mapping operation may involve copying data from device to host memory.

Notes:

Values modified inside the `with` statement are updated on the
device when the `with` statement exits.

Example:

```mojo
from gpu.host import DeviceContext

var ctx = DeviceContext()
var length = 1024
var in_dev = ctx.enqueue_create_buffer[DType.float32](length)
var out_dev = ctx.enqueue_create_buffer[DType.float32](length)

# Initialize the input and output with known values.
with in_dev.map_to_host() as in_host, out_dev.map_to_host() as out_host:
    for i in range(length):
        in_host[i] = i
        out_host[i] = 255
```

**Returns:**

A host-mapped buffer that provides CPU access to the device buffer's
contents inside a with-statement.

**Raises:**

If there's an error during buffer creation or data transfer.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes a string representation of this buffer to the provided writer.

This method formats the buffer's contents as a string and writes it to
the specified writer. For large buffers, a compact representation is used.

**Parameters:**

* ​W (`Writer`): The writer type.

**Args:**

* ​writer (`W`): The writer to output the formatted string to.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the `DeviceBuffer`.

This method creates a human-readable string representation of the buffer's contents
by mapping the device memory to host memory and formatting the elements.

**Returns:**

A string containing the formatted buffer contents.

---

## DeviceContext

`@register_passable`
`struct DeviceContext`

Represents a single stream of execution on a particular accelerator (GPU).

A `DeviceContext` serves as the low-level interface to the
accelerator inside a MAX [custom operation](/max/custom-ops/) and provides
methods for allocating buffers on the device, copying data between host and
device, and for compiling and running functions (also known as kernels) on
the device.

The device context can be used as a
[context manager](/mojo/manual/errors#use-a-context-manager). For example:

```mojo
from gpu.host import DeviceContext
from gpu import thread_idx

fn kernel():
    print("hello from thread:", thread_idx.x, thread_idx.y, thread_idx.z)

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2))
    ctx.synchronize()
```

A custom operation receives an opaque `DeviceContextPtr`, which provides
a `get_device_context()` method to retrieve the device context:

```mojo
from runtime.asyncrt import DeviceContextPtr

@register("custom_op")
struct CustomOp:
    @staticmethod
    fn execute(ctx_ptr: DeviceContextPtr) raises:
        var ctx = ctx_ptr.get_device_context()
        ctx.enqueue_function[kernel](grid_dim=1, block_dim=(2, 2, 2))
        ctx.synchronize()
```

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `device_api`

`alias device_api = from_name[::StringSlice[::Bool().api`

Device API for the default accelerator (for example, "cuda" or "hip").

### `device_info`

`alias device_info = from_name[::StringSlice[::Bool()`

`gpu.info.Info` object for the default accelerator.

## Methods

### `__init__`

`__init__(out self, device_id: Int = 0, *, owned api: String = String(from_name[::StringSlice[::Bool()))`

Constructs a `DeviceContext` for the specified device.

This initializer creates a new device context for the specified accelerator device.
The device context provides an interface for interacting with the GPU, including
memory allocation, data transfer, and kernel execution.

Example:

```mojo
from gpu.host import DeviceContext

# Create a context for the default GPU
var ctx = DeviceContext()

# Create a context for a specific GPU (device 1)
var ctx2 = DeviceContext(1)
```

**Args:**

* ​device\_id (`Int`): ID of the accelerator device. If not specified, uses
  the default accelerator (device 0).
* ​api (`String`): Requested device API (for example, "cuda" or "hip"). Defaults to the
  device API specified by the DeviceContext class.

**Raises:**

If device initialization fails or the specified device is not available.

### `__copyinit__`

`__copyinit__(existing: Self) -> Self`

Creates a copy of an existing device context by incrementing its reference count.

This copy constructor creates a new reference to the same underlying device context
by incrementing the reference count of the native context object. Both the original
and the copy will refer to the same device context.

**Args:**

* ​existing (`Self`): The device context to copy.

### `__del__`

`__del__(owned self)`

Releases resources associated with this device context.

This destructor decrements the reference count of the native device context.
When the reference count reaches zero, the underlying resources are released,
including any cached memory buffers and compiled device functions.

### `copy`

`copy(self) -> Self`

Explicitly constructs a copy of this device context.

This method creates a new reference to the same underlying device context
by incrementing the reference count of the native context object.

**Returns:**

A copy of this device context that refers to the same underlying context.

### `__enter__`

`__enter__(owned self) -> Self`

Enables the use of DeviceContext in a 'with' statement context manager.

This method allows DeviceContext to be used with Python-style context managers,
which ensures proper resource management and cleanup when the context exits.

Example:

```mojo
from gpu.host import DeviceContext

# Using DeviceContext as a context manager
with DeviceContext() as ctx:
    # Perform GPU operations
    # Resources are automatically released when exiting the block
```

**Returns:**

The DeviceContext instance to be used within the context manager block.

### `name`

`name(self) -> String`

Returns the device name, an ASCII string identifying this device, defined by the native device API.

This method queries the underlying GPU device for its name, which typically
includes the model and other identifying information. This can be useful for
logging, debugging, or making runtime decisions based on the specific GPU hardware.

Example:

```mojo
from gpu.host import DeviceContext

var ctx = DeviceContext()
print("Running on device:", ctx.name())
```

**Returns:**

A string containing the device name.

### `api`

`api(self) -> String`

Returns the name of the API used to program the device.

This method queries the underlying device context to determine which GPU programming
API is being used for the current device. This information is useful for writing
code that can adapt to different GPU architectures and programming models.

Possible values are:

* "cpu": Generic host device (CPU).
* "cuda": NVIDIA GPUs.
* "hip": AMD GPUs.

Example:

```mojo
from gpu.host import DeviceContext

var ctx = DeviceContext()
var api_name = ctx.api()
print("Using device API:", api_name)

# Conditionally execute code based on the API
if api_name == "cuda":
    print("Running on NVIDIA GPU")
elif api_name == "hip":
    print("Running on AMD GPU")
```

**Returns:**

A string identifying the device API.

### `enqueue_create_buffer`

`enqueue_create_buffer[type: DType](self, size: Int) -> DeviceBuffer[type]`

Enqueues a buffer creation using the `DeviceBuffer` constructor.

For GPU devices, the space is allocated in the device's global memory.

**Parameters:**

* ​type (`DType`): The data type to be stored in the allocated memory.

**Args:**

* ​size (`Int`): The number of elements of `type` to allocate memory for.

**Returns:**

The allocated buffer.

### `create_buffer_sync`

`create_buffer_sync[type: DType](self, size: Int) -> DeviceBuffer[type]`

Creates a buffer synchronously using the `DeviceBuffer` constructor.

**Parameters:**

* ​type (`DType`): The data type to be stored in the allocated memory.

**Args:**

* ​size (`Int`): The number of elements of `type` to allocate memory for.

**Returns:**

The allocated buffer.

### `enqueue_create_host_buffer`

`enqueue_create_host_buffer[type: DType](self, size: Int) -> HostBuffer[type]`

Enqueues the creation of a HostBuffer.

This function allocates memory on the host that is accessible by the device.
The memory is page-locked (pinned) for efficient data transfer between host and device.

Pinned memory is guaranteed to remain resident in the host's RAM, not be
paged/swapped out to disk. Memory allocated normally (for example, using
[`UnsafePointer.alloc()`](/mojo/stdlib/memory/unsafe_ptr/UnsafePointer#alloc))
is pageable—individual pages of memory can be moved to secondary storage
(disk/SSD) when main memory fills up.

Using pinned memory allows devices to make fast transfers
between host memory and device memory, because they can use direct
memory access (DMA) to transfer data without relying on the CPU.

Allocating too much pinned memory can cause performance issues, since it
reduces the amount of memory available for other processes.

Example:

```mojo
from gpu.host import DeviceContext

with DeviceContext() as ctx:
    # Allocate host memory accessible by the device
    var host_buffer = ctx.enqueue_create_host_buffer[DType.float32](1024)

    # Use the host buffer for device operations
    # ...
```

**Parameters:**

* ​type (`DType`): The data type to be stored in the allocated memory.

**Args:**

* ​size (`Int`): The number of elements of `type` to allocate memory for.

**Returns:**

A `HostBuffer` object that wraps the allocated host memory.

**Raises:**

If memory allocation fails or if the device context is invalid.

### `compile_function`

`compile_function[func_type: AnyTrivialRegType, //, func: func_type, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(None), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])`

Compiles the provided function for execution on this device.

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): Type of the function.
* ​func (`func_type`): The function to compile.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).
* ​\_target (`target`): Change the target to different device type than the
  one associated with this `DeviceContext`.

**Args:**

* ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such
  as maximum shared memory size).

**Returns:**

The compiled function.

### `compile_function_unchecked`

`compile_function_unchecked[func_type: AnyTrivialRegType, //, func: func_type, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(None), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])`

Compiles the provided function for execution on this device.

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): Type of the function.
* ​func (`func_type`): The function to compile.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).
* ​\_target (`target`): Change the target to different device type than the
  one associated with this `DeviceContext`.

**Args:**

* ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such
  as maximum shared memory size).

**Returns:**

The compiled function.

### `compile_function_checked`

`compile_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])`

Compiles the provided function for execution on this device.

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): Type of the function.
* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`func_type`): The function to compile.
* ​signature\_func (`fn(*args: *declared_arg_types) -> None`): The function to compile, passed in again. Used for
  checking argument types later.
  Note: This will disappear in future versions.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).
* ​\_target (`target`): Change the target to different device type than the
  one associated with this `DeviceContext`.

**Args:**

* ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such
  as maximum shared memory size).

**Returns:**

The compiled function.

`compile_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) capturing -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])`

Compiles the provided function for execution on this device.

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): Type of the function.
* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`func_type`): The function to compile.
* ​signature\_func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile, passed in again. Used for
  checking argument types later.
  Note: This will disappear in future versions.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).
* ​\_target (`target`): Change the target to different device type than the
  one associated with this `DeviceContext`.

**Args:**

* ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such
  as maximum shared memory size).

**Returns:**

The compiled function.

### `compile_function_experimental`

`compile_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])`

Compiles the provided function for execution on this device.

**Parameters:**

* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`fn(*args: *declared_arg_types) -> None`): The function to compile.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).
* ​\_target (`target`): Change the target to different device type than the
  one associated with this `DeviceContext`.

**Args:**

* ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such
  as maximum shared memory size).

**Returns:**

The compiled function.

`compile_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) capturing -> None, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False, _target: target = from_name[::StringSlice[::Bool().target()](self, *, func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceFunction[func, Optional(declared_arg_types), target=_target, _ptxas_info_verbose=_ptxas_info_verbose])`

Compiles the provided function for execution on this device.

**Parameters:**

* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).
* ​\_target (`target`): Change the target to different device type than the
  one associated with this `DeviceContext`.

**Args:**

* ​func\_attribute (`OptionalReg[FuncAttribute]`): An attribute to use when compiling the code (such
  as maximum shared memory size).

**Returns:**

The compiled function.

### `load_function`

`load_function[func_type: AnyTrivialRegType, //, func: func_type](self, *, function_name: StringSlice[origin], asm: StringSlice[origin], func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}), out result: DeviceExternalFunction)`

Loads a pre-compiled device function from assembly code.

This method loads an external GPU function from provided assembly code (PTX/SASS)
rather than compiling it from Mojo source. This is useful for integrating with
existing CUDA/HIP code or for using specialized assembly optimizations.

Example:

```mojo
from gpu.host import DeviceContext
from gpu.host.device_context import DeviceExternalFunction

fn func_signature(
    # Arguments being passed to the assembly code
    # e.g. two pointers and a length
    input: UnsafePointer[Float32],
    output: UnsafePointer[Float32],
    len: Int,
):
    # No body because that is passed as assembly code below.
    pass

var ctx = DeviceContext()
var ptx_code = "..."  # PTX assembly code
var ext_func = ctx.load_function[func_signature](
    function_name="my_kernel",
    asm=ptx_code,
)
```

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): The type of the function to load.
* ​func (`func_type`): The function reference.

**Args:**

* ​function\_name (`StringSlice[origin]`): The name of the function in the assembly code.
* ​asm (`StringSlice[origin]`): The assembly code (PTX/SASS) containing the function.
* ​func\_attribute (`OptionalReg[FuncAttribute]`): Optional attribute to apply to the function (such as
  maximum shared memory size).

**Returns:**

The loaded function is stored in the `result` parameter.

**Raises:**

If loading the function fails or the assembly code is invalid.

### `enqueue_function`

`enqueue_function[func_type: AnyTrivialRegType, //, func: func_type, *Ts: AnyType, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List(), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))`

Compiles and enqueues a kernel for execution on this device.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it
first to remove the overhead:

```mojo
with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): The type of the function to launch.
* ​func (`func_type`): The function to launch.
* ​\*Ts (`AnyType`): The types of the arguments being passed to the function.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).

**Args:**

* ​\*args (`*Ts`): Variadic arguments which are passed to the `func`.
* ​grid\_dim (`Dim`): The grid dimensions.
* ​block\_dim (`Dim`): The block dimensions.
* ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions.
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks.
* ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings.
* ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum.

`enqueue_function[*Ts: AnyType](self, f: DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List())`

Enqueues a compiled function for execution on this device.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile
the function first to remove the overhead:

```mojo
from gpu.host import DeviceContext

with DeviceContext() as ctx:
    var compiled_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​\*Ts (`AnyType`): Argument types.

**Args:**

* ​f (`DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose]`): The compiled function to execute.
* ​\*args (`*Ts`): Arguments to pass to the function.
* ​grid\_dim (`Dim`): Dimensions of the compute grid, made up of thread
  blocks.
* ​block\_dim (`Dim`): Dimensions of each thread block in the grid.
* ​cluster\_dim (`OptionalReg[Dim]`): Dimensions of clusters (if the thread blocks are
  grouped into clusters).
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Amount of shared memory per thread block.
* ​attributes (`List[LaunchAttribute]`): Launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): Constant memory mapping.

`enqueue_function[*Ts: AnyType](self, f: DeviceExternalFunction, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List())`

Enqueues an external device function for asynchronous execution on the GPU.

This method schedules an external device function to be executed on the GPU with the
specified execution configuration. The function and its arguments are passed to the
underlying GPU runtime, which will execute them when resources are available.

Example:

```mojo
from gpu.host import DeviceContext
from gpu.host.device_context import DeviceExternalFunction

# Create a device context and load an external function
with DeviceContext() as ctx:
    var ext_func = DeviceExternalFunction("my_kernel")

    # Enqueue the external function with execution configuration
    ctx.enqueue_function(
        ext_func,
        grid_dim=Dim(16),
        block_dim=Dim(256)
    )

    # Wait for completion
    ctx.synchronize()
```

**Parameters:**

* ​\*Ts (`AnyType`): The types of the arguments to be passed to the device function.

**Args:**

* ​f (`DeviceExternalFunction`): The external device function to execute.
* ​\*args (`*Ts`): The arguments to pass to the device function.
* ​grid\_dim (`Dim`): The dimensions of the grid (number of thread blocks).
* ​block\_dim (`Dim`): The dimensions of each thread block (number of threads per block).
* ​cluster\_dim (`OptionalReg[Dim]`): Optional dimensions for thread block clusters (for newer GPU architectures).
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Optional amount of dynamic shared memory to allocate per block.
* ​attributes (`List[LaunchAttribute]`): Optional list of launch attributes for fine-grained control.
* ​constant\_memory (`List[ConstantMemoryMapping]`): Optional list of constant memory mappings to use during execution.

**Raises:**

If there's an error enqueuing the function or if the function execution fails.

### `enqueue_function_unchecked`

`enqueue_function_unchecked[func_type: AnyTrivialRegType, //, func: func_type, *Ts: AnyType, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List(), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))`

Compiles and enqueues a kernel for execution on this device.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it
first to remove the overhead:

```mojo
with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): The type of the function to launch.
* ​func (`func_type`): The function to launch.
* ​\*Ts (`AnyType`): The types of the arguments being passed to the function.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).

**Args:**

* ​\*args (`*Ts`): Variadic arguments which are passed to the `func`.
* ​grid\_dim (`Dim`): The grid dimensions.
* ​block\_dim (`Dim`): The block dimensions.
* ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions.
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks.
* ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings.
* ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum.

`enqueue_function_unchecked[*Ts: AnyType](self, f: DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List())`

Enqueues a compiled function for execution on this device.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile
the function first to remove the overhead:

```mojo
from gpu.host import DeviceContext

with DeviceContext() as ctx:
    var compiled_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​\*Ts (`AnyType`): Argument types.

**Args:**

* ​f (`DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose]`): The compiled function to execute.
* ​\*args (`*Ts`): Arguments to pass to the function.
* ​grid\_dim (`Dim`): Dimensions of the compute grid, made up of thread
  blocks.
* ​block\_dim (`Dim`): Dimensions of each thread block in the grid.
* ​cluster\_dim (`OptionalReg[Dim]`): Dimensions of clusters (if the thread blocks are
  grouped into clusters).
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Amount of shared memory per thread block.
* ​attributes (`List[LaunchAttribute]`): Launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): Constant memory mapping.

### `enqueue_function_checked`

`enqueue_function_checked[*Ts: DevicePassable](self, f: DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose], *args: *Ts, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List())`

Enqueues a compiled function for execution on this device.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile
the function first to remove the overhead:

```mojo
from gpu.host import DeviceContext

with DeviceContext() as ctx:
    var compiled_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​\*Ts (`DevicePassable`): Argument types.

**Args:**

* ​f (`DeviceFunction[func, declared_arg_types, target=target, _ptxas_info_verbose=_ptxas_info_verbose]`): The compiled function to execute.
* ​\*args (`*Ts`): Arguments to pass to the function.
* ​grid\_dim (`Dim`): Dimensions of the compute grid, made up of thread
  blocks.
* ​block\_dim (`Dim`): Dimensions of each thread block in the grid.
* ​cluster\_dim (`OptionalReg[Dim]`): Dimensions of clusters (if the thread blocks are
  grouped into clusters).
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Amount of shared memory per thread block.
* ​attributes (`List[LaunchAttribute]`): Launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): Constant memory mapping.

`enqueue_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List(), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))`

Compiles and enqueues a kernel for execution on this device.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it
first to remove the overhead:

```mojo
with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): The type of the function to launch.
* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`func_type`): The function to compile and launch.
* ​signature\_func (`fn(*args: *declared_arg_types) -> None`): The function to compile and launch, passed in
  again. Used for checking argument types later.
  Note: This will disappear in future versions.
* ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).

**Args:**

* ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`.
* ​grid\_dim (`Dim`): The grid dimensions.
* ​block\_dim (`Dim`): The block dimensions.
* ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions.
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks.
* ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings.
* ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum.

`enqueue_function_checked[func_type: AnyTrivialRegType, declared_arg_types: Variadic[AnyType], //, func: func_type, signature_func: fn(*args: *declared_arg_types) capturing -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List(), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))`

Compiles and enqueues a kernel for execution on this device. This overload takes in a function that's `capturing`.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it
first to remove the overhead:

```mojo
with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): The type of the function to launch.
* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`func_type`): The function to compile and launch.
* ​signature\_func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile and launch, passed in
  again. Used for checking argument types later.
  Note: This will disappear in future versions.
* ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).

**Args:**

* ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`.
* ​grid\_dim (`Dim`): The grid dimensions.
* ​block\_dim (`Dim`): The block dimensions.
* ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions.
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks.
* ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings.
* ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum.

### `enqueue_function_experimental`

`enqueue_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List(), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))`

Compiles and enqueues a kernel for execution on this device.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it
first to remove the overhead:

```mojo
with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`fn(*args: *declared_arg_types) -> None`): The function to compile and launch.
* ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).

**Args:**

* ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`.
* ​grid\_dim (`Dim`): The grid dimensions.
* ​block\_dim (`Dim`): The block dimensions.
* ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions.
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks.
* ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings.
* ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum.

`enqueue_function_experimental[declared_arg_types: Variadic[AnyType], //, func: fn(*args: *declared_arg_types) capturing -> None, *actual_arg_types: DevicePassable, *, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _ptxas_info_verbose: Bool = False](self, *args: *actual_arg_types, *, grid_dim: Dim, block_dim: Dim, cluster_dim: OptionalReg[Dim] = OptionalReg[Dim]({:i1 0, 1}), shared_mem_bytes: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), owned attributes: List[LaunchAttribute] = List(), owned constant_memory: List[ConstantMemoryMapping] = List(), func_attribute: OptionalReg[FuncAttribute] = OptionalReg[FuncAttribute]({:i1 0, 1}))`

Compiles and enqueues a kernel for execution on this device. This overload takes in a function that's `capturing`.

You can pass the function directly to `enqueue_function` without
compiling it first:

```mojo
from gpu.host import DeviceContext

fn kernel():
    print("hello from the GPU")

with DeviceContext() as ctx:
    ctx.enqueue_function[kernel](grid_dim=1, block_dim=1)
    ctx.synchronize()
```

If you are reusing the same function and parameters multiple times, this
incurs 50-500 nanoseconds of overhead per enqueue, so you can compile it
first to remove the overhead:

```mojo
with DeviceContext() as ctx:
    var compile_func = ctx.compile_function[kernel]()
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.enqueue_function(compile_func, grid_dim=1, block_dim=1)
    ctx.synchronize()
```

**Parameters:**

* ​declared\_arg\_types (`Variadic[AnyType]`): Types of the arguments to pass to the device function.
* ​func (`fn(*args: *declared_arg_types) capturing -> None`): The function to compile and launch.
* ​\*actual\_arg\_types (`DevicePassable`): The types of the arguments being passed to the function.
* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the compiled assembly, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): To dump the generated LLVM code, pass `True`, or a file
  path to dump to, or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Only runs on NVIDIA targets, and requires CUDA Toolkit
  to be installed. Pass `True`, or a file path to dump to, or a
  function returning a file path.
* ​\_ptxas\_info\_verbose (`Bool`): Only runs on NVIDIA targets, and requires CUDA
  Toolkit to be installed. Changes `dump_asm` to output verbose
  PTX assembly (default `False`).

**Args:**

* ​\*args (`*actual_arg_types`): Variadic arguments which are passed to the `func`.
* ​grid\_dim (`Dim`): The grid dimensions.
* ​block\_dim (`Dim`): The block dimensions.
* ​cluster\_dim (`OptionalReg[Dim]`): The cluster dimensions.
* ​shared\_mem\_bytes (`OptionalReg[Int]`): Per-block memory shared between blocks.
* ​attributes (`List[LaunchAttribute]`): A `List` of launch attributes.
* ​constant\_memory (`List[ConstantMemoryMapping]`): A `List` of constant memory mappings.
* ​func\_attribute (`OptionalReg[FuncAttribute]`): `CUfunction_attribute` enum.

### `execution_time`

`execution_time[: origin.set, //, func: fn(DeviceContext) raises capturing -> None](self, num_iters: Int) -> Int`

Measures the execution time of a function that takes a DeviceContext parameter.

This method times the execution of a provided function that requires the
DeviceContext as a parameter. It runs the function for the specified number
of iterations and returns the total elapsed time in nanoseconds.

Example:

```mojo
from gpu.host import DeviceContext

fn gpu_operation(ctx: DeviceContext) raises capturing [_] -> None:
    # Perform some GPU operation using ctx
    pass

with DeviceContext() as ctx:
    # Measure execution time of a function that uses the context
    var time_ns = ctx.execution_time[gpu_operation](10)
    print("Execution time for 10 iterations:", time_ns, "ns")
```

**Parameters:**

* ​func (`fn(DeviceContext) raises capturing -> None`): A function that takes a DeviceContext parameter to execute and time.

**Args:**

* ​num\_iters (`Int`): The number of iterations to run the function.

**Returns:**

The total elapsed time in nanoseconds for all iterations.

**Raises:**

If the timer operations fail or if the function raises an exception.

`execution_time[: origin.set, //, func: fn() raises capturing -> None](self, num_iters: Int) -> Int`

Measures the execution time of a function over multiple iterations.

This method times the execution of a provided function that doesn't require
the DeviceContext as a parameter. It runs the function for the specified
number of iterations and returns the total elapsed time in nanoseconds.

Example:

```mojo
from gpu.host import DeviceContext

fn some_gpu_operation() raises capturing [_] -> None:
    # Perform some GPU operation
    pass

with DeviceContext() as ctx:
    # Measure execution time of a function
    var time_ns = ctx.execution_time[some_gpu_operation]
    print("Execution time:", time_ns, "ns")
```

**Parameters:**

* ​func (`fn() raises capturing -> None`): A function with no parameters to execute and time.

**Args:**

* ​num\_iters (`Int`): The number of iterations to run the function.

**Returns:**

The total elapsed time in nanoseconds for all iterations.

**Raises:**

If the timer operations fail or if the function raises an exception.

### `execution_time_iter`

`execution_time_iter[: origin.set, //, func: fn(DeviceContext, Int) raises capturing -> None](self, num_iters: Int) -> Int`

Measures the execution time of a function that takes iteration index as input.

This method times the execution of a provided function that requires both the
DeviceContext and the current iteration index as parameters. It runs the function
for the specified number of iterations, passing the iteration index to each call,
and returns the total elapsed time in nanoseconds.

Example:

```mojo
from gpu.host import DeviceContext

var my_kernel = DeviceFunction(...)

fn benchmark_kernel(ctx: DeviceContext, i: Int) raises capturing [_] -> None:
    # Run kernel with different parameters based on iteration
    ctx.enqueue_function[my_kernel](grid_dim=Dim(i), block_dim=Dim(256))

with DeviceContext() as ctx:
    # Measure execution time with iteration awareness
    var time_ns = ctx.execution_time_iter[benchmark_kernel](10)
    print("Total execution time:", time_ns, "ns")
```

**Parameters:**

* ​func (`fn(DeviceContext, Int) raises capturing -> None`): A function that takes the DeviceContext and an iteration index.

**Args:**

* ​num\_iters (`Int`): The number of iterations to run the function.

**Returns:**

The total elapsed time in nanoseconds for all iterations.

**Raises:**

If the timer operations fail or if the function raises an exception.

### `enqueue_copy`

`enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type], src_ptr: UnsafePointer[SIMD[type, 1]])`

Enqueues an async copy from the host to the provided device buffer. The number of bytes copied is determined by the size of the device buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_buf (`DeviceBuffer[type]`): Device buffer to copy to.
* ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy from.

`enqueue_copy[type: DType](self, dst_buf: HostBuffer[type], src_ptr: UnsafePointer[SIMD[type, 1]])`

Enqueues an async copy from the host to the provided device buffer. The number of bytes copied is determined by the size of the device buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_buf (`HostBuffer[type]`): Device buffer to copy to.
* ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy from.

`enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_buf: DeviceBuffer[type])`

Enqueues an async copy from the device to the host. The number of bytes copied is determined by the size of the device buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy to.
* ​src\_buf (`DeviceBuffer[type]`): Device buffer to copy from.

`enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_buf: HostBuffer[type])`

Enqueues an async copy from the device to the host. The number of bytes copied is determined by the size of the device buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy to.
* ​src\_buf (`HostBuffer[type]`): Device buffer to copy from.

`enqueue_copy[type: DType](self, dst_ptr: UnsafePointer[SIMD[type, 1]], src_ptr: UnsafePointer[SIMD[type, 1]], size: Int)`

Enqueues an async copy of `size` elements from a device pointer to another device pointer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Host pointer to copy to.
* ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Device pointer to copy from.
* ​size (`Int`): Number of elements (of the specified `DType`) to copy.

`enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type], src_buf: DeviceBuffer[type])`

Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_buf (`DeviceBuffer[type]`): Device buffer to copy to.
* ​src\_buf (`DeviceBuffer[type]`): Device buffer to copy from. Must be at least as large as
  `dst`.

`enqueue_copy[type: DType](self, dst_buf: DeviceBuffer[type], src_buf: HostBuffer[type])`

Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_buf (`DeviceBuffer[type]`): Device buffer to copy to.
* ​src\_buf (`HostBuffer[type]`): Device buffer to copy from. Must be at least as large as
  `dst`.

`enqueue_copy[type: DType](self, dst_buf: HostBuffer[type], src_buf: DeviceBuffer[type])`

Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_buf (`HostBuffer[type]`): Device buffer to copy to.
* ​src\_buf (`DeviceBuffer[type]`): Device buffer to copy from. Must be at least as large as
  `dst`.

`enqueue_copy[type: DType](self, dst_buf: HostBuffer[type], src_buf: HostBuffer[type])`

Enqueues an async copy from one device buffer to another. The amount of data transferred is determined by the size of the destination buffer.

**Parameters:**

* ​type (`DType`): Type of the data being copied.

**Args:**

* ​dst\_buf (`HostBuffer[type]`): Device buffer to copy to.
* ​src\_buf (`HostBuffer[type]`): Device buffer to copy from. Must be at least as large as
  `dst`.

### `enqueue_memset`

`enqueue_memset[type: DType](self, dst: DeviceBuffer[type], val: SIMD[type, 1])`

Enqueues an async memset operation, setting all of the elements in the destination device buffer to the specified value.

**Parameters:**

* ​type (`DType`): Type of the data stored in the buffer.

**Args:**

* ​dst (`DeviceBuffer[type]`): Destination buffer.
* ​val (`SIMD[type, 1]`): Value to set all elements of `dst` to.

`enqueue_memset[type: DType](self, dst: HostBuffer[type], val: SIMD[type, 1])`

Enqueues an async memset operation, setting all of the elements in the destination host buffer to the specified value.

**Parameters:**

* ​type (`DType`): Type of the data stored in the buffer.

**Args:**

* ​dst (`HostBuffer[type]`): Destination buffer.
* ​val (`SIMD[type, 1]`): Value to set all elements of `dst` to.

### `synchronize`

`synchronize(self)`

Blocks until all asynchronous calls on the stream associated with this device context have completed.

This should never be necessary when writing a custom operation.

### `enqueue_wait_for`

`enqueue_wait_for(self, other: Self)`

Enqueues a wait operation for another device context to complete its work.

This method creates a dependency between two device contexts, ensuring that operations
in the current context will not begin execution until all previously enqueued operations
in the other context have completed. This is useful for synchronizing work across
multiple devices or streams.

Example:

```mojo
from gpu.host import DeviceContext

# Create two device contexts
var ctx1 = DeviceContext(0)  # First GPU
var ctx2 = DeviceContext(1)  # Second GPU

# Enqueue operations on ctx1
# ...

# Make ctx2 wait for ctx1 to complete before proceeding
ctx2.enqueue_wait_for(ctx1)

# Enqueue operations on ctx2 that depend on ctx1's completion
# ...
```

**Args:**

* ​other (`Self`): The device context whose operations must complete before operations in this context can proceed.

**Raises:**

If there's an error enqueuing the wait operation or if the operation
is not supported by the underlying device API.

### `get_api_version`

`get_api_version(self) -> Int`

Returns the API version associated with this device.

This method retrieves the version number of the GPU driver currently installed
on the system for the device associated with this context. The version is
returned as an integer that can be used to check compatibility with specific
features or to troubleshoot driver-related issues.

Example:

```mojo
from gpu.host import DeviceContext

with DeviceContext() as ctx:
    # Get the API version
    var api_version = ctx.get_api_version()
    print("GPU API version:", api_version)
```

**Returns:**

An integer representing the driver version.

**Raises:**

If the driver version cannot be retrieved or if the device context is invalid.

### `get_attribute`

`get_attribute(self, attr: DeviceAttribute) -> Int`

Returns the specified attribute for this device.

Use the aliases defined by
[DeviceAttribute](/mojo/stdlib/gpu/host/device_attribute/DeviceAttribute)
to specify attributes. For example:

```mojo
from gpu.host import DeviceAttribute, DeviceContext

def main():
    var ctx = DeviceContext()
    var attr = DeviceAttribute.MAX_BLOCKS_PER_MULTIPROCESSOR
    var max_blocks = ctx.get_attribute(attr)
    print(max_blocks)
```

**Args:**

* ​attr (`DeviceAttribute`): The device attribute to query.

**Returns:**

The value for `attr` on this device.

### `is_compatible`

`is_compatible(self) -> Bool`

Returns True if this device is compatible with MAX.

This method checks whether the current device is compatible with the
Modular Accelerated Execution (MAX) runtime. It's useful for validating
that the device can execute the compiled code before attempting operations.

Example:

```mojo
from gpu.host import DeviceContext

var ctx = DeviceContext()
print("Device is compatible with MAX:", ctx.is_compatible())
```

**Returns:**

True if the device is compatible with MAX, False otherwise.

### `id`

`id(self) -> SIMD[int64, 1]`

Returns the ID associated with this device.

This method retrieves the unique identifier for the current device.
Device IDs are used to distinguish between multiple devices in a system
and are often needed for multi-GPU programming.

Example:

```mojo
var ctx = DeviceContext()
try:
    var device_id = ctx.id()
    print("Using device with ID:", device_id)
except:
    print("Failed to get device ID")
```

**Returns:**

The unique device ID as an Int64.

**Raises:**

If there's an error retrieving the device ID.

### `get_memory_info`

`get_memory_info(self) -> Tuple[UInt, UInt]`

Returns the free and total memory size for this device.

This method queries the current state of device memory, providing information
about how much memory is available and the total memory capacity of the device.
This is useful for memory management and determining if there's enough space
for planned operations.

Example:

```mojo
from gpu.host import DeviceContext

var ctx = DeviceContext()
try:
    (free, total) = ctx.get_memory_info()
    print("Free memory:", free / (1024*1024), "MB")
    print("Total memory:", total / (1024*1024), "MB")
except:
    print("Failed to get memory information")
```

**Returns:**

A tuple of (free memory, total memory) in bytes.

**Raises:**

If there's an error retrieving the memory information.

### `can_access`

`can_access(self, peer: Self) -> Bool`

Returns True if this device can access the identified peer device.

This method checks whether the current device can directly access memory on
the specified peer device. Peer-to-peer access allows for direct memory transfers
between devices without going through host memory, which can significantly
improve performance in multi-GPU scenarios.

Example:

```mojo
from gpu.host import DeviceContext
var ctx1 = DeviceContext(0)  # First GPU
var ctx2 = DeviceContext(1)  # Second GPU

try:
    if ctx1.can_access(ctx2):
        print("Direct peer access is possible")
        ctx1.enable_peer_access(ctx2)
    else:
        print("Direct peer access is not supported")
except:
    print("Failed to check peer access capability")
```

**Args:**

* ​peer (`Self`): The peer device to check for accessibility.

**Returns:**

True if the current device can access the peer device, False otherwise.

**Raises:**

If there's an error checking peer access capability.

### `enable_peer_access`

`enable_peer_access(self, peer: Self)`

Enables direct memory access to the peer device.

This method establishes peer-to-peer access from the current device to the
specified peer device. Once enabled, the current device can directly read from
and write to memory allocated on the peer device without going through host memory,
which can significantly improve performance for multi-GPU operations.

Notes:

* It's recommended to call `can_access()` first to check if peer access is possible.
* Peer access is not always symmetric; you may need to enable access in both directions.

Example:

```mojo
from gpu.host import DeviceContext

var ctx1 = DeviceContext(0)  # First GPU
var ctx2 = DeviceContext(1)  # Second GPU

try:
    if ctx1.can_access(ctx2):
        ctx1.enable_peer_access(ctx2)
        print("Peer access enabled from device 0 to device 1")

        # For bidirectional access
        if ctx2.can_access(ctx1):
            ctx2.enable_peer_access(ctx1)
            print("Peer access enabled from device 1 to device 0")
    else:
        print("Peer access not supported between these devices")
except:
    print("Failed to enable peer access")
```

**Args:**

* ​peer (`Self`): The peer device to enable access to.

**Raises:**

If there's an error enabling peer access or if peer access is not supported
between the devices.

### `supports_multicast`

`supports_multicast(self) -> Bool`

Returns True if this device supports multicast memory mappings.

**Returns:**

True if the current device supports multicast memory, False otherwise.

**Raises:**

If there's an error checking peer access capability.

### `number_of_devices`

`static number_of_devices(*, api: String = String(from_name[::StringSlice[::Bool())) -> Int`

Returns the number of devices available that support the specified API.

This function queries the system for available devices that support the
requested API (such as CUDA or HIP). It's useful for determining how many
accelerators are available before allocating resources or distributing work.

Example:

```mojo
from gpu.host import DeviceContext

# Get number of CUDA devices
var num_cuda_devices = DeviceContext.number_of_devices(api="cuda")

# Get number of devices for the default API
var num_devices = DeviceContext.number_of_devices()
```

**Args:**

* ​api (`String`): Requested device API (for example, "cuda" or "hip"). Defaults to the
  device API specified by the DeviceContext class.

**Returns:**

The number of available devices supporting the specified API.

---

## DeviceContextPtr

`@register_passable(trivial)`
`struct DeviceContextPtr`

Exposes a pointer to a C++ DeviceContext to Mojo.

Note: When initializing a `DeviceContext` from a pointer, the refcount is not
incremented. This is considered safe because `get_device_context()`
is only used within kernels and the `DeviceContext` lifetime is managed
by the graph compiler.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Initialize an empty `DeviceContextPtr` with a null pointer.

This creates a `DeviceContextPtr` that doesn't point to any device context.

`@implicit`
`__init__(handle: UnsafePointer[NoneType]) -> Self`

Initialize a `DeviceContextPtr` from a raw pointer.

**Args:**

* ​handle (`UnsafePointer[NoneType]`): A raw pointer to a C++ `DeviceContext`.

`@implicit`
`__init__(device: DeviceContext) -> Self`

Initialize a DeviceContextPtr from a `DeviceContext`.

This constructor allows implicit conversion from `DeviceContext` to `DeviceContextPtr`.

**Args:**

* ​device (`DeviceContext`): The `DeviceContext` to wrap in this pointer.

### `__getitem__`

`__getitem__(self) -> DeviceContext`

Dereference the pointer to get the `DeviceContext`.

**Returns:**

The `DeviceContext` that this pointer points to.

### `get_device_context`

`get_device_context(self) -> DeviceContext`

Get the `DeviceContext` that this pointer points to.

This is an alias for the dereference operator.

**Returns:**

The `DeviceContext` that this pointer points to.

---

## DeviceContextPtrList

`@register_passable(trivial)`
`struct DeviceContextPtrList[size: Int]`

A fixed-size collection of `DeviceContextPtr` objects.

This struct provides a lightweight, register-passable container for a fixed number
of `DeviceContextPtr` objects, with array-like access semantics.

## Parameters

* ​size (`Int`): The fixed number of `DeviceContextPtr` objects in the collection.

## Fields

* ​ptrs (`StaticTuple[DeviceContextPtr, size]`): The underlying storage for the device context pointers.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(ptrs: StaticTuple[DeviceContextPtr, size]) -> Self`

Initialize with a StaticTuple of `DeviceContextPtr` objects.

**Args:**

* ​ptrs (`StaticTuple[DeviceContextPtr, size]`): A StaticTuple containing the `DeviceContextPtr` objects to store.

### `__getitem__`

`__getitem__[index: Int](self) -> DeviceContext`

Access a `DeviceContext` at a compile-time known index.

**Parameters:**

* ​index (`Int`): A compile-time integer index.

**Returns:**

The `DeviceContext` at the specified index.

`__getitem__[I: Indexer, //](self, idx: I) -> DeviceContext`

Access a `DeviceContext` using a runtime index value.

**Parameters:**

* ​I (`Indexer`): A type that conforms to the `Indexer` trait.

**Args:**

* ​idx (`I`): A runtime index value that conforms to the Indexer trait.

**Returns:**

The `DeviceContext` at the specified index.

### `__len__`

`__len__(self) -> Int`

Get the number of `DeviceContextPtr` objects in the collection.

**Returns:**

The size of the collection as specified by the size parameter.

---

## DeviceExternalFunction

`struct DeviceExternalFunction`

Represents an external device function loaded from PTX/SASS assembly.

This class provides functionality to load and execute pre-compiled GPU functions
from assembly code rather than compiling them from Mojo source. This is useful
for integrating with existing CUDA/HIP code or for using specialized assembly
optimizations.

The `DeviceExternalFunction` handles reference counting of the underlying device
function handle and provides methods for launching the function on a GPU with
specified execution configuration.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Creates a copy of an existing device function by incrementing its reference count.

**Args:**

* ​existing (`Self`): The device function to copy.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Moves an existing device function into this one.

**Args:**

* ​existing (`Self`): The device function to move from.

### `__del__`

`__del__(owned self)`

Releases resources associated with this device function.

### `get_attribute`

`get_attribute(self, attr: Attribute) -> Int`

Retrieves a specific attribute of this device function.

**Args:**

* ​attr (`Attribute`): The attribute to query.

**Returns:**

The value of the requested attribute.

**Raises:**

If the attribute query fails.

---

## DeviceFunction

`struct DeviceFunction[func_type: AnyTrivialRegType, //, func: func_type, declared_arg_types: Optional[Variadic[AnyType]], *, target: target = _get_gpu_target[::StringSlice[::Bool(), _ptxas_info_verbose: Bool = False]`

Represents a compiled device function for GPU execution.

This struct encapsulates a compiled GPU function that can be launched on a device.
It handles the compilation, loading, and resource management of device functions.

Example:

```mojo
from gpu.host import DeviceContext, DeviceFunction

fn my_kernel(x: Int, y: Int):
    # Kernel implementation
    pass

var ctx = DeviceContext()
var kernel = ctx.compile_function[my_kernel]()
ctx.enqueue_function(kernel, grid_dim=(1,1,1), block_dim=(32,1,1))
```

## Parameters

* ​func\_type (`AnyTrivialRegType`): The type of the function to compile.
* ​func (`func_type`): The function to compile for GPU execution.
* ​declared\_arg\_types (`Optional[Variadic[AnyType]]`): An optional containing a variadic of the declared types of the kernel signature.
* ​target (`target`): The target architecture for compilation. Defaults to the current GPU target.
* ​\_ptxas\_info\_verbose (`Bool`): Whether to enable verbose PTX assembly output. Defaults to False.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Creates a copy of an existing DeviceFunction.

This increases the reference count of the underlying device function handle.

**Args:**

* ​existing (`Self`): The DeviceFunction to copy from.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Moves an existing DeviceFunction into this one.

**Args:**

* ​existing (`Self`): The DeviceFunction to move from.

### `__del__`

`__del__(owned self)`

Releases resources associated with this DeviceFunction.

This decrements the reference count of the underlying device function handle.

### `dump_rep`

`dump_rep[dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False), _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path] = __init__[::Copyable & ::Movable](False)](self)`

Dumps various representations of the compiled device function.

This method dumps the assembly, LLVM IR, and/or SASS code for the compiled
device function based on the provided parameters. The output can be directed
to stdout or written to files.

Notes:

When a path contains '%', it will be replaced with the module name to
help disambiguate multiple kernel dumps.

**Parameters:**

* ​dump\_asm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Controls dumping of assembly code. Can be a boolean, a file path,
  or a function returning a file path.
* ​dump\_llvm (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Controls dumping of LLVM IR. Can be a boolean, a file path,
  or a function returning a file path.
* ​\_dump\_sass (`Variant[Bool, Path, StringSlice[StaticConstantOrigin], fn() capturing -> Path]`): Controls dumping of SASS code (internal use). Can be a boolean,
  a file path, or a function returning a file path.

**Raises:**

If any file operations fail during the dumping process.

### `get_attribute`

`get_attribute(self, attr: Attribute) -> Int`

Retrieves a specific attribute value from the compiled device function.

This method queries the device function for information about its resource
requirements, execution capabilities, or other properties defined by the
specified attribute.

Example:

```mojo
from gpu.host import Attribute, DeviceFunction

var device_function = DeviceFunction(...)

# Get the maximum number of threads per block for this function
var max_threads = device_function.get_attribute(Attribute.MAX_THREADS_PER_BLOCK)
```

**Args:**

* ​attr (`Attribute`): The attribute to query, defined in the Attribute enum.

**Returns:**

The integer value of the requested attribute.

**Raises:**

If the attribute query fails or the attribute is not supported.

---

## DeviceMulticastBuffer

`struct DeviceMulticastBuffer[type: DType]`

Represents a muticast memory object enables special memory operations to be broadcast across a group of devices.

## Parameters

* ​type (`DType`): Data type to be stored in the associated memory regions.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

---

## DevicePassable

This trait marks types as passable to accelerator devices.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `device_type`

`alias device_type`

Indicate the type being used on accelerator devices.

## Methods

### `get_type_name`

`static get_type_name() -> String`

Gets the name of the host type (the one implementing this trait). For example, Int would return "Int", DeviceBuffer\[DType.float32] would return "DeviceBuffer\[DType.float32]". This is used for error messages when passing types to the device. TODO: This method will be retired soon when better kernel call error messages arrive.

**Returns:**

The host type's name.

### `get_device_type_name`

`static get_device_type_name() -> String`

Gets device\_type's name. For example, because DeviceBuffer's device\_type is UnsafePointer, DeviceBuffer\[DType.float32]'s get\_device\_type\_name() should return something like "UnsafePointer\[Scalar\[DType.float32]]". This is used for error messages when passing types to the device. TODO: This method will be retired soon when better kernel call error messages arrive.

**Returns:**

The device type's name.

---

## DeviceStream

`struct DeviceStream`

Represents a CUDA/HIP stream for asynchronous GPU operations.

A DeviceStream provides a queue for GPU operations that can execute concurrently
with operations in other streams. Operations within a single stream execute in
the order they are issued, but operations in different streams may execute in
any relative order or concurrently.

This abstraction allows for better utilization of GPU resources by enabling
overlapping of computation and data transfers.

Example:

```mojo
from gpu.host import DeviceContext, DeviceStream
var ctx = DeviceContext(0)  # Select first GPU
var stream = DeviceStream(ctx)

# Launch operations on the stream
# ...

# Wait for all operations in the stream to complete
stream.synchronize()
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `synchronize`

`synchronize(self)`

Blocks the calling CPU thread until all operations in this stream complete.

This function waits until all previously issued commands in this stream
have completed execution. It provides a synchronization point between
host and device code.

Example:

```mojo
# Launch kernel or memory operations on the stream
# ...

# Wait for completion
stream.synchronize()

# Now it's safe to use results on the host
```

**Raises:**

If synchronization fails.

---

## dict

Defines `Dict`, a collection that stores key-value pairs.

Dict provides an efficient, O(1) amortized
average-time complexity for insert, lookup, and removal of dictionary elements.
Its implementation closely mirrors Python's `dict` implementation:

* Performance and size are heavily optimized for small dictionaries, but can
  scale to large dictionaries.

* Insertion order is implicitly preserved. Iteration over keys, values, and
  items have a deterministic order based on insertion.

* For more information on the Mojo `Dict` type, see the
  [Mojo `Dict` manual](/mojo/manual/types/#dict). To learn more about using
  Python dictionaries from Mojo, see
  [Python types in Mojo](/mojo/manual/python/types/#python-types-in-mojo).

Key elements must implement the `KeyElement` trait, which encompasses
Movable, Hashable, and EqualityComparable. It also includes Copyable and Movable
until we push references through the standard library types.

Value elements must be CollectionElements for a similar reason. Both key and
value types must always be Movable so we can resize the dictionary as it grows.

See the `Dict` docs for more details.

## Structs

* [​`Dict`](/mojo/stdlib/collections/dict/Dict): A container that stores key-value pairs.
* [​`DictEntry`](/mojo/stdlib/collections/dict/DictEntry): Store a key-value pair entry inside a dictionary.
* [​`OwnedKwargsDict`](/mojo/stdlib/collections/dict/OwnedKwargsDict): Container used to pass owned variadic keyword arguments to functions.

## Traits

* [​`KeyElement`](/mojo/stdlib/collections/dict/KeyElement): A trait composition for types which implement all requirements of dictionary keys. Dict keys must minimally be Copyable, Movable, Hashable, and EqualityComparable for a hash map. Until we have references they must also be copyable.

---

## Dict

`struct Dict[K: KeyElement, V: Copyable & Movable]`

A container that stores key-value pairs.

The key type and value type must be specified statically, unlike a Python
dictionary, which can accept arbitrary key and value types.

The key type must implement the `KeyElement` trait, which encompasses
`Movable`, `Hashable`, and `EqualityComparable`. It also includes `Copyable`
and `Movable` until we have references.

The value type must implement the `Copyable` and `Movable` traits.

Examples:

```mojo
var d = Dict[String, Int]()
d["a"] = 1
d["b"] = 2
print(len(d))      # prints 2
print(d["a"])      # prints 1
print(d.pop("b"))  # prints 2
print(len(d))      # prints 1
```

For more information on the Mojo `Dict` type, see the
[Mojo `Dict` manual](/mojo/manual/types/#dict). To learn more about using
Python dictionaries from Mojo, see
[Python types in Mojo](/mojo/manual/python/types/#python-types-in-mojo).

## Parameters

* ​K (`KeyElement`): The type of the dictionary key. Must be `Hashable` and
  `EqualityComparable` so we can find the key in the map.
* ​V (`Copyable & Movable`): The value type of the dictionary. Currently must be
  Copyable & Movable.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `EMPTY`

`alias EMPTY = -1`

### `REMOVED`

`alias REMOVED = -2`

## Methods

### `__init__`

`__init__(out self)`

Initialize an empty dictiontary.

`__init__(out self, *, power_of_two_initial_capacity: Int)`

Initialize an empty dictiontary with a pre-reserved initial capacity.

Examples:

```mojo
var x = Dict[Int, Int](power_of_two_initial_capacity = 1024)
# Insert (2/3 of 1024) entries without reallocation.
```

**Args:**

* ​power\_of\_two\_initial\_capacity (`Int`): At least 8, has to be a power of two.

`__init__(out self, owned keys: List[K], owned values: List[V], __dict_literal__: Tuple[])`

Constructs a dictionary from the given keys and values.

**Args:**

* ​keys (`List[K]`): The list of keys to build the dictionary with.
* ​values (`List[V]`): The corresponding values to pair with the keys.
* ​**dict\_literal** (`Tuple[]`): Tell Mojo to use this method for dict literals.

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Copy an existing dictiontary.

**Args:**

* ​existing (`Self`): The existing dict.

### `__bool__`

`__bool__(self) -> Bool`

Check if the dictionary is empty or not.

**Returns:**

`False` if the dictionary is empty, `True` if there is at least one
element.

### `__getitem__`

`__getitem__(self, key: K) -> ref [*[0,0]._entries._value.value] V`

Retrieve a value out of the dictionary.

**Args:**

* ​key (`K`): The key to retrieve.

**Returns:**

The value associated with the key, if it's present.

**Raises:**

"KeyError" if the key isn't present.

### `__setitem__`

`__setitem__(mut self, owned key: K, owned value: V)`

Set a value in the dictionary by key.

**Args:**

* ​key (`K`): The key to associate with the specified value.
* ​value (`V`): The data to store in the dictionary.

### `__contains__`

`__contains__(self, key: K) -> Bool`

Check if a given key is in the dictionary or not.

**Args:**

* ​key (`K`): The key to check.

**Returns:**

True if there key exists in the dictionary, False otherwise.

### `__or__`

`__or__(self, other: Self) -> Self`

Merge self with other and return the result as a new dict.

**Args:**

* ​other (`Self`): The dictionary to merge with.

**Returns:**

The result of the merge.

### `__ior__`

`__ior__(mut self, other: Self)`

Merge self with other in place.

**Args:**

* ​other (`Self`): The dictionary to merge with.

### `copy`

`copy(self) -> Self`

Copy an existing dictiontary.

**Returns:**

A copy of the value.

### `fromkeys`

`static fromkeys(keys: List[K, hint_trivial_type], value: V) -> Self`

Create a new dictionary with keys from list and values set to value.

**Args:**

* ​keys (`List[K, hint_trivial_type]`): The keys to set.
* ​value (`V`): The value to set.

**Returns:**

The new dictionary.

`static fromkeys(keys: List[K, hint_trivial_type], value: Optional[V] = Optional(None)) -> Dict[K, Optional[V]]`

Create a new dictionary with keys from list and values set to value.

**Args:**

* ​keys (`List[K, hint_trivial_type]`): The keys to set.
* ​value (`Optional[V]`): The value to set.

**Returns:**

The new dictionary.

### `__iter__`

`__iter__(ref self) -> _DictKeyIter[K, V, self_is_origin]`

Iterate over the dict's keys as immutable references.

**Returns:**

An iterator of immutable references to the dictionary keys.

### `__reversed__`

`__reversed__(ref self) -> _DictKeyIter[K, V, self_is_origin, False]`

Iterate backwards over the dict keys, returning immutable references.

**Returns:**

A reversed iterator of immutable references to the dict keys.

### `__len__`

`__len__(self) -> Int`

The number of elements currently stored in the dictionary.

**Returns:**

The number of elements currently stored in the dictionary.

### `__str__`

`__str__[T: KeyElement & Representable, U: Copyable & Movable & Representable, //](self: Dict[T, U]) -> String`

Returns a string representation of a `Dict`.

Notes:
Since we can't condition methods on a trait yet, the way to call
this method is a bit special. Here is an example below:

```mojo
var my_dict = Dict[Int, Float64]()
my_dict[1] = 1.1
my_dict[2] = 2.2
dict_as_string = my_dict.__str__()
print(dict_as_string)
# prints "{1: 1.1, 2: 2.2}"
```

When the compiler supports conditional methods, then a simple
`String(my_dict)` will be enough.

**Parameters:**

* ​T (`KeyElement & Representable`): The type of the keys in the Dict. Must implement the
  traits `Representable` and `KeyElement`.
* ​U (`Copyable & Movable & Representable`): The type of the values in the Dict. Must implement the
  traits `Representable`, `Copyable` and `Movable`.

**Returns:**

A string representation of the Dict.

### `find`

`find(self, key: K) -> Optional[V]`

Find a value in the dictionary by key.

**Args:**

* ​key (`K`): The key to search for in the dictionary.

**Returns:**

An optional value containing a copy of the value if it was present,
otherwise an empty Optional.

### `get`

`get(self, key: K) -> Optional[V]`

Get a value from the dictionary by key.

**Args:**

* ​key (`K`): The key to search for in the dictionary.

**Returns:**

An optional value containing a copy of the value if it was present,
otherwise an empty Optional.

`get(self, key: K, default: V) -> V`

Get a value from the dictionary by key.

**Args:**

* ​key (`K`): The key to search for in the dictionary.
* ​default (`V`): Default value to return.

**Returns:**

A copy of the value if it was present, otherwise default.

### `pop`

`pop(mut self, key: K, owned default: V) -> V`

Remove a value from the dictionary by key.

**Args:**

* ​key (`K`): The key to remove from the dictionary.
* ​default (`V`): A default value to return if the key
  was not found instead of raising.

**Returns:**

The value associated with the key, if it was in the dictionary.
If it wasn't, return the provided default value instead.

`pop(mut self, key: K) -> V`

Remove a value from the dictionary by key.

**Args:**

* ​key (`K`): The key to remove from the dictionary.

**Returns:**

The value associated with the key, if it was in the dictionary.
Raises otherwise.

**Raises:**

"KeyError" if the key was not present in the dictionary.

### `popitem`

`popitem(mut self) -> DictEntry[K, V]`

Remove and return a (key, value) pair from the dictionary.

Notes:
Pairs are returned in LIFO order. popitem() is useful to
destructively iterate over a dictionary, as often used in set
algorithms. If the dictionary is empty, calling popitem() raises a
KeyError.

**Returns:**

Last dictionary item

**Raises:**

"KeyError" if the dictionary is empty.

### `keys`

`keys(ref self) -> _DictKeyIter[K, V, self_is_origin]`

Iterate over the dict's keys as immutable references.

**Returns:**

An iterator of immutable references to the dictionary keys.

### `values`

`values(ref self) -> _DictValueIter[K, V, self_is_origin]`

Iterate over the dict's values as references.

**Returns:**

An iterator of references to the dictionary values.

### `items`

`items(ref self) -> _DictEntryIter[K, V, self_is_origin]`

Iterate over the dict's entries as immutable references.

Examples:

```mojo
var my_dict = Dict[String, Int]()
my_dict["a"] = 1
my_dict["b"] = 2

for e in my_dict.items():
    print(e[].key, e[].value)
```

Notes:
These can't yet be unpacked like Python dict items, but you can
access the key and value as attributes.

**Returns:**

An iterator of immutable references to the dictionary entries.

### `update`

`update(mut self, other: Self, /)`

Update the dictionary with the key/value pairs from other, overwriting existing keys.

Notes:
The argument must be positional only.

**Args:**

* ​other (`Self`): The dictionary to update from.

### `clear`

`clear(mut self)`

Remove all elements from the dictionary.

### `setdefault`

`setdefault(mut self, key: K, owned default: V) -> ref [*[0,0]._entries._value.value] V`

Get a value from the dictionary by key, or set it to a default if it doesn't exist.

**Args:**

* ​key (`K`): The key to search for in the dictionary.
* ​default (`V`): The default value to set if the key is not present.

**Returns:**

The value associated with the key, or the default value if it wasn't
present.

---

## DictEntry

`struct DictEntry[K: KeyElement, V: Copyable & Movable]`

Store a key-value pair entry inside a dictionary.

## Parameters

* ​K (`KeyElement`): The key type of the dict. Must be Hashable+EqualityComparable.
* ​V (`Copyable & Movable`): The value type of the dict.

## Fields

* ​hash (`SIMD[uint64, 1]`): `key.__hash__()`, stored so hashing isn't re-computed during dict lookup.
* ​key (`K`): The unique key for the entry.
* ​value (`V`): The value associated with the key.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, owned key: K, owned value: V)`

Create an entry from a key and value, computing the hash.

**Args:**

* ​key (`K`): The key of the entry.
* ​value (`V`): The value of the entry.

### `copy`

`copy(self) -> Self`

Copy an existing entry.

**Returns:**

A copy of the value.

### `reap_value`

`reap_value(owned self, out result: V)`

Take the value from an owned entry.

**Returns:**

The value of the entry.

---

## dim

This module implements the dim type.

## Structs

* [​`Dim`](/mojo/stdlib/gpu/host/dim/Dim): Represents a dimension with up to three components (x, y, z).

---

## Dim

`@register_passable(trivial)`
`struct Dim`

A static or dynamic dimension modeled with an optional integer.

This class is meant to represent an optional static dimension. When a value
is present, the dimension has that static value. When a value is not
present, the dimension is dynamic.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`ImplicitlyBoolable`,
`Indexer`,
`Intable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`@implicit`
`__init__[I: Intable](value: I) -> Self`

Creates a statically-known dimension.

**Parameters:**

* ​I (`Intable`): The Intable type.

**Args:**

* ​value (`I`): The static dimension value.

`@implicit`
`__init__[I: Indexer](value: I) -> Self`

Creates a statically-known dimension.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​value (`I`): The static dimension value.

`@implicit`
`__init__(value: index) -> Self`

Creates a statically-known dimension.

**Args:**

* ​value (`index`): The static dimension value.

`@implicit`
`__init__(value: Int) -> Self`

Creates a statically-known dimension.

**Args:**

* ​value (`Int`): The static dimension value.

`__init__() -> Self`

Creates a dynamic dimension with no static value.

### `__bool__`

`__bool__(self) -> Bool`

Returns True if the dimension has a static value.

**Returns:**

Whether the dimension has a static value.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compares two dimensions for equality.

**Args:**

* ​rhs (`Self`): The other dimension.

**Returns:**

True if the dimensions are the same.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Compare two dimensions for inequality.

**Args:**

* ​rhs (`Self`): The dimension to compare.

**Returns:**

True if they are not equal.

### `__mul__`

`__mul__(self, rhs: Self) -> Self`

Multiplies two dimensions.

If either are unknown, the result is unknown as well.

**Args:**

* ​rhs (`Self`): The other dimension.

**Returns:**

The product of the two dimensions.

### `__floordiv__`

`__floordiv__(self, rhs: Self) -> Self`

Divide by the given dimension and round towards negative infinity.

If either are unknown, the result is unknown as well.

**Args:**

* ​rhs (`Self`): The divisor dimension.

**Returns:**

The floor division of the two dimensions.

### `__rfloordiv__`

`__rfloordiv__(self, rhs: Self) -> Self`

Divide the given argument by self and round towards negative infinity.

If either are unknown, the result is unknown as well.

**Args:**

* ​rhs (`Self`): The dimension to divide by this Dim.

**Returns:**

The floor of the argument divided by self.

### `__imul__`

`__imul__(mut self, rhs: Self)`

Inplace multiplies two dimensions.

If either are unknown, the result is unknown as well.

**Args:**

* ​rhs (`Self`): The other dimension.

### `__as_bool__`

`__as_bool__(self) -> Bool`

Returns True if the dimension has a static value.

**Returns:**

Whether the dimension has a static value.

### `has_value`

`has_value(self) -> Bool`

Returns True if the dimension has a static value.

**Returns:**

Whether the dimension has a static value.

### `is_dynamic`

`is_dynamic(self) -> Bool`

Returns True if the dimension has a dynamic value.

**Returns:**

Whether the dimension is dynamic.

### `get`

`get(self) -> Int`

Gets the static dimension value.

**Returns:**

The static dimension value.

### `is_multiple`

`is_multiple[alignment: Int](self) -> Bool`

Checks if the dimension is aligned.

**Parameters:**

* ​alignment (`Int`): The alignment requirement.

**Returns:**

Whether the dimension is aligned.

### `__index__`

`__index__(self) -> index`

Convert to index.

**Returns:**

The corresponding \_\_mlir\_type.index value.

### `__int__`

`__int__(self) -> Int`

Gets the static dimension value.

**Returns:**

The static dimension value.

### `__str__`

`__str__(self) -> String`

Converts the Dim to a String. If the value is unknown, then the string "?" is returned.

**Returns:**

The string representation of the type.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this DimList to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `or_else`

`or_else(self, default: Int) -> Int`

Return the underlying value contained in the Optional or a default value if the Optional's underlying value is not present.

**Args:**

* ​default (`Int`): The new value to use if no value was present.

**Returns:**

The underlying value contained in the Optional or a default value.

---

## Dim

`@register_passable(trivial)`
`struct Dim`

Represents a dimension with up to three components (x, y, z).

This struct is commonly used to represent grid and block dimensions
for kernel launches.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`@implicit`
`__init__[I: Indexer](x: I) -> Self`

Initializes Dim with a single indexable value for x.

y and z dimensions are set to 1.

**Parameters:**

* ​I (`Indexer`): The type of the indexable value.

**Args:**

* ​x (`I`): The value for the x dimension.

`__init__[I0: Indexer, I1: Indexer](x: I0, y: I1) -> Self`

Initializes Dim with indexable values for x and y.

z dimension is set to 1.

**Parameters:**

* ​I0 (`Indexer`): The type of the first indexable value.
* ​I1 (`Indexer`): The type of the second indexable value.

**Args:**

* ​x (`I0`): The value for the x dimension.
* ​y (`I1`): The value for the y dimension.

`__init__[I0: Indexer, I1: Indexer, I2: Indexer](x: I0, y: I1, z: I2) -> Self`

Initializes Dim with indexable values for x, y, and z.

**Parameters:**

* ​I0 (`Indexer`): The type of the first indexable value.
* ​I1 (`Indexer`): The type of the second indexable value.
* ​I2 (`Indexer`): The type of the third indexable value.

**Args:**

* ​x (`I0`): The value for the x dimension.
* ​y (`I1`): The value for the y dimension.
* ​z (`I2`): The value for the z dimension.

`@implicit`
`__init__[I: Indexer](dims: Tuple[I]) -> Self`

Initializes Dim with a tuple containing a single indexable value.

y and z dimensions are set to 1.

**Parameters:**

* ​I (`Indexer`): The type of the indexable value in the tuple.

**Args:**

* ​dims (`Tuple[I]`): A tuple with one element for x dimension.

`@implicit`
`__init__[I0: Indexer, I1: Indexer](dims: Tuple[I0, I1]) -> Self`

Initializes Dim with a tuple of two indexable values.

The z dimension is set to 1.

**Parameters:**

* ​I0 (`Indexer`): The type of the first indexable value in the tuple.
* ​I1 (`Indexer`): The type of the second indexable value in the tuple.

**Args:**

* ​dims (`Tuple[I0, I1]`): A tuple with two elements: x and y dimensions.

`@implicit`
`__init__[I0: Indexer, I1: Indexer, I2: Indexer](dims: Tuple[I0, I1, I2]) -> Self`

Initializes Dim with a tuple of three indexable values.

**Parameters:**

* ​I0 (`Indexer`): The type of the first indexable value in the tuple.
* ​I1 (`Indexer`): The type of the second indexable value in the tuple.
* ​I2 (`Indexer`): The type of the third indexable value in the tuple.

**Args:**

* ​dims (`Tuple[I0, I1, I2]`): Tuple with three elements: x, y, and z dimensions.

### `__getitem__`

`__getitem__(self, idx: Int) -> Int`

Gets the dimension value at the specified index.

**Args:**

* ​idx (`Int`): The index (0 for x, 1 for y, 2 for z).

**Returns:**

The value of the dimension at the given index.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the Dim.

**Returns:**

String representation of this Dim object.

### `__repr__`

`__repr__(self) -> String`

Returns a string representation of the Dim.

**Returns:**

String representation of this Dim object.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes a formatted string representation of the Dim.

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait.

**Args:**

* ​writer (`W`): The Writer to write to.

### `z`

`z(self) -> Int`

Returns the z dimension.

**Returns:**

The value of the z dimension.

### `y`

`y(self) -> Int`

Returns the y dimension.

**Returns:**

The value of the y dimension.

### `x`

`x(self) -> Int`

Returns the x dimension.

**Returns:**

The value of the x dimension.

---

## dimlist

Provides utilities for working with static and variadic lists.

You can import these APIs from the `buffer` package. For example:

```mojo
from buffer import Dim
```

## Structs

* [​`Dim`](/mojo/stdlib/buffer/dimlist/Dim): A static or dynamic dimension modeled with an optional integer.
* [​`DimList`](/mojo/stdlib/buffer/dimlist/DimList): This type represents a list of dimensions. Each dimension may have a static value or not have a value, which represents a dynamic dimension.

---

## DimList

`@register_passable(trivial)`
`struct DimList`

This type represents a list of dimensions. Each dimension may have a static value or not have a value, which represents a dynamic dimension.

## Fields

* ​value (`VariadicList[Dim]`): The underlying storage for the list of dimensions.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Representable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`@implicit`
`__init__[Intable: Intable](value: Intable) -> Self`

Creates a dimension list from the given list of values.

**Parameters:**

* ​Intable (`Intable`): A type able to be converted to an `Int`.

**Args:**

* ​value (`Intable`): The initial dim values list.

`@implicit`
`__init__[I: Indexer](values: Tuple[I]) -> Self`

Creates a dimension list from the given list of values.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​values (`Tuple[I]`): The initial dim values list.

`@implicit`
`__init__[I0: Indexer, I1: Indexer](values: Tuple[I0, I1]) -> Self`

Creates a dimension list from the given list of values.

**Parameters:**

* ​I0 (`Indexer`): A type that can be used as an Index.
* ​I1 (`Indexer`): A type that can be used as an Index.

**Args:**

* ​values (`Tuple[I0, I1]`): The initial dim values list.

`@implicit`
`__init__[I0: Indexer, I1: Indexer, I2: Indexer](values: Tuple[I0, I1, I2]) -> Self`

Creates a dimension list from the given list of values.

**Parameters:**

* ​I0 (`Indexer`): A type that can be used as an Index.
* ​I1 (`Indexer`): A type that can be used as an Index.
* ​I2 (`Indexer`): A type that can be used as an Index.

**Args:**

* ​values (`Tuple[I0, I1, I2]`): The initial dim values list.

`__init__[I0: Indexer, I1: Indexer](val0: I0, val1: I1) -> Self`

Creates a dimension list from the given list of values.

**Parameters:**

* ​I0 (`Indexer`): A type that can be used as an Index.
* ​I1 (`Indexer`): A type that can be used as an Index.

**Args:**

* ​val0 (`I0`): The initial dim value.
* ​val1 (`I1`): The initial dim value.

`__init__[I0: Indexer, I1: Indexer, I2: Indexer](val0: I0, val1: I1, val2: I2) -> Self`

Creates a dimension list from the given list of values.

**Parameters:**

* ​I0 (`Indexer`): A type that can be used as an Index.
* ​I1 (`Indexer`): A type that can be used as an Index.
* ​I2 (`Indexer`): A type that can be used as an Index.

**Args:**

* ​val0 (`I0`): The initial dim value.
* ​val1 (`I1`): The initial dim value.
* ​val2 (`I2`): The initial dim value.

`__init__[I0: Indexer, I1: Indexer, I2: Indexer, I3: Indexer](val0: I0, val1: I1, val2: I2, val3: I3) -> Self`

Creates a statically-known dimension.

**Parameters:**

* ​I0 (`Indexer`): A type that can be used as an Index.
* ​I1 (`Indexer`): A type that can be used as an Index.
* ​I2 (`Indexer`): A type that can be used as an Index.
* ​I3 (`Indexer`): A type that can be used as an Index.

**Args:**

* ​val0 (`I0`): The initial dim value.
* ​val1 (`I1`): The initial dim value.
* ​val2 (`I2`): The initial dim value.
* ​val3 (`I3`): The initial dim value.

`@implicit`
`__init__(values: VariadicList[Dim]) -> Self`

Creates a dimension list from the given list of values.

**Args:**

* ​values (`VariadicList[Dim]`): The initial dim values list.

`@implicit`
`__init__(*values: Dim) -> Self`

Creates a dimension list from the given Dim values.

**Args:**

* ​\*values (`Dim`): The initial dim values.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compares two DimLists for equality.

DimLists are considered equal if all non-dynamic Dims have similar
values and all dynamic Dims in self are also dynamic in rhs.

**Args:**

* ​rhs (`Self`): The other DimList.

**Returns:**

True if the DimLists are the same.

### `__len__`

`__len__(self) -> Int`

Gets the size of the DimList.

**Returns:**

The number of elements in the DimList.

### `get`

`get[i: Int](self) -> Int`

Gets the static dimension value at a specified index.

**Parameters:**

* ​i (`Int`): The dimension index.

**Returns:**

The static dimension value at the specified index.

### `at`

`at[i: Int](self) -> Dim`

Gets the dimension at a specified index.

**Parameters:**

* ​i (`Int`): The dimension index.

**Returns:**

The dimension at the specified index.

### `has_value`

`has_value[i: Int](self) -> Bool`

Returns True if the dimension at the given index has a static value.

**Parameters:**

* ​i (`Int`): The dimension index.

**Returns:**

Whether the specified dimension has a static value.

### `product`

`product[length: Int](self) -> Dim`

Computes the product of the first `length` dimensions in the list.

If any are dynamic, the result is a dynamic dimension value.

**Parameters:**

* ​length (`Int`): The number of elements in the list.

**Returns:**

The product of the first `length` dimensions.

`product[start: Int, end: Int](self) -> Dim`

Computes the product of a range of the dimensions in the list.

If any in the range are dynamic, the result is a dynamic dimension
value.

**Parameters:**

* ​start (`Int`): The starting index.
* ​end (`Int`): The end index.

**Returns:**

The product of all the dimensions.

`product(self) -> Dim`

Computes the product of all the dimensions in the list.

If any are dynamic, the result is a dynamic dimension value.

**Returns:**

The product of all the dimensions.

### `contains`

`contains[length: Int](self, value: Dim) -> Bool`

Determines whether the dimension list contains a specified dimension value.

**Parameters:**

* ​length (`Int`): The number of elements in the list.

**Args:**

* ​value (`Dim`): The value to find.

**Returns:**

True if the list contains a dimension of the specified value.

### `all_known`

`all_known[length: Int](self) -> Bool`

Determines whether all dimensions are statically known.

**Parameters:**

* ​length (`Int`): The number of elements in the list.

**Returns:**

True if all dimensions have a static value.

`all_known[start: Int, end: Int](self) -> Bool`

Determines whether all dimensions within \[start, end) are statically known.

**Parameters:**

* ​start (`Int`): The first queried dimension.
* ​end (`Int`): The last queried dimension.

**Returns:**

True if all queried dimensions have a static value.

### `into_index_list`

`into_index_list[rank: Int](self) -> IndexList[rank]`

Copy the DimList values into an `IndexList`, providing the rank.

```mojo
from buffer import DimList

var dim_list = DimList(2, 4)
var index_list = dim_list.into_index_list[rank=2]()
```

.

**Parameters:**

* ​rank (`Int`): The rank of the output IndexList.

**Returns:**

An IndexList with the same dimensions as the DimList.

### `create_unknown`

`static create_unknown[length: Int]() -> Self`

Creates a dimension list of all dynamic dimension values.

**Parameters:**

* ​length (`Int`): The number of elements in the list.

**Returns:**

A list of all dynamic dimension values.

### `__str__`

`__str__(self) -> String`

Converts the DimList to a String. The String is a comma separated list of the string representation of Dim.

**Returns:**

The string representation of the type.

### `__repr__`

`__repr__(self) -> String`

Converts the DimList to a readable String representation.

**Returns:**

The string representation of the type.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this DimList to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

---

## dirname

`dirname[PathLike: PathLike, //](path: PathLike) -> String`

Returns the directory component of a pathname.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to a file.

**Returns:**

The directory component of a pathname.

---

## dispatch_get_kernel_type

`dispatch_get_kernel_type[: origin.set, //, func: fn[Bool]() raises capturing -> None](m: Int, n: Int, k: Int)`

`dispatch_get_kernel_type[: origin.set, //, func: fn[Bool]() capturing -> None](m: Int, n: Int, k: Int)`

---

## dispatch_mask_and_score_mod

`dispatch_mask_and_score_mod[mask_type: String, score_mod_type: String, callback_fn: fn[MHAMask, ScoreModTrait](mask: $0, score_mod: $1) raises capturing -> None, local_window_size: Int = -1, num_heads: Int = -1]()`

---

## dispatch_materialized_mask_and_score_mod

`dispatch_materialized_mask_and_score_mod[score_mod_type: String, callback_fn: fn[MHAMask, ScoreModTrait](mask: $0, score_mod: $1) raises capturing -> None, num_heads: Int = -1](mask_nd: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], start_pos_nd: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}))`

---

## dispatch_table_a100_gpu

## Functions

* [​`create_matmul_configs_ampere`](./create_matmul_configs_ampere):
* [​`get_dispatch_table`](./get_dispatch_table):

---

## dispatch_table_amd

## Functions

* [​`create_tile_configs`](./create_tile_configs):

---

## distributed_matmul

## Functions

* [​`matmul_allreduce`](./matmul_allreduce): Performs C = matmul(A, B^T) followed with Out = allreduce(C) operation across multiple GPUs. Split the A or B and C matrices into `num_partitions` submatrices at dimension `partition_dim`. This way we can perform `num_partitions` independent matmul + allreduce kernels, and overlap some of the computation.

---

## distributed_transformer

## `DistributedTransformer` {#max.nn.transformer.distributed_transformer.DistributedTransformer}

> *class* max.nn.transformer.distributed\_transformer.DistributedTransformer(dim, n\_heads, layers, norm, output, embedding, kv\_params, kv\_collection\_constructor, devices, return\_logits=ReturnLogits.LAST\_TOKEN)

Transformer model consisting for TransformerBlock layers.

**Parameters:**

* **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **layers** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`DistributedTransformerBlock`](#max.nn.transformer.distributed_transformer.DistributedTransformerBlock) `]` )
* **norm** ([`DistributedRMSNorm`](../norm/rms_norm.md#max.nn.norm.rms_norm.DistributedRMSNorm) )
* **output** ([`ColumnParallelLinear`](../linear.md#max.nn.linear.ColumnParallelLinear) )
* **embedding** ([`VocabParallelEmbedding`](../embedding.md#max.nn.embedding.VocabParallelEmbedding) )
* **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **kv\_collection\_constructor** ([`FetchContinuousBatchingKVCacheCollection`](../kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.FetchContinuousBatchingKVCacheCollection)  `|`  `FetchPagedKVCacheCollection` )
* **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` )
* **return\_logits** ([`ReturnLogits`](transformer.md#max.nn.transformer.transformer.ReturnLogits) )

## `DistributedTransformerBlock` {#max.nn.transformer.distributed_transformer.DistributedTransformerBlock}

> *class* max.nn.transformer.distributed\_transformer.DistributedTransformerBlock(attention, mlp, attention\_norm, mlp\_norm, devices, use\_subgraph=False)

Stack of Attention, FeedForward, and RMSNorm layers.

**Parameters:**

* **attention** ([`Module`](../layer.md#max.nn.layer.Module) )
* **mlp** ([`Module`](../layer.md#max.nn.layer.Module) )
* **attention\_norm** ([`DistributedRMSNorm`](../norm/rms_norm.md#max.nn.norm.rms_norm.DistributedRMSNorm) )
* **mlp\_norm** ([`DistributedRMSNorm`](../norm/rms_norm.md#max.nn.norm.rms_norm.DistributedRMSNorm) )
* **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` )
* **use\_subgraph** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

### `build_subgraph()` {#max.nn.transformer.distributed_transformer.DistributedTransformerBlock.build_subgraph}

> build\_subgraph(name)

**Parameters:**

**name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )

**Return type:**

[*Module*](../layer.md#max.nn.layer.Module)

## `distribute_value()` {#max.nn.transformer.distributed_transformer.distribute_value}

> max.nn.transformer.distributed\_transformer.distribute\_value(v, devices)

**Parameters:**

**devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` )

---

## divmod

`divmod(numerator: Int, denominator: Int) -> Tuple[Int, Int]`

Performs integer division and returns the quotient and the remainder.

Currently supported only for integers. Support for more standard library
types like Int8, Int16... is planned.

This method calls `a.__divmod__(b)`, thus, the actual implementation of
divmod should go in the `__divmod__` method of the struct of `a`.

**Args:**

* ​numerator (`Int`): The dividend.
* ​denominator (`Int`): The divisor.

**Returns:**

A `Tuple` containing the quotient and the remainder.

`divmod(numerator: UInt, denominator: UInt) -> Tuple[UInt, UInt]`

Performs integer division and returns the quotient and the remainder.

Currently supported only for integers. Support for more standard library
types like Int8, Int16... is planned.

This method calls `a.__divmod__(b)`, thus, the actual implementation of
divmod should go in the `__divmod__` method of the struct of `a`.

**Args:**

* ​numerator (`UInt`): The dividend.
* ​denominator (`UInt`): The divisor.

**Returns:**

A `Tuple` containing the quotient and the remainder.

---

## DLHandle

`@register_passable(trivial)`
`struct DLHandle`

Represents a dynamically linked library that can be loaded and unloaded.

The library is loaded on initialization and unloaded by `close`.

## Fields

* ​handle (`UnsafePointer[NoneType]`): The handle to the dynamic library.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, flags: Int = (256 if os_is_linux() else 8 | 2))`

Initialize a dynamic library handle to all global symbols in the current process.

On POXIX-compatible operating systems, this performs
`dlopen(nullptr, flags)`.

**Args:**

* ​flags (`Int`): The flags to load the dynamic library.

`__init__[PathLike: PathLike, //](out self, path: PathLike, flags: Int = (256 if os_is_linux() else 8 | 2))`

Initialize a DLHandle object by loading the dynamic library at the given path.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the `os.PathLike` trait.

**Args:**

* ​path (`PathLike`): The path to the dynamic library file.
* ​flags (`Int`): The flags to load the dynamic library.

### `__bool__`

`__bool__(self) -> Bool`

Checks if the handle is valid.

**Returns:**

True if the DLHandle is not null and False otherwise.

### `copy`

`copy(self) -> Self`

Copy the object.

**Returns:**

A copy of the value.

### `check_symbol`

`check_symbol(self, owned name: String) -> Bool`

Check that the symbol exists in the dynamic library.

**Args:**

* ​name (`String`): The symbol to check.

**Returns:**

`True` if the symbol exists.

### `close`

`close(mut self)`

Delete the DLHandle object unloading the associated dynamic library.

### `get_function`

`get_function[result_type: AnyTrivialRegType](self, owned name: String) -> result_type`

Returns a handle to the function with the given name in the dynamic library.

**Parameters:**

* ​result\_type (`AnyTrivialRegType`): The type of the function pointer to return.

**Args:**

* ​name (`String`): The name of the function to get the handle for.

**Returns:**

A handle to the function.

### `get_symbol`

`get_symbol[result_type: AnyType](self, name: StringSlice[origin]) -> UnsafePointer[result_type]`

Returns a pointer to the symbol with the given name in the dynamic library.

**Parameters:**

* ​result\_type (`AnyType`): The type of the symbol to return.

**Args:**

* ​name (`StringSlice[origin]`): The name of the symbol to get the handle for.

**Returns:**

A pointer to the symbol.

`get_symbol[result_type: AnyType](self, *, cstr_name: UnsafePointer[SIMD[int8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> UnsafePointer[result_type]`

Returns a pointer to the symbol with the given name in the dynamic library.

**Parameters:**

* ​result\_type (`AnyType`): The type of the symbol to return.

**Args:**

* ​cstr\_name (`UnsafePointer[SIMD[int8, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The name of the symbol to get the handle for.

**Returns:**

A pointer to the symbol.

### `call`

`call[name: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType = NoneType, *T: AnyType = *?](self, *args: *T) -> return_type`

Call a function with any amount of arguments.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the function.
* ​return\_type (`AnyTrivialRegType`): The return type of the function.
* ​\*T (`AnyType`): The types of `args`.

**Args:**

* ​\*args (`*T`): The arguments.

**Returns:**

The result.

`call[name: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType = NoneType](self, args: VariadicPack[is_owned, origin, AnyType, element_types]) -> return_type`

Call a function with any amount of arguments.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the function.
* ​return\_type (`AnyTrivialRegType`): The return type of the function.

**Args:**

* ​args (`VariadicPack[is_owned, origin, AnyType, element_types]`): The arguments.

**Returns:**

The result.

---

## dot_at_b

`dot_at_b(c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive])`

---

## dot_at_b_impl

`dot_at_b_impl(c: NDBuffer[float32, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(16, 16)))], a: NDBuffer[float32, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(16, 16)))], b: NDBuffer[float32, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(16, 16)))])`

`dot_at_b_impl(c: NDBuffer[float16, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(32, 32)))], a: NDBuffer[float16, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(32, 32)))], b: NDBuffer[float16, 2, origin, __init__[::Indexer,::Indexer](Tuple(VariadicPack(32, 32)))])`

---

## dot_i16_to_i32_AVX2

`dot_i16_to_i32_AVX2[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

The dot product of the two words in each int32 element of a and b plus a int32 from src.

**Constraints:**

Requires AVX2.
The size of the output vector must be 4, 8 or 16.

**Parameters:**

* ​width (`Int`): Size of the output SIMD vector.
* ​a\_type (`DType`): The DType for a.
* ​b\_type (`DType`): The DType for b.
* ​c\_type (`DType`): The DType for c.

**Args:**

* ​src (`SIMD[c_type, width]`): A int32 SIMD vector.
* ​a (`SIMD[a_type, width]`): A int16 SIMD vector.
* ​b (`SIMD[b_type, width]`): A int16 SIMD vector.

**Returns:**

A SIMD vector of width elements.

---

## dot_i16_to_i32_x86

`dot_i16_to_i32_x86[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

The dot product of the two words in each int32 element of a and b plus a int32 from src using VNNI or AVX2.

**Constraints:**

Requires AVX512\_VNNI or AVX2.
The size of the output vector must be 4, 8 or 16.

**Parameters:**

* ​width (`Int`): Size of the output SIMD vector.
* ​a\_type (`DType`): The DType for a.
* ​b\_type (`DType`): The DType for b.
* ​c\_type (`DType`): The DType for c.

**Args:**

* ​src (`SIMD[c_type, width]`): A int32 SIMD vector.
* ​a (`SIMD[a_type, width]`): A int16 SIMD vector.
* ​b (`SIMD[b_type, width]`): A int16 SIMD vector.

**Returns:**

A SIMD vector of width elements.

---

## dot_i8_to_i32_AVX2

`dot_i8_to_i32_AVX2[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

The dot product of the four bytes in each int32 element of a and b plus a int32 from src.

**Constraints:**

Requires AVX2.
The size of the output vector must be 4, 8 or 16.
The a argument has range \[0,255].
The b argument has range \[-128,127].

**Parameters:**

* ​width (`Int`): Size of the output SIMD vector.
* ​a\_type (`DType`): The DType for a.
* ​b\_type (`DType`): The DType for b.
* ​c\_type (`DType`): The DType for c.

**Args:**

* ​src (`SIMD[c_type, width]`): A int32 SIMD vector.
* ​a (`SIMD[a_type, width]`): A uint8 SIMD vector.
* ​b (`SIMD[b_type, width]`): A int8 SIMD vector.

**Returns:**

A SIMD vector of width elements.

---

## dot_i8_to_i32_saturated_AVX2

`dot_i8_to_i32_saturated_AVX2[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

The dot product of the four bytes in each int32 element of a and b plus a int32 from src.

**Constraints:**

Requires AVX2.
The size of the output vector must be 4, 8 or 16.
The a argument has range \[0,127] not \[0, 255].
The b argument has range \[-128,127].

**Parameters:**

* ​width (`Int`): Size of the output SIMD vector.
* ​a\_type (`DType`): The DType for a.
* ​b\_type (`DType`): The DType for b.
* ​c\_type (`DType`): The DType for c.

**Args:**

* ​src (`SIMD[c_type, width]`): A int32 SIMD vector.
* ​a (`SIMD[a_type, width]`): A uint8 SIMD vector.
* ​b (`SIMD[b_type, width]`): A int8 SIMD vector.

**Returns:**

A SIMD vector of width elements.

---

## dot_i8_to_i32_saturated_x86

`dot_i8_to_i32_saturated_x86[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2.

**Constraints:**

Requires AVX512\_VNNI or AVX2.
The size of the output vector must be 4, 8 or 16.
The a argument has range \[0,127] not \[0, 255].
The b argument has range \[-128,127].

**Parameters:**

* ​width (`Int`): Size of the output SIMD vector.
* ​a\_type (`DType`): The DType for a.
* ​b\_type (`DType`): The DType for b.
* ​c\_type (`DType`): The DType for c.

**Args:**

* ​src (`SIMD[c_type, width]`): A int32 SIMD vector.
* ​a (`SIMD[a_type, width]`): A uint8 SIMD vector.
* ​b (`SIMD[b_type, width]`): A int8 SIMD vector.

**Returns:**

A SIMD vector of width elements.

---

## dot_i8_to_i32_x86

`dot_i8_to_i32_x86[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2.

**Constraints:**

Requires AVX512\_VNNI or AVX2.
The size of the output vector must be 4, 8 or 16.
The a argument has range \[0,255].
The b argument has range \[-128,127].

**Parameters:**

* ​width (`Int`): Size of the output SIMD vector.
* ​a\_type (`DType`): The DType for a.
* ​b\_type (`DType`): The DType for b.
* ​c\_type (`DType`): The DType for c.

**Args:**

* ​src (`SIMD[c_type, width]`): A int32 SIMD vector.
* ​a (`SIMD[a_type, width]`): A uint8 SIMD vector.
* ​b (`SIMD[b_type, width]`): A int8 SIMD vector.

**Returns:**

A SIMD vector of width elements.

---

## downcast

`downcast(layout: Layout, factor: Int) -> Layout`

Splits elements in a layout to create a finer layout without changing the total number of elements so that the alignment is preserved.

This function is useful for converting between different data type granularities,
such as from uint128 to bf16.

**Args:**

* ​layout (`Layout`): The layout to downcast.
* ​factor (`Int`): The number of elements to split into.

**Returns:**

A new layout with adjusted shape and stride for the finer granularity.

---

## driver

Exposes APIs for interacting with hardware, such as allocating tensors on a GPU
and moving tensors between the CPU and GPU. It provides interfaces for memory
management, device properties, and hardware monitoring. Through these APIs, you
can control data placement, track resource utilization, and configure device
settings for optimal performance.

For example, you can use the following code to use an accelerator if one is
available, otherwise use the CPU:

```python
from max import driver

device = driver.CPU() if driver.accelerator_count() == 0 else driver.Accelerator()
print(f"Using {device} device")
```

## `Accelerator` {#max.driver.Accelerator}

> *class* max.driver.Accelerator(self, id: [int](https://docs.python.org/3/library/functions.html#int) = -1)

Creates an accelerator device with the specified ID.

Provides access to GPU or other hardware accelerators in the system.

```python
from max import driver
device = driver.Accelerator()
# Or specify GPU id
device = driver.Accelerator(id=0)  # First GPU
device = driver.Accelerator(id=1)  # Second GPU
# Get device id
device_id = device.id
```

**Parameters:**

**id** ([`int`](https://docs.python.org/3/library/functions.html#int) `,`  `optional` ) – The device ID to use. Defaults to -1, which selects
the first available accelerator.

**Returns:**

A new Accelerator device object.

**Return type:**

[Accelerator](#max.driver.Accelerator)

## `CPU` {#max.driver.CPU}

> *class* max.driver.CPU(self, id: [int](https://docs.python.org/3/library/functions.html#int) = -1)

Creates a CPU device.

```python
from max import driver
# Create default CPU device
device = driver.CPU()
# Device id is always 0 for CPU devices
device_id = device.id
```

**Parameters:**

**id** ([`int`](https://docs.python.org/3/library/functions.html#int) `,`  `optional` ) – The device ID to use.
Defaults to -1.

**Returns:**

A new CPU device object.

**Return type:**

[CPU](#max.driver.CPU)

## `DLPackArray` {#max.driver.DLPackArray}

> *class* max.driver.DLPackArray(\*args, \*\*kwargs)

## `Device` {#max.driver.Device}

> *class* max.driver.Device

### `api` {#max.driver.Device.api}

> *property* api

Returns the API used to program the device.

Possible values are:

* `cpu` for host devices.
* `cuda` for NVIDIA GPUs.
* `hip` for AMD GPUs.

```python
from max import driver

device = driver.CPU()
device.api
```

### `can_access` {#max.driver.Device.can_access}

> can\_access

Checks if this device can directly access memory of another device.

```python
from max import driver

gpu0 = driver.Accelerator(id=0)
gpu1 = driver.Accelerator(id=1)

if gpu0.can_access(gpu1):
    print("GPU0 can directly access GPU1 memory.")
```

**Parameters:**

**other** ([`Device`](#max.driver.Device) ) – The other device to check peer access against.

**Returns:**

True if peer access is possible, False otherwise.

**Return type:**

[bool](https://docs.python.org/3/library/functions.html#bool)

### `cpu` {#max.driver.Device.cpu}

> cpu *= \*

### `default_stream` {#max.driver.Device.default_stream}

> *property* default\_stream

Returns the default stream for this device.

The default stream is initialized when the device object is created.

**Returns:**

The default execution stream for this device.

**Return type:**

DeviceStream

### `id` {#max.driver.Device.id}

> *property* id

Returns a zero-based device id. For a CPU device this is always 0.
For GPU accelerators this is the id of the device relative to this host.
Along with the `label`, an id can uniquely identify a device,
e.g. `gpu:0`, `gpu:1`.

```python
from max import driver

device = driver.Accelerator()
device_id = device.id
```

**Returns:**

The device ID.

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `is_compatible` {#max.driver.Device.is_compatible}

> *property* is\_compatible

Returns whether this device is compatible with MAX.

**Returns:**

True if the device is compatible with MAX, False otherwise.

**Return type:**

[bool](https://docs.python.org/3/library/functions.html#bool)

### `is_host` {#max.driver.Device.is_host}

> *property* is\_host

Whether this device is the CPU (host) device.

```python
from max import driver

device = driver.CPU()
device.is_host
```

### `label` {#max.driver.Device.label}

> *property* label

Returns device label.

Possible values are:

* `cpu` for host devices.
* `gpu` for accelerators.

```python
from max import driver

device = driver.CPU()
device.label
```

### `stats` {#max.driver.Device.stats}

> *property* stats

Returns utilization data for the device.

```python
from max import driver

device = driver.CPU()
stats = device.stats
```

**Returns:**

A dictionary containing device utilization statistics.

**Return type:**

[dict](https://docs.python.org/3/library/stdtypes.html#dict)

### `synchronize` {#max.driver.Device.synchronize}

> synchronize

Ensures all operations on this device complete before returning.

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If any enqueued operations had an internal error.

## `DeviceSpec` {#max.driver.DeviceSpec}

> *class* max.driver.DeviceSpec(id, device\_type='cpu')

Specification for a device, containing its ID and type.

This class provides a way to specify device parameters like ID and type (CPU/GPU)
for creating Device instances.

**Parameters:**

* **id** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **device\_type** ([`Literal`](https://docs.python.org/3/library/typing.html#typing.Literal) `[` `'cpu'` `,`  `'gpu'` `]` )

### `accelerator()` {#max.driver.DeviceSpec.accelerator}

> *static* accelerator(id=0)

Creates an accelerator (GPU) device specification.

**Parameters:**

**id** ([`int`](https://docs.python.org/3/library/functions.html#int) )

### `cpu()` {#max.driver.DeviceSpec.cpu}

> *static* cpu(id=-1)

Creates a CPU device specification.

**Parameters:**

**id** ([`int`](https://docs.python.org/3/library/functions.html#int) )

### `device_type` {#max.driver.DeviceSpec.device_type}

> device\_type\*: [Literal](https://docs.python.org/3/library/typing.html#typing.Literal)\['cpu', 'gpu']\* *= 'cpu'*

Type of specified device.

### `id` {#max.driver.DeviceSpec.id}

> id\*: [int](https://docs.python.org/3/library/functions.html#int)\*

Provided id for this device.

## `Tensor` {#max.driver.Tensor}

> *class* max.driver.Tensor(self, dtype: [max.\_core.dtype.DType](dtype.md#max.dtype.DType), shape: [collections.abc.Sequence](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[int](https://docs.python.org/3/library/functions.html#int)], device: [max.\_core.driver.Device](#max.driver.Device) | [None](https://docs.python.org/3/library/constants.html#None) = None, pinned: [bool](https://docs.python.org/3/library/functions.html#bool) = False)

> *class* max.driver.Tensor(self, dtype: [max.\_core.dtype.DType](dtype.md#max.dtype.DType), shape: [collections.abc.Sequence](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[int](https://docs.python.org/3/library/functions.html#int)], stream: max.\_core.driver.DeviceStream, pinned: [bool](https://docs.python.org/3/library/functions.html#bool) = False)

> *class* max.driver.Tensor(self, shape: ndarray\[writable=False], device: max.\_core.driver.Device)

> *class* max.driver.Tensor(self, other: [max.\_core.driver.Tensor](#max.driver.Tensor))

Device-resident tensor representation.

Allocates memory onto a given device with the provided shape and dtype.
Tensors can be sliced to provide strided views of the underlying memory,
but any tensors input into model execution must be contiguous.

Supports numpy-style slicing but does not currently support setting
items across multiple indices.

```python
from max import driver
from max.dtype import DType

# Create a tensor on CPU
cpu_tensor = driver.Tensor(shape=[2, 3], dtype=DType.float32)

# Create a tensor on GPU
gpu = driver.Accelerator()
gpu_tensor = driver.Tensor(shape=[2, 3], dtype=DType.float32, device=gpu)
```

**Parameters:**

* **dtype** ([`DType`](dtype.md#max.dtype.DType) ) – Data type of tensor elements.
* **shape** (`Sequence` `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – Tuple of positive, non-zero integers denoting the tensor shape.
* **device** ([`Device`](#max.driver.Device) `,`  `optional` ) – Device to allocate tensor onto. Defaults to the CPU.
* **pinned** ([`bool`](https://docs.python.org/3/library/functions.html#bool) `,`  `optional` ) – If True, memory is page-locked (pinned). Defaults to False.
* **stream** (`DeviceStream` `,`  `optional` ) – Stream to associate the tensor with.

Overloaded function.

1. `__init__(self, dtype: max._core.dtype.DType, shape: collections.abc.Sequence[int], device: max._core.driver.Device | None = None, pinned: bool = False) -> None`
2. `__init__(self, dtype: max._core.dtype.DType, shape: collections.abc.Sequence[int], stream: max._core.driver.DeviceStream, pinned: bool = False) -> None`
3. `__init__(self, shape: ndarray[writable=False], device: max._core.driver.Device) -> None`
4. `__init__(self, other: max._core.driver.Tensor) -> None`

   > Moves the internals from an existing Tensor object into a new Tensor object.

   > Primarily used for initializing subclasses with existing Tensors.

### `contiguous()` {#max.driver.Tensor.contiguous}

> contiguous()

Creates a contiguous copy of the parent tensor.

**Return type:**

[*Tensor*](#max.driver.Tensor)

### `copy` {#max.driver.Tensor.copy}

> copy

Overloaded function.

1. `copy(self, stream: max._core.driver.DeviceStream) -> max._core.driver.Tensor`

   > Creates a deep copy on the device associated with the stream.
   > Args:
   > : stream (DeviceStream): The stream to associate the new tensor with.

   > Returns:
   > : Tensor: A new tensor that is a copy of this tensor.
2. `copy(self, device: max._core.driver.Device | None = None) -> max._core.driver.Tensor`

   > Creates a deep copy on an optionally given device.

   > If device is None (default), a copy is created on the same device.
   >
   > ```python
   > from max import driver
   > from max.dtype import DType
   > ​
   > cpu_tensor = driver.Tensor(shape=[2, 3], dtype=DType.bfloat16, device=driver.CPU())
   > cpu_copy = cpu_tensor.copy()
   > ​
   > # Copy to GPU
   > gpu = driver.Accelerator()
   > gpu_copy = cpu_tensor.copy(device=gpu)
   > ```

   > Args:
   > : device (Device, optional): The device to create the copy on.
   > : Defaults to None (same device).

   > Returns:
   > : Tensor: A new tensor that is a copy of this tensor.

### `device` {#max.driver.Tensor.device}

> *property* device

Device on which tensor is resident.

### `dtype` {#max.driver.Tensor.dtype}

> *property* dtype

DType of constituent elements in tensor.

### `element_size` {#max.driver.Tensor.element_size}

> *property* element\_size

Return the size of the element type in bytes.

### `from_dlpack()` {#max.driver.Tensor.from_dlpack}

> from\_dlpack(\*, copy=None)

Create a tensor from an object implementing the dlpack protocol.

This usually does not result in a copy, and the producer of the object
retains ownership of the underlying memory.

**Parameters:**

* **array** ([`Any`](https://docs.python.org/3/library/typing.html#typing.Any) )
* **copy** ([`bool`](https://docs.python.org/3/library/functions.html#bool)  `|`  `None` )

**Return type:**

[*Tensor*](#max.driver.Tensor)

### `from_numpy()` {#max.driver.Tensor.from_numpy}

> from\_numpy()

Creates a tensor from a provided numpy array on the host device.

The underlying data is not copied unless the array is noncontiguous. If
it is, a contiguous copy will be returned.

**Parameters:**

**arr** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Return type:**

[*Tensor*](#max.driver.Tensor)

### `inplace_copy_from()` {#max.driver.Tensor.inplace_copy_from}

> inplace\_copy\_from(src)

Copy the contents of another tensor into this one. These tensors may
be on different devices.

Requires that both tensors are contiguous and have same size.

**Parameters:**

**src** ([`Tensor`](#max.driver.Tensor) )

**Return type:**

None

### `is_contiguous` {#max.driver.Tensor.is_contiguous}

> *property* is\_contiguous

Whether or not tensor is contiguously allocated in memory. Returns
false if the tensor is a non-contiguous slice.

Currently, we consider certain situations that are contiguous as
non-contiguous for the purposes of our engine, such as when a tensor
has negative steps.

### `is_host` {#max.driver.Tensor.is_host}

> *property* is\_host

Whether or not tensor is host-resident. Returns false for GPU tensors,
true for CPU tensors.

```python
from max import driver
from max.dtype import DType

cpu_tensor = driver.Tensor(shape=[2, 3], dtype=DType.bfloat16, device=driver.CPU())

print(cpu_tensor.is_host)
```

### `item` {#max.driver.Tensor.item}

> item

Returns the scalar value at a given location. Currently
implemented only for zero-rank tensors. The return type is
converted to a Python built-in type.

### `mmap()` {#max.driver.Tensor.mmap}

> mmap(dtype, shape, mode='copyonwrite', offset=0)

**Parameters:**

* **filename** (`PathLike`  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **dtype** ([`DType`](dtype.md#max.dtype.DType) )
* **shape** (`ShapeType`  `|`  [`int`](https://docs.python.org/3/library/functions.html#int) )
* **mode** (`np._MemMapModeKind` )

### `num_elements` {#max.driver.Tensor.num_elements}

> *property* num\_elements

Returns the number of elements in this tensor.

Rank-0 tensors have 1 element by convention.

### `pinned` {#max.driver.Tensor.pinned}

> *property* pinned

Whether or not the underlying memory is pinned (page-locked).

### `rank` {#max.driver.Tensor.rank}

> *property* rank

Tensor rank.

### `scalar` {#max.driver.Tensor.scalar}

> scalar *= \*

### `shape` {#max.driver.Tensor.shape}

> *property* shape

Shape of tensor.

### `stream` {#max.driver.Tensor.stream}

> *property* stream

Stream to which tensor is bound.

### `to` {#max.driver.Tensor.to}

> to

Overloaded function.

1. `to(self, device: max._core.driver.Device) -> Tensor`

   > Return a tensor that’s guaranteed to be on the given device.

   > The tensor is only copied if the requested device is different from the
   > device upon which the tensor is already resident.
2. `to(self, device: max._core.driver.DeviceStream) -> Tensor`

   > Return a tensor that’s guaranteed to be on the given device and associated
   > with the given stream.

   > The tensor is only copied if the requested device is different from the
   > device upon which the tensor is already resident.

### `to_numpy()` {#max.driver.Tensor.to_numpy}

> to\_numpy()

Converts the tensor to a numpy array.

If the tensor is not on the host, an exception is raised.

**Return type:**

[*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)

### `view()` {#max.driver.Tensor.view}

> view(dtype, shape=None)

Return a new tensor with the given type and shape that shares the
underlying memory.

If the shape is not given, it will be deduced if possible, or a
ValueError is raised.

**Parameters:**

* **dtype** ([`DType`](dtype.md#max.dtype.DType) )
* **shape** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]`  `|`  `None` )

**Return type:**

[*Tensor*](#max.driver.Tensor)

### `zeros` {#max.driver.Tensor.zeros}

> zeros *= \*

## `accelerator_api()` {#max.driver.accelerator_api}

> max.driver.accelerator\_api()

Returns the API used to program the accelerator.

**Return type:**

[str](https://docs.python.org/3/library/stdtypes.html#str)

## `devices_exist()` {#max.driver.devices_exist}

> max.driver.devices\_exist(devices)

Identify if devices exist.

**Parameters:**

**devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`DeviceSpec`](#max.driver.DeviceSpec) `]` )

**Return type:**

[bool](https://docs.python.org/3/library/functions.html#bool)

## `load_devices()` {#max.driver.load_devices}

> max.driver.load\_devices(device\_specs)

Initialize and return a list of devices, given a list of device specs.

**Parameters:**

**device\_specs** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`DeviceSpec`](#max.driver.DeviceSpec) `]` )

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Device*](#max.driver.Device)]

## `scan_available_devices()` {#max.driver.scan_available_devices}

> max.driver.scan\_available\_devices()

Returns all accelerators if available, else return cpu.

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*DeviceSpec*](#max.driver.DeviceSpec)]

---

## DriverVersion

`struct DriverVersion`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(out self, value: List[String])`

### `major`

`major(self) -> Int`

### `minor`

`minor(self) -> Int`

### `patch`

`patch(self) -> Int`

### `__str__`

`__str__(self) -> String`

---

## dtype

Provides data type definitions for tensors in MAX Engine. These data types are
essential for defining the precision and memory layout of tensor data when
working with machine learning models.

This module defines the [`DType`](#max.dtype.DType) enum, which represents all supported tensor
data types in MAX Engine, including:

* Integer types (signed and unsigned): `int8` | `uint8` | `int16` | `uint16` | `int32` | `uint32` | `int64` | `uint64`
* Floating-point types: `float8` variants | `float16` | `bfloat16` | `float32` | `float64`
* Boolean type

The module also provides utilities for converting between MAX Engine data types
and [NumPy dtypes](https://numpy.org/doc/stable/user/basics.types.html), making
it easy to interoperate with the NumPy ecosystem.

```python
import numpy as np
from max.dtype import DType

tensor = np.zeros((2, 3), dtype=DType.float32.to_numpy())

# Convert NumPy dtype to MAX DType
array = np.ones((4, 4), dtype=np.float16)
max_dtype = DType.from_numpy(array.dtype)

# Check properties of data types
is_float = DType.float32.is_float()  # True
is_int = DType.int64.is_integral()   # True
size = DType.float64.size_in_bytes   # 8
```

## `DType` {#max.dtype.DType}

> *class* max.dtype.DType(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None)

The tensor data type.

### `align` {#max.dtype.DType.align}

> *property* align

Returns the alignment of the dtype.

### `bfloat16` {#max.dtype.DType.bfloat16}

> bfloat16 *= 71*

### `bool` {#max.dtype.DType.bool}

> bool *= 1*

### `float16` {#max.dtype.DType.float16}

> float16 *= 70*

### `float32` {#max.dtype.DType.float32}

> float32 *= 72*

### `float64` {#max.dtype.DType.float64}

> float64 *= 73*

### `float8_e4m3fn` {#max.dtype.DType.float8_e4m3fn}

> float8\_e4m3fn *= 66*

### `float8_e4m3fnuz` {#max.dtype.DType.float8_e4m3fnuz}

> float8\_e4m3fnuz *= 67*

### `float8_e5m2` {#max.dtype.DType.float8_e5m2}

> float8\_e5m2 *= 68*

### `float8_e5m2fnuz` {#max.dtype.DType.float8_e5m2fnuz}

> float8\_e5m2fnuz *= 69*

### `from_numpy()` {#max.dtype.DType.from_numpy}

> from\_numpy()

Converts a NumPy dtype to the corresponding DType.

**Parameters:**

**dtype** (`np.dtype` ) – The NumPy dtype to convert.

**Returns:**

The corresponding DType enum value.

**Return type:**

[DType](#max.dtype.DType)

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the input dtype is not supported.

### `from_torch()` {#max.dtype.DType.from_torch}

> from\_torch()

**Parameters:**

**dtype** (`dtype` )

**Return type:**

[*DType*](#max.dtype.DType)

### `int16` {#max.dtype.DType.int16}

> int16 *= 137*

### `int32` {#max.dtype.DType.int32}

> int32 *= 139*

### `int64` {#max.dtype.DType.int64}

> int64 *= 141*

### `int8` {#max.dtype.DType.int8}

> int8 *= 135*

### `is_float` {#max.dtype.DType.is_float}

> is\_float

Returns true if the dtype is floating point.

### `is_float8` {#max.dtype.DType.is_float8}

> is\_float8

Returns true if the dtype is any variant of float8.

### `is_half` {#max.dtype.DType.is_half}

> is\_half

Returns true if the dtype is half-precision floating point.

### `is_integral` {#max.dtype.DType.is_integral}

> is\_integral

Returns true if the dtype is an integer.

### `size_in_bytes` {#max.dtype.DType.size_in_bytes}

> *property* size\_in\_bytes

Returns the size of the dtype in bytes.

### `to_numpy()` {#max.dtype.DType.to_numpy}

> to\_numpy()

Converts this `DType` to the corresponding NumPy dtype.

**Returns:**

The corresponding NumPy dtype object.

**Return type:**

[DType](#max.dtype.DType)

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the dtype is not supported.

### `to_torch()` {#max.dtype.DType.to_torch}

> to\_torch()

**Parameters:**

**dtype** ([`DType`](#max.dtype.DType) )

**Return type:**

*dtype*

### `uint16` {#max.dtype.DType.uint16}

> uint16 *= 136*

### `uint32` {#max.dtype.DType.uint32}

> uint32 *= 138*

### `uint64` {#max.dtype.DType.uint64}

> uint64 *= 140*

### `uint8` {#max.dtype.DType.uint8}

> uint8 *= 134*

---

## dtype

Implements the DType class.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`DType`](/mojo/stdlib/builtin/dtype/DType): Represents DType and provides methods for working with it.

---

## DType

`@register_passable(trivial)`
`struct DType`

Represents DType and provides methods for working with it.

## Fields

* ​value (`dtype`): The underlying storage for the DType value.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Hashable`,
`Movable`,
`Representable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Aliases

### `bfloat16`

`alias bfloat16`

Represents a brain floating point value whose bitwidth is 16.

### `bool`

`alias bool`

Represents a boolean data type.

### `float16`

`alias float16`

Represents an IEEE754-2008 `binary16` floating point value.

### `float32`

`alias float32`

Represents an IEEE754-2008 `binary32` floating point value.

### `float64`

`alias float64`

Represents an IEEE754-2008 `binary64` floating point value.

### `float8_e3m4`

`alias float8_e3m4`

Represents an 8-bit e3m4 floating point format, encoded as `seeemmmm`: - (s)ign: 1 bit - (e)xponent: 3 bits - (m)antissa: 4 bits - exponent bias: 3 - nan: 00111111, 11111111 - -0: 10000000 - fn: finite (no inf or -inf encodings)

### `float8_e4m3fn`

`alias float8_e4m3fn`

Represents the E4M3 floating point format defined in the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1).

This type is named differently across libraries and vendors, for example:

* Mojo, PyTorch, JAX, and LLVM refer to it as `e4m3fn`.
* OCP, NVIDIA CUDA, and AMD ROCm refer to it as `e4m3`.

In these contexts, they are all referring to the same finite type specified
in the OFP8 standard above, encoded as `seeeemmm`:

* (s)ign: 1 bit
* (e)xponent: 4 bits
* (m)antissa: 3 bits
* exponent bias: 7
* nan: 01111111, 11111111
* -0: 10000000
* fn: finite (no inf or -inf encodings)

### `float8_e4m3fnuz`

`alias float8_e4m3fnuz`

Represents an 8-bit e4m3fnuz floating point format, encoded as `seeeemmm`: - (s)ign: 1 bit - (e)xponent: 4 bits - (m)antissa: 3 bits - exponent bias: 8 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding)

### `float8_e5m2`

`alias float8_e5m2`

Represents the 8-bit E5M2 floating point format from the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1), encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 15 - nan: {0,1}11111{01,10,11} - inf: 01111100 - -inf: 11111100 - -0: 10000000

### `float8_e5m2fnuz`

`alias float8_e5m2fnuz`

Represents an 8-bit floating point format, encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 16 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding)

### `index`

`alias index`

Represents an integral type whose bitwidth is the maximum integral value on the system.

### `int128`

`alias int128 = si128`

Represents a signed integer type whose bitwidth is 128.

### `int16`

`alias int16`

Represents a signed integer type whose bitwidth is 16.

### `int256`

`alias int256 = si256`

Represents a signed integer type whose bitwidth is 256.

### `int32`

`alias int32`

Represents a signed integer type whose bitwidth is 32.

### `int64`

`alias int64`

Represents a signed integer type whose bitwidth is 64.

### `int8`

`alias int8`

Represents a signed integer type whose bitwidth is 8.

### `invalid`

`alias invalid`

Represents an invalid or unknown data type.

### `tensor_float32`

`alias tensor_float32`

Represents a special floating point format supported by NVIDIA Tensor Cores, with the same range as float32 and reduced precision (>=10 bits). Note that this dtype is only available on NVIDIA GPUs.

### `type`

`alias type = dtype`

### `uint128`

`alias uint128 = ui128`

Represents an unsigned integer type whose bitwidth is 128.

### `uint16`

`alias uint16`

Represents an unsigned integer type whose bitwidth is 16.

### `uint256`

`alias uint256 = ui256`

Represents an unsigned integer type whose bitwidth is 256.

### `uint32`

`alias uint32`

Represents an unsigned integer type whose bitwidth is 32.

### `uint64`

`alias uint64`

Represents an unsigned integer type whose bitwidth is 64.

### `uint8`

`alias uint8`

Represents an unsigned integer type whose bitwidth is 8.

## Methods

### `__init__`

`@implicit`
`__init__(value: dtype) -> Self`

Construct a DType from MLIR dtype.

**Args:**

* ​value (`dtype`): The MLIR dtype.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compares one DType to another for equality.

**Args:**

* ​rhs (`Self`): The DType to compare against.

**Returns:**

True if the DTypes are the same and False otherwise.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Compares one DType to another for inequality.

**Args:**

* ​rhs (`Self`): The DType to compare against.

**Returns:**

False if the DTypes are the same and True otherwise.

### `__is__`

`__is__(self, rhs: Self) -> Bool`

Compares one DType to another for equality.

**Args:**

* ​rhs (`Self`): The DType to compare against.

**Returns:**

True if the DTypes are the same and False otherwise.

### `__isnot__`

`__isnot__(self, rhs: Self) -> Bool`

Compares one DType to another for inequality.

**Args:**

* ​rhs (`Self`): The DType to compare against.

**Returns:**

True if the DTypes are the same and False otherwise.

### `copy`

`copy(self) -> Self`

Copy this DType.

**Returns:**

A copy of the value.

### `__str__`

`__str__(self) -> String`

Gets the name of the DType.

**Returns:**

The name of the dtype.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this dtype to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__repr__`

`__repr__(self) -> String`

Gets the representation of the DType e.g. `"DType.float32"`.

**Returns:**

The representation of the dtype.

### `get_value`

`get_value(self) -> dtype`

Gets the associated internal kgen.dtype value.

**Returns:**

The kgen.dtype value.

### `__hash__`

`__hash__(self) -> UInt`

Return a 64-bit hash for this `DType` value.

**Returns:**

A 64-bit integer hash of this `DType` value.

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with this `DType` value.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

### `is_unsigned`

`is_unsigned(self) -> Bool`

Returns True if the type parameter is unsigned and False otherwise.

**Returns:**

Returns True if the input type parameter is unsigned.

### `is_signed`

`is_signed(self) -> Bool`

Returns True if the type parameter is signed and False otherwise.

**Returns:**

Returns True if the input type parameter is signed.

### `is_integral`

`is_integral(self) -> Bool`

Returns True if the type parameter is an integer and False otherwise.

**Returns:**

Returns True if the input type parameter is an integer.

### `is_floating_point`

`is_floating_point(self) -> Bool`

Returns True if the type parameter is a floating-point and False otherwise.

**Returns:**

Returns True if the input type parameter is a floating-point.

### `is_float8`

`is_float8(self) -> Bool`

Returns True if the dtype is a 8bit-precision floating point type, e.g. float8\_e5m2, float8\_e5m2fnuz, float8\_e4m3fn and float8\_e4m3fnuz.

**Returns:**

True if the dtype is a 8bit-precision float, false otherwise.

### `is_half_float`

`is_half_float(self) -> Bool`

Returns True if the dtype is a half-precision floating point type, e.g. either fp16 or bf16.

**Returns:**

True if the dtype is a half-precision float, false otherwise..

### `is_numeric`

`is_numeric(self) -> Bool`

Returns True if the type parameter is numeric (i.e. you can perform arithmetic operations on).

**Returns:**

Returns True if the input type parameter is either integral or
floating-point.

### `sizeof`

`sizeof(self) -> Int`

Returns the size in bytes of the current DType.

**Returns:**

Returns the size in bytes of the current DType.

### `bitwidth`

`bitwidth(self) -> Int`

Returns the size in bits of the current DType.

**Returns:**

Returns the size in bits of the current DType.

### `dispatch_integral`

`dispatch_integral[: origin.set, //, func: fn[DType]() capturing -> None](self)`

Dispatches an integral function corresponding to the current DType.

**Constraints:**

DType must be integral.

**Parameters:**

* ​func (`fn[DType]() capturing -> None`): A parametrized on dtype function to dispatch.

### `dispatch_floating`

`dispatch_floating[: origin.set, //, func: fn[DType]() capturing -> None](self)`

Dispatches a floating-point function corresponding to the current DType.

**Constraints:**

DType must be floating-point or integral.

**Parameters:**

* ​func (`fn[DType]() capturing -> None`): A parametrized on dtype function to dispatch.

### `dispatch_arithmetic`

`dispatch_arithmetic[: origin.set, //, func: fn[DType]() capturing -> None](self)`

Dispatches a function corresponding to the current DType.

**Parameters:**

* ​func (`fn[DType]() capturing -> None`): A parametrized on dtype function to dispatch.

### `__mlir_type`

`__mlir_type(self) -> !kgen.deferred`

Returns the MLIR type of the current DType as an MLIR type.

**Returns:**

The MLIR type of the current DType.

---

## dual_gemm

`dual_gemm[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], config: OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]] = OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]]({:i1 0, 1}), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], ctx: DeviceContext)`

---

## dual_gemm

## Aliases

### `binary_fn_type`

`alias binary_fn_type = fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1]`

## Functions

* [​`config_in_smem`](./config_in_smem):
* [​`dual_gemm`](./dual_gemm):
* [​`dual_gemv`](./dual_gemv):
* [​`dual_gemv_kernel`](./dual_gemv_kernel):
* [​`multistage_dual_gemm`](./multistage_dual_gemm):
* [​`multistage_dual_gemm_kernel`](./multistage_dual_gemm_kernel):
* [​`multistage_dual_mma`](./multistage_dual_mma):
* [​`swilu`](./swilu):
* [​`swishGLU`](./swishGLU): Reference:     GLU Variants Improve Transformer     by Noam Shazeer      The implementation follows cutlass, using one kernel invocation and writing to the destination once.

---

## dual_gemv

`dual_gemv[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], ctx: DeviceContext)`

---

## dual_gemv_kernel

`dual_gemv_kernel[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, simd_width: UInt, tile_m: UInt, tile_n: UInt, num_threads: UInt, binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape])`

---

## dynamic

`dynamic(d: Int) -> ValueOrUnknown`

Creates a dynamic dimension with runtime value.

**Args:**

* ​d (`Int`): Runtime dimension value.

**Returns:**

`ValueOrUnknown` - A dynamic dimension with the given value.

---

## DynamicInt

`@register_passable(trivial)`
`struct DynamicInt`

## Fields

* ​value (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Intable`,
`Movable`,
`OptionallyStaticInt`,
`UnknownDestructibility`

## Aliases

### `static_value`

`alias static_value = OptionalReg[Int]({:i1 0, 1})`

## Methods

### `__init__`

`__init__(value: Int) -> Self`

### `__int__`

`__int__(self) -> Int`

### `as_uint32`

`as_uint32(self) -> SIMD[uint32, 1]`

---

## DynamicTensor

`struct DynamicTensor[type: DType, rank: Int]`

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `Type`

`alias Type = ManagedTensorSlice[IOSpec(), static_spec=create_unknown()]`

---

## elect_one_sync

`elect_one_sync() -> Bool`

Elects a single thread within a warp to perform an operation.

Note:

* Only supported on NVIDIA SM90+ GPUs.
* Maps directly to the `elect.sync` instruction in CUDA PTX.
* Useful for having a single thread perform an operation while
  maintaining warp synchronization.

**Returns:**

True for the elected thread, False for all other threads in the warp.

---

## element

Provides element-based access to memory using layout-driven vectorization.

This module implements efficient memory access patterns for multi-dimensional data
using the layout system. It provides abstractions for loading and storing data with
specific memory layouts, enabling vectorized operations that respect the underlying
memory organization.

Key components:

* `Element`: A wrapper around SIMD types that provides layout-driven vectorized
  operations
* `MemoryElement`: Represents data in memory organized according to a specific layout

These components enable efficient tensor operations by ensuring memory accesses
follow optimal patterns defined by the layout system.

## Structs

* [​`Element`](./Element): A wrapper around SIMD types that provides layout-driven vectorized operations.
* [​`MemoryElement`](./MemoryElement): Represents data in memory organized according to a specific layout.

---

## Element

`struct Element[dtype: DType, layout: Layout, /, index_type: DType = _get_index_type(layout)]`

A wrapper around SIMD types that provides layout-driven vectorized operations.

The `Element` struct extends SIMD types with layout-aware load and store
operations, enabling efficient vectorized access to multi-dimensional data.
It maps between logical tensor coordinates and physical memory locations
according to the specified layout.

## Parameters

* ​dtype (`DType`): The data type of the elements.
* ​layout (`Layout`): The memory layout describing how elements are organized.
* ​index\_type (`DType`): The integer type of the index pointing to each element.

## Fields

* ​element\_data (`SIMD[dtype, layout.size()]`): The actual SIMD data stored in this element.
  This field contains the vectorized data values that can be processed
  efficiently using SIMD operations.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout information for memory access patterns.
  This field stores the layout information needed to map between logical tensor
  coordinates and physical memory locations, supporting both compile-time and
  runtime-determined access patterns.

## Implemented traits

`AnyType`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `element_data_type`

`alias element_data_type = SIMD[dtype, layout.size()]`

The SIMD type used to store and process the element data.

This type alias defines a SIMD vector with the specified data type and size
matching the layout's total element count, enabling efficient vectorized operations.

## Methods

### `__init__`

`@implicit`
`__init__(out self, element_data: SIMD[dtype, layout.size()])`

Initializes an Element with the given SIMD data.

**Args:**

* ​element\_data (`SIMD[dtype, layout.size()]`): The SIMD data to initialize the element with.

`__init__(out self, element_data: SIMD[dtype, layout.size()], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type])`

Initializes an Element with the given SIMD data and runtime layout.

**Args:**

* ​element\_data (`SIMD[dtype, layout.size()]`): The SIMD data to initialize the element with.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access.

### `load`

`static load(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type] = RuntimeLayout()) -> Self`

Loads data from memory according to the specified layout.

This method loads data from memory using the layout information to determine
the memory access pattern. It supports both rank-1 and rank-2 layouts with
various stride patterns, optimizing for contiguous memory access when
possible.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access.

**Returns:**

A new `Element` containing the loaded data.

### `masked_load`

`static masked_load(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type] = RuntimeLayout()) -> Self`

Loads data from memory with masking for partial loads.

This method loads data from memory using the layout information, but also
handles cases where the runtime dimensions are smaller than the static
layout dimensions. It ensures that only valid memory locations are accessed.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access.

**Returns:**

A new `Element` containing the loaded data, with zeros in positions
beyond the runtime dimensions.

### `store`

`store(self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin])`

Stores element data to memory according to the specified layout.

This method performs a layout-aware store operation, writing data to memory
following the access patterns defined by the layout. It optimizes memory
writes based on the layout's stride patterns to maximize performance.

The method handles different memory layout patterns:

* For rank-1 tensors with contiguous memory (stride=1), it uses vectorized stores
* For rank-2 tensors with contiguous rows or columns, it uses optimized slice-based stores
* For non-contiguous memory layouts, it performs element-by-element stores

Unlike `masked_store()`, this method assumes the full static dimensions will be written
and does not perform runtime dimension boundary checking.

Note:
This method is constrained to layouts with rank ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): Mutable pointer to the memory location where data will be stored.

### `masked_store`

`masked_store(self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin])`

Stores element data to memory with masking for partial stores.

This method performs a layout-aware store operation with boundary checking.
It ensures that only valid memory locations are written to when the runtime
dimensions are smaller than the static layout dimensions, preventing out-of-bounds
memory access.

The method optimizes for different memory layouts:

* For contiguous memory (stride=1), it uses vectorized stores when possible
* For non-contiguous memory, it performs element-by-element stores
* For all patterns, it respects runtime dimension bounds

Note:
This method is constrained to layouts with rank ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): Pointer to the memory location where data will be stored.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the element.

**Returns:**

A string representation of the element's data.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the element to the specified writer.

**Parameters:**

* ​W (`Writer`): Type parameter representing a Writer implementation.

**Args:**

* ​writer (`W`): The writer to output the element representation to.

---

## elementwise

`elementwise[: origin.set, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: Int)`

Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed.

**Parameters:**

* ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function.
* ​simd\_width (`Int`): The SIMD vector width to use.
* ​use\_blocking\_impl (`Bool`): Do not invoke the function using asychronous calls.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.
* ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace.

**Args:**

* ​shape (`Int`): The shape of the buffer.

`elementwise[: origin.set, rank: Int, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: IndexList[rank, element_type=element_type])`

Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed.

**Parameters:**

* ​rank (`Int`): The rank of the buffer.
* ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function.
* ​simd\_width (`Int`): The SIMD vector width to use.
* ​use\_blocking\_impl (`Bool`): Do not invoke the function using asychronous calls.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.
* ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace.

**Args:**

* ​shape (`IndexList[rank, element_type=element_type]`): The shape of the buffer.

`elementwise[: origin.set, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: Int, context: DeviceContext)`

Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed.

**Parameters:**

* ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function.
* ​simd\_width (`Int`): The SIMD vector width to use.
* ​use\_blocking\_impl (`Bool`): Do not invoke the function using asychronous calls.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.
* ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace.

**Args:**

* ​shape (`Int`): The shape of the buffer.
* ​context (`DeviceContext`): The device context to use.

`elementwise[: origin.set, rank: Int, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: IndexList[rank, element_type=element_type], context: DeviceContext)`

Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed.

**Parameters:**

* ​rank (`Int`): The rank of the buffer.
* ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function.
* ​simd\_width (`Int`): The SIMD vector width to use.
* ​use\_blocking\_impl (`Bool`): Do not invoke the function using asychronous calls.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.
* ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace.

**Args:**

* ​shape (`IndexList[rank, element_type=element_type]`): The shape of the buffer.
* ​context (`DeviceContext`): The device context to use.

`elementwise[: origin.set, rank: Int, //, func: fn[Int, Int](IndexList[$1]) capturing -> None, simd_width: Int, *, use_blocking_impl: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](shape: IndexList[rank, element_type=element_type], context: DeviceContextPtr)`

Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed.

**Parameters:**

* ​rank (`Int`): The rank of the buffer.
* ​func (`fn[Int, Int](IndexList[$1]) capturing -> None`): The body function.
* ​simd\_width (`Int`): The SIMD vector width to use.
* ​use\_blocking\_impl (`Bool`): Do not invoke the function using asychronous calls.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.
* ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): Description of the trace.

**Args:**

* ​shape (`IndexList[rank, element_type=element_type]`): The shape of the buffer.
* ​context (`DeviceContextPtr`): The device context to use.

---

## elementwise_epilogue_c_tile

`elementwise_epilogue_c_tile[: origin.set, //, simd_width: Int, type: DType, origin: MutableOrigin, c_shape: DimList, func: fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None](offset: GemmShape, tile_len: GemmShape, c: NDBuffer[type, 2, origin, c_shape])`

---

## elu

`elu[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]`

Compute the Elu Op using the equation $z if z >= 0 else alpha*(e^z -1)$.

**Parameters:**

* ​type (`DType`): DType used for the computation.
* ​simd\_width (`Int`): SIMD width used for the computation.

**Args:**

* ​x (`SIMD[type, simd_width]`): The value to compute the ELU operation on.

**Returns:**

The result of the ELU operation.

---

## embedding

The `embedding` module provides classes for mapping integer indices (like
token IDs) to dense vector representations. These embedding operations are
fundamental building blocks for natural language processing, recommendation
systems, and other tasks involving discrete tokens.

* `Embedding`: Basic embedding lookup table for simple use cases
* `EmbeddingV2`: Enhanced embedding with device placement control and improved memory management
* `VocabParallelEmbedding`: Distributed embedding that shards the vocabulary across multiple devices for large embedding tables

Here’s an example demonstrating how to use embeddings:

```python
import max.nn as nn
from max.graph import Graph, ops, DeviceRef
from max.dtype import DType
import numpy as np

with Graph(name="embedding_example") as graph:
    # Define dimensions
    batch_size = 4
    seq_length = 16
    vocab_size = 10000
    hidden_dim = 256

    # Create input tensor of token indices
    input_data = np.random.randint(0, vocab_size, (batch_size, seq_length), dtype=np.int32)
    input_indices = ops.constant(input_data, dtype=DType.int32, device=DeviceRef.CPU())

    # Create embedding layer
    embedding = nn.EmbeddingV2(
        vocab_size=vocab_size,
        hidden_dim=hidden_dim,
        dtype=DType.float32,
        device=DeviceRef.GPU(),
        name="token_embeddings"
    )

    # Look up embeddings for input indices
    embeddings = embedding(input_indices)
    print(f"Embedding output shape: {embeddings.shape}")
    # Embedding output shape: [Dim(4), Dim(16), Dim(256)]
```

## `Embedding` {#max.nn.embedding.Embedding}

> *class* max.nn.embedding.Embedding(vocab\_size, hidden\_dim, dtype, device, quantization\_encoding=None, name=None)

A lookup table for embedding integer indices into dense vectors.

This layer maps each integer index to a dense vector of fixed size.
Embedding weights are stored on the CPU but are moved to the specified
device during the model init phase.

Example:

```python
embedding_layer = Embedding(
    vocab_size=1000,
    hidden_dim=256,
    dtype=DType.float32,
    device=DeviceRef.GPU(),
    name="embeddings",
)

token_indices: TensorValueLike
embeddings = embedding_layer(token_indices)
```

Initializes the embedding layer with the given arguments.

**Parameters:**

* **vocab\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of unique items in the vocabulary.
  Indices must be in the range `[0, vocab_size)`.
* **hidden\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of each embedding vector.
* **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The data type of the embedding weights.
* **device** (`DeviceRef` ) – The device where embedding lookups are executed.
  Model init transfers the initially CPU-resident weights to this
  device.
* **name** (`Optional` `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `]` ) – The name identifier for the embedding weight matrix.
* **quantization\_encoding** (`Optional` `[` [`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) `]` )

### `device` {#max.nn.embedding.Embedding.device}

> device\*: DeviceRef\*

The device on which embedding lookup is performed.

### `weight` {#max.nn.embedding.Embedding.weight}

> weight\*: [Weight](../graph/Weight.md#max.graph.Weight)\*

The embedding weight matrix stored on the CPU.
Model init moves weights to the device specified in [`device`](#max.nn.embedding.Embedding.device).

## `EmbeddingV1` {#max.nn.embedding.EmbeddingV1}

> *class* max.nn.embedding.EmbeddingV1(weights, device)

A lookup table for embedding integer indices into dense vectors.

Deprecated: Use Embedding instead.

**Parameters:**

* **weights** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **device** (`DeviceRef` )

### `device` {#max.nn.embedding.EmbeddingV1.device}

> device\*: DeviceRef\*

### `weights` {#max.nn.embedding.EmbeddingV1.weights}

> weights\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

## `VocabParallelEmbedding` {#max.nn.embedding.VocabParallelEmbedding}

> *class* max.nn.embedding.VocabParallelEmbedding(vocab\_size, hidden\_dim, dtype, devices, quantization\_encoding=None, name=None)

A lookup table for embedding integer indices into dense vectors.

This layer works like nn.Embedding except the embedding table is sharded
on the vocabulary dimension across all devices.

Example:

```python
embedding_layer = VocabParallelEmbedding(
    vocab_size=1000,
    hidden_dim=256,
    dtype=DType.float32,
    device=[DeviceRef.GPU(0), DeviceRef.GPU(1)],
    name="embeddings",
)

# Token indices of shape: [batch, ..., num_indices].
token_indices: TensorValueLike
embeddings = embedding_layer(token_indices)
```

**Parameters:**

* **vocab\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of unique items in the vocabulary.
  Indices must be in the range `[0, vocab_size)`.
* **hidden\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of each embedding vector.
* **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The data type of the embedding weights.
* **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` ) – The devices where embedding lookups are executed.
  Model init transfers the initially CPU-resident weights to this
  device.
* **name** (`Optional` `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `]` ) – The name identifier for the embedding weight matrix.
* **quantization\_encoding** (`Optional` `[` [`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) `]` )

---

## Embedding

An embedding (also known as a "vector embedding") is a numerical representation
of information in a high-dimensional vector space. For example, a token
embedding (or word embedding) encodes the meaning of words for use in large
language models (LLMs).

Because artificial neural networks (AI models) are a sequence of mathematical
operations, they require numerical structures as input. Vector embeddings are
numerical structures that provide a way to express a wide range of complex
concepts. They can be used to capture information about all sorts of things,
including words, groups of words, sounds, images, and more.

For example, [tokenizing](tokenization.mdx) a word like "bank" into a simple
number can't encode the different meanings in "bank loan" and "river bank." By
converting the token into a high-dimensional vector, we can encode (or "embed")
a variety of word meanings that help the model understand word relationships
using a notion of closeness along various vector dimensions (expressed through
[euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance)). In
this way, when a model encounters the embedding for the word "bank," it can
recognize the relationship it has with nearby words such as "loan" or "river,"
based on the closeness they each have to each other on different vector
dimensions (perhaps a "finance" dimension vs a "geography" dimension that were
learned during training).

Although word embeddings are a type of static embedding that encode the meaning
of individual words as input to an LLM, an LLM also builds its own embeddings
that are hidden inside the model. For example, as an LLM tries to understand
the relationship between each word from an input sequence, it compresses more
information into each token embedding based on the attention scores computed in
the [self-attention layer](self-attention.mdx).

:::note Embedding models

Whereas the token embeddings described above use a vector space to represent
the meaning of individual tokens, the output from an embedding model uses a
vector space to represent the meaning of the input data (a document) as a
whole. In this way, an embedding model allows you to programmatically search
and compare different documents by analyzing their corresponding embeddings,
which can reveal nuanced meaning and semantics far beyond what a pure text
comparison can achieve.

:::

---

## EnableState

`@register_passable(trivial)`
`struct EnableState`

## Fields

* ​code (`SIMD[int32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `DISABLED`

`alias DISABLED = EnableState(__init__[__mlir_type.!pop.int_literal](0))`

Feature disabled

### `ENABLED`

`alias ENABLED = EnableState(__init__[__mlir_type.!pop.int_literal](1))`

Feature enabled

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

---

## engine

The APIs in this module allow you to run inference with MAX Engine—a graph
compiler and runtime that accelerates your AI models on a wide variety of
hardware.

## `InferenceSession` {#max.engine.InferenceSession}

> *class* max.engine.InferenceSession(num\_threads=None, devices=None, \*, custom\_extensions=None)

Manages an inference session in which you can load and run models.

You need an instance of this to load a model as a [`Model`](#max.engine.Model) object.
For example:

```python
session = engine.InferenceSession()
model_path = Path('bert-base-uncased')
model = session.load(model_path)
```

**Parameters:**

* **num\_threads** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` ) – Number of threads to use for the inference session.
  This defaults to the number of physical cores on your machine.
* **devices** (`Iterable` `[` [`Device`](driver.md#max.driver.Device) `]`  `|`  `None` ) – A list of devices on which to run inference. Default is
  the host CPU only.
* **custom\_extensions** (`CustomExtensionsType`  `|`  `None` ) – The extensions to load for the model.
  Supports paths to .mojopkg custom ops, .so custom op libraries
  for PyTorch and .pt torchscript files for torch metadata
  libraries. Supports `TorchMetadata` and
  `torch.jit.ScriptModule` objects for
  torch metadata libraries without serialization.

### `devices` {#max.engine.InferenceSession.devices}

> *property* devices\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[Device](driver.md#max.driver.Device)]\*

A list of available devices.

### `gpu_profiling()` {#max.engine.InferenceSession.gpu_profiling}

> gpu\_profiling(mode)

Enables end to end gpu profiling configuration.

**Parameters:**

**mode** (`GPUProfilingMode` )

### `load()` {#max.engine.InferenceSession.load}

> load(model, \*, custom\_extensions=None, custom\_ops\_path=None, weights\_registry=None)

Loads a trained model and compiles it for inference.

**Parameters:**

* **model** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path)  `|`  [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) ) – Path to a model.
* **custom\_extensions** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path)  `|`  [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `]`  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path)  `|`  [`Any`](https://docs.python.org/3/library/typing.html#typing.Any)  `|`  `None` ) – The extensions to load for the model.
  Supports paths to .mojopkg custom ops.
* **custom\_ops\_path** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `None` ) – The path to your custom ops Mojo package.
  Deprecated, use `custom_extensions` instead.
* **weights\_registry** ([`Mapping`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Mapping) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  [`DLPackArray`](driver.md#max.driver.DLPackArray)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `[` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `,`  [`dtype`](https://numpy.org/doc/stable/reference/generated/numpy.dtype.html#numpy.dtype) `[` `\_ScalarType_co` `]` `]` `]`  `|`  `None` ) – A mapping from names of model weights’ names to
  their values. The values are currently expected to be dlpack
  arrays. If an array is a read-only numpy array, the user must
  ensure that its lifetime extends beyond the lifetime of the model.

**Returns:**

The loaded model, compiled and ready to execute.

**Raises:**

[**RuntimeError**](https://docs.python.org/3/library/exceptions.html#RuntimeError) – If the path provided is invalid.

**Return type:**

[*Model*](#max.engine.Model)

### `reset_stats_report()` {#max.engine.InferenceSession.reset_stats_report}

> reset\_stats\_report()

Clears all entries in stats\_report.

**Return type:**

None

### `set_mojo_assert_level()` {#max.engine.InferenceSession.set_mojo_assert_level}

> set\_mojo\_assert\_level(level)

Sets which mojo asserts are kept in the compiled model.

**Parameters:**

**level** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `AssertLevel` )

### `set_mojo_log_level()` {#max.engine.InferenceSession.set_mojo_log_level}

> set\_mojo\_log\_level(level)

Sets the verbosity of mojo logging in the compiled model.

**Parameters:**

**level** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `LogLevel` )

### `set_split_k_reduction_precision()` {#max.engine.InferenceSession.set_split_k_reduction_precision}

> set\_split\_k\_reduction\_precision(precision)

Sets the accumulation precision for split k reductions in large matmuls.

**Parameters:**

**precision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `SplitKReductionPrecision` )

### `stats_report` {#max.engine.InferenceSession.stats_report}

> *property* stats\_report\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [Any](https://docs.python.org/3/library/typing.html#typing.Any)]\*

Metadata about model compilation (PyTorch only).

Prints a list of “fallback ops”, which are ops that could not be lowered
to our internal dialect MO. Fallback ops have to be executed using the
original framework (i.e. PyTorch), which makes the model much slower.
This function is a good starting point for debugging model performance.

## `Model` {#max.engine.Model}

> *class* max.engine.Model

A loaded model that you can execute.

Do not instantiate this class directly. Instead, create it with
[`InferenceSession`](#max.engine.InferenceSession).

### `__call__()` {#max.engine.Model.__call}

> \_\_call\_\_(\*args, \*\*kwargs)

Call self as a function.

**Parameters:**

* **self** ([`Model`](#max.engine.Model) )
* **args** ([`DLPackArray`](driver.md#max.driver.DLPackArray)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `[` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `,`  [`dtype`](https://numpy.org/doc/stable/reference/generated/numpy.dtype.html#numpy.dtype) `[` `\_ScalarType_co` `]` `]`  `|`  [`Tensor`](driver.md#max.driver.Tensor)  `|`  [`MojoValue`](#max.engine.MojoValue)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`bool`](https://docs.python.org/3/library/functions.html#bool)  `|`  [`generic`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.generic) )
* **kwargs** ([`DLPackArray`](driver.md#max.driver.DLPackArray)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `[` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `,`  [`dtype`](https://numpy.org/doc/stable/reference/generated/numpy.dtype.html#numpy.dtype) `[` `\_ScalarType_co` `]` `]`  `|`  [`Tensor`](driver.md#max.driver.Tensor)  `|`  [`MojoValue`](#max.engine.MojoValue)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`bool`](https://docs.python.org/3/library/functions.html#bool)  `|`  [`generic`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.generic) )

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Tensor*](driver.md#max.driver.Tensor) | [*MojoValue*](#max.engine.MojoValue)]

### `execute()` {#max.engine.Model.execute}

> execute(\*args)

**Parameters:**

* **self** ([`Model`](#max.engine.Model) )
* **args** ([`DLPackArray`](driver.md#max.driver.DLPackArray)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `[` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `,`  [`dtype`](https://numpy.org/doc/stable/reference/generated/numpy.dtype.html#numpy.dtype) `[` `\_ScalarType_co` `]` `]`  `|`  [`Tensor`](driver.md#max.driver.Tensor)  `|`  [`MojoValue`](#max.engine.MojoValue)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`bool`](https://docs.python.org/3/library/functions.html#bool)  `|`  [`generic`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.generic) )

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Tensor*](driver.md#max.driver.Tensor) | [*MojoValue*](#max.engine.MojoValue)]

### `execute_legacy()` {#max.engine.Model.execute_legacy}

> execute\_legacy(\*\*kwargs)

**Parameters:**

* **self** ([`Model`](#max.engine.Model) )
* **kwargs** ([`Any`](https://docs.python.org/3/library/typing.html#typing.Any) )

**Return type:**

[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) | [dict](https://docs.python.org/3/library/stdtypes.html#dict) | [list](https://docs.python.org/3/library/stdtypes.html#list) | [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)]

### `input_metadata` {#max.engine.Model.input_metadata}

> *property* input\_metadata

Metadata about the model’s input tensors, as a list of
[`TensorSpec`](#max.engine.TensorSpec) objects.

For example, you can print the input tensor names, shapes, and dtypes:

```python
for tensor in model.input_metadata:
    print(f'name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}')
```

### `output_metadata` {#max.engine.Model.output_metadata}

> *property* output\_metadata

Metadata about the model’s output tensors, as a list of
[`TensorSpec`](#max.engine.TensorSpec) objects.

For example, you can print the output tensor names, shapes, and dtypes:

```python
for tensor in model.output_metadata:
    print(f'name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}')
```

## `MojoValue` {#max.engine.MojoValue}

> *class* max.engine.MojoValue

This is work in progress and you should ignore it for now.

## `TensorSpec` {#max.engine.TensorSpec}

> *class* max.engine.TensorSpec(self, shape: [collections.abc.Sequence](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)] | [None](https://docs.python.org/3/library/constants.html#None), dtype: [max.\_core.dtype.DType](dtype.md#max.dtype.DType), name: [str](https://docs.python.org/3/library/stdtypes.html#str))

Defines the properties of a tensor, including its name, shape and
data type.

For usage examples, see [`Model.input_metadata`](#max.engine.Model.input_metadata).

**Parameters:**

* **shape** – The tensor shape.
* **dtype** – The tensor data type.
* **name** – The tensor name.

### `dtype` {#max.engine.TensorSpec.dtype}

> *property* dtype

A tensor data type.

### `name` {#max.engine.TensorSpec.name}

> *property* name

A tensor name.

### `shape` {#max.engine.TensorSpec.shape}

> *property* shape

The shape of the tensor as a list of integers.

If a dimension size is unknown/dynamic (such as the batch size), its
value is `None`.

---

## entrypoints

## `LLM` {#max.entrypoints.llm.LLM}

> *class* max.entrypoints.llm.LLM(pipeline\_config)

A high level interface for interacting with LLMs.

**Parameters:**

**pipeline\_config** ([`PipelineConfig`](pipelines/config.md#max.pipelines.lib.config.PipelineConfig) )

### `generate()` {#max.entrypoints.llm.LLM.generate}

> generate(prompts, max\_new\_tokens=100, use\_tqdm=True)

Generates text completions for the given prompts.

**Parameters:**

* **prompts** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `]` ) – The input string or list of strings to generate completions for.
* **max\_new\_tokens** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` ) – The maximum number of tokens to generate in the response.
* **use\_tqdm** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether to display a progress bar during generation.

**Returns:**

A list of generated text completions corresponding to each input prompt.

**Raises:**

* [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If prompts is empty or contains invalid data.
* [**RuntimeError**](https://docs.python.org/3/library/exceptions.html#RuntimeError) – If the model fails to generate completions.

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[str](https://docs.python.org/3/library/stdtypes.html#str)]

---

## env

Provides functions for working with environment variables.

You can import these APIs from the `os` package. For example:

```mojo
from os import setenv
```

## Functions

* [​`getenv`](/mojo/stdlib/os/env/getenv): Returns the value of the given environment variable.
* [​`setenv`](/mojo/stdlib/os/env/setenv): Changes or adds an environment variable.
* [​`unsetenv`](/mojo/stdlib/os/env/unsetenv): Unsets an environment variable.

---

## env_get_bool

`env_get_bool[name: StringSlice[StaticConstantOrigin]]() -> Bool`

Try to get an boolean-valued define. Compilation fails if the name is not defined or the value is neither `True` or `False`.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the define.

**Returns:**

An boolean parameter value.

`env_get_bool[name: StringSlice[StaticConstantOrigin], default: Bool]() -> Bool`

Try to get an bool-valued define. If the name is not defined, return a default value instead. The boolean must be either `True` or `False`.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the define.
* ​default (`Bool`): The default value to use.

**Returns:**

An bool parameter value.

---

## env_get_dtype

`env_get_dtype[name: StringSlice[StaticConstantOrigin], default: DType]() -> DType`

Try to get an DType-valued define. If the name is not defined, return a default value instead.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the define.
* ​default (`DType`): The default value to use.

**Returns:**

An DType parameter value.

---

## env_get_int

`env_get_int[name: StringSlice[StaticConstantOrigin]]() -> Int`

Try to get an integer-valued define. Compilation fails if the name is not defined.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the define.

**Returns:**

An integer parameter value.

`env_get_int[name: StringSlice[StaticConstantOrigin], default: Int]() -> Int`

Try to get an integer-valued define. If the name is not defined, return a default value instead.

Example:

```mojo
from sys.param_env import env_get_int

def main():
    alias number = env_get_int[
        "favorite_number",
        1 # Default value
    ]()
    parametrized[number]()

fn parametrized[num: Int]():
    print(num)
```

If the program is `app.mojo`:

* `mojo run -D favorite_number=2 app.mojo`
* `mojo run -D app.mojo`

Note: useful for parameterizing SIMD vector sizes.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the define.
* ​default (`Int`): The default value to use.

**Returns:**

An integer parameter value.

---

## env_get_string

`env_get_string[name: StringSlice[StaticConstantOrigin]]() -> StringSlice[StaticConstantOrigin]`

Try to get a string-valued define. Compilation fails if the name is not defined.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the define.

**Returns:**

A string parameter value.

`env_get_string[name: StringSlice[StaticConstantOrigin], default: StringSlice[StaticConstantOrigin]]() -> StringSlice[StaticConstantOrigin]`

Try to get a string-valued define. If the name is not defined, return a default value instead.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the define.
* ​default (`StringSlice[StaticConstantOrigin]`): The default value to use.

**Returns:**

A string parameter value.

---

## equality_comparable

## Traits

* [​`EqualityComparable`](/mojo/stdlib/builtin/equality_comparable/EqualityComparable): A type which can be compared for equality with other instances of itself.

---

## EqualityComparable

A type which can be compared for equality with other instances of itself.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__eq__`

`__eq__(self: _Self, other: _Self) -> Bool`

Define whether two instances of the object are equal to each other.

**Args:**

* ​other (`_Self`): Another instance of the same type.

**Returns:**

True if the instances are equal according to the type's definition
of equality, False otherwise.

### `__ne__`

`__ne__(self: _Self, other: _Self) -> Bool`

Define whether two instances of the object are not equal to each other.

**Args:**

* ​other (`_Self`): Another instance of the same type.

**Returns:**

True if the instances are not equal according to the type's definition
of equality, False otherwise.

---

## erf

`erf[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs the elementwise Erf on a SIMD vector.

**Constraints:**

The type must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector to perform elementwise Erf on.

**Returns:**

The result of the elementwise Erf operation.

---

## erfc

`erfc[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `erfc` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `erfc` of the input.

---

## error

Implements the Error class.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`Error`](/mojo/stdlib/builtin/error/Error): This type represents an Error.

---

## Error

`@register_passable`
`struct Error`

This type represents an Error.

## Fields

* ​data (`UnsafePointer[SIMD[uint8, 1]]`): A pointer to the beginning of the string data being referenced.
* ​loaded\_length (`Int`): The length of the string being referenced. Error instances conditionally own their error message. To reduce the size of the error instance we use the sign bit of the length field to store the ownership value. When loaded\_length is negative it indicates ownership and a free is executed in the destructor.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Representable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Default constructor.

`@implicit`
`__init__(value: StringLiteral[value]) -> Self`

Construct an Error object with a given string literal.

**Args:**

* ​value (`StringLiteral[value]`): The error message.

`@implicit`
`__init__(src: String) -> Self`

Construct an Error object with a given string.

**Args:**

* ​src (`String`): The error message.

`@implicit`
`__init__(src: StringSlice[origin]) -> Self`

Construct an Error object with a given string ref.

**Args:**

* ​src (`StringSlice[origin]`): The error message.

`__init__[*Ts: Writable](*args: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")) -> Self`

Construct an Error by concatenating a sequence of Writable arguments.

**Parameters:**

* ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy
  `Writable`.

**Args:**

* ​\*args (`*Ts`): A sequence of Writable arguments.
* ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements.
* ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements.

### `__copyinit__`

`__copyinit__(existing: Self) -> Self`

Creates a deep copy of an existing error.

**Args:**

* ​existing (`Self`): The error to copy from.

### `__del__`

`__del__(owned self)`

Releases memory if allocated.

### `__bool__`

`__bool__(self) -> Bool`

Returns True if the error is set and false otherwise.

**Returns:**

True if the error object contains a value and False otherwise.

### `copy`

`copy(self) -> Self`

Copy the object.

**Returns:**

A copy of the value.

### `__str__`

`__str__(self) -> String`

Converts the Error to string representation.

**Returns:**

A String of the error message.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this error to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__repr__`

`__repr__(self) -> String`

Converts the Error to printable representation.

**Returns:**

A printable representation of the error message.

### `byte_length`

`byte_length(self) -> Int`

Get the length of the Error string in bytes.

Notes:
This does not include the trailing null terminator in the count.

**Returns:**

The length of the Error string in bytes.

### `unsafe_cstr_ptr`

`unsafe_cstr_ptr(self) -> UnsafePointer[SIMD[int8, 1]]`

Retrieves a C-string-compatible pointer to the underlying memory.

The returned pointer is guaranteed to be NUL terminated, and not null.

**Returns:**

The pointer to the underlying memory.

### `as_string_slice`

`as_string_slice(self) -> StringSlice[ImmutableAnyOrigin]`

Returns a string slice of the data maybe owned by the Error.

Notes:
Since the data is not guaranteed to be owned by the Error, the
resulting StringSlice is given an ImmutableAnyOrigin.

**Returns:**

A string slice pointing to the data maybe owned by the Error.

---

## Errors, error handling, and context managers

This page discusses how to raise errors in Mojo programs and how to detect and
handle error conditions. It also discusses how you can use context managers to
allocate and release resources such as files correctly, even when error
conditions occur. Finally, it shows you how to implement context managers for
your own custom resources.

## Raise an error

The `raise` statement raises an error condition in your program. You provide the
`raise` statement with an [`Error`](/mojo/stdlib/builtin/error/Error) instance
to indicate the type of error that occurred. For example:

```mojo
raise Error("integer overflow")
```

As a convenience, you can instead provide an error message in the form of a
[`String`](/mojo/stdlib/collections/string/string/String) or
[`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral) value, and
`raise` automatically uses that to create an `Error` instance. So you can raise
the same error condition as shown above by executing:

```mojo
raise "integer overflow"
```

:::note

Currently, Mojo does not support typed error conditions. All errors are
instances of `Error`, and the only thing that distinguishes different error
conditions is the error message that you provide.

:::

An error interrupts the current execution flow of your program. If you provide
an error handler (as described in [Handle an error](#handle-an-error)) in the
current function, execution resumes with that handler. If the error isn't
handled in the current function, it propagates to the calling function and so
on. If an error isn't caught by any error handler, your program terminates with
a non-zero exit code and prints the error message. For example:

```output
Unhandled exception caught during execution: integer overflow
```

If a function you define using the `fn` keyword can raise an error, you must
include the `raises` keyword in the function definition. For example:

```mojo
fn incr(n: Int) raises -> Int:
    if n == Int.MAX:
        raise "inc: integer overflow"
    else:
        return n + 1
```

If you don't include the `raises` keyword when defining a function with `fn`,
then the function must explicitly handle any errors that might occur in code
that it executes. For example:

```mojo
# This function doesn't compile because of the unhandled error
fn unhandled_error(n: Int):
    print(n, "+ 1 =", incr(n))

# This function compiles because it handles the possible error
fn handled_error(n: Int):
    try:
        print(n, "+ 1 =", incr(n))
    except e:
        print("Handled an error:", e)
```

In contrast, you **cannot** use the `raises` keyword when defining a
function using the `def` keyword, because `def` always implies that the
function might raise an error. So the following is equivalent to the `incr`
function defined above with `fn`:

```mojo
def incr(n: Int) -> Int:
    if n == Int.MAX:
        raise "inc: integer overflow"
    else:
        return n + 1
```

## Handle an error

Mojo allows you to detect and handle error conditions using the `try-except`
control flow structure, whose full syntax is:

```mojo
try:
    # Code block to execute that might raise an error
except :
    # Code block to execute if an error occurs
else:
    # Code block to execute if no error occurs
finally:
    # Final code block to execute in all circumstances
```

You must include one or both of the `except` and `finally` clauses. The `else`
clause is optional.

The `try` clause contains a code block to execute that might raise an error. If
no error occurs, the entire code block executes. If an error occurs, execution
of the code block stops at the point that the error is raised. Your program then
continues with the execution of the `except` clause, if provided, or the
`finally` clause.

If the `except` clause is present, its code block executes only if an error
occurred in the `try` clause. The `except` clause "consumes" the error that
occurred in the `try` clause. You can then implement any error handling or
recovery that's appropriate for your application.

If you provide the name of a variable after the `except` keyword, then the
`Error` instance is bound to the variable if an error occurs. The `Error` type
implements the [`Writable`](/mojo/stdlib/utils/write/Writable) trait, so you can
pass it as an argument to the [`print()`](/mojo/stdlib/builtin/io/print)
function if you'd like to print its error message to the console. It also
implements the [`Stringable`](/mojo/stdlib/builtin/str/Stringable) trait, so you
can construct a `String` with `String(error)` if you want to extract the error
message as a `String` for further processing.

If desired, you can re-raise an error condition from your `except` clause simply
by executing a `raise` statement from within its code block. This can be either
a new `Error` instance or, if you provided a variable name to capture the
`Error` that occurred originally, you can re-raise that error.

:::note

Because Mojo does not currently support typed errors, a `try-except` control
structure can include at most one `except` clause, which catches any `Error`
raised.

:::

If the `else` clause is present, its code block executes only if an error does
not occur in the `try` clause. Note that the `else` clause is *skipped* if the
`try` clause executes a `continue`, `break`, or `return` that exits from the
`try` block.

If the `finally` clause is present, its code block executes after the `try`
clause and the `except` or `else` clause, if applicable. The `finally` clause
executes even if one of the other code blocks exit by executing a `continue`,
`break`, or `return` statement or by raising an error. The `finally` clause is
often used to release resources used by the `try` clause (such as a file handle)
regardless of whether or not an error occurred.

As an example, consider the following program:

```mojo
def incr(n: Int) -> Int:
    if n == Int.MAX:
        raise "inc: integer overflow"
    else:
        return n + 1

def main():
    values = List(0, 1, Int.MAX)
    for value in values:
        try:
            print()
            print("try     =>", value[])
            if value[] == 1:
                continue
            result = StaticString("{} incremented is {}").format(value[], incr(value[]))
        except e:
            print("except  =>", e)
        else:
            print("else    =>", result)
        finally:
            print("finally => ====================")

```

Running this program generates the following output:

```output
try     => 0
else    => 0 incremented is 1
finally => ====================

try     => 1
finally => ====================

try     => 9223372036854775807
except  => inc: integer overflow
finally => ====================
```

## Use a context manager

A *context manager* is an object that manages resources such as files, network
connections, and database connections. It provides a way to allocate resources
and release them automatically when they are no longer needed, ensuring proper
cleanup and preventing resource leaks even in the case of error conditions.

As an example, consider reading data from a file. A naive approach might look
like this:

```mojo
# Obtain a file handle to read from storage
f = open(input_file, "r")
content = f.read()
# Process the content as needed
# Close the file handle
f.close()
```

Calling [`close()`](/mojo/stdlib/builtin/file/FileHandle#close) releases the
memory and other operating system resources associated with the opened file. If
your program were to open many files without closing them, you could exhaust the
resources available to your program and cause errors. The problem is even worse
if you were writing to a file instead of reading from it, because the operating
system might buffer the output in memory until the file is closed. If your
program were to crash instead of exiting normally, that buffered data could be
lost instead of being written to storage.

The example above actually includes the call to `close()`, but it ignores the
possibility that [`read()`](/mojo/stdlib/builtin/file/FileHandle#read) could
raise an error, which would result in the program not executing the `close()`.
To handle this scenario you could rewrite the code to use `try` like this:

```mojo
# Obtain a file handle to read from storage
f = open(input_file, "r")

try:
    content = f.read()
    # Process the content as needed
finally:
    # Ensure that the file handle is closed even if read() raises an error
    f.close()
```

However, the [`FileHandle`](/mojo/stdlib/builtin/file/FileHandle) struct
returned by [`open()`](/mojo/stdlib/builtin/file/open) is a context manager.
When used in conjunction with Mojo's `with` statement, a context manager ensures
that the resources it manages are properly released at the end of the block,
even if an error occurs. In the case of a `FileHandle`, that means that the call
to `close()` takes place automatically. So you could rewrite the example above
to take advantage of the context manager—and omit the explicit call to
`close()`—like this:

```mojo
with open(input_file, "r") as f:
    content = f.read()
    # Process the content as needed
```

The `with` statement also allows you to use multiple context managers within the
same code block. As an example, the following code opens one text file, reads
its entire content, converts it to upper case, and then writes the result to a
different file:

```mojo
    with open(input_file, "r") as f_in, open(output_file, "w") as f_out:
        input_text = f_in.read()
        output_text = input_text.upper()
        f_out.write(output_text)
```

`FileHandle` is perhaps the most commonly used context manager. Other examples
of context managers in the Mojo standard library are
[`NamedTemporaryFile`](/mojo/stdlib/tempfile/tempfile/NamedTemporaryFile),
[`TemporaryDirectory`](/mojo/stdlib/tempfile/tempfile/TemporaryDirectory),
[`BlockingScopedLock`](/mojo/stdlib/utils/lock/BlockingScopedLock), and
[`assert_raises`](/mojo/stdlib/testing/testing/assert_raises). You can also
create your own custom context managers, as described in [Write a custom context
manager](#write-a-custom-context-manager) below.

## Write a custom context manager

Writing a custom context manager is a matter of defining a
[struct](/mojo/manual/structs) that implements two special *dunder* methods
("double underscore" methods): `__enter__()` and `__exit__()`:

- `__enter__()` is called by the `with` statement to enter the runtime context.
  The `__enter__()` method should initialize any state necessary for the context
  and return the context manager.

- `__exit__()` is called when the `with` code block completes execution, even if
  the `with` code block terminates with a call to `continue`, `break`, or
  `return`. The `__exit__()` method should release any resources associated with
  the context. After the `__exit__()` method returns, the context manager is
  destroyed.

  If the `with` code block raises an error, then the `__exit__()` method runs
  before any error processing occurs (that is, before it is caught by a
  `try-except` structure or your program terminates). If you'd like to define
  conditional processing for error conditions in a `with` code block, you can
  implement an overloaded version of `__exit__()` that takes an `Error`
  argument. For more information, see [Define a conditional `__exit__()`
  method](#define-a-conditional-__exit__-method) below.

  For context managers that don't need to release resources or perform other
  actions on termination, you are not required to implement an `__exit__()`
  method. In that case the context manager is destroyed automatically after the
  `with` code block completes execution.

Here is an example of implementing a `Timer` context manager, which prints the
amount of time spent executing the `with` code block:

```mojo title="context_mgr.mojo"
import sys
import time

@value
struct Timer:
    var start_time: Int

    fn __init__(out self):
        self.start_time = 0

    fn __enter__(mut self) -> Self:
        self.start_time = time.perf_counter_ns()
        return self

    fn __exit__(mut self):
        end_time = time.perf_counter_ns()
        elapsed_time_ms = round(((end_time - self.start_time) / 1e6), 3)
        print("Elapsed time:", elapsed_time_ms, "milliseconds")

def main():
    with Timer():
        print("Beginning execution")
        time.sleep(1)
        if len(sys.argv()) > 1:
            raise "simulated error"
        time.sleep(1)
        print("Ending execution")
```

Running this example produces output like this:

```sh
mojo context_mgr.mojo
```

```output
Beginning execution
Ending execution
Elapsed time: 2010.0 milliseconds
```

```sh
mojo context_mgr.mojo fail
```

```output
Beginning execution
Elapsed time: 1002.0 milliseconds
Unhandled exception caught during execution: simulated error
```

### Define a conditional `__exit__()` method

When creating a context manager, you can implement the `__exit__(self)` form of
the `__exit__()` method to handle completion of the `with` statement under all
circumstances including errors. However, you have the option of additionally
implementing an overloaded version that is invoked instead when an error occurs
in the `with` code block:

```mojo
fn __exit__(self, error: Error) raises -> Bool
```

Given the `Error` that occurred as an argument, the method can:

- Return `True` to suppress the error
- Return `False` to re-raise the error
- Raise a new error

The following is an example of a context manager that suppresses only a certain
type of error condition and propagates all others:

```mojo title="conditional_context_mgr.mojo"
import sys
import time

@value
struct ConditionalTimer:
    var start_time: Int

    fn __init__(out self):
        self.start_time = 0

    fn __enter__(mut self) -> Self:
        self.start_time = time.perf_counter_ns()
        return self

    fn __exit__(mut self):
        end_time = time.perf_counter_ns()
        elapsed_time_ms = round(((end_time - self.start_time) / 1e6), 3)
        print("Elapsed time:", elapsed_time_ms, "milliseconds")

    fn __exit__(mut self, e: Error) raises -> Bool:
        if String(e) == "just a warning":
            print("Suppressing error:", e)
            self.__exit__()
            return True
        else:
            print("Propagating error")
            self.__exit__()
            return False

def flaky_identity(n: Int) -> Int:
    if (n % 4) == 0:
        raise "really bad"
    elif (n % 2) == 0:
        raise "just a warning"
    else:
        return n

def main():
    for i in range(1, 9):
        with ConditionalTimer():
            print("\nBeginning execution")

            print("i =", i)
            time.sleep(0.1)

            if i == 3:
                print("continue executed")
                continue

            j = flaky_identity(i)
            print("j =", j)

            print("Ending execution")
```

Running this example produces this output:

```output
Beginning execution
i = 1
j = 1
Ending execution
Elapsed time: 105.0 milliseconds

Beginning execution
i = 2
Suppressing error: just a warning
Elapsed time: 106.0 milliseconds

Beginning execution
i = 3
continue executed
Elapsed time: 106.0 milliseconds

Beginning execution
i = 4
Propagating error
Elapsed time: 106.0 milliseconds
Unhandled exception caught during execution: really bad
```

---

## eval_composed

`eval_composed[composed_layout: ComposedLayout[Layout, Swizzle]](idx: UInt, offset: UInt = UInt(0)) -> UInt`

Evaluate a composed layout with swizzle.

Evaluates a `ComposedLayout[Layout, Swizzle]`. Applies the base
layout, adds an optional offset, and then applies the swizzle.

**Parameters:**

* ​composed\_layout (`ComposedLayout[Layout, Swizzle]`): The composed layout to evaluate, consisting of a base Layout
  and a Swizzle transformation.

**Args:**

* ​idx (`UInt`): The input index to transform.
* ​offset (`UInt`): Optional offset to apply between layouts (default: 0).

**Returns:**

The transformed index after applying both layouts.

---

## exists

`exists[PathLike: PathLike, //](path: PathLike) -> Bool`

Return True if path exists.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

Returns True if the path exists and is not a broken symbolic link.

---

## exit

`exit()`

Exits from Mojo. Unlike the Python implementation this does not raise an exception to exit.

`exit[intable: Intable](code: intable)`

Exits from Mojo. Unlike the Python implementation this does not raise an exception to exit.

**Parameters:**

* ​intable (`Intable`): The type of the exit code.

**Args:**

* ​code (`intable`): The exit code.

---

## exp

`exp[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Calculates elementwise exponential of the input vector.

Given an input vector $X$ and an output vector $Y$, sets $Y_i = e^{X_i}$ for
each position $i$ in the input vector (where $e$ is the mathematical constant
$e$).

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input SIMD vector.

**Returns:**

A SIMD vector containing $e$ raised to the power $X_i$ where $X_i$ is an
element in the input SIMD vector.

`exp[T: _Expable](x: T) -> T`

Computes the exponential of the input value.

**Parameters:**

* ​T (`_Expable`): The type of the input value.

**Args:**

* ​x (`T`): The input value.

**Returns:**

The exponential of the input value.

---

## exp2

`exp2[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes elementwise 2 raised to the power of n, where n is an element of the input SIMD vector.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector to perform exp2 on.

**Returns:**

Vector containing $2^n$ computed elementwise, where n is an element in
the input SIMD vector.

---

## expand_modes_alike

`expand_modes_alike(shape_a: IntTuple[origin], stride_a: IntTuple[origin], shape_b: IntTuple[origin], stride_b: IntTuple[origin]) -> InlineArray[IntTuple, 3]`

Aligns two shape-stride pairs to have the same hierarchical structure.

This function is used to make two layouts compatible for operations by ensuring
they have the same hierarchical structure, expanding scalar values into tuples
as needed.

**Args:**

* ​shape\_a (`IntTuple[origin]`): The first shape tuple.
* ​stride\_a (`IntTuple[origin]`): The first stride tuple.
* ​shape\_b (`IntTuple[origin]`): The second shape tuple.
* ​stride\_b (`IntTuple[origin]`): The second stride tuple.

**Returns:**

An array containing three tuples: the common shape, the expanded stride\_a,
and the expanded stride\_b.

`expand_modes_alike(layout_a: Layout, layout_b: Layout) -> InlineArray[Layout, 2]`

Aligns two layouts to have the same hierarchical structure.

This function tiles both layouts so they mirror each other's structure,
making them compatible for operations that require matching hierarchies.

Example:

Given layouts with different structures:

* layout\_0: (((3, (5, 2)), 4):((1, (24, 12)), 3))
* layout\_1: ((30, (2, 2)):(2, (60, 1)))

The result would be two layouts with matching structures:

* (((3, (5, 2)), (2, 2)):((1, (24, 12)), (3, 6)))
* (((3, (5, 2)), (2, 2)):((2, (6, 30)), (60, 1)))

```mojo
from layout import Layout, IntTuple
from layout.layout import expand_modes_alike

alias layout_0 = Layout(
    IntTuple(IntTuple(3, IntTuple(5, 2)), 4),
    IntTuple(IntTuple(1, IntTuple(24, 12)), 3),
)
alias layout_1 = Layout(
    IntTuple(30, IntTuple(2, 2)), IntTuple(2, IntTuple(60, 1))
)
alias uc = expand_modes_alike(layout_0, layout_1)
print(uc[0])
# (((3, (5, 2)), (2, 2)):((1, (24, 12)), (3, 6)))
print(uc[1])
# (((3, (5, 2)), (2, 2)):((2, (6, 30)), (60, 1)))
```

.

**Args:**

* ​layout\_a (`Layout`): The first layout to align.
* ​layout\_b (`Layout`): The second layout to align.

**Returns:**

An array containing two layouts with matching hierarchical structures.

---

## expand_strides

`expand_strides(shape: IntTuple[origin], stride: Int) -> IntTuple`

Expands a scalar stride into a stride tuple matching a shape tuple.

This function creates a stride tuple that matches the structure of a shape tuple,
with each stride value calculated based on the cumulative product of shape
dimensions.

**Args:**

* ​shape (`IntTuple[origin]`): The shape tuple to match.
* ​stride (`Int`): The base stride value to expand.

**Returns:**

A stride tuple matching the structure of the shape tuple.

---

## expanduser

`expanduser[PathLike: PathLike, //](path: PathLike) -> String`

Expands a tilde "\~" prefix in `path` to the user's home directory.

For example, `~/folder` becomes `/home/current_user/folder`. On macOS and
Linux a path starting with `~user/` will expand to the specified user's home
directory, so `~user/folder` becomes `/home/user/folder`.

If the home directory cannot be determined, or the `path` is not prefixed
with "\~", the original path is returned unchanged.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path that is being expanded.

**Returns:**

The expanded path.

---

## expandvars

`expandvars[PathLike: PathLike, //](path: PathLike) -> String`

Replaces `${var}` or `$var` in the path with values from the current environment variables. Malformed variable names and references to non-existing variables are left unchanged.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path that is being expanded.

**Returns:**

The expanded path.

---

## expect

`expect[T: AnyTrivialRegType, //, expected_val: T](val: T) -> T`

Provides information about expected (the most probable) value of `val`, which can be used by optimizers.

Notes:
Only works with integer/boolean types.

**Parameters:**

* ​T (`AnyTrivialRegType`): The type of the input value.
* ​expected\_val (`T`): The expected value of `val`.

**Args:**

* ​val (`T`): The input value.

**Returns:**

The input value.

---

## ExplicitlyCopyable

The ExplicitlyCopyable trait denotes a type whose value can be copied explicitly.

Unlike `Copyable`, which denotes types that are *implicitly* copyable, an
explicitly copyable type can only be copied when the explicit copy
initializer is called intentionally by the programmer.

An explicit copy initializer is just a normal `__init__` method that takes
a `read-only` argument of `Self`.

Example implementing the `ExplicitlyCopyable` trait on `Foo` which requires
the `fn(self) -> Self` method:

```mojo
struct Foo(ExplicitlyCopyable):
    var s: String

    @implicit
    fn __init__(out self, s: String):
        self.s = s

    fn copy(self) -> Self:
        print("explicitly copying value")
        return Foo(self.s)
```

You can now copy objects inside a generic function:

```mojo
fn copy_return[T: ExplicitlyCopyable](foo: T) -> T:
    var copy = foo.copy()
    return copy

var foo = Foo("test")
var res = copy_return(foo)
```

```plaintext
explicitly copying value
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `copy`

`copy(self: _Self) -> _Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

---

## expm1

`expm1[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `expm1` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `expm1` of the input.

---

## extend_shape

`extend_shape[rank: Int](in_shape: IndexList[rank], first: Int, last: Int) -> IndexList[(rank + 2)]`

Extend input shape by inserting `first` and `last` at both ends.

---

## external_call

`external_call[callee: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType, *types: AnyType](*args: *types) -> return_type`

Calls an external function.

**Parameters:**

* ​callee (`StringSlice[StaticConstantOrigin]`): The name of the external function.
* ​return\_type (`AnyTrivialRegType`): The return type.
* ​\*types (`AnyType`): The argument types.

**Args:**

* ​\*args (`*types`): The arguments to pass to the external function.

**Returns:**

The external call result.

`external_call[callee: StringSlice[StaticConstantOrigin], return_type: AnyTrivialRegType](args: VariadicPack[is_owned, origin, AnyType, element_types]) -> return_type`

Calls an external function.

**Parameters:**

* ​callee (`StringSlice[StaticConstantOrigin]`): The name of the external function.
* ​return\_type (`AnyTrivialRegType`): The return type.

**Args:**

* ​args (`VariadicPack[is_owned, origin, AnyType, element_types]`): The arguments to pass to the external function.

**Returns:**

The external call result.

---

## external_memory

`external_memory[type: AnyTrivialRegType, *, address_space: AddressSpace, alignment: Int, name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("extern_ptr_syml")]() -> UnsafePointer[type, address_space=address_space, alignment=alignment]`

Gets a pointer to dynamically allocated external memory.

This function returns a pointer to external memory that can be used for dynamic
shared memory allocations in GPU kernels. The memory is allocated in the specified
address space with the given alignment requirements.

Note:

* The memory is not initialized and must be explicitly written before reading.
* The allocation size is determined at kernel launch time.
* The pointer is only valid within the GPU kernel execution context.
* Care must be taken to respect alignment requirements when accessing the memory.

**Parameters:**

* ​type (`AnyTrivialRegType`): The type of elements stored in the memory. Must be a trivial register type.
* ​address\_space (`AddressSpace`): The memory address space to allocate in (e.g. shared, global).
* ​alignment (`Int`): The minimum alignment requirement in bytes for the allocated memory.
* ​name (`StringSlice[StaticConstantOrigin]`): Optional symbolic name for the external memory allocation. Defaults to
  "extern\_ptr\_syml".

**Returns:**

A properly aligned pointer to the allocated external memory in the
specified address space.

---

## extrx

`extrx(gpr: Int)`

Extracts a row or moves it to x, result in amx0.

---

## extry

`extry(gpr: Int)`

Extracts a row or moves it to y, result in amx0.

---

## factorial

`factorial(n: Int) -> Int`

Computes the factorial of the integer.

**Args:**

* ​n (`Int`): The input value. Must be non-negative.

**Returns:**

The factorial of the input. Results are undefined for negative inputs.

---

## FAQ

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

If this page doesn't answer your question, please ask us on our [Modular
forum](https://forum.modular.com) or [Discord
channel](https://www.discord.gg/modular).

## Distribution

### What are the system requirements? {#system-requirements}

- macOS Ventura (13) or later
- Apple silicon (M1/M2/M3/M4 processor)
- Python 3.9 - 3.13
- Xcode or Xcode Command Line Tools
- We currently don't support Mac GPUs

- Ubuntu 22.04 LTS
- x86-64 CPU (with [SSE4.2 or
  newer](https://www.intel.com/content/www/us/en/support/articles/000057621/processors.html))
  or AWS Graviton2/3 CPU
- Minimum 8 GiB RAM (or much more, depending on the model you run)
- Python 3.9 - 3.13
- g++ or clang++ C++ compiler
- To use GPUs, see the [GPU requirements](#gpu-requirements)

Windows is not officially supported at this time.

In the meantime, you can try MAX on Windows [with
WSL](https://learn.microsoft.com/en-us/windows/wsl/install), using a compatible
version of Ubuntu (see our requirements for Linux).

### What are the GPU requirements? {#gpu-requirements}

The Modular Platform supports both CPUs and GPUs, so you don't need a GPU to
serve a model or program with Mojo. But if you do want to accelerate your model
with GPUs or program for GPUs with Mojo, Modular supports many GPU types.

Because we don't test every variant of a GPU architecture, and support for new
architectures will improve incrementally, we've divided our list of compatible
GPUs into 3 tiers:

#### Tier 1: Fully supported

We provide full support and testing for the following data center GPUs:

- NVIDIA H100 and H200 (Hopper)
- NVIDIA A100 and A10 (Ampere)
- NVIDIA L4 and L40 (Ada Lovelace)

#### Tier 2: Confirmed compatibility

We've confirmed compatibility with the following GPUs but we currently don't
maintain tests for them:

- NVIDIA RTX 40XX series (Ada Lovelace)
- NVIDIA RTX 30XX series (Ampere)

#### Tier 3: Limited compatibility

We've either confirmed or received reports that the following GPUs work for GPU
programming with Mojo and can execute basic graphs with MAX APIs. However,
these GPUs currently can't run some GenAI models for various reasons:

- NVIDIA RTX 20XX series (Turing)
- NVIDIA T4 (Turing)
- NVIDIA Jetson Orin and Orin Nano (Ampere)

If you've had success with any GPUs not listed here, please [let us know on
Discord](https://discord.gg/modular).

#### Software requirements

- NVIDIA GPU driver version 550 or higher.

  You can check your NVIDIA GPU driver version using
[nvidia-smi](https://developer.nvidia.com/system-management-interface). To
update, see the [NVIDIA driver docs](https://www.nvidia.com/en-us/drivers/).

:::note Notes

- Many GPUs are available in variants with different amounts of memory, and
each AI model has different memory requirements. So even if your GPU
architecture is listed as compatible, you must confirm that the available
memory is sufficient for the model you're using.

- Modular can serve lots of models on either CPU and GPU, but some models do
require one or more GPUs. When you browse our [model
repository](https://builds.modular.com/?category=models), you can filter by
models that support either CPU or GPU.

:::

### Why bundle Mojo with MAX?

Integrating Mojo and MAX into a single package is the best way to ensure
interoperability between Mojo and MAX for all users, and avoid version
conflicts that happen when installing them separately.

Moreover, we built Mojo as a [core technology for MAX](/mojo/why-mojo), and you
can use it to [extend MAX Engine](/max/custom-ops), so MAX clearly depends
on Mojo. On the other hand, writing Mojo code that runs on both CPUs and GPUs
(and other accelerators) requires runtime components and orchestration logic
that falls outside the domain of Mojo, and into the domain of MAX. That is, MAX
isn't just a framework for AI development, it's also a framework for general
heterogeneous compute. As such, writing Mojo programs that can execute across
heterogeneous hardware depends on MAX.

Nothing has changed for Mojo developers—you can still build and develop in Mojo
like you always have. The only difference is that you're now able to
seamlessly step into general-purpose GPU programming (coming soon).

### Will MAX be open-sourced?

We want to contribute a lot to open source, but we also want to do it right.
Our team has decades of experience building open-source projects, and we
believe it's very important to create an inclusive and vibrant
community, which takes a lot of work.

We've already begun open-sourcing parts of the MAX framework, including
our [Python serving library](https://github.com/modular/modular/tree/main/max/serve),
[MAX model architectures](https://github.com/modular/modular/tree/main/max/pipelines/architectures),
and [GPU kernels](https://github.com/modular/modular/tree/main/max/kernels/src/nn).

To get the latest updates, [sign up for our
newsletter](https://www.modular.com/modverse#signup).

## Functionality

### What hardware does MAX support?

MAX supports a broad range of CPUs, including Intel, AMD, and ARM variants, as
well as GPUs from NVIDIA and AMD (coming soon). For more specifics, see the
above [system requirements](#system-requirements).

### What clouds and services can I deploy MAX onto?

You can deploy our MAX container across a variety of VM and Kubernetes-based
cloud services, including AWS, GCP, and Azure. To get started with any of them,
check out our [tutorials using MAX
Serve](/max/tutorials?filterByTags&tag=serve).

### Can I run MAX locally?

Yes. MAX has support for MacOS and ARM hardware, meaning it can be run on your
local laptop for exploration and testing purposes.

### Will MAX support distributed inference of large models?

Yes, it will support executing large models that do not fit into the memory of
a single device. This isn't available yet, so stay tuned!

## Installation

### Can I install both stable and nightly builds?

Yes, it's safe and easy to use the stable
and nightly builds for different projects, each with their own virtual
environment and package dependencies. For more information,
read the [Install guide](/max/packages).

### Does the MAX SDK collect telemetry?

Yes, the MAX SDK collects basic system information, session durations, compiler
events, and crash reports that enable us to identify, analyze, and prioritize
issues. The MAX container for model serving also collects performance metrics
such as time to first token and input processing time.

This telemetry is crucial to help us quickly identify problems and improve our
products for you. Without this telemetry, we would rely solely on
user-submitted bug reports, which are limited and would severely limit our
performance insights.

You can opt-out of some telemetry, such as compiler events and crash reports.
However, package install/update/uninstall events, basic system information, and
session durations (the amount of time spent running MAX Engine) cannot be
disabled (see the [Terms of use](https://www.modular.com/legal/terms)).

To disable telemetry for compiler events and crash reports, run this command
in your project environment (you must run this for each project):

```sh
magic telemetry --disable
```

To disable serving telemetry, see the [MAX container
documentation](/max/container#metrics).

---

## fast_div

Implements the fast division algorithm.

This method replaces division by constants with a sequence of shifts and
multiplications, significantly optimizing division performance.

## Structs

* [​`FastDiv`](./FastDiv): Implements fast division for a given type.

---

## FastDiv

`@register_passable(trivial)`
`struct FastDiv[type: DType]`

Implements fast division for a given type.

This struct provides optimized division by a constant divisor,
replacing the division operation with a series of shifts and
multiplications. This approach significantly improves performance,
especially in scenarios where division is a frequent operation.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `uint_type`

`alias uint_type = _uint_type_of_width[::Int]()`

## Methods

### `__init__`

`@implicit`
`__init__(divisor: Int = 1) -> Self`

Initializes FastDiv with the divisor.

**Constraints:**

ConstraintError: If the bitwidth of the type is > 32.

**Args:**

* ​divisor (`Int`): The divisor to use for fast division.
  Defaults to 1.

### `__rtruediv__`

`__rtruediv__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> SIMD[_uint_type_of_width[::Int](), 1]`

Divides the other scalar by the divisor (true division).

Uses the fast division algorithm.

**Args:**

* ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend.

**Returns:**

The result of the division.

### `__rmod__`

`__rmod__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> SIMD[_uint_type_of_width[::Int](), 1]`

Computes the remainder of division.

**Args:**

* ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend.

**Returns:**

The remainder.

### `__rdiv__`

`__rdiv__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> SIMD[_uint_type_of_width[::Int](), 1]`

Divides the other scalar by the divisor.

**Args:**

* ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend.

**Returns:**

The result of the division.

### `__divmod__`

`__divmod__(self, other: SIMD[_uint_type_of_width[::Int](), 1]) -> Tuple[SIMD[_uint_type_of_width[::Int](), 1], SIMD[_uint_type_of_width[::Int](), 1]]`

Computes both quotient and remainder.

**Args:**

* ​other (`SIMD[_uint_type_of_width[::Int](), 1]`): The dividend.

**Returns:**

A tuple containing the quotient and remainder.

---

## Featured tutorials

import Tutorials from '@site/src/components/Tutorials';

export const tutorials = {
  featured: [
    'start-a-chat-endpoint',
    'max-serve-local-to-cloud',
    'deploy-max-serve-on-kubernetes'
  ],
  new: [
    'run-embeddings-with-max-serve',
    'build-custom-ops',
  ],
  popular: [
    'max-pipeline-bring-your-own-model',
    'deploy-serverless-cloud-run',
    'get-started-with-max-graph-in-python'
  ]
}

---

## fence_mbarrier_init

`fence_mbarrier_init()`

Creates a memory fence after mbarrier initialization.

This function establishes a memory barrier that ensures the proper initialization
of memory barriers (mbarrier) before they are used. It guarantees that the
mbarrier initialization is complete and visible to all threads before subsequent
operations.

Note:

Should be called immediately after mbarrier initialization to ensure proper
synchronization semantics.

---

## fence_proxy_tensormap_generic_sys_acquire

`fence_proxy_tensormap_generic_sys_acquire[type: AnyType](ptr: UnsafePointer[type, alignment=alignment, mut=mut, origin=origin], size: SIMD[int32, 1])`

Acquires a system-wide memory fence for tensor map operations.

This function establishes a memory fence that ensures proper synchronization
between tensor map operations and system memory. It guarantees that all previous
memory operations are completed before subsequent tensor map accesses.

Note:

This is a low-level synchronization primitive typically used in conjunction with
TMA (Tensor Memory Access) operations on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The data type of the tensor map object being synchronized.

**Args:**

* ​ptr (`UnsafePointer[type, alignment=alignment, mut=mut, origin=origin]`): Pointer to the tensor map object in system memory that needs to be synchronized.
* ​size (`SIMD[int32, 1]`): The size in bytes of the tensor map object being synchronized.

---

## fence_proxy_tensormap_generic_sys_release

`fence_proxy_tensormap_generic_sys_release()`

Releases the system-wide memory fence for tensor map operations.

This function releases the memory fence previously established by the acquire operation.
It ensures that all tensor map operations are completed and visible to the system
before proceeding.

Note:

Should be called after tensor map operations are complete to maintain proper
memory ordering semantics.

---

## ffi

Implements a foreign functions interface (FFI).

## Aliases

### `c_char`

`alias c_char = SIMD[int8, 1]`

C `char` type.

### `c_double`

`alias c_double = SIMD[float64, 1]`

C `double` type.

### `c_float`

`alias c_float = SIMD[float32, 1]`

C `float` type.

### `c_int`

`alias c_int = SIMD[int32, 1]`

C `int` type.

The C `int` type is typically a signed 32-bit integer on commonly used targets
today.

### `c_long`

`alias c_long = SIMD[_c_long_dtype(), 1]`

C `long` type.

The C `long` type is typically a signed 64-bit integer on macOS and Linux, and a
32-bit integer on Windows.

### `c_long_long`

`alias c_long_long = SIMD[_c_long_long_dtype(), 1]`

C `long long` type.

The C `long long` type is typically a signed 64-bit integer on commonly used
targets today.

### `c_short`

`alias c_short = SIMD[int16, 1]`

C `short` type.

### `c_size_t`

`alias c_size_t = UInt`

C `size_t` type.

### `c_ssize_t`

`alias c_ssize_t = Int`

C `ssize_t` type.

### `c_uchar`

`alias c_uchar = SIMD[uint8, 1]`

C `unsigned char` type.

### `c_uint`

`alias c_uint = SIMD[uint32, 1]`

C `unsigned int` type.

### `c_ushort`

`alias c_ushort = SIMD[uint16, 1]`

C `unsigned short` type.

### `DEFAULT_RTLD`

`alias DEFAULT_RTLD = (256 if os_is_linux() else 8 | 2)`

### `OpaquePointer`

`alias OpaquePointer = UnsafePointer[NoneType]`

An opaque pointer, equivalent to the C `void*` type.

## Structs

* [​`DLHandle`](/mojo/stdlib/sys/ffi/DLHandle): Represents a dynamically linked library that can be loaded and unloaded.
* [​`RTLD`](/mojo/stdlib/sys/ffi/RTLD): Enumeration of the RTLD flags used during dynamic library loading.

## Functions

* [​`external_call`](/mojo/stdlib/sys/ffi/external_call): Calls an external function.

---

## file

Provides APIs to read and write files.

These are Mojo built-ins, so you don't need to import them.

For example, here's how to read a file:

```mojo
var  f = open("my_file.txt", "r")
print(f.read())
f.close()
```

Or use a `with` statement to close the file automatically:

```mojo
with open("my_file.txt", "r") as f:
  print(f.read())
```

## Structs

* [​`FileHandle`](/mojo/stdlib/builtin/file/FileHandle): File handle to an opened file.

## Functions

* [​`open`](/mojo/stdlib/builtin/file/open): Opens the file specified by path using the mode provided, returning a FileHandle.

---

## file_descriptor

Higher level abstraction for file stream.

These are Mojo built-ins, so you don't need to import them.

For example, here's how to print to a file

```mojo
var f = open("my_file.txt", "r")
print("hello", file=f^)
f.close()
```

## Structs

* [​`FileDescriptor`](/mojo/stdlib/builtin/file_descriptor/FileDescriptor): File descriptor of a file.

---

## FileDescriptor

`@register_passable(trivial)`
`struct FileDescriptor`

File descriptor of a file.

## Fields

* ​value (`Int`): The underlying value of the file descriptor.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`,
`Writer`

## Methods

### `__init__`

`__init__(value: Int = 1) -> Self`

Constructs the file descriptor from an integer.

**Args:**

* ​value (`Int`): The file identifier (Default 1 = stdout).

`@implicit`
`__init__(f: FileHandle) -> Self`

Constructs the file descriptor from a file handle.

**Args:**

* ​f (`FileHandle`): The file handle.

### `__write_bytes_cpu`

`__write_bytes_cpu(mut self, bytes: Span[SIMD[uint8, 1], origin])`

Write a span of bytes to the file.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file.

### `write_bytes`

`write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])`

Write a span of bytes to the file.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file.

### `read_bytes`

`read_bytes(mut self, buffer: Span[SIMD[uint8, 1], origin]) -> UInt`

Read a number of bytes from the file into a buffer.

Notes:
[Reference](https://pubs.opengroup.org/onlinepubs/9799919799/functions/read.html).

**Args:**

* ​buffer (`Span[SIMD[uint8, 1], origin]`): A `Span[Byte]` to read bytes into. Read up to `len(buffer)` number of bytes.

**Returns:**

Actual number of bytes read.

### `write`

`write[*Ts: Writable](mut self, *args: *Ts)`

Write a sequence of Writable arguments to the provided Writer.

**Parameters:**

* ​\*Ts (`Writable`): Types of the provided argument sequence.

**Args:**

* ​\*args (`*Ts`): Sequence of arguments to write to this Writer.

---

## FileHandle

`struct FileHandle`

File handle to an opened file.

## Fields

* ​handle (`UnsafePointer[NoneType]`): The underlying pointer to the file handle.

## Implemented traits

`AnyType`,
`UnknownDestructibility`,
`Writer`

## Methods

### `__init__`

`__init__(out self)`

Default constructor.

`__init__(out self, path: StringSlice[origin], mode: StringSlice[origin])`

Construct the FileHandle using the file path and mode.

**Args:**

* ​path (`StringSlice[origin]`): The file path.
* ​mode (`StringSlice[origin]`): The mode to open the file in (the mode can be "r" or "w" or "rw").

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Moves constructor for the file handle.

**Args:**

* ​existing (`Self`): The existing file handle.

### `__del__`

`__del__(owned self)`

Closes the file handle.

### `close`

`close(mut self)`

Closes the file handle.

### `read`

`read(self, size: Int = -1) -> String`

Reads data from a file and sets the file handle seek position. If size is left as the default of -1, it will read to the end of the file. Setting size to a number larger than what's in the file will set the String length to the total number of bytes, and read all the data.

Examples:

Read the entire file into a String:

```mojo
var file = open("/tmp/example.txt", "r")
var string = file.read()
print(string)
```

Read the first 8 bytes, skip 2 bytes, and then read the next 8 bytes:

```mojo
import os
var file = open("/tmp/example.txt", "r")
var word1 = file.read(8)
print(word1)
_ = file.seek(2, os.SEEK_CUR)
var word2 = file.read(8)
print(word2)
```

Read the last 8 bytes in the file, then the first 8 bytes

```mojo
_ = file.seek(-8, os.SEEK_END)
var last_word = file.read(8)
print(last_word)
_ = file.seek(8, os.SEEK_SET) # os.SEEK_SET is the default start of file
var first_word = file.read(8)
print(first_word)
```

.

**Args:**

* ​size (`Int`): Requested number of bytes to read (Default: -1 = EOF).

**Returns:**

The contents of the file.

**Raises:**

An error if this file handle is invalid, or if the file read
returned a failure.

`read[dtype: DType, origin: MutableOrigin](self, buffer: Span[SIMD[dtype, 1], origin]) -> Int`

Read data from the file into the Span.

This will read n bytes from the file into the input Span where
`0 dtype (`DType`): The type that the data will be represented as.
* ​origin (`MutableOrigin`): The origin of the passed in Span.

**Args:**

* ​buffer (`Span[SIMD[dtype, 1], origin]`): The mutable Span to read data into.

**Returns:**

The total amount of data that was read in bytes.

**Raises:**

An error if this file handle is invalid, or if the file read
returned a failure.

### `read_bytes`

`read_bytes(self, size: Int = -1) -> List[SIMD[uint8, 1]]`

Reads data from a file and sets the file handle seek position. If size is left as default of -1, it will read to the end of the file. Setting size to a number larger than what's in the file will be handled and set the List length to the total number of bytes in the file.

Examples:

Reading the entire file into a List\[Int8]:

```mojo
var file = open("/tmp/example.txt", "r")
var string = file.read_bytes()
```

Reading the first 8 bytes, skipping 2 bytes, and then reading the next
8 bytes:

```mojo
import os
var file = open("/tmp/example.txt", "r")
var list1 = file.read(8)
_ = file.seek(2, os.SEEK_CUR)
var list2 = file.read(8)
```

Reading the last 8 bytes in the file, then the first 8 bytes:

```mojo
import os
var file = open("/tmp/example.txt", "r")
_ = file.seek(-8, os.SEEK_END)
var last_data = file.read(8)
_ = file.seek(8, os.SEEK_SET) # os.SEEK_SET is the default start of file
var first_data = file.read(8)
```

.

**Args:**

* ​size (`Int`): Requested number of bytes to read (Default: -1 = EOF).

**Returns:**

The contents of the file.

**Raises:**

An error if this file handle is invalid, or if the file read
returned a failure.

### `seek`

`seek(self, offset: SIMD[uint64, 1], whence: SIMD[uint8, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[uint64, 1]`

Seeks to the given offset in the file.

Examples:

Skip 32 bytes from the current read position:

```mojo
import os
var f = open("/tmp/example.txt", "r")
_ = f.seek(32, os.SEEK_CUR)
```

Start from 32 bytes from the end of the file:

```mojo
import os
var f = open("/tmp/example.txt", "r")
_ = f.seek(-32, os.SEEK_END)
```

.

**Args:**

* ​offset (`SIMD[uint64, 1]`): The byte offset to seek to.
* ​whence (`SIMD[uint8, 1]`): The reference point for the offset:
  os.SEEK\_SET = 0: start of file (Default).
  os.SEEK\_CUR = 1: current position.
  os.SEEK\_END = 2: end of file.

**Returns:**

The resulting byte offset from the start of the file.

**Raises:**

An error if this file handle is invalid, or if file seek returned a
failure.

### `write_bytes`

`write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])`

Write a span of bytes to the file.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file.

### `write`

`write[*Ts: Writable](mut self, *args: *Ts)`

Write a sequence of Writable arguments to the provided Writer.

**Parameters:**

* ​\*Ts (`Writable`): Types of the provided argument sequence.

**Args:**

* ​\*args (`*Ts`): Sequence of arguments to write to this Writer.

### `__enter__`

`__enter__(owned self) -> Self`

The function to call when entering the context.

**Returns:**

The file handle.

---

## Fill

`@register_passable(trivial)`
`struct Fill`

Represents memory fill patterns for GPU memory operations.

This struct defines different fill patterns that can be used when allocating or
initializing GPU memory. The patterns control how memory is initialized, which
can be important for debugging and performance optimization.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `NAN`

`alias NAN = Fill(2)`

Fill memory with NaN values. Useful for debugging floating point computations.

### `NONE`

`alias NONE = Fill(0)`

No fill pattern - memory is left uninitialized.

### `ZERO`

`alias ZERO = Fill(1)`

Fill memory with zeros.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Tests if two Fill instances have the same fill pattern.

**Args:**

* ​other (`Self`): The Fill instance to compare against.

**Returns:**

True if the fill patterns are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Tests if two Fill instances have different fill patterns.

**Args:**

* ​other (`Self`): The Fill instance to compare against.

**Returns:**

True if the fill patterns are different, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Tests if two Fill instances are identical.

**Args:**

* ​other (`Self`): The Fill instance to compare against.

**Returns:**

True if the fill patterns are identical, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Tests if two Fill instances are not identical.

**Args:**

* ​other (`Self`): The Fill instance to compare against.

**Returns:**

True if the fill patterns are not identical, False otherwise.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the fill pattern.

Converts the fill pattern into a human-readable string for debugging
and display purposes.

**Returns:**

A string describing the fill pattern.

---

## fill_like

`fill_like(src: IntTuple[origin], val: Int) -> IntTuple`

Creates an `IntTuple` with the same structure as the source but filled with a specified value.

This function recursively traverses the source `IntTuple` and creates a new `IntTuple`
with identical structure, but with all leaf values replaced by the specified value.

**Args:**

* ​src (`IntTuple[origin]`): The source `IntTuple` whose structure will be copied.
* ​val (`Int`): The integer value to fill the new `IntTuple` with.

**Returns:**

A new `IntTuple` with the same structure as src but filled with val.

---

## flare_mla_decoding

`flare_mla_decoding[rank: Int, cache_t: KVCacheT, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: cache_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

MLA decoding kernel that would only be called in the optimized compute graph.

The Q input has a shape of \[seq\_len, num\_heads, depth].
The K input has a shape of \[seq\_len, 1, depth].
The V tensor is derived by reusing K, where V = K\[:, :, :depth\_v].

Specifically, for DeepSeek V2/3, depth = 576 and depth\_v = 512.

This kernel computes attention without needing to load V twice. This kernel
only handles decoding requests. In this case q\_max\_seq\_len = 1.

This kernel handles batches with different valid lengths (i.e., before the
padding). Such lengths are passed in valid\_length argument.

`flare_mla_decoding[rank: Int, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, rank, origin, shape, strides], mask_functor: mask_t, score_mod_functor: score_mod_t, scale: SIMD[float32, 1], ctx: DeviceContext, num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

---

## flare_mla_decoding_dispatch

`flare_mla_decoding_dispatch[rank: Int, k_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, kv_num_heads: Int, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, _is_cache_length_accurate: Bool = False, _use_valid_length: Bool = True, decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: k_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], max_prompt_len: Int, max_cache_valid_length: Int, scale: SIMD[float32, 1], ctx: DeviceContext, kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

---

## flare_mla_prefill

`flare_mla_prefill[rank: Int, cache_t: KVCacheT, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, output_type: DType, softmax_type: DType, q_shape: DimList, //, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False](output: NDBuffer[output_type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, 3, origin, shape, strides], v: NDBuffer[type, 3, origin, shape, strides], k_rope: cache_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], cache_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}), cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), prev_output: OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]] = OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]]({:i1 0, 1}), prev_softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}))`

MLA prefill kernel that would only be called in the optimized compute graph. Only supports ragged Q/K/V inputs.

The Q input has a shape of \[seq\_len, num\_heads, q\_depth].
The K and V input has a shape of \[cache\_len, num\_heads, depth].
The K\_rope input is retrieved from the KV cache, with a shape of
\[cache\_len, 1, q\_depth - depth].

Specifically, for DeepSeek V2/3, depth = 128 and q\_depth = 192.

When computing attention scores (Q @ K), each head of K is smaller than Q
head. The missing 64 elements of K are retrieved from the K cache, and
broadcasted to all the heads. This kernel also handles that output has
reduced dimension compared to input Q.

This kernel handles batches with different valid lengths (i.e., before the
padding). Such lengths are passed in valid\_length argument.

`flare_mla_prefill[rank: Int, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, softmax_type: DType, q_shape: DimList, //, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, 3, origin, shape, strides], v: NDBuffer[type, 3, origin, shape, strides], k_rope: NDBuffer[type, 4, origin, shape, strides], mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], cache_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}), cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}))`

---

## flare_mla_prefill_dispatch

`flare_mla_prefill_dispatch[rank: Int, k_t: MHAOperand, v_t: MHAOperand, k_rope_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, output_type: DType, softmax_type: DType, q_shape: DimList, //, kv_num_heads: Int, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False, q_depth: Int = 192, cache_depth: Int = 576, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), _ndbuffer_mha_operand: Bool = False](output: NDBuffer[output_type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: k_t, v: v_t, k_rope: k_rope_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: NDBuffer[uint32, 1, origin, shape, strides], max_prompt_len: Int, scale: SIMD[float32, 1], ctx: DeviceContext, softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}), cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), prev_output: OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]] = OptionalReg[NDBuffer[output_type, rank, MutableAnyOrigin]]({:i1 0, 1}), prev_softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}))`

---

## Flash attention

Flash attention is an optimization technique to compute attention blocks in
[transformer](transformer.mdx) models. Traditional [attention](attention.mdx)
requires storing large intermediate activation tensors, leading to high memory
overhead that slows execution because it requires frequent memory transfers
between high-bandwidth memory (HBM) and faster SRAM on the GPU.

Flash attention improves performance and reduces the memory footprint for
attention layers by reordering computations with techniques such as tiling to
compute attention scores in blocks, and keeping only small chunks of
activations in the faster on-chip SRAM. This allows the model to process much
longer sequences without running into memory limitations.

By improving the efficiency of attention layers, flash attention enables LLMs
to handle much longer contexts, improving their ability to understand and
generate complex text.

---

## flash_attention

`flash_attention[type: DType, rank: Int, mask_rank: Int, //, input_k_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_v_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_mask_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](q: NDBuffer[type, rank, origin, shape, strides], k_shape: IndexList[rank], v_shape: IndexList[rank], mask_shape: IndexList[mask_rank], output: NDBuffer[type, rank, origin, shape, strides], scale: SIMD[float32, 1])`

---

## flash_attention

## Functions

* [​`flash_attention`](./flash_attention):
* [​`flash_attention_kv_cache`](./flash_attention_kv_cache):
* [​`flash_attention_split_kv`](./flash_attention_split_kv): Variant of flash attention that takes the previous KV cache `input_{k,v}_cache_fn` and the current KV tensors `input_k_fn` and `input_v_fn` as separate arguments.

---

## flash_attention

`flash_attention[rank: Int, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), decoding_warp_split_k: Bool = False, naive_kernel: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, rank, origin, shape, strides], v: NDBuffer[type, rank, origin, shape, strides], mask: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], scale: SIMD[float32, 1], context: DeviceContextPtr = DeviceContextPtr(), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

`flash_attention[rank: Int, cache_t: KVCacheT, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, decoding_warp_split_k: Bool = False, naive_kernel: Bool = False, assert_write_mode: Int = 0](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: cache_t, v: cache_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], ctx: DeviceContext, q_max_seq_len: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

Flash attention 2 algorithm. Compute:     (1) Transpose (Q) BSHD -> BHSD;     (2) Transpose (K) BSHD -> BHSD;     (3) Transpose (V) BSHD -> BHSD;     (4) P = Bmm(Q, K), P is also called "score";     (5) P = P \* scale + mask;     (6) P = softmax(P);     (7) O = Bmm(P, V)     (8) Output = Transpose(O).

B, S, H, D denote batch size, sequence length, head count and depth, respectively.
(1), (2), (3) happens while loading the data into shared memory.
(8) happens when writing output to global memory.

All inputs (query, key, and value) must have BSHD layout. The mask can be
BSS or BHSS.

This kernel also handles grouped attention optimization. In this case the shape of
K and V are BShD where h = H / num\_groups.

This kernels handles batches with different valid lengths (i.e., before the
padding). Such lengths are passed in valid\_length argument.

`flash_attention[rank: Int, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), decoding_warp_split_k: Bool = False, naive_kernel: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: NDBuffer[type, rank, origin, shape, strides], v: NDBuffer[type, rank, origin, shape, strides], mask_functor: mask_t, score_mod_functor: score_mod_t, scale: SIMD[float32, 1], ctx: DeviceContext, num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

---

## flash_attention_dispatch

`flash_attention_dispatch[rank: Int, k_t: MHAOperand, v_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, q_shape: DimList, //, kv_num_heads: Int, use_score_mod: Bool = False, config: MHAConfig = MHAConfig(type, UInt(q_shape.get[::Int]()), UInt(q_shape.get[::Int]()), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), OptionalReg[UInt]({:i1 0, 1}), UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), UInt(1), FlashAttentionAlgorithm()), ragged: Bool = False, _is_flash_attention_applicable: Bool = True, _is_cache_length_accurate: Bool = False, _use_valid_length: Bool = True, decoding_warp_split_k: Bool = False](output: NDBuffer[type, rank, origin, shape, strides], q: NDBuffer[type, rank, origin, q_shape, strides], k: k_t, v: v_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], max_prompt_len: Int, max_cache_valid_length: Int, scale: SIMD[float32, 1], is_token_generation: Bool, ctx: DeviceContext, kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1}), num_partitions: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

---

## flash_attention_hw_supported

`flash_attention_hw_supported[qkv_type: DType]() -> Bool`

---

## flash_attention_kv_cache

`flash_attention_kv_cache[type: DType, cache_t: KVCacheT, //](q: NDBuffer[type, 4, origin, shape, strides], k: cache_t, v: cache_t, mask: NDBuffer[type, rank, origin, shape, strides], scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides])`

`flash_attention_kv_cache[type: DType, cache_t: KVCacheT, mask_t: MHAMask, //](q: NDBuffer[type, 4, origin, shape, strides], k: cache_t, v: cache_t, mask: mask_t, scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides])`

`flash_attention_kv_cache[type: DType, cache_t: KVCacheT, mask_t: MHAMask, //](q: NDBuffer[type, 3, origin, shape, strides], q_input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], k: cache_t, v: cache_t, mask: mask_t, scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides])`

Entrypoint for ragged tensors.

---

## flash_attention_split_kv

`flash_attention_split_kv[type: DType, rank: Int, mask_rank: Int, //, input_k_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_v_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_k_cache_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_v_cache_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_mask_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](q: NDBuffer[type, rank, origin, shape, strides], k_shape: IndexList[rank], v_shape: IndexList[rank], k_cache_shape: IndexList[(rank + 1)], v_cache_shape: IndexList[(rank + 1)], mask_shape: IndexList[mask_rank], output: NDBuffer[type, rank, origin, shape, strides], scale: SIMD[float32, 1])`

Variant of flash attention that takes the previous KV cache `input_{k,v}_cache_fn` and the current KV tensors `input_k_fn` and `input_v_fn` as separate arguments.

This works around the fact that fusion can't currently look through concat.
So this kernel does an in-place concat fusion by changing the input lambdas
`input_{k,v}_cache_fn_wrapper` to take previous sequence KV elements from
the KV cache, and current KV elements from tensors `k` and `v`.

---

## FlashAttentionAlgorithm

`@register_passable(trivial)`
`struct FlashAttentionAlgorithm`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `FLASH_ATTENTION_1`

`alias FLASH_ATTENTION_1 = FlashAttentionAlgorithm(1)`

### `FLASH_ATTENTION_2`

`alias FLASH_ATTENTION_2 = FlashAttentionAlgorithm(2)`

### `FLASH_ATTENTION_3`

`alias FLASH_ATTENTION_3 = FlashAttentionAlgorithm(3)`

### `NAIVE`

`alias NAIVE = FlashAttentionAlgorithm(0)`

## Methods

### `__init__`

`__init__() -> Self`

`@implicit`
`__init__(value: Int) -> Self`

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

---

## flatten

`flatten(t: IntTuple[origin]) -> IntTuple`

Flatten a nested `IntTuple` into a single-level `IntTuple`.

This function converts a hierarchical `IntTuple` structure into a flat
sequence of integer values, preserving the order of elements.

**Args:**

* ​t (`IntTuple[origin]`): The nested `IntTuple` to flatten.

**Returns:**

A new `IntTuple` containing all integer values in a flat structure.

---

## float_literal

Implements the FloatLiteral class.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`FloatLiteral`](/mojo/stdlib/builtin/float_literal/FloatLiteral): Mojo floating point literal type.

---

## floatable

Implements the `Floatable` and `FloatableRaising` traits.

These are Mojo built-ins, so you don't need to import them.

## Traits

* [​`Floatable`](/mojo/stdlib/builtin/floatable/Floatable): The `Floatable` trait describes a type that can be converted to a Float64.
* [​`FloatableRaising`](/mojo/stdlib/builtin/floatable/FloatableRaising): The `FloatableRaising` trait describes a type that can be converted to a Float64, but the conversion might raise an error (e.g.: a string).

---

## Floatable

The `Floatable` trait describes a type that can be converted to a Float64.

This trait requires the type to implement the `__float__()` method.

For example:

```mojo
struct Foo(Floatable):
    var i: Float64

    fn __float__(self) -> Float64:
        return self.i
```

A `Foo` can now be converted to a `Float64`:

```mojo
var f = Float64(Foo(5.5))
```

**Note:** If the `__float__()` method can raise an error, use
the [`FloatableRaising`](/mojo/stdlib/builtin/floatable/floatableraising)
trait instead.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__float__`

`__float__(self: _Self) -> SIMD[float64, 1]`

Get the float point representation of the value.

**Returns:**

The float point representation of the value.

---

## FloatableRaising

The `FloatableRaising` trait describes a type that can be converted to a Float64, but the conversion might raise an error (e.g.: a string).

This trait requires the type to implement the `__float__()` method, which
can raise an error.

For example:

```mojo
from utils import Variant

struct MaybeFloat(FloatableRaising):
    var value: Variant[Float64, NoneType]

    fn __float__(self) raises -> Float64:
        if self.value.isa[NoneType]():
            raise "Float expected"
        return self.value[Float64]
```

A `MaybeFloat` can now be converted to `Float64`:

```mojo
try:
    print(Float64(MaybeFloat(4.6)))
except:
    print("error occurred")
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__float__`

`__float__(self: _Self) -> SIMD[float64, 1]`

Get the float point representation of the value.

**Returns:**

The float point representation of the value.

**Raises:**

If the type does not have a float point representation.

---

## FloatLiteral

`@register_passable(trivial)`
`struct FloatLiteral[value: !pop.float_literal]`

Mojo floating point literal type.

## Parameters

* ​value (`!pop.float_literal`): The underlying infinite precision floating point value.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Floatable`,
`ImplicitlyBoolable`,
`Intable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`

## Aliases

### `infinity`

`alias infinity = inf`

### `nan`

`alias nan`

### `negative_infinity`

`alias negative_infinity = -inf`

### `negative_zero`

`alias negative_zero = -0.0`

## Methods

### `__init__`

`__init__() -> Self`

Create a FloatLiteral for any parameter value.

`@implicit`
`__init__(value: IntLiteral[value]) -> FloatLiteral[#pop.int_to_float_literal]`

Convert an IntLiteral to a FloatLiteral value.

**Args:**

* ​value (`IntLiteral[value]`): The IntLiteral value.

### `__bool__`

`__bool__(self) -> Bool`

A FloatLiteral value is true if it is non-zero.

**Returns:**

True if non-zero.

### `__neg__`

`__neg__(self) -> FloatLiteral[#pop.float_literal_bin>]`

Return the negation of the FloatLiteral value.

**Returns:**

The negated FloatLiteral value.

### `__lt__`

`__lt__(self, rhs: FloatLiteral[value]) -> Bool`

Less than comparison.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to compare.

**Returns:**

True if this value is less than `rhs`.

### `__le__`

`__le__(self, rhs: FloatLiteral[value]) -> Bool`

Less than or equal to comparison.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to compare.

**Returns:**

True if this value is less than or equal to `rhs`.

### `__eq__`

`__eq__(self, rhs: FloatLiteral[value]) -> Bool`

Compare for equality.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to compare.

**Returns:**

True if they are equal.

### `__ne__`

`__ne__(self, rhs: FloatLiteral[value]) -> Bool`

Compare for inequality.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to compare.

**Returns:**

True if they are not equal.

### `__gt__`

`__gt__(self, rhs: FloatLiteral[value]) -> Bool`

Greater than comparison.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to compare.

**Returns:**

True if this value is greater than `rhs`.

### `__ge__`

`__ge__(self, rhs: FloatLiteral[value]) -> Bool`

Greater than or equal to comparison.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to compare.

**Returns:**

True if this value is greater than or equal to `rhs`.

### `__add__`

`__add__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Add two FloatLiterals.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to add.

**Returns:**

The sum of the two values.

### `__sub__`

`__sub__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Subtract two FloatLiterals.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to subtract.

**Returns:**

The difference of the two values.

### `__mul__`

`__mul__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Multiply two FloatLiterals.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to multiply.

**Returns:**

The product of the two values.

### `__truediv__`

`__truediv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Divide two FloatLiterals.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to divide.

**Returns:**

The quotient of the two values.

### `__floordiv__`

`__floordiv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Returns self divided by rhs, rounded down to the nearest integer.

**Args:**

* ​rhs (`FloatLiteral[value]`): The divisor value.

**Returns:**

`floor(self / rhs)` value.

### `__mod__`

`__mod__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin, value>>]`

Return the remainder of self divided by rhs.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to divide on.

**Returns:**

The remainder of dividing self by rhs.

### `__radd__`

`__radd__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Reversed addition operator.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to add.

**Returns:**

The sum of this and the given value.

### `__rsub__`

`__rsub__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Reversed subtraction operator.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to subtract from.

**Returns:**

The result of subtracting this from the given value.

### `__rmul__`

`__rmul__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Reversed multiplication operator.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to multiply.

**Returns:**

The product of the given number and this.

### `__rtruediv__`

`__rtruediv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Reversed division.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to be divided by this.

**Returns:**

The result of dividing the given value by this.

### `__rfloordiv__`

`__rfloordiv__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin]`

Returns rhs divided by self, rounded down to the nearest integer.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to be divided by self.

**Returns:**

`floor(rhs / self)` value.

### `__rmod__`

`__rmod__(self, rhs: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin, value>>]`

Return the remainder of rhs divided by self.

**Args:**

* ​rhs (`FloatLiteral[value]`): The value to divide on.

**Returns:**

The remainder of dividing rhs by self.

### `is_nan`

`is_nan(self) -> Bool`

Return whether the FloatLiteral is nan.

Since `nan == nan` is False, this provides a way to check for nan-ness.

**Returns:**

True, if the value is nan, False otherwise.

### `is_neg_zero`

`is_neg_zero(self) -> Bool`

Return whether the FloatLiteral is negative zero.

Since `FloatLiteral.negative_zero == 0.0` is True, this provides a way
to check if the FloatLiteral is negative zero.

**Returns:**

True, if the value is negative zero, False otherwise.

### `__str__`

`__str__(self) -> String`

Get the float as a string.

**Returns:**

A string representation.

### `__int_literal__`

`__int_literal__(self) -> IntLiteral[#pop.float_to_int_literal]`

Casts the floating point value to an IntLiteral. If there is a fractional component, then the value is truncated towards zero.

Eg. `(4.5).__int_literal__()` returns `4`, and `(-3.7).__int_literal__()`
returns `-3`.

**Returns:**

The value as an integer.

### `__int__`

`__int__(self) -> Int`

Converts the FloatLiteral value to an Int. If there is a fractional component, then the value is truncated towards zero.

Eg. `(4.5).__int__()` returns `4`, and `(-3.7).__int__()` returns `-3`.

**Returns:**

The value as an integer.

### `__float__`

`__float__(self) -> SIMD[float64, 1]`

Converts the FloatLiteral to a concrete Float64.

**Returns:**

The Float value.

### `__as_bool__`

`__as_bool__(self) -> Bool`

A FloatLiteral value is true if it is non-zero.

**Returns:**

True if non-zero.

### `__ceildiv__`

`__ceildiv__(self, denominator: FloatLiteral[value]) -> FloatLiteral[#pop.float_literal_bin>>, #pop.float_literal>]`

Return the rounded-up result of dividing self by denominator.

**Args:**

* ​denominator (`FloatLiteral[value]`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

---

## floor

`floor[T: Floorable, //](value: T) -> T`

Get the floor value of the given object.

**Parameters:**

* ​T (`Floorable`): The type conforming to `Floorable`.

**Args:**

* ​value (`T`): The object to get the floor value of.

**Returns:**

The floor value of the object.

---

## Floorable

The `Floorable` trait describes a type that defines a floor operation.

Types that conform to `Floorable` will work with the builtin `floor`
function. The floor operation always returns the same type as the input.

For example:

```mojo
from math import Floorable, floor

@value
struct Complex(Floorable):
    var re: Float64
    var im: Float64

    fn __floor__(self) -> Self:
        return Self(floor(self.re), floor(self.im))
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__floor__`

`__floor__(self: _Self) -> _Self`

Return the floor of the Int value, which is itself.

**Returns:**

The Int value itself.

---

## FlushDenormals

`struct FlushDenormals`

Flushes and denormals are set to zero within the context and the state is restored to the prior value on exit.

## Fields

* ​state (`SIMD[int32, 1]`): The current state.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Initializes the FlushDenormals.

### `__enter__`

`__enter__(self)`

Enters the context. This will set denormals to zero.

### `__exit__`

`__exit__(self)`

Exits the context. This will restore the prior FPState.

---

## fma

`fma[mode: StringSlice[StaticConstantOrigin], type: DType](z_row_index: Int, x_row_index: Int, y_row_index: Int, clear_z: Bool)`

---

## fma

`fma(a: Int, b: Int, c: Int) -> Int`

Performs `fma` (fused multiply-add) on the inputs.

The result is `(a * b) + c`.

**Args:**

* ​a (`Int`): The first input.
* ​b (`Int`): The second input.
* ​c (`Int`): The third input.

**Returns:**

`(a * b) + c`.

`fma(a: UInt, b: UInt, c: UInt) -> UInt`

Performs `fma` (fused multiply-add) on the inputs.

The result is `(a * b) + c`.

**Args:**

* ​a (`UInt`): The first input.
* ​b (`UInt`): The second input.
* ​c (`UInt`): The third input.

**Returns:**

`(a * b) + c`.

`fma[dtype: DType, width: Int, //](a: SIMD[dtype, width], b: SIMD[dtype, width], c: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs elementwise `fma` (fused multiply-add) on the inputs.

Each element in the result SIMD vector is $(A_i * B_i) + C_i$, where $A_i$,
$B_i$ and $C_i$ are elements at index $i$ in a, b, and c respectively.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​a (`SIMD[dtype, width]`): The first vector of inputs.
* ​b (`SIMD[dtype, width]`): The second vector of inputs.
* ​c (`SIMD[dtype, width]`): The third vector of inputs.

**Returns:**

Elementwise `fma` of a, b and c.

---

## fma16

`fma16(gpr: Int)`

Float16 matrix multiply and subtract.

---

## fma32

`fma32(gpr: Int)`

Float32 matrix multiply and add.

---

## fma64

`fma64(gpr: Int)`

Float64 matrix multiply and add.

---

## fms16

`fms16(gpr: Int)`

Float16 matrix multiply and add.

---

## fold

`fold[dtype: DType, input_dim: DimList, output_dim: DimList, target: StringSlice[StaticConstantOrigin]](input: NDBuffer[dtype, 3, MutableAnyOrigin, input_dim], output: NDBuffer[dtype, 4, MutableAnyOrigin, output_dim], output_size: IndexList[2], kernel_size: IndexList[2], stride: IndexList[2], dilation: IndexList[2], padding: IndexList[2], ctx: DeviceContextPtr)`

Folds array of sliding local blocks into a single output tensor.

**Args:**

* ​input (`NDBuffer[dtype, 3, MutableAnyOrigin, input_dim]`): Input tensor to fold, shape \[N, C x kernel size, num\_blocks].
* ​output (`NDBuffer[dtype, 4, MutableAnyOrigin, output_dim]`): Output tensor to write to, shape \[N, C, H, W].
* ​output\_size (`IndexList[2]`): Spacial shape of the output tensor (H, W).
* ​kernel\_size (`IndexList[2]`): Size of the sliding blocks.
* ​stride (`IndexList[2]`): Stride of the sliding blocks.
* ​dilation (`IndexList[2]`): Dilation of the sliding blocks.
* ​padding (`IndexList[2]`): 0-paddings to be added on both sides of the inputs.
* ​ctx (`DeviceContextPtr`): DeviceContextPtr.

---

## fold

Implements the fold operation.

## Functions

* [​`fold`](./fold): Folds array of sliding local blocks into a single output tensor.
* [​`fold_shape`](./fold_shape): Returns the shape of the output tensor of the fold operation.

---

## fold_shape

`fold_shape[dtype: DType, input_dim: DimList](input: NDBuffer[dtype, 3, MutableAnyOrigin, input_dim], output_size: IndexList[2], kernel_size: IndexList[2], stride: IndexList[2], dilation: IndexList[2], padding: IndexList[2]) -> IndexList[4]`

Returns the shape of the output tensor of the fold operation.

---

## foreach

`foreach[type: DType, rank: Int, //, func: fn[Int](IndexList[rank]) capturing -> SIMD[type, $0], *, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), simd_width: Int = get_kernel_simd_width[::DType,::StringSlice[::Bool(), _synchronous: Bool = False, _trace_name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("mogg.for_each")](tensor: ManagedTensorSlice[io_spec, static_spec=static_spec], ctx: DeviceContextPtr = DeviceContextPtr())`

Apply the function `func` to each element of the tensor slice.

**Parameters:**

* ​type (`DType`): The data type of the elements in the tensor slice.
* ​rank (`Int`): The rank of the tensor slice.
* ​func (`fn[Int](IndexList[rank]) capturing -> SIMD[type, $0]`): The function to apply to each element of the tensor slice.
* ​target (`StringSlice[StaticConstantOrigin]`): Indicates the type of the target device (e.g. "cpu", "gpu").
* ​simd\_width (`Int`): The SIMD width for the target (usually leave this as its default value).
* ​\_synchronous (`Bool`): True to run the custom op synchronously in the runtime (defaults to False).
* ​\_trace\_name (`StringSlice[StaticConstantOrigin]`): Name of the executed operation displayed in the trace\_description.

**Args:**

* ​tensor (`ManagedTensorSlice[io_spec, static_spec=static_spec]`): The output tensor slice which receives the return values from `func`.
* ​ctx (`DeviceContextPtr`): The call context (forward this from the custom operation).

`foreach[: origin.set, type: DType, rank: Int, //, func: fn[Int](IndexList[rank]) capturing -> SIMD[type, $0], out_func: fn[Int](IndexList[rank]) capturing -> None, *, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), simd_width: Int = get_kernel_simd_width[::DType,::StringSlice[::Bool(), _synchronous: Bool = False, _trace_name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("mogg.for_each")](tensor: ManagedTensorSlice[io_spec, static_spec=static_spec], ctx: DeviceContextPtr = DeviceContextPtr())`

Apply the function `func` to each element of the tensor slice.

**Parameters:**

* ​type (`DType`): The data type of the elements in the tensor slice.
* ​rank (`Int`): The rank of the tensor slice.
* ​func (`fn[Int](IndexList[rank]) capturing -> SIMD[type, $0]`): The function to apply to each element of the tensor slice.
* ​out\_func (`fn[Int](IndexList[rank]) capturing -> None`): The function to apply on each output element.
* ​target (`StringSlice[StaticConstantOrigin]`): Indicates the type of the target device (e.g. "cpu", "gpu").
* ​simd\_width (`Int`): The SIMD width for the target (usually leave this as its default value).
* ​\_synchronous (`Bool`): True to run the custom op synchronously in the runtime (defaults to False).
* ​\_trace\_name (`StringSlice[StaticConstantOrigin]`): Name of the executed operation displayed in the trace\_description.

**Args:**

* ​tensor (`ManagedTensorSlice[io_spec, static_spec=static_spec]`): The input tensor slice which the consumed values.
* ​ctx (`DeviceContextPtr`): The call context (forward this from the custom operation).

---

## form_q

`form_q[dtype: DType, element_layout: Layout](sigma: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], A: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], Q: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Forms the Q factor from the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` and stores the result in `Q`.

---

## format

String formatting utilities for Mojo.

This module provides string formatting functionality similar to Python's
`str.format()` method. The `format()` method (available on the
[`String`](/mojo/stdlib/collections/string/string/String#format) and
[`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice#format)
types) takes the current string as a template (or "format string"), which can
contain literal text and/or replacement fields delimited by curly braces (`{}`).
The replacement fields are replaced with the values of the arguments.

Replacement fields can mapped to the arguments in one of two ways:

* Automatic indexing by argument position:

  ```mojo
  var s = String("{} is {}").format("Mojo", "🔥")
  ```

* Manual indexing by argument position:

  ```mojo
  var s = String("{1} is {0}").format("hot", "🔥")
  ```

The replacement fields can also contain the `!r` or `!s` conversion flags, to
indicate whether the argument should be formatted using `repr()` or `String()`,
respectively:

```mojo
var s = String("{!r}").format(myComplicatedObject)
```

Note that the following features from Python's `str.format()` are
**not yet supported**:

* Named arguments (for example `"{name} is {adjective}"`).
* Accessing the attributes of an argument value (for example, `"{0.name}"`.
* Accessing an indexed value from the argument (for example, `"{1[0]}"`).
* Format specifiers for controlling output format (width, precision, and so on).

Example:

```mojo
from collections.string import String

# Basic formatting
var s1 = String("Hello {0}!").format("World")  # Hello World!

# Multiple arguments
var s2 = String("{0} plus {1} equals {2}").format(1, 2, 3)  # 1 plus 2 equals 3

# Conversion flags
var s4 = String("{!r}").format("test")  # "'test'"
```

This module has no public API; its functionality is available through the
[`String.format()`](/mojo/stdlib/collections/string/string/String#format) and
[`StringSlice.format()`](/mojo/stdlib/collections/string/string_slice/StringSlice#format)
methods.

---

## Format

`struct Format`

Defines a format for the benchmark output when printing or writing to a file.

## Fields

* ​value (`StringSlice[StaticConstantOrigin]`): The format to print results.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `csv`

`alias csv = __init__[__mlir_type.!kgen.string]("csv")`

Comma separated values with no alignment.

### `table`

`alias table = __init__[__mlir_type.!kgen.string]("table")`

Table format with dynamically aligned columns.

### `tabular`

`alias tabular = __init__[__mlir_type.!kgen.string]("tabular")`

Comma separated values with dynamically aligned columns.

## Methods

### `__init__`

`@implicit`
`__init__(out self, value: StringSlice[origin])`

Constructs a Format object from a string.

**Args:**

* ​value (`StringSlice[origin]`): The format to print results.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two Format objects are equal.

**Args:**

* ​other (`Self`): The `Format` to compare with.

**Returns:**

True if the two `Format` objects are equal, false otherwise.

### `__str__`

`__str__(self) -> String`

Returns the string representation of the format.

**Returns:**

The string representation of the format.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the format to a writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The writer to write the `Format` to.

---

## format_int

Provides the `hex` and `bin` functions.

These are Mojo built-ins, so you don't need to import them.

## Functions

* [​`bin`](/mojo/stdlib/builtin/format_int/bin): Return the binary string representation an integral value.
* [​`hex`](/mojo/stdlib/builtin/format_int/hex): Returns the hex string representation of the given integer.
* [​`oct`](/mojo/stdlib/builtin/format_int/oct): Returns the octal string representation of the given integer.

---

## format_layout

`format_layout[W: Writer](layout: Layout, mut writer: W)`

Formats a 2D layout as a table and writes it to the specified writer.

This function creates a visual representation of a 2D layout as a table
showing the memory indices for each logical coordinate.

**Parameters:**

* ​W (`Writer`): Type parameter representing a Writer implementation.

**Args:**

* ​layout (`Layout`): The 2D layout to format.
* ​writer (`W`): The writer to output the formatted layout to.

---

## fp8_quantization

## Functions

* [​`block_reduce`](./block_reduce):
* [​`matmul_dynamic_scaled_fp8`](./matmul_dynamic_scaled_fp8):
* [​`quantize_dynamic_scaled_fp8`](./quantize_dynamic_scaled_fp8):
* [​`quantize_fp8_kernel`](./quantize_fp8_kernel):
* [​`quantize_static_scaled_fp8`](./quantize_static_scaled_fp8):

---

## FPUtils

`struct FPUtils[dtype: DType, *, _constraint: NoneType = NoneType(_constrain_fp_type[::DType]())]`

Collection of utility functions for working with FP values.

**Constraints:**

The dtype is floating point.

## Parameters

* ​dtype (`DType`): The concrete FP dtype (FP32/FP64/etc).
* ​\_constraint (`NoneType`): Implements the constraint. Do not pass explicitly.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `integral_type`

`alias integral_type = _integral_type_of[::DType]()`

The equivalent integer dtype of the float type.

### `uint_type`

`alias uint_type = _unsigned_integral_type_of[::DType]()`

The equivalent uint dtype of the float type.

## Methods

### `mantissa_width`

`static mantissa_width() -> Int`

Returns the mantissa width of a floating point type.

**Returns:**

The mantissa width.

### `max_exponent`

`static max_exponent() -> Int`

Returns the max exponent of a floating point dtype without accounting for inf representations. This is not the maximum representable exponent, which is generally equal to the exponent\_bias.

**Returns:**

The max exponent.

### `exponent_width`

`static exponent_width() -> Int`

Returns the exponent width of a floating point type.

**Returns:**

The exponent width.

### `mantissa_mask`

`static mantissa_mask() -> Int`

Returns the mantissa mask of a floating point type.

**Returns:**

The mantissa mask.

### `exponent_bias`

`static exponent_bias() -> Int`

Returns the exponent bias of a floating point type.

**Returns:**

The exponent bias.

### `sign_mask`

`static sign_mask() -> Int`

Returns the sign mask of a floating point type.

It is computed by `1 

### `exponent_mask`

`static exponent_mask() -> Int`

Returns the exponent mask of a floating point type.

It is computed by `~(sign_mask | mantissa_mask)`.

**Returns:**

The exponent mask.

### `exponent_mantissa_mask`

`static exponent_mantissa_mask() -> Int`

Returns the exponent and mantissa mask of a floating point type.

It is computed by `exponent_mask | mantissa_mask`.

**Returns:**

The exponent and mantissa mask.

### `quiet_nan_mask`

`static quiet_nan_mask() -> Int`

Returns the quiet NaN mask for a floating point type.

The mask is defined by evaluating:

```
(1

### `bitcast_to_integer`

`static bitcast_to_integer(value: SIMD[dtype, 1]) -> Int`

Bitcasts the floating-point value to an integer.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point type.

**Returns:**

An integer representation of the floating-point value.

### `bitcast_to_uint`

`static bitcast_to_uint(value: SIMD[dtype, 1]) -> SIMD[_unsigned_integral_type_of[::DType](), 1]`

Bitcasts the floating-point value to an integer.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point type.

**Returns:**

An integer representation of the floating-point value.

### `bitcast_from_integer`

`static bitcast_from_integer(value: Int) -> SIMD[dtype, 1]`

Bitcasts the floating-point value from an integer.

**Args:**

* ​value (`Int`): The int value.

**Returns:**

An floating-point representation of the Int.

### `get_sign`

`static get_sign(value: SIMD[dtype, 1]) -> Bool`

Returns the sign of the floating point value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point type.

**Returns:**

Returns True if the sign is set and False otherwise.

### `set_sign`

`static set_sign(value: SIMD[dtype, 1], sign: Bool) -> SIMD[dtype, 1]`

Sets the sign of the floating point value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point value.
* ​sign (`Bool`): True to set the sign and false otherwise.

**Returns:**

Returns the floating point value with the sign set.

### `get_exponent`

`static get_exponent(value: SIMD[dtype, 1]) -> Int`

Returns the exponent bits of the floating-point value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point value.

**Returns:**

Returns the exponent bits.

### `get_exponent_biased`

`static get_exponent_biased(value: SIMD[dtype, 1]) -> Int`

Returns the biased exponent of the floating-point value as an Int, this is how the value is stored before subtracting the exponent bias.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point value.

**Returns:**

The biased exponent as an Int.

### `set_exponent`

`static set_exponent(value: SIMD[dtype, 1], exponent: Int) -> SIMD[dtype, 1]`

Sets the exponent bits of the floating-point value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point value.
* ​exponent (`Int`): The exponent bits.

**Returns:**

Returns the floating-point value with the exponent bits set.

### `get_mantissa`

`static get_mantissa(value: SIMD[dtype, 1]) -> Int`

Gets the mantissa bits of the floating-point value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point value.

**Returns:**

The mantissa bits.

### `get_mantissa_uint`

`static get_mantissa_uint(value: SIMD[dtype, 1]) -> SIMD[_unsigned_integral_type_of[::DType](), 1]`

Gets the mantissa bits of the floating-point value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point value.

**Returns:**

The mantissa bits.

### `set_mantissa`

`static set_mantissa(value: SIMD[dtype, 1], mantissa: Int) -> SIMD[dtype, 1]`

Sets the mantissa bits of the floating-point value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The floating-point value.
* ​mantissa (`Int`): The mantissa bits.

**Returns:**

Returns the floating-point value with the mantissa bits set.

### `pack`

`static pack(sign: Bool, exponent: Int, mantissa: Int) -> SIMD[dtype, 1]`

Construct a floating-point value from its constituent sign, exponent, and mantissa.

**Args:**

* ​sign (`Bool`): The sign of the floating-point value.
* ​exponent (`Int`): The exponent of the floating-point value.
* ​mantissa (`Int`): The mantissa of the floating-point value.

**Returns:**

Returns the floating-point value.

---

## frexp

`frexp[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> StaticTuple[SIMD[dtype, width], 2]`

Breaks floating point values into a fractional part and an exponent part. This follows C and Python in increasing the exponent by 1 and normalizing the fraction from 0.5 to 1.0 instead of 1.0 to 2.0.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input values.

**Returns:**

A tuple of two SIMD vectors containing the fractional and exponent parts
of the input floating point values.

---

## fsm32

`fsm32(gpr: Int)`

Float32 matrix multiply and subtract.

---

## fsm64

`fsm64(gpr: Int)`

Float64 matrix multiply and subtract.

---

## fstat

Implements file system status operations.

You can import these APIs from the `os` package. For example:

```mojo
from os import stat
```

## Structs

* [​`stat_result`](/mojo/stdlib/os/fstat/stat_result): Object whose fields correspond  to the members of the stat structure.

## Functions

* [​`lstat`](/mojo/stdlib/os/fstat/lstat): Get the status of a file or a file descriptor (similar to stat, but does not follow symlinks).
* [​`stat`](/mojo/stdlib/os/fstat/stat): Get the status of a file or a file descriptor.

---

## func_attribute

GPU Kernel Function Attributes Module

This module provides structures for defining and managing GPU kernel function attributes.
It implements functionality similar to CUDA's CUfunction\_attribute enum, allowing
for querying and setting various attributes that control kernel execution behavior
and resource allocation.

The module includes:

* `Attribute`: A value type representing different GPU kernel function attribute types
* `FuncAttribute`: A structure that pairs an attribute type with its value

These structures enable fine-grained control over GPU kernel execution parameters
such as shared memory allocation, cache behavior, and cluster configuration.

## Structs

* [​`Attribute`](/mojo/stdlib/gpu/host/func_attribute/Attribute): Represents GPU kernel function attributes.
* [​`FuncAttribute`](/mojo/stdlib/gpu/host/func_attribute/FuncAttribute): Implements CUDA's CUfunction\_attribute enum for GPU kernel function attributes.

---

## FuncAttribute

`@register_passable(trivial)`
`struct FuncAttribute`

Implements CUDA's CUfunction\_attribute enum for GPU kernel function attributes.

This struct represents function attributes that can be set or queried for GPU kernels,
following NVIDIA's CUDA driver API conventions. Each attribute consists of a type
(represented by the Attribute enum) and an associated value.

The struct provides factory methods for creating common attribute configurations,
such as cache mode settings and shared memory allocations.

Reference: 

## Fields

* ​attribute (`Attribute`): The type of function attribute.
* ​value (`SIMD[int32, 1]`): The value associated with this attribute.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `NULL`

`alias NULL = FuncAttribute(Attribute(__init__[__mlir_type.!pop.int_literal](-1)), __init__[__mlir_type.!pop.int_literal](-1))`

A null/invalid function attribute constant.

## Methods

### `__init__`

`__init__(*, other: Self) -> Self`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two `FuncAttribute` instances are equal.

**Args:**

* ​other (`Self`): The FuncAttribute to compare with.

**Returns:**

True if both the attribute type and value are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if two `FuncAttribute` instances are not equal.

**Args:**

* ​other (`Self`): The `FuncAttribute` to compare with.

**Returns:**

True if either the attribute type or value differs, False otherwise.

### `CACHE_MODE_CA`

`static CACHE_MODE_CA(val: Bool) -> Self`

Creates a CACHE\_MODE\_CA function attribute.

Indicates whether the function has been compiled with user specified
option `CacheMode.L1_CACHE_DISABLED` set.

**Args:**

* ​val (`Bool`): Boolean value indicating if L1 cache is disabled.

**Returns:**

A `FuncAttribute` instance with CACHE\_MODE\_CA attribute type.

### `MAX_DYNAMIC_SHARED_SIZE_BYTES`

`static MAX_DYNAMIC_SHARED_SIZE_BYTES(val: SIMD[uint32, 1]) -> Self`

Creates a MAX\_DYNAMIC\_SHARED\_SIZE\_BYTES function attribute.

The maximum size in bytes of dynamically-allocated shared memory that
can be used by this function. If the user-specified dynamic shared memory
size is larger than this value, the launch will fail.

**Args:**

* ​val (`SIMD[uint32, 1]`): Maximum dynamic shared memory size in bytes.

**Returns:**

A `FuncAttribute` instance with `MAX_DYNAMIC_SHARED_SIZE_BYTES` attribute type.

### `PREFERRED_SHARED_MEMORY_CARVEOUT`

`static PREFERRED_SHARED_MEMORY_CARVEOUT(val: SIMD[int32, 1]) -> Self`

Creates a PREFERRED\_SHARED\_MEMORY\_CARVEOUT function attribute.

On devices where the L1 cache and shared memory use the same hardware
resources, this sets the shared memory carveout preference, in percent
of the total shared memory.

**Args:**

* ​val (`SIMD[int32, 1]`): Shared memory carveout preference as a percentage (0-100).

**Returns:**

A FuncAttribute instance with `PREFERRED_SHARED_MEMORY_CARVEOUT` attribute type.

---

## Function calling and tool use

import TutorialStack from '@site/src/components/TutorialStack';

Function calling enables AI models to dynamically interact with external
systems, retrieve up-to-date data, and execute tasks. This capability is a
foundational building block for agentic GenAI applications, where models call
different functions to achieve specific objectives.

## When to use function calling

You may want to define functions for the following purposes:

- **To fetch data**: Access APIs, knowledge bases, or external services to
  retrieve up-to-date information and augment model responses
- **To perform actions**: Execute predefined tasks like modifying application
  states, invoking workflows, or integrating with custom business logic

Based on the system prompt and messages, the model may decide to call these
functions instead of or in addition to generating text. Developers then handle
the function calls, execute them, and return the results to the model, which
integrates the function call results into its final response.

## How function calling works

MAX supports the
[OpenAI function calling specification](https://platform.openai.com/docs/guides/function-calling)
to call developer-defined functions as tools that a model can use to augment
prompts, in order to have more control over model behavior and directly trigger
actions based on user input.

The following example defines a function, registers that function as a tool,
and sends a request to the chat completion client.

:::note

MAX does not currently support streaming with function calling. Be sure
to set `stream` to `False` when making requests with function calling.

:::

```python
from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:8000/v1", api_key="")

# Define a function that the model can call
def get_weather(location: str):
    return f"Getting the weather for {location} ..."

# Register your function as an available tool
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
	                  "type": "string",
	                  "description": "City and state, e.g., 'Los Angeles, CA'"
							  }
            },
            "required": [
                "location"
            ]
        }
    }
}]

# Generate a response with the chat completion client with access to tools
response = client.chat.completions.create(
    model="modularai/Llama-3.1-8B-Instruct-GGUF",
    messages=[{"role": "user", "content": "What's the weather like in Paris today?"}],
    tools=tools,
    stream=False
)

# Print the model's selected function call
print(completion.choices[0].message.tool_calls)
```

At this stage of the function calling workflow, the model responds with the
selected tool to use along with detected function inputs:

```json
[{
    "id": "call_12345xyz",
    "type": "function",
    "function": {
        "name": "get_weather",
        "arguments": "{\"location\":\"Paris, France\"}"
    }
}]
```

From here, you must execute the function call and supply the model with the
results in order to augment the model response.

The OpenAI function calling spec is compatible with multiple agent frameworks,
such as [AutoGen](https://github.com/microsoft/autogen),
[CrewAI](https://github.com/crewAIInc/crewAI), and more.

### Supported models

The `max` CLI supports several LLMs optimized for function calling:

- [`modularai/Llama-3.1-8B-Instruct-GGUF`](https://huggingface.co/modularai/Llama-3.1-8B-Instruct-GGUF)
- [Meta's Llama 3.1 models & evals](https://huggingface.co/collections/meta-llama/metas-llama-31-models-and-evals-675bfd70e574a62dd0e40565) collection
- [Meta's Llama 3.2 language models & evals](https://huggingface.co/collections/meta-llama/metas-llama-32-language-models-and-evals-675bfd70e574a62dd0e40586) collection

:::note

The Meta Llama 3 models are hosted in gated repositories on Hugging Face. You
must have a Hugging Face account with access to these repositories and an access
token configured in your environment to deploy these models.

:::

## Quickstart

Use MAX to serve a model that is compatible with function calling and test it
out locally.

:::note

Function calling is enabled by default with MAX. However, function calling with
MAX is model-dependent and will only produce valid output if the model is
pretrained to return tool use responses. This example uses the Modular
implementation of Llama 3.1. For more information on which models to use, see
[Supported models](/max/serve/function-calling#supported-models).

:::

1. Follow the steps to
  [set up your project](/max/get-started#set-up-your-project) to set up a
  GenAI endpoint.
2. Next, open a new window and send a request to the endpoint specifying the
  available tools:

    ```bash
    curl -N http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
        "stream": false,
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the weather like in Boston today?"}
        ],
        "tools": [
        {
          "type": "function",
          "function": {
            "name": "get_weather",
            "description": "Get the current weather in a given location",
            "parameters": {
              "type": "object",
              "properties": {
                "location": {
                  "type": "string",
                  "description": "The city and state, e.g. Los Angeles, CA"
                }
              },
              "required": ["location"]
            }
          }
        }
      ],
      "tool_choice": "auto"
    }'
    ```

Within the generated response, you should see that the `get_weather` function
was chosen to call as a tool and the inputs for the function are taken from the
original prompt.

```json
"tool_calls": [
  {
    "id": "call_ac73df14fe184349",
    "type": "function",
    "function": {
        "name": "get_weather",
        "arguments": "{\"location\": \"Boston, MA\"}"
    }
  }
]
```

## Next steps

Now that you know the basics of function calling, you can get started with MAX
on GPUs.

export const tutorials = [
  'max-serve-local-to-cloud',
  'deploy-max-serve-on-kubernetes',
];

---

## functional

Implements higher-order functions.

You can import these APIs from the `algorithm` package. For example:

```mojo
from algorithm import map
```

## Aliases

### `BinaryTile1DTileUnitFunc`

`alias BinaryTile1DTileUnitFunc = fn[Int](Int, Int) capturing -> None`

Signature of a tiled function that performs some work with a dynamic tile size and a secondary static tile size.

### `Dynamic1DTileUnitFunc`

`alias Dynamic1DTileUnitFunc = fn(Int, Int) capturing -> None`

Signature of a 1d tiled function that performs some work with a dynamic tile size   and an offset. i.e. func(offset: Int, tile\_size: Int)

### `Dynamic1DTileUnswitchUnitFunc`

`alias Dynamic1DTileUnswitchUnitFunc = fn[Bool](Int, Int, Int) capturing -> None`

### `Static1DTileUnitFunc`

`alias Static1DTileUnitFunc = fn[Int](Int) capturing -> None`

Signature of a 1d tiled function that performs some work with a static tile size and an offset. i.e. func\ (offset: Int)

### `Static1DTileUnitFuncWithFlag`

`alias Static1DTileUnitFuncWithFlag = fn[Int, Bool](Int) capturing -> None`

### `Static1DTileUnitFuncWithFlags`

`alias Static1DTileUnitFuncWithFlags = fn[Int, Bool, Bool](Int) capturing -> None`

### `Static1DTileUnswitchUnitFunc`

`alias Static1DTileUnswitchUnitFunc = fn[Int, Bool](Int, Int) capturing -> None`

Signature of a tiled function that performs some work with a static tile size   and an offset. i.e. func\ (offset: Int)

### `Static2DTileUnitFunc`

`alias Static2DTileUnitFunc = fn[Int, Int](Int, Int) capturing -> None`

Signature of a 2d tiled function that performs some work with a static tile size and an offset. i.e. func\ (offset\_x: Int, offset\_y: Int)

### `stencil`

`alias stencil = _stencil_impl_cpu[__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,::Int,::Int,::IndexList[$7, ::DType`

### `stencil_gpu`

`alias stencil_gpu = _stencil_impl_gpu[__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,__mlir_type.!lit.origin.set,::Int,::Int,::IndexList[$7, ::DType`

### `SwitchedFunction`

`alias SwitchedFunction = fn[Bool]() raises capturing -> None`

### `SwitchedFunction2`

`alias SwitchedFunction2 = fn[Bool, Bool]() capturing -> None`

## Functions

* [​`elementwise`](/mojo/stdlib/algorithm/functional/elementwise): Executes `func[width, rank](indices)`, possibly as sub-tasks, for a suitable combination of width and indices so as to cover shape. Returns when all sub-tasks have completed.
* [​`map`](/mojo/stdlib/algorithm/functional/map): Maps a function over a range from 0 to size.
* [​`parallelize`](/mojo/stdlib/algorithm/functional/parallelize): Executes func(0) ... func(num\_work\_items-1) as sub-tasks in parallel, and returns when all are complete.
* [​`parallelize_over_rows`](/mojo/stdlib/algorithm/functional/parallelize_over_rows): Parallelize func over non-axis dims of shape.
* [​`sync_parallelize`](/mojo/stdlib/algorithm/functional/sync_parallelize): Executes func(0) ... func(num\_work\_items-1) as parallel sub-tasks, and returns when all are complete.
* [​`tile`](/mojo/stdlib/algorithm/functional/tile): A generator that launches work groups in specified list of tile sizes.
* [​`tile_and_unswitch`](/mojo/stdlib/algorithm/functional/tile_and_unswitch): Performs time and unswitch functional transformation.
* [​`tile_middle_unswitch_boundaries`](/mojo/stdlib/algorithm/functional/tile_middle_unswitch_boundaries): Divides 1d iteration space into three parts and tiles them with different steps.
* [​`unswitch`](/mojo/stdlib/algorithm/functional/unswitch): Performs a functional unswitch transformation.
* [​`vectorize`](/mojo/stdlib/algorithm/functional/vectorize): Simplifies SIMD optimized loops by mapping a function across a range from 0 to `size`, incrementing by `simd_width` at each step. The remainder of `size % simd_width` will run in separate iterations.

---

## Functions

As mentioned in the [syntax overview](/mojo/manual/basics), Mojo supports two
keywords to declare functions: `def` and `fn`. You can use either declaration
with any function, including the `main()` function, but they have different
default behaviors, as described on this page.

We believe both `def` and `fn` have good use cases and don't consider either to
be better than the other. Deciding which to use is a matter of personal taste as
to which style best fits a given task.

:::note

Functions declared inside a [`struct`](/mojo/manual/structs) are called
"methods," but they have all the same qualities as "functions" described here.

:::

## Anatomy of a function

Both `def` and `fn` function declarations have the same basic components (here
demonstrated with a `def` function):

def function_name[
&#8203;    parameters ...
](
&#8203;    arguments ...
) -&gt; return_value_type:
&#8203;    function_body

Functions can have:

- Parameters: A function can optionally take one or more compile-time
  _parameter_ values used for metaprogramming.
- Arguments: A function can also optionally take one or more run-time
  _arguments_.
- Return value: A function can optionally return a value.
- Function body: Statements that are executed when you call the function.
  Function definitions must include a body.

All of the optional parts of the function can be omitted, so the minimal
function is something like this:

```mojo
def do_nothing():
    pass
```

If a function takes no parameters, you can omit the square brackets, but the
parentheses are always required.

Although you can't leave out the function body, you can use the `pass` statement
to define a function that does nothing.

### Arguments and parameters

Functions take two kinds of inputs: _arguments_ and _parameters_. Arguments are
familiar from many other languages: they are run-time values passed into the
function.

```mojo
def add(a: Int, b: Int) -> Int:
    return a+b
```

On the other hand, you can think of a parameter as a compile-time variable that
becomes a run-time constant. For example, consider the following function with a
parameter:

```mojo
def add_tensors[rank: Int](a: MyTensor[rank], b: MyTensor[rank]) -> MyTensor[rank]:
    # ...
```

In this case, the `rank` value needs to be specified in a way that can be
determined at compilation time, such as a literal or expression.

When you compile a program that uses this code, the compiler produces a unique
version of the function for each unique `rank` value used in the program, with
`rank` treated as a constant within each specialized version.

This usage of "parameter"
is probably different from what you're used to from other languages, where
"parameter" and "argument" are often used interchangeably. In Mojo, "parameter"
and "parameter expression" refer to compile-time values, and "argument" and
"expression" refer to run-time values.

By default, both arguments and parameters can be specified either by position or
by keyword. These forms can also be mixed in the same function call.

```mojo
# positional
x = add(5, 7)      # Positionally, a=5 and b=7
# keyword
y = add(b=3, a=9)
# mixed
z = add(5, b=7)    # Positionally, a=5
```

For more information on arguments, see [Function arguments](#function-arguments)
on this page. For more information on parameters, see
[Parameterization: compile-time metaprogramming](/mojo/manual/parameters/).

## `def` and `fn` comparison

Defining a function using `def` and `fn` have much in common. They both have the
following requirements:

* You must declare the type of each function parameter and argument.

* If a function doesn't return a value, you can either omit the return type or
  declare `None` as the return type.

  ```mojo
  # The following function definitions are equivalent

  def greet(name: String):
    print("Hello," name)

  def greet(name: String) -> None:
    print("Hello," name)
  ```

* If the function returns a value, you must either declare the return type using
  the -> type syntax or provide a
  [named result](#named-results) in the argument list.

  ```mojo
  # The following function definitions are equivalent

  def incr(a: Int) -> Int:
    return a + 1

  def incr(a: Int, out b: Int):
    b = a + 1
  ```

  For more information, see the [Return values](#return-values) section of this
  page.

Where `def` and `fn` differ is error handling and argument mutability defaults.

* The compiler doesn't allow a function declared with `fn` to raise an error
  condition unless it explicitly includes a `raises` declaration. In contrast,
  the compiler assumes that *all* functions declared with `def` *might* raise an
  error. See the [Raising and non-raising
  functions](#raising-and-non-raising-functions) section of this page for more
  information.

* All arguments to a function declared with `fn` are immutable references by
  default (that is, values are read-only, using the `read` [argument
  convention](/mojo/manual/values/ownership#argument-conventions)). This
  prevents accidental mutations, and permits the use of non-copyable types as
  arguments.

  All arguments to a function declared with `def` are mutable. Arguments default
  to using the `read` [argument
  convention](/mojo/manual/values/ownership#argument-conventions) like an `fn`
  function, with a special addition: if the function mutates the argument, it
  makes a mutable copy.

  You can override the default behavior for both `def` and `fn` functions by
  providing an explicit [argument
  convention](/mojo/manual/values/ownership#argument-conventions) when declaring
  the argument.

As far as a function caller is concerned, there is no difference between
invoking a function declared with `def` vs a function declared with `fn`. You
could reimplement a `def` function as an `fn` function without making any
changes to code that calls the function.

## Function arguments

As noted in the previous section, there is a difference between how `def`
and `fn` functions handle default *argument conventions*.
Argument conventions are discussed in much more detail in the page on
[Ownership](/mojo/manual/values/ownership#argument-conventions).

The remaining rules for arguments described in this section apply to both `def`
and `fn` functions.

:::note Functions with \`/\` and \`*\` in the argument list

You might see the following characters in
place of arguments: slash (`/`) and/or star (`*`). For example:

```mojo
def myfunc(pos_only, /, pos_or_keyword, *, keyword_only):
```

Arguments **before** the `/` can be passed only by position. Arguments **after**
the `*` can be passed only by keyword. For details, see
[Positional-only and keyword-only arguments](#positional-only-and-keyword-only-arguments)

You may also see argument names prefixed with one or two stars (`*`):

```mojo
def myfunc2(*names, **attributes):
```
An argument name prefixed by a single star character, like `*names` identifies a
[variadic argument](#variadic-arguments), while an argument name prefixed with
a double star, like `**attributes` identifies a
[variadic keyword-only argument](#variadic-keyword-arguments).

:::

### Optional arguments

An optional argument is one that includes a default value, such as the `exp`
argument here:

```mojo
fn my_pow(base: Int, exp: Int = 2) -> Int:
    return base ** exp

fn use_defaults():
    # Uses the default value for `exp`
    var z = my_pow(3)
    print(z)
```

However, you can't define a default value for an argument that's declared with
the [`mut`](/mojo/manual/values/ownership#mutable-arguments-mut) argument
convention.

Any optional arguments must appear after any required arguments. [Keyword-only
arguments](#positional-only-and-keyword-only-arguments), discussed later, can
also be either required or optional.

### Keyword arguments

You can also use keyword arguments when calling a function. Keyword arguments
are specified using the format
argument_name = argument_value.
You can pass keyword arguments in any order:

```mojo
fn my_pow(base: Int, exp: Int = 2) -> Int:
    return base ** exp

fn use_keywords():
    # Uses keyword argument names (with order reversed)
    var z = my_pow(exp=3, base=2)
    print(z)
```

### Variadic arguments

Variadic arguments let a function accept a variable number of arguments. To
define a function that takes a variadic argument, use the variadic argument
syntax *argument_name:

```mojo
fn sum(*values: Int) -> Int:
  var sum: Int = 0
  for value in values:
    sum = sum + value
  return sum
```

The variadic argument `values` here is a placeholder that accepts any number of
passed positional arguments.

You can define zero or more arguments before the variadic argument. When calling
the function, any remaining positional arguments are assigned to the variadic
argument, so any arguments declared **after** the variadic argument can only be
specified by keyword (see
[Positional-only and keyword-only arguments](#positional-only-and-keyword-only-arguments)).

Variadic arguments can be divided into two categories:

* Homogeneous variadic arguments, where all of the passed arguments are the same
  type—all `Int`, or all `String`, for example.
* Heterogeneous variadic arguments, which can accept a set of different argument
  types.

The following sections describe how to work with homogeneous and heterogeneous
variadic arguments.

:::note Variadic parameters

Mojo also supports variadic *parameters*, but with some limitations—for details
see [variadic parameters](/mojo/manual/parameters/#variadic-parameters).

:::

#### Homogeneous variadic arguments

When defining a homogeneous variadic argument, use *argument_name: argument_type:

```mojo
def greet(*names: String):
    ...
```

Inside the function body, the variadic argument is available as an iterable list
for ease of use. Currently there are some differences in handling the list
depending on whether the arguments are register-passable types (such as `Int`)
or memory-only types (such as `String`).

:::note TODO

We hope to remove these differences in the future.

:::

Register-passable types, such as `Int`, are available as a
[`VariadicList`](/mojo/stdlib/builtin/list_literal/VariadicList) type. As
shown in the previous example, you can iterate over the values using a `for..in`
loop.

```mojo
fn sum(*values: Int) -> Int:
  var sum: Int = 0
  for value in values:
    sum = sum+value
  return sum
```

Memory-only types, such as `String`, are available as a
[`VariadicListMem`](/mojo/stdlib/builtin/list_literal/VariadicListMem).
Iterating over this list directly with a `for..in` loop currently produces a
[`Pointer`](/mojo/stdlib/memory/pointer/Pointer) to each value instead
of the value itself. You must add an empty subscript operator `[]` to
dereference the pointer and retrieve the value:

```mojo
def make_worldly(mut *strs: String):
    # Requires extra [] to dereference the pointer for now.
    for i in strs:
        i[] += " world"

```

Alternately, subscripting into a `VariadicListMem` returns the argument value,
and doesn't require any dereferencing:

```mojo
fn make_worldly(mut *strs: String):
    # This "just works" as you'd expect!
    for i in range(len(strs)):
        strs[i] += " world"
```

#### Heterogeneous variadic arguments

Implementing heterogeneous variadic arguments is somewhat more complicated than
homogeneous variadic arguments. Writing generic code to handle multiple argument
types requires [traits](/mojo/manual/traits) and
[parameters](/mojo/manual/parameters/). So the syntax may look a little
unfamiliar if you haven't worked with those features. The signature for a
function with a heterogeneous variadic argument looks like this:

```mojo
def count_many_things[*ArgTypes: Intable](*args: *ArgTypes):
    ...
```

The parameter list, `[*ArgTypes: Intable]` specifies that the function takes an
`ArgTypes` parameter, which is a list of types, all of which conform to the
[`Intable`](/mojo/stdlib/builtin/int/Intable) trait. The argument list,
`(*args: *ArgTypes)` has the familiar `*args` for the variadic argument, but
instead of a single type, its type is defined as *list* of types, `*ArgTypes`.

This means that each argument in `args` has a corresponding type in `ArgTypes`,
so args[n] is of type ArgTypes[n].

Inside the function, `args` is available as a
[`VariadicPack`](/mojo/stdlib/builtin/list_literal/VariadicPack). The easiest
way to work with the arguments is to use the `each()` method to iterate through
the `VariadicPack`:

```mojo
fn count_many_things[*ArgTypes: Intable](*args: *ArgTypes) -> Int:
    var total = 0

    @parameter
    fn add[Type: Intable](value: Type):
        total += Int(value)

    args.each[add]()
    return total

print(count_many_things(5, 11.7, 12))
```

```output
28
```

In the example above, the `add()` function is called for each argument in turn,
with the appropriate `value` and `Type` values. For instance, `add()` is first
called with `value=5` and `Type=Int`, then with `value=11.7` and `Type=Float64`.

Also, note that when calling `count_many_things()`, you don't actually pass in
a list of argument types. You only need to pass in the arguments, and Mojo
generates the `ArgTypes` list itself.

As a small optimization, if your function is likely to be called with a single
argument frequently, you can define your function with a single argument
followed by a variadic argument. This lets the simple case bypass populating and
iterating through the `VariadicPack`.

For example, given a `print_string()` function that prints a single string, you
could re-implement the variadic `print()` function with code like this:

```mojo
fn print_string(s: String):
    print(s, end="")

fn print_many[T: Stringable, *Ts: Stringable](first: T, *rest: *Ts):
    print_string(String(first))

    @parameter
    fn print_elt[T: Stringable](a: T):
        print_string(" ")
        print_string(String(a))
    rest.each[print_elt]()
print_many("Bob")
```

```output
Bob
```

If you call `print_many()` with a single argument, it calls `print_string()`
directly. The `VariadicPack` is empty, so `each()` returns immediately without
calling the `print_elt()` function.

#### Variadic keyword arguments

Mojo functions also support variadic keyword arguments (`**kwargs`). Variadic
keyword arguments allow the user to pass an arbitrary number of keyword
arguments. To define a function that takes a variadic keyword argument, use the
variadic keyword argument syntax **kw_argument_name:

```mojo
fn print_nicely(**kwargs: Int) raises:
  for key in kwargs.keys():
      print(key[], "=", kwargs[key[]])

 # prints:
 # `a = 7`
 # `y = 8`
print_nicely(a=7, y=8)
```

In this example, the argument name `kwargs` is a placeholder that accepts any
number of keyword arguments. Inside the body of the function, you can access
the arguments as a dictionary of keywords and argument values (specifically,
an instance of
[`OwnedKwargsDict`](/mojo/stdlib/collections/dict/OwnedKwargsDict)).

There are currently a few limitations:

* Variadic keyword arguments are always implicitly treated as if they
  were declared with the `owned` [argument
  convention](/mojo/manual/values/ownership#argument-conventions), and
  can't be declared otherwise:

  ```mojo
  # Not supported yet.
  fn read_var_kwargs(read **kwargs: Int): ...
  ```

* All the variadic keyword arguments must have the same type, and this
  determines the type of the argument dictionary. For example, if the argument
  is `**kwargs: Float64` then the argument dictionary will be a
  `OwnedKwargsDict[Float64]`.

* The argument type must conform to both the
  [`Movable`](/mojo/stdlib/builtin/value/Movable) and
  [`Copyable`](/mojo/stdlib/builtin/value/Copyable) traits.

* Dictionary unpacking is not supported yet:

  ```mojo
  fn takes_dict(d: Dict[String, Int]):
    print_nicely(**d)  # Not supported yet.
  ```

* Variadic keyword *parameters* are not supported yet:

  ```mojo
  # Not supported yet.
  fn var_kwparams[**kwparams: Int](): ...
  ```

### Positional-only and keyword-only arguments

When defining a function, you can restrict some arguments so that they can
be passed only as positional arguments, or they can be passed only as keyword
arguments.

To define positional-only arguments, add a slash character (`/`) to the
argument list. Any arguments before the `/` are positional-only: they can't be
passed as keyword arguments. For example:

```mojo
fn min(a: Int, b: Int, /) -> Int:
    return a if a  Int:
    var product = a1 * a2
    if double:
        return product * 2
    else:
        return product
```

Keyword-only arguments often have default values, but this is not required. If a
keyword-only argument doesn't have a default value, it is a *required
keyword-only argument*. It must be specified, and it must be specified by
keyword.

Any required keyword-only arguments must appear in the signature before
any optional keyword-only arguments. That is, arguments appear in the following
sequence a function signature:

* Required positional arguments.
* Optional positional arguments.
* Variadic arguments.
* Required keyword-only arguments.
* Optional keyword-only arguments.
* Variadic keyword arguments.

For more information on keyword-only arguments, see [PEP 3102 – Keyword-Only
Arguments](https://peps.python.org/pep-3102/).

## Overloaded functions

All function declarations must specify argument types, so if you want a
want a function to work with different data types, you need to implement
separate versions of the function that each specify different argument types.
This is called "overloading" a function.

For example, here's an overloaded `add()` function that can accept either
`Int` or `String` types:

```mojo
fn add(x: Int, y: Int) -> Int:
    return x + y

fn add(x: String, y: String) -> String:
    return x + y
```

If you pass anything other than `Int` or `String` to the `add()` function,
you'll get a compiler error. That is, unless `Int` or `String` can implicitly
cast the type into their own type. For example, `String` includes an overloaded
version of its constructor (`__init__()`) that supports
[implicit conversion](/mojo/manual/lifecycle/life#constructors-and-implicit-conversion)
from a `StringLiteral` value. Thus, you can also pass a `StringLiteral` to a
function that expects a `String`.

When resolving an overloaded function call, the Mojo compiler tries each
candidate function and uses the one that works (if only one version works), or
it picks the closest match (if it can determine a close match), or it reports
that the call is ambiguous (if it can't figure out which one to pick). For
details on how Mojo picks the best candidate, see
[Overload resolution](#overload-resolution).

If the compiler can't figure out which function to use, you can resolve the
ambiguity by explicitly casting your value to a supported argument type. For
example, the following code calls the overloaded `foo()` function,
but both implementations accept an argument that supports [implicit
conversion](/mojo/manual/lifecycle/life#constructors-and-implicit-conversion)
from `StringLiteral`. So, the call to `foo(string)` is ambiguous and creates a
compiler error. You can fix this by casting the value to the type you really
want:

```mojo
@value
struct MyString:
    @implicit
    fn __init__(out self, string: StringLiteral):
        pass

fn foo(name: String):
    print("String")

fn foo(name: MyString):
    print("MyString")

fn call_foo():
    alias string: StringLiteral = "Hello"
    # foo(string) # error: ambiguous call to 'foo' ... This call is ambiguous because two `foo` functions match it
    foo(MyString(string))
```

Overloading also works with combinations of both `fn` and `def` function
declarations.

### Overload resolution

When resolving an overloaded function, Mojo does not consider the return type
or other contextual information at the call site—it considers only parameter and
argument types and whether the functions are instance methods or static methods.

The overload resolution logic filters for candidates according to the following
rules, in order of precedence:

1. Candidates requiring the smallest number of implicit conversions (in both
   arguments and parameters).
2. Candidates without variadic arguments.
3. Candidates without variadic parameters.
4. Candidates with the shortest parameter signature.
5. Non-`@staticmethod` candidates (over `@staticmethod` ones, if available).

If there is more than one candidate after applying these rules, the overload
resolution fails. For example:

```mojo
@register_passable("trivial")
struct MyInt:
    """A type that is implicitly convertible to `Int`."""
    var value: Int

    @implicit
    fn __init__(out self, _a: Int):
        self.value = _a

fn foo[x: MyInt, a: Int]():
    print("foo[x: MyInt, a: Int]()")

fn foo[x: MyInt, y: MyInt]():
    print("foo[x: MyInt, y: MyInt]()")

fn bar[a: Int](b: Int):
    print("bar[a: Int](b: Int)")

fn bar[a: Int](*b: Int):
    print("bar[a: Int](*b: Int)")

fn bar[*a: Int](b: Int):
    print("bar[*a: Int](b: Int)")

fn parameter_overloads[a: Int, b: Int, x: MyInt]():
    # `foo[x: MyInt, a: Int]()` is called because it requires no implicit
    # conversions, whereas `foo[x: MyInt, y: MyInt]()` requires one.
    foo[x, a]()

    # `bar[a: Int](b: Int)` is called because it does not have variadic
    # arguments or parameters.
    bar[a](b)

    # `bar[*a: Int](b: Int)` is called because it has variadic parameters.
    bar[a, a, a](b)

parameter_overloads[1, 2, MyInt(3)]()

struct MyStruct:
    fn __init__(out self):
        pass

    fn foo(mut self):
        print("calling instance method")

    @staticmethod
    fn foo():
        print("calling static method")

fn test_static_overload():
    var a = MyStruct()
    # `foo(mut self)` takes precedence over a static method.
    a.foo()
```

```output
foo[x: MyInt, a: Int]()
bar[a: Int](b: Int)
bar[*a: Int](b: Int)
```

## Return values

Return value types are declared in the signature using the
-> type syntax. Values are
passed using the `return` keyword, which ends the function and returns the
identified value (if any) to the caller.

```mojo
def get_greeting() -> String:
    return "Hello"
```

By default, the value is returned to the caller as an owned value. As with
arguments, a return value may be [implicitly
converted](/mojo/manual/lifecycle/life#constructors-and-implicit-conversion) to
the named return type. For example, the previous example calls `return` with a
string literal, `"Hello"`, which is implicitly converted to a `String`.

:::note Returning a reference

A function can also return a mutable or immutable reference using a `ref` return
value. For details, see
[Lifetimes, origins, and references](/mojo/manual/values/lifetimes).

:::

### Named results

Named function results allow a function to return a value that can't be moved or
copied. Named result syntax lets you specify a named, uninitialized variable to
return to the caller using the `out` argument convention:

```mojo
def get_name_tag(owned name: String, out name_tag: NameTag):
    name_tag = NameTag(name^)
```

The `out` argument convention identifies an uninitialized variable that the
function must initialize. (This is the same as the `out` convention used in
[struct constructors](/mojo/manual/lifecycle/life#constructor).) The `out`
argument for a named result can appear anywhere in the argument list, but by
convention, it should be the last argument in the list.

A function can declare only one return value, whether it's declared using an
`out` argument or using the standard ->
type syntax.

A function with a named result argument doesn't need to include an explicit
`return` statement, as shown above. If the function terminates without a `return`,
or at a `return` statement with no value, the value of the `out` argument is
returned to the caller. If it includes a `return` statement with a value, that
value is returned to the caller, as usual.

The fact that a function uses a named result is transparent to the caller. That
is, these two signatures are interchangeable to the caller:

```mojo
def get_name_tag(owned name: String) -> NameTag:
    ...
def get_name_tag(owned name: String, out name_tag: NameTag):
    ...
```

In both cases, the call looks like this:

```mojo
tag = get_name_tag("Judith")
```

Because the return value is assigned to this special `out` variable, it doesn't
need to be moved or copied when it's returned to the caller. This means that you
can create a function that returns a type that can't be moved or copied, and
which takes several steps to initialize:

```mojo
struct ImmovableObject:
    var name: String

    fn __init__(out self, owned name: String):
        self.name = name^

def create_immovable_object(owned name: String, out obj: ImmovableObject):
    obj = ImmovableObject(name^)
    obj.name += "!"
    # obj is implicitly returned

def main():
    my_obj = create_immutable_object("Blob")
```

By contrast, the following function with a standard return value doesn't work:

```mojo
def create_immovable_object2(owned name: String) -> ImmovableObject:
    obj = ImmovableObject(name^)
    obj.name += "!"
    return obj^ # Error: ImmovableObject is not copyable or movable
```

Because `create_immovable_object2` uses a local variable to store the object
while it's under construction, the return call requires it to be either moved
or copied to the callee. This isn't an issue if the newly-created value is
returned immediately:

```mojo
def create_immovable_object3(owned name: String) -> ImmovableObject:
    return ImmovableObject(name^) # OK
```

## Raising and non-raising functions

By default, when a function raises an error, the function terminates immediately
and the error propagates to the calling function. If the calling function
doesn't handle the error, it continues to propagate up the call stack.

```mojo
def raises_error():
    raise Error("There was an error.")
```

The Mojo compiler *always* treats a function declared with `def` as a *raising
function*, even if the body of the function doesn't contain any code that could
raise an error.

Functions declared with `fn` without the `raises` keyword are *non-raising
functions*—that is, they are not allowed to propagate an error to the calling
function. If a non-raising function calls a raising function, it **must handle
any possible errors.**

```mojo
# This function will not compile
fn unhandled_error():
    raises_error()   # Error: can't call raising function in a non-raising context

# Explicitly handle the error
fn handle_error():
    try:
        raises_error()
    except e:
        print("Handled an error:", e)

# Explicitly propagate the error
fn propagate_error() raises:
    raises_error()

```

If you're writing code that you expect to use widely or distribute as a package,
you may want to use `fn` functions for APIs that don't raise errors to limit
the number of places users need to add unnecessary error handling code. For some
extremely performance-sensitive code, it may be preferable to avoid run-time
error-handling.

For more information, see
[Errors, error handling, and context managers](/mojo/manual/errors).

---

## fused_concat

`fused_concat[type: DType, rank: Int, single_thread_blocking_override: Bool, input_fn: fn[Int, Int, Int](IndexList[$2]) capturing -> SIMD[type, $1], output_0_fn: fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](axis: Int, input_shapes: StaticTuple[IndexList[rank], size], output: NDBuffer[type, rank, origin], ctx: DeviceContextPtr)`

---

## fused_qk_rope

`fused_qk_rope[type: DType, collection_t: KVCollectionT, //, cache_t: KVCacheT, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 4, origin, shape, strides], kv_collection: collection_t, freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: Optional[DeviceContext])`

---

## fused_qk_rope

## Functions

* [​`fused_qk_rope`](./fused_qk_rope):
* [​`fused_qk_rope_ragged`](./fused_qk_rope_ragged): Applies RoPE (Rotary Position Embedding) to query and key tensors.
* [​`get_identity_rope_coeff`](./get_identity_rope_coeff):
* [​`get_safetensors_idx`](./get_safetensors_idx):
* [​`rope_k_cache`](./rope_k_cache):
* [​`rope_q_proj`](./rope_q_proj):

---

## fused_qk_rope_ragged

`fused_qk_rope_ragged[type: DType, collection_t: KVCollectionT, //, cache_t: KVCacheT, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: Optional[DeviceContext])`

Applies RoPE (Rotary Position Embedding) to query and key tensors.

This function can applies RoPE only to the last `rope_dim` elements of each
head, leaving the first `unroped_dim` elements unchanged. This is required
for DeepSeek models where only part of each head undergoes rotary
transformation.

---

## gamma

`gamma[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the Gamma of the input.

For details, see .

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The Gamma function evaluated at the input.

---

## gather

`gather[type: DType, indices_type: DType, //, *, axis: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](output: NDBuffer[type, rank, origin, shape, strides], input: NDBuffer[type, rank, origin, shape, strides], indices: NDBuffer[indices_type, rank, origin, shape, strides], *, context: DeviceContext)`

Gather operation as defined in .

Note that this is NOT the same as the default PyTorch gather (which is equivalent to
).

`gather[type: DType, indices_type: DType, //, *, axis: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](output: NDBuffer[type, rank, origin, shape, strides], input: NDBuffer[type, rank, origin, shape, strides], indices: NDBuffer[indices_type, rank, origin, shape, strides], *, context: DeviceContextPtr = DeviceContextPtr())`

Gather operation as defined in .

Note that this is NOT the same as the default PyTorch gather (which is equivalent to
).

`gather[*, type: DType, indices_type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], indices_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[indices_type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, prefetch_fn: OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None] = OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None]({:i1 0, 1}), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type], *, context: DeviceContext)`

Gather operation as defined in .

Note that this is NOT the same as the default PyTorch gather (which is equivalent to
).

`gather[*, type: DType, indices_type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], indices_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[indices_type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, prefetch_fn: OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None] = OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None]({:i1 0, 1}), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type], *, context: DeviceContextPtr = DeviceContextPtr())`

Gather operation as defined in .

Note that this is NOT the same as the default PyTorch gather (which is equivalent to
).

---

## gather

`gather[dtype: DType, size: Int, //, *, invariant: Bool = False](owned base: SIMD[index, size], mask: SIMD[bool, size], passthrough: SIMD[dtype, size], alignment: Int = 0) -> SIMD[dtype, size]`

Reads scalar values from a SIMD vector, and gathers them into one vector.

The gather function reads scalar values from a SIMD vector of memory
locations and gathers them into one vector. The memory locations are
provided in the vector of pointers `base` as addresses. The memory is
accessed according to the provided mask. The mask holds a bit for each
vector lane, and is used to prevent memory accesses to the masked-off
lanes. The masked-off lanes in the result vector are taken from the
corresponding lanes of the `passthrough` operand.

In general, for some vector of pointers `base`, mask `mask`, and passthrough
`passthrough` a call of the form:

```mojo
result = gather(base, mask, passthrough)
```

is equivalent to the following sequence of scalar loads in C++:

```cpp
for (int i = 0; i dtype (`DType`): DType of the return SIMD buffer.
* ​size (`Int`): Size of the return SIMD buffer.
* ​invariant (`Bool`): Whether the memory is load invariant.

**Args:**

* ​base (`SIMD[index, size]`): The vector containing memory addresses that gather will access.
* ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of
  the base vector.
* ​passthrough (`SIMD[dtype, size]`): In the result vector, the masked-off lanes are replaced
  with the passthrough vector.
* ​alignment (`Int`): The alignment of the source addresses. Must be 0 or a power
  of two constant integer value.

**Returns:**

A SIMD\[dtype, size] containing the result of the gather operation.

---

## gather_elements

`gather_elements[rank: Int, input_type: DType, indices_type: DType](input: NDBuffer[input_type, rank, origin], indices: NDBuffer[indices_type, rank, origin], _axis: Int, output: NDBuffer[input_type, rank, origin])`

Implements ONNX GatherElements op which is equivalent to Pytorch gather.

---

## gather_elementwise_fn_wrapper

`gather_elementwise_fn_wrapper[*, type: DType, indices_type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], indices_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[indices_type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, simd_width: Int, prefetch_fn: OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None] = OptionalReg[fn[Int, Int](IndexList[$0], IndexList[$1]) capturing -> None]({:i1 0, 1})](axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type], coords: IndexList[size, element_type=element_type])`

---

## gather_guards

`gather_guards(axis: Axis, input_shape: IndexList[size, element_type=element_type], indices_shape: IndexList[size, element_type=element_type], output_shape: IndexList[size, element_type=element_type])`

---

## gather_nd

`gather_nd[type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, output_rank: Int, batch_dims: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](data: NDBuffer[type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], output: NDBuffer[type, output_rank, origin], ctx: DeviceContextPtr)`

GatherND operation as defined in . Based on reference implementation: .

**Parameters:**

* ​type (`DType`): Type of data tensor.
* ​indices\_type (`DType`): Type of indices tensor.
* ​data\_rank (`Int`): Rank of data tensor (data\_rank >= 1).
* ​indices\_rank (`Int`): Rank of indices tensor (indices\_rank >= 1).
* ​output\_rank (`Int`): Rank of output tensor.
* ​batch\_dims (`Int`): Number of batch dimensions. The gather of indexing
  starts from dimension of data\[batch\_dims:].
* ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to execute on.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​data (`NDBuffer[type, data_rank, origin]`): Tensor of rank data\_rank >= 1.
* ​indices (`NDBuffer[indices_type, indices_rank, origin]`): Tensor of rank indices\_rank >= 1. All index values are expected
  to be within bounds \[-s, s-1] along axis of size s. It is an
  error if any of the index values are out of bounds.
* ​output (`NDBuffer[type, output_rank, origin]`): Tensor of rank data\_rank + indices\_rank - indices\_shape\[-1] - 1 - b.
* ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler.

---

## gather_nd_shape

`gather_nd_shape[input_rank: Int, indices_rank: Int, output_rank: Int, input_type: DType, indices_type: DType, batch_dims: Int, single_thread_blocking_override: Bool = True](input_buf: NDBuffer[input_type, input_rank, origin], indices_buf: NDBuffer[indices_type, indices_rank, origin]) -> IndexList[output_rank]`

Compute the output shape of a `gather` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_rank (`Int`): Rank of the input tensor.
* ​indices\_rank (`Int`): Rank of the indices tensor.
* ​output\_rank (`Int`): Rank of the output tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​indices\_type (`DType`): Type of the indices tensor.
* ​batch\_dims (`Int`): Batch dimensions.
* ​single\_thread\_blocking\_override (`Bool`): If True, then reduction is run
  synchronously using a single thread.

**Args:**

* ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor.
* ​indices\_buf (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor.

**Returns:**

The output shape.

---

## gather_reduce

`gather_reduce[type: DType, gather_axis: Int, reduce_axis: Int, simd_width: Int, reduce_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1], output_rank: Int, output_shape: DimList, input_rank: Int, input_shape: DimList, indices_rank: Int, indices_shape: DimList](output: NDBuffer[type, output_rank, origin, output_shape], input: NDBuffer[type, input_rank, origin, input_shape], indices: NDBuffer[int32, indices_rank, origin, indices_shape], reduce_init: SIMD[type, 1])`

Computes output\[i, j, k] = input\[indices\[i, j], k] and simultaneously reduces the output across axis 1 to produce output\[i, k].

The motivating use-case for this is multi-hot embeddings in recommender models.
This provides similar functionality to Torch's EmbeddingBag layer. In that
context, i is the batch dimension, j is the multi-hot dimension, and k is
the embedding dimension.

---

## gather_scatter

## Structs

* [​`Axis`](./Axis):

## Functions

* [​`gather`](./gather): Gather operation as defined in .
* [​`gather_elements`](./gather_elements): Implements ONNX GatherElements op which is equivalent to Pytorch gather.
* [​`gather_elementwise_fn_wrapper`](./gather_elementwise_fn_wrapper):
* [​`gather_guards`](./gather_guards):
* [​`gather_nd`](./gather_nd): GatherND operation as defined in . Based on reference implementation: .
* [​`gather_nd_shape`](./gather_nd_shape): Compute the output shape of a `gather` operation, and assert the inputs are compatible.
* [​`gather_reduce`](./gather_reduce): Computes output\[i, j, k] = input\[indices\[i, j], k] and simultaneously reduces the output across axis 1 to produce output\[i, k].
* [​`gather_shape`](./gather_shape): Compute the output shape of a `gather` operation, and assert the inputs are compatible.
* [​`normalize_neg_index`](./normalize_neg_index): Indices passed to gather and scatter ops may be negative. This performs a normalization so that they can be used to index into a buffer.
* [​`scatter_elements`](./scatter_elements): Implements ONNX ScatterElements op which is equivalent to Pytorch scatter.
* [​`scatter_elements_shape`](./scatter_elements_shape): Compute the output shape of a `scatter_elements` operation, and assert the inputs are compatible.
* [​`scatter_nd`](./scatter_nd): Scatter\_nd operation without any reduction.
* [​`scatter_nd_generator`](./scatter_nd_generator): Implements ONNX ScatterND operation as defined in .
* [​`scatter_nd_shape`](./scatter_nd_shape): Compute the output shape of a `scatter_nd` operation, and assert the inputs are compatible.

---

## gather_shape

`gather_shape[output_rank: Int, input_rank: Int, indices_rank: Int, input_type: DType, indices_type: DType, single_thread_blocking_override: Bool = False](input_buf: NDBuffer[input_type, input_rank, origin], indices_buf: NDBuffer[indices_type, indices_rank, origin], axis: Int) -> IndexList[output_rank]`

Compute the output shape of a `gather` operation, and assert the inputs are compatible.

**Parameters:**

* ​output\_rank (`Int`): Rank of the output tensor.
* ​input\_rank (`Int`): Rank of the input tensor.
* ​indices\_rank (`Int`): Rank of the indices tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​indices\_type (`DType`): Type of the indices tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor.
* ​indices\_buf (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor.
* ​axis (`Int`): The axis.

**Returns:**

The output shape.

---

## gcd

`gcd(m: Int, n: Int, /) -> Int`

Compute the greatest common divisor of two integers.

**Args:**

* ​m (`Int`): The first integer.
* ​n (`Int`): The second integrer.

**Returns:**

The greatest common divisor of the two integers.

`gcd(s: Span[Int, origin], /) -> Int`

Computes the greatest common divisor of a span of integers.

**Args:**

* ​s (`Span[Int, origin]`): A span containing a collection of integers.

**Returns:**

The greatest common divisor of all the integers in the span.

`gcd(l: List[Int, hint_trivial_type], /) -> Int`

Computes the greatest common divisor of a list of integers.

**Args:**

* ​l (`List[Int, hint_trivial_type]`): A list containing a collection of integers.

**Returns:**

The greatest common divisor of all the integers in the list.

`gcd(*values: Int) -> Int`

Computes the greatest common divisor of a variadic number of integers.

**Args:**

* ​\*values (`Int`): A variadic list of integers.

**Returns:**

The greatest common divisor of the given integers.

---

## gelu

`gelu[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]`

Compute the GELU Op using the equation $0.5 * x * (1 + erf(x / sqrt(2)))$.

**Constraints:**

Type must be a floating point type.

**Parameters:**

* ​type (`DType`): DType used for the computation.
* ​simd\_width (`Int`): SIMD width used for the computation.

**Args:**

* ​x (`SIMD[type, simd_width]`): The value to compute the GELU operation on.

**Returns:**

The result of the GELU operation.

---

## gelu_approximate

`gelu_approximate[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]`

Compute the approximate GELU Op using the equation $0.5 * x * (1 + tanh(sqrt(2 / pi) * (x + 0.044715 * x^3)))$.

**Constraints:**

Type must be a floating point type.

**Parameters:**

* ​type (`DType`): The `DType` used for the computation.
* ​simd\_width (`Int`): SIMD width used for the computation.

**Args:**

* ​x (`SIMD[type, simd_width]`): The value to compute the GELU operation on.

**Returns:**

The result of the approximate GELU operation.

---

## GemmShape

`@register_passable(trivial)`
`struct GemmShape`

Helper class to unpack gemm dimension and layout.

## Fields

* ​M (`Int`):
* ​N (`Int`):
* ​K (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(index: IndexList[3]) -> Self`

Constructor of a gemm shape record from a index tuple.

**Args:**

* ​index (`IndexList[3]`): The int tuple containing the index(m,n,k).

### `__getitem__`

`__getitem__(self, idx: Int) -> Int`

### `__setitem__`

`__setitem__(mut self, idx: Int, value: Int)`

### `__add__`

`__add__(self, rhs: Self) -> Self`

Coordinate-wise addition of two gemm shape records.

**Args:**

* ​rhs (`Self`): Another gemm shape record to add with.

### `__sub__`

`__sub__(self, rhs: Self) -> Self`

Coordinate-wise subtraction of two gemm shape records.

**Args:**

* ​rhs (`Self`): Another gemm shape record to subtract with.

### `get`

`static get[transpose_b: Bool](c: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> Self`

Constructor of a gemm shape record from input buffers.

M, N, and K are intentionally calculated using `a` and `c` ONLY. This
is because `b` may be padded to a multiple of the tile size if it has
been pre-packed.

**Args:**

* ​c (`NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): NDBuffer with allocated output space.
* ​a (`NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): NDBuffer containing matrix operand A.
* ​b (`NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): NDBuffer containing matrix operand B.

### `as_index`

`as_index(self) -> IndexList[3]`

Utility to convert the underlying data to an index tuple. So that the utilities such as elementwise add can be used.

**Returns:**

The constructed index tuple.

---

## gemv

`gemv[parallelize: Bool, c_size: Dim, c_type: DType, a_shape: DimList, a_type: DType, b_size: Dim, b_type: DType, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c_buf: NDBuffer[c_type, 1, origin, __init__[::Intable](c_size)], a_buf: NDBuffer[a_type, 2, origin, a_shape], b_buf: NDBuffer[b_type, 1, origin, __init__[::Intable](b_size)])`

---

## gemv

## Structs

* [​`GEMVAlgorithm`](./GEMVAlgorithm):

## Functions

* [​`gemv`](./gemv):
* [​`gemv_gpu`](./gemv_gpu):
* [​`gemv_gpu_dispatch`](./gemv_gpu_dispatch):
* [​`gemv_kernel`](./gemv_kernel):
* [​`gemv_kernel_vector`](./gemv_kernel_vector):
* [​`gemv_split_k`](./gemv_split_k): GEMV with tiling in K dimension. Assuming the B (weight) matrix is transposed i.e. row major N x K, this kernel implements a vector (1 x K) times a matrix (N x K).
* [​`gevm_kernel`](./gevm_kernel):
* [​`gevm_tc_kernel_vector_8x`](./gevm_tc_kernel_vector_8x):
* [​`naive_gemv`](./naive_gemv):
* [​`reverse_idx`](./reverse_idx):

---

## gemv_gpu

`gemv_gpu[transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], ctx: DeviceContext)`

---

## gemv_gpu_dispatch

`gemv_gpu_dispatch[transpose_b: Bool = False, reduction_method: ReductionMethod = ReductionMethod(1), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](kernel_func: GEMVAlgorithm, c: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b: NDBuffer[type, 2, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], ctx: DeviceContext)`

---

## gemv_kernel

`gemv_kernel[c_type: DType, a_type: DType, b_type: DType, *, reduction_method: ReductionMethod, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: UnsafePointer[SIMD[c_type, 1]], a: UnsafePointer[SIMD[a_type, 1]], b: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)`

---

## gemv_kernel_vector

`gemv_kernel_vector[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, *, reduction_method: ReductionMethod, simd_width: UInt, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], m: UInt, n: UInt, k: UInt)`

---

## gemv_split_k

`gemv_split_k[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, simd_width: UInt, tile_m: UInt, tile_n: UInt, num_threads: UInt, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](output: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], act: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], weight: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], m: UInt, n: UInt, k: UInt)`

GEMV with tiling in K dimension. Assuming the B (weight) matrix is transposed i.e. row major N x K, this kernel implements a vector (1 x K) times a matrix (N x K).

The impl can actually handle M > 1 but it's only optimal fro tiny M. We use
it for M = 1 only.

---

## GEMVAlgorithm

`struct GEMVAlgorithm`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `GEMV_KERNEL`

`alias GEMV_KERNEL = GEMVAlgorithm(0)`

### `GEMV_KERNEL_VECTOR`

`alias GEMV_KERNEL_VECTOR = GEMVAlgorithm(1)`

### `GEMV_SPLIT_K`

`alias GEMV_SPLIT_K = GEMVAlgorithm(2)`

### `GEVM_KERNEL`

`alias GEVM_KERNEL = GEMVAlgorithm(4)`

### `GEVM_KERNEL_VECTOR`

`alias GEVM_KERNEL_VECTOR = GEMVAlgorithm(3)`

### `MATMUL_NAIVE`

`alias MATMUL_NAIVE = GEMVAlgorithm(5)`

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

### `__is__`

`__is__(self, other: Self) -> Bool`

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

---

## Generate image descriptions with Llama 3.2 Vision

import SmallCards from '@site/src/components/SmallCards';
import InstallModular from '@site/docs/_includes/install-modular.mdx';
import Requirements from '@site/src/components/Requirements';
import { requirementsWithGPU } from '@site/docs/max/requirements';

The MAX framework simplifies the process to create an endpoint for
multimodal models that handle both text and images, such as [Llama 3.2 11B
Vision
Instruct](https://builds.modular.com/models/Llama-3.2-Vision-Instruct/11B),
which excels at tasks such as image captioning and visual question answering.
This tutorial walks you through installing the necessary tools, configuring
access, and serving the model locally with an OpenAI-compatible endpoint.

:::note GPU required

To run the model in this tutorial, your system must have a [compatible
GPU](/max/faq#gpu-requirements).

:::

System requirements:

## Set up your environment

Create a Python project to install our APIs and CLI tools:

## Configure Hugging Face access

To get the model used below, you must have a Hugging Face user access token and
approved access to the [Llama 3.2 11B Vision Instruct Hugging Face
repo](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct).

To create a Hugging Face user access token, see
[Access Tokens](https://huggingface.co/settings/tokens). Within your local
environment, save your access token as an environment variable.

```bash
export HF_TOKEN="hf_..."
```

## Generate a sample description

You can generate an image description using the
[`max generate`](/max/max-cli#generate) command. Downloading the
Llama 3.2 11B Vision Instruct model weights takes some time.

:::note

You may need to alter the `—-max-length` and `—-max-batch-size` parameters
depending on the amount of memory you have access to. The following command is
optimized for a `p4d.24xlarge` instance with one NVIDIA A100 GPU and 96 vCPUs.

:::

```bash
max generate \
  --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
  --prompt "What is in this image?" \
  --image_url="https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" \
  --max-new-tokens 100 \
  --max-batch-size 1 \
  --max-length 108172
```

When using the `max` CLI tool with multimodal input, you must provide
both a `--prompt` and an `--image_url`. Additionally, the prompt should be in a
valid format for the model used. For Llama 3.2 Vision 11B Instruct, you must
include the `` tag in the prompt if the input includes an image to
reason about. For more information about Llama 3.2 Vision prompt templates, see
[Vision Model Inputs and Outputs](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/#-vision-model-inputs-and-outputs-).

## Serve the Llama 3.2 Vision model

You can alternatively serve the Llama 3.2 Vision model and make multiple
requests to a local endpoint. If you already tested the model with the
`max generate` command, you do not have to wait for the model to
download again.

Serve the model with the [`max serve`](/max/max-cli/#serve)
command:

```bash
max serve \
  --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
  --max-length 108172 \
  --max-batch-size 1
```

The endpoint is ready when you see this message printed in your terminal:

```bash
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
```

## Test the endpoint

After the server is running, you can test it by opening a new terminal window
and sending a `curl` request.

:::note

When making requests with `max serve`, you do not need to include
model-specific image tags within your prompt.

:::

The following request includes an image URL and a question to answer about the
provided image:

```bash
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "meta-llama/Llama-3.2-11B-Vision-Instruct",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What is in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
```

This sends an image along with a text prompt to the model, and you should
receive a response describing the image. You can test the endpoint with any
local base64-encoded image or any image URL.

:::note

If you make significant changes to the provided request template, you might
receive less accurate responses. Some parts of the text prompt get ignored for
certain input combinations. We've identified the problem and will have a fix in
a subsequent [nightly release](/max/packages/#nightly-release).

:::

## Next steps

Now that you have successfully deployed Llama 3.2 Vision, you can:

- Experiment with different images and prompts
- Explore deployment configurations and additional features, such as
  [function calling](/max/serve/function-calling),
  [prefix caching](/max/serve/prefix-caching), and
  [structured output](/max/serve/structured-output)
- Deploy the model to a containerized cloud environment for scalable serving

export const cards = [
  {
    title: 'Deploy Llama 3 on GPU with MAX Serve',
    link: '/max/tutorials/max-serve-local-to-cloud',
    description: `Learn how to deploy Llama 3 on GPU with MAX Serve.`,
  },
  {
    title: 'Deploy Llama 3 on GPU-powered Kubernetes clusters',
    link: '/max/tutorials/deploy-max-serve-on-kubernetes',
    description:
    `Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs`,
  },
];

---

## generic_cross_attention_kv_cache

`generic_cross_attention_kv_cache[collection_t: KVCollectionT, type: DType, //, target: StringSlice[StaticConstantOrigin], mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], q_input_row_offsets: ManagedTensorSlice[io_spec, static_spec=static_spec], q_max_seq_len: NDBuffer[uint32, 1, origin, shape, strides], kv_input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_flare_mla_decode_kv_cache_ragged

`generic_flare_mla_decode_kv_cache_ragged[collection_t: KVCollectionT, type: DType, //, mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], target: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_flare_mla_decompress_k_cache_ragged_paged

`generic_flare_mla_decompress_k_cache_ragged_paged[target: StringSlice[StaticConstantOrigin], type: DType](buffer_row_offsets_1d: NDBuffer[uint32, 1, origin, shape, strides], cache_offsets_1d: NDBuffer[uint32, 1, origin, shape, strides], buffer_length: SIMD[int32, 1], weight: NDBuffer[type, 2, origin, shape, strides], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode], layer_idx: SIMD[uint32, 1], k_latent_buffer: NDBuffer[type, 2, origin, shape, strides], k_buffer: NDBuffer[type, 2, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_flare_mla_prefill_kv_cache_ragged

`generic_flare_mla_prefill_kv_cache_ragged[collection_t: KVCollectionT, type: DType, //, softmax_type: DType, write_softmax_info: Bool, use_cascade_attention: Bool, mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], target: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], k: NDBuffer[type, 3, origin, shape, strides], v: NDBuffer[type, 3, origin, shape, strides], buffer_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], cache_offsets: NDBuffer[uint32, 1, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], softmax_info: NDBuffer[softmax_type, 3, MutableAnyOrigin], context: DeviceContextPtr, prev_output: OptionalReg[NDBuffer[type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[type, 3, MutableAnyOrigin]]({:i1 0, 1}), prev_softmax_info: OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]] = OptionalReg[NDBuffer[softmax_type, 3, MutableAnyOrigin]]({:i1 0, 1}))`

---

## generic_flare_mla_prefill_ragged_paged_plan

`generic_flare_mla_prefill_ragged_paged_plan[target: StringSlice[StaticConstantOrigin]](input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode], layer_idx: SIMD[uint32, 1], buffer_token_size: SIMD[uint32, 1], buffer_row_offsets: NDBuffer[uint32, 2, origin, shape, strides], cache_offsets: NDBuffer[uint32, 2, origin, shape, strides], buffer_lengths: NDBuffer[int32, 1, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_flash_attention_kv_cache_padded

`generic_flash_attention_kv_cache_padded[collection_t: KVCollectionT, type: DType, //, *, target: StringSlice[StaticConstantOrigin], mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1, num_heads: Int = -1](q: NDBuffer[type, 4, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], valid_lengths: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_flash_attention_kv_cache_padded_materialized_mask

`generic_flash_attention_kv_cache_padded_materialized_mask[collection_t: KVCollectionT, type: DType, //, *, target: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1, num_heads: Int = -1](q: NDBuffer[type, 4, origin, shape, strides], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], mask: NDBuffer[type, rank, origin, shape, strides], valid_lengths: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_flash_attention_kv_cache_ragged

`generic_flash_attention_kv_cache_ragged[collection_t: KVCollectionT, type: DType, //, *, target: StringSlice[StaticConstantOrigin], mask_str: StringSlice[StaticConstantOrigin], score_mod_str: StringSlice[StaticConstantOrigin], local_window_size: Int = -1](q: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: ManagedTensorSlice[io_spec, static_spec=static_spec], kv_collection: collection_t, layer_idx: SIMD[uint32, 1], scale: SIMD[float32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_fused_qk_rope_bshd_continuous_batch

`generic_fused_qk_rope_bshd_continuous_batch[type: DType, //, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 4, origin, shape, strides], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_, assert_write_mode], freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 4, origin, shape, strides], context: DeviceContextPtr = DeviceContextPtr())`

Performs a fused RoPE projection for Q and K projections.

We have a manually fused QKV projection with mo.opaque types in our Llama model.
Due to a limitation in custom op definitions, we can't declare both a tensor
and opaque type as output from a custom kernel. This requires us to only note
Q\_proj as an output from the QKV projection. If we immediately follow the
QKV proj kernel with a RoPE kernel applied to K, we'll get a race condition
because the graph compiler doesn't know about the dependency between these
kernels in the graph definition. Here we fuse the RoPE kernel applied to
Q\_proj with K\_proj, so K\_proj RoPE is only executed after QKV completes.

---

## generic_fused_qk_rope_bshd_continuous_batch_ragged

`generic_fused_qk_rope_bshd_continuous_batch_ragged[type: DType, //, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_, assert_write_mode], freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr)`

---

## generic_fused_qk_rope_bshd_paged_ragged

`generic_fused_qk_rope_bshd_paged_ragged[type: DType, //, *, interleaved: Bool, target: StringSlice[StaticConstantOrigin]](q_proj: NDBuffer[type, 3, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode], freqs_cis: NDBuffer[type, 2, origin, shape, strides], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape, strides], context: DeviceContextPtr = DeviceContextPtr())`

Performs a fused RoPE projection for Q and K projections.

We have a manually fused QKV projection with mo.opaque types in our Llama model.
Due to a limitation in custom op definitions, we can't declare both a tensor
and opaque type as output from a custom kernel. This requires us to only note
Q\_proj as an output from the QKV projection. If we immediately follow the
QKV proj kernel with a RoPE kernel applied to K, we'll get a race condition
because the graph compiler doesn't know about the dependency between these
kernels in the graph definition. Here we fuse the RoPE kernel applied to
Q\_proj with K\_proj, so K\_proj RoPE is only executed after QKV completes.

---

## generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch

`generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch[type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](hidden_state: NDBuffer[type, 3, origin, shape], weight: NDBuffer[type, 2, origin, shape], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_, assert_write_mode], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 3, origin, shape], ctx: DeviceContextPtr)`

Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.

**Args:**

* ​hidden\_state (`NDBuffer[type, 3, origin, shape]`): Tensor with shape (batch\_size, seq\_len, num\_heads \* head\_size).
* ​weight (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​kv\_collection (`ContinuousBatchingKVCacheCollection[type_, kv_params_, assert_write_mode]`): The historical KVCache for keys and values. The KVCache for
  this layer is retrieved via layer\_idx.
* ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache
  for the given layer from kv\_collection.
* ​output (`NDBuffer[type, 3, origin, shape]`): The pre-allocated output buffer for Q projections. K and V
  projections are written in-place to k\_cache and v\_cache.
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## generic_fused_qkv_matmul_kv_cache_cont_batch_ragged

`generic_fused_qkv_matmul_kv_cache_cont_batch_ragged[type: DType, //, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[type, 2, origin, shape], kv_collection: ContinuousBatchingKVCacheCollection[type_, kv_params_, assert_write_mode], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 2, origin, shape], ctx: DeviceContextPtr)`

Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.

**Args:**

* ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size).
* ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,).
  The value at each index is the start\_idx of the corresponding batch in hidden\_state.
* ​weight (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​kv\_collection (`ContinuousBatchingKVCacheCollection[type_, kv_params_, assert_write_mode]`): The object storing the KVCache for this layer.
* ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from kv\_collection.
* ​output (`NDBuffer[type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V
  projections are written in-place to k\_cache and v\_cache.
  Shape: (sum(seq\_lens), num\_heads \* head\_size).
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## generic_fused_qkv_matmul_kv_cache_paged_ragged

`generic_fused_qkv_matmul_kv_cache_paged_ragged[type: DType, weight_type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), group_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), has_zp: OptionalReg[Bool] = OptionalReg[Bool]({:i1 0, 1})](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[weight_type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 2, origin, shape], ctx: DeviceContextPtr)`

Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.

**Args:**

* ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size).
* ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,).
  The value at each index is the start\_idx of the corresponding batch in hidden\_state.
* ​weight (`NDBuffer[weight_type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​kv\_collection (`PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode]`): The object storing the KVCache for this layer.
* ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from kv\_collection.
* ​output (`NDBuffer[type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V
  projections are written in-place to k\_cache and v\_cache.
  Shape: (sum(seq\_lens), num\_heads \* head\_size).
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## generic_fused_qkv_matmul_kv_cache_paged_ragged_bias

`generic_fused_qkv_matmul_kv_cache_paged_ragged_bias[type: DType, weight_type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), group_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), has_zp: OptionalReg[Bool] = OptionalReg[Bool]({:i1 0, 1})](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[weight_type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode], layer_idx: SIMD[uint32, 1], output: NDBuffer[type, 2, origin, shape], bias: NDBuffer[type, 1, origin], ctx: DeviceContextPtr)`

Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.

**Args:**

* ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size).
* ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,).
  The value at each index is the start\_idx of the corresponding batch in hidden\_state.
* ​weight (`NDBuffer[weight_type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​kv\_collection (`PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode]`): The object storing the KVCache for this layer.
* ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from kv\_collection.
* ​output (`NDBuffer[type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V
  projections are written in-place to k\_cache and v\_cache.
  Shape: (sum(seq\_lens), num\_heads \* head\_size).
* ​bias (`NDBuffer[type, 1, origin]`): Bias to be added to the QKV Tensor. Tensor is concatenated q + k + v. Rank 1.
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## generic_fused_qkv_matmul_kv_cache_paged_ragged_scale

`generic_fused_qkv_matmul_kv_cache_paged_ragged_scale[type: DType, weight_type: DType, output_type: DType, scale_type: DType, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[weight_type, 2, origin, shape], input_scale: NDBuffer[scale_type, 2, origin, shape], weight_scale: NDBuffer[scale_type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode], layer_idx: SIMD[uint32, 1], output: NDBuffer[output_type, 2, origin, shape], ctx: DeviceContextPtr)`

Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.

**Args:**

* ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size).
* ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,).
  The value at each index is the start\_idx of the corresponding batch
  in hidden\_state.
* ​weight (`NDBuffer[weight_type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \*
  head\_size).
* ​input\_scale (`NDBuffer[scale_type, 2, origin, shape]`): Scale to be multiplied to the input Tensor.
* ​weight\_scale (`NDBuffer[scale_type, 2, origin, shape]`): Scale to be multiplied to the weight Tensor.
* ​kv\_collection (`PagedKVCacheCollection[type_, kv_params_, page_size, assert_write_mode]`): The object storing the KVCache for this layer.
* ​layer\_idx (`SIMD[uint32, 1]`): The current layer, used to retrieve the KVCache object from
  kv\_collection.
* ​output (`NDBuffer[output_type, 2, origin, shape]`): The pre-allocated output buffer for Q projections. K and V
  projections are written in-place to k\_cache and v\_cache.
  Shape: (sum(seq\_lens), num\_heads \* head\_size).
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## generic_get_continuous_cache

`generic_get_continuous_cache[type: DType, kv_params: KVCacheStaticParams](blocks: NDBuffer[type, 6, origin], cache_lengths: NDBuffer[uint32, 1, origin], lookup_table: NDBuffer[uint32, 1, origin], max_lengths: NDBuffer[uint32, 2, origin]) -> ContinuousBatchingKVCacheCollection[type, kv_params]`

---

## generic_get_paged_cache

`generic_get_paged_cache[type: DType, kv_params: KVCacheStaticParams, page_size: Int](blocks: NDBuffer[type, 6, origin], cache_lengths: NDBuffer[uint32, 1, origin], lookup_table: NDBuffer[uint32, 2, origin], max_lengths: NDBuffer[uint32, 2, origin], out result: PagedKVCacheCollection[type, kv_params, page_size])`

---

## genlut

`genlut(gpr: Int)`

---

## Get started with GPU programming

import GetMagic from '@site/src/includes/get_magic.mdx';
import Requirements from '@site/src/components/Requirements';
import { requirementsWithGPU } from '@site/docs/max/requirements';

This tutorial introduces you to GPU programming with Mojo. You'll learn how to
write a simple program that performs vector addition on a GPU, exploring
fundamental concepts of GPU programming along the way.

By the end of this tutorial, you will:

- Understand basic GPU programming concepts like grids and thread blocks.
- Learn how to move data between CPU and GPU memory.
- Write and compile a simple GPU kernel function.
- Execute parallel computations on the GPU.
- Understand the asynchronous nature of GPU programming.

We'll build everything step-by-step, starting with the basics and gradually
adding more complexity. The concepts you learn here will serve as a foundation
for more advanced GPU programming with Mojo. If you just want to see the
finished code, you can [get it on
GitHub](https://github.com/modular/modular/tree/main/examples/mojo/gpu-intro).

System requirements:

## 1. Create a Mojo project with `magic`

We'll start by using the [`magic`](/magic) CLI to create a virtual environment
and generate our initial project directory.

1. 

2. Navigate to the directory in which you want to create the project
    and execute:

    ```bash
    magic init gpu-intro --format mojoproject
    ```

    This creates a project directory named `gpu-intro`.

3. Let's go into the directory and verify the project is configured
    correctly by checking the version of Mojo that's installed within our
    project's virtual environment:

    ```bash
    cd gpu-intro
    ```

    ```bash
    magic run mojo --version
    ```

    You should see a version string indicating the version of Mojo installed,
    which by default should be the latest nightly version. Because we used the
    `--format mojoproject` option when creating the project, `magic`
    automatically added the `max` package as a dependency, which includes Mojo
    and the MAX libraries.

4. Activate the project's virtual environment:

    ```bash
    magic shell
    ```
    Later on, when you want to exit the virtual environment, just type `exit`.

## 2. Get a reference to the GPU device

The [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext/) type
represents a logical instance of a GPU device. It provides methods for
allocating memory on the device, copying data between the host CPU and the GPU,
and compiling and running functions (also known as *kernels*) on the device.

Use the
[`DeviceContext()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#__init__)
constructor to get a reference to the GPU device. The constructor raises an
error if no compatible GPU is available. You can use the
[`has_accelerator()`](/mojo/stdlib/sys/info/has_accelerator/) function to check
if a compatible GPU is available.

So let's start by writing a program that checks if a GPU is available and then
obtains a reference to the GPU device. Using any editor, create a file named
`vector_addition.mojo` with the following code:

```mojo title="vector_addition.mojo"
from gpu.host import DeviceContext
from sys import has_accelerator

def main():
    @parameter
    if not has_accelerator():
        print("No compatible GPU found")
    else:
        ctx = DeviceContext()
        print("Found GPU:", ctx.name())
```

Save the file and run it using the `mojo` CLI:

```bash
mojo vector_addition.mojo
```

You should see output like the following (depending on the type of GPU you
have):

```output
Found GPU: NVIDIA A10G
```

:::note

Mojo requires a [compatible GPU development
environment](/max/faq/#gpu-requirements) to compile kernel functions, otherwise
it raises a compile-time error. In our code, we're using the
[`@parameter`](/mojo/manual/decorators/parameter) decorator to evaluate the
`has_accelerator()` function at compile time and compile only the corresponding
branch of the `if` statement. As a result, if you don't have a compatible GPU
development environment, you'll see the following message when you run the
program:

```output
No compatible GPU found
```

In that case, you need to find a system that has a supported GPU to continue
with this tutorial.

:::

## 3. Define a simple kernel

A GPU *kernel* is simply a function that runs on a GPU, executing a specific
computation on a large dataset in parallel across thousands or millions of
*threads*. You might already be familiar with threads when programming for a
CPU, but GPU threads are different. On a CPU, threads are managed by the
operating system and can perform completely independent tasks, such as managing
a user interface, fetching data from a database, and so on. But on a GPU,
threads are managed by the GPU itself. All the threads on a GPU execute the same
kernel function, but they each work on a different part of the data.

When you run a kernel, you need to specify the number of threads you want to
use. The number of threads you specify depends on the size of the data you want
to process and the amount of parallelism you want to achieve. A common strategy
is to use one thread per element of data in the result. So if you're performing
an element-wise addition of two 1,024-element vectors, you'd use 1,024 threads.

A *grid* is the top-level organizational structure for the threads executing a
kernel function. A grid consists of multiple *thread blocks*, which are further
divided into individual threads that execute the kernel function concurrently.
The GPU assigns a unique block index to each thread block, and a unique thread
index to each thread within a block. Threads within the same thread block can
share data through shared memory and synchronize using built-in mechanisms, but
they cannot directly communicate with threads in other blocks. For this
tutorial, we won't get in the details of why or how to do this, but it's an
important concept to keep in mind when you're writing more complex kernels.

To better understand how grids, thread blocks, and threads are organized, let's
write a simple kernel function that prints the thread block and thread indices.
Add the following code to your `vector_addition.mojo` file:

```mojo title="vector_addition.mojo"
from gpu.id import block_idx, thread_idx

fn print_threads():
    """Print thread IDs."""

    print("Block index: [",
        block_idx.x,
        "]\tThread index: [",
        thread_idx.x,
        "]"
    )
```

:::note

We're using `fn` here without the `raises` keyword because a kernel function is
not allowed to raise an error condition. In contrast, when you define a Mojo
function with `def`, the compiler always assumes that the function *can* raise
an error condition. See the [Functions](/mojo/manual/functions) section of the
Mojo Manual for more information on the difference between using `fn` and `def`
to define functions in Mojo.

:::

## 4. Compile and run the kernel

Next, we need to update the `main()` function to compile the kernel function for
our GPU and then run it, specifying the number of thread blocks in the grid and
the number of threads per thread block. For this initial example, let's define a
grid consisting of 2 thread blocks, each with 64 threads. Modify the `main()`
function so that your program looks like this:

```mojo title="vector_addition.mojo"
from gpu.host import DeviceContext
from gpu.id import block_idx, thread_idx
from sys import has_accelerator

fn print_threads():
    """Print thread IDs."""
    print("Block index: [",
        block_idx.x,
        "]\tThread index: [",
        thread_idx.x,
        "]"
    )

def main():
    @parameter
    if not has_accelerator():
        print("No compatible GPU found")
    else:
        ctx = DeviceContext()
        ctx.enqueue_function[print_threads](grid_dim=2, block_dim=64)
        ctx.synchronize()
        print("Program finished")
```

Save the file and run it:

```bash
mojo vector_addition.mojo
```

You should see something like the following output (which is abbreviated here):

```output
Block index: [ 1 ]	Thread index: [ 32 ]
Block index: [ 1 ]	Thread index: [ 33 ]
Block index: [ 1 ]	Thread index: [ 34 ]
...
Block index: [ 0 ]	Thread index: [ 30 ]
Block index: [ 0 ]	Thread index: [ 31 ]
Program finished
```

Typical CPU-GPU interaction is asynchronous, allowing the GPU to process tasks
while the CPU is busy with other work. Each `DeviceContext` has an associated
stream of queued operations to execute on the GPU. Operations within a stream
execute in the order they are issued.

The
[`enqueue_function()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_function)
method compiles a kernel function and enqueues it to run on the given device.
You must provide the name of the kernel function as a compile-time Mojo
parameter, and the following arguments:

- Any additional arguments specified by the kernel function definition (none, in
  this case).
- The grid dimensions using the `grid_dim` keyword argument.
- The thread block dimensions using the `block_dim` keyword argument.

(See the [Functions](/mojo/manual/functions) section of the Mojo Manual for more
information on Mojo function arguments and the
[Parameters](/mojo/manual/parameters) section for more information on Mojo
compile-time parameters and metaprogramming.)

:::note

Mojo currently doesn't typecheck the arguments to the compiled kernel function.
This means that you can encounter obscure errors if the ordering, types, or
argument count doesn't match. We're working to add more robust typechecking
soon.

:::

We're invoking the compiled kernel function with `grid_dim=2` and
`block_dim=64`, which means we're using a grid of 2 thread blocks, with 64
threads each for a total of 128 threads in the grid.

When you run a kernel, the GPU assigns each thread block within the grid to a
*streaming multiprocessor* for execution. A streaming multiprocessor (SM) is the
fundamental processing unit of a GPU, designed to execute multiple parallel
workloads efficiently. Each SM contains several cores, which perform the actual
computations of the threads executing on the SM, along with shared resources
like registers, shared memory, and control mechanisms to coordinate the
execution of threads. The number of SMs and the number of cores on a GPU depends
on its architecture. For example, the NVIDIA H100 PCIe contains 114 SMs, with
128 32-bit floating point cores per SM.

Additionally, when an SM is assigned a thread block, it divides the block into
multiple *warps*, which are groups of 32 or 64 threads, depending on the GPU
architecture. These threads execute the same instruction simultaneously in a
*single instruction, multiple threads* (SIMT) model. The SM's *warp scheduler*
coordinates the execution of warps on an SM's cores.

Warps are used to efficiently utilize GPU hardware by maximizing throughput and
minimizing control overhead. Since GPUs are designed for high-performance
parallel processing, grouping threads into warps allows for streamlined
instruction scheduling and execution, reducing the complexity of managing
individual threads. Multiple warps from multiple thread blocks can be active
within an SM at any given time, enabling the GPU to keep execution units busy.
For example, if the threads of a particular warp are blocked waiting for data
from memory, the warp scheduler can immediately switch execution to another warp
that's ready to run.

After enqueuing the kernel function, we want to ensure that the CPU waits for it
to finish execution before exiting the program. We do this by calling the
[`synchronize()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#synchronize)
method of the `DeviceContext` object, which blocks until the device completes
all operations in its queue.

## 5. Manage grid dimensions

The grid in the previous step consisted of a one-dimensional grid of 2 thread
blocks with 64 threads in each block. However, you can also organize the thread
blocks in a two- or even a three-dimensional grid. Similarly, you can arrange
the threads in a thread block across one, two, or three dimensions. Typically,
you determine the dimensions of the grid and thread blocks based on the
dimensionality of the data to process. For example, you might choose a
1-dimensional grid for processing large vectors, a 2-dimensional grid for
processing matrices, and a 3-dimensional grid for processing the frames of a
video.

To better understand how grids, thread blocks, and threads work together, let's
modify our `print_threads()` kernel function to print the `x`, `y`, and `z`
components of the thread block and thread indices for each thread.

```mojo title="vector_addition.mojo"
fn print_threads():
    """Print thread IDs."""

    print("Block index: [",
        block_idx.x, block_idx.y, block_idx.z,
        "]\tThread index: [",
        thread_idx.x, thread_idx.y, thread_idx.z,
        "]"
    )
```

Then, update `main()` to enqueue the kernel function with a 2x2x1 grid of
thread blocks and a 16x4x2 arrangement of threads within each thread block:

```mojo title="vector_addition.mojo"
        ctx.enqueue_function[print_threads](
            grid_dim=(2, 2, 1),
            block_dim=(16, 4, 2)
        )
```

Save the file and run it again:

```bash
mojo vector_addition.mojo
```

You should see something like the following output (which is abbreviated here):

```output
Block index: [ 1 1 0 ]	Thread index: [ 0 2 0 ]
Block index: [ 1 1 0 ]	Thread index: [ 1 2 0 ]
Block index: [ 1 1 0 ]	Thread index: [ 2 2 0 ]
...
Block index: [ 0 0 0 ]	Thread index: [ 14 1 0 ]
Block index: [ 0 0 0 ]	Thread index: [ 15 1 0 ]
Program finished
```

Try changing the grid and thread block dimensions to see how the output changes.

:::note

The maximum number of threads per thread block and threads per SM is
GPU-specific. For example, the NVIDIA A100 GPU has a maximum of 1,024 threads
per thread block and 2,048 threads per SM.

Choosing the size and shape of the grid and thread blocks is a balancing act
between maximizing the number of threads that can execute concurrently and
minimizing the amount of time spent waiting for data to be loaded from memory.
Factors such as the size of the data to process, the number of SMs on the GPU,
and the memory bandwidth of the GPU can all play a role in determining the
optimal grid and thread block dimensions. One general guideline is to choose a
thread block size that is a multiple of the warp size. This helps to maximize
the utilization of the GPU's resources and minimizes the overhead of managing
multiple warps.

:::

Now that you understand how to manage grid dimensions, let's get ready to create
a kernel that performs a simple element-wise addition of two vectors of floating
point numbers.

## 6. Allocate host memory for the input vectors

Before creating the two input vectors for our kernel function, we need to
understand the distinction between *host memory* and *device memory*. Host
memory is dynamic random-access memory (DRAM) accessible by the CPU, whereas
device memory is DRAM accessible by the GPU. If you have data in host memory,
you must explicitly copy it to device memory before you can use it in a kernel
function. Similarly, if your kernel function produces data that you want the CPU
to use later, you must explicitly copy it back to host memory.

For this tutorial, we'll use the
[`HostBuffer`](/mojo/stdlib/gpu/host/device_context/HostBuffer) type to
represent our vectors on the host. A `HostBuffer` is a block of host memory
associated with a particular `DeviceContext`. It supports methods for
transferring data between host and device memory, as well as a basic set of
methods for accessing data elements by index and for printing the buffer.

Let's update `main()` to create two `HostBuffer`s for our input vectors and
initialize them with values. You won't need the `print_threads()` kernel
function anymore, so you can remove it and the code to compile and invoke it. So
after all that, your `vector_addition.mojo` file should look like this:

```mojo title="vector_addition.mojo"
from gpu.host import DeviceContext
from gpu.id import block_idx, thread_idx
from sys import has_accelerator

# Vector data type and size
alias float_dtype = DType.float32
alias vector_size = 1000

def main():
    @parameter
    if not has_accelerator():
        print("No compatible GPU found")
    else:
        # Get the context for the attached GPU
        ctx = DeviceContext()

        # Create HostBuffers for input vectors
        lhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        rhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](
            vector_size
        )
        ctx.synchronize()

        # Initialize the input vectors
        for i in range(vector_size):
            lhs_host_buffer[i] = Float32(i)
            rhs_host_buffer[i] = Float32(i * 0.5)

        print("LHS buffer: ", lhs_host_buffer)
        print("RHS buffer: ", rhs_host_buffer)
```

The
[`enqueue_create_host_buffer()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_create_host_buffer)
method accepts the data type as a compile-time parameter and the size of the
buffer as a run-time argument and returns a `HostBuffer`. As with all
`DeviceContext` methods whose name starts with `enqueue_`, the method is
asynchronous and returns immediately, adding the operation to the queue to be
executed by the `DeviceContext`. Therefore, we need to call the `synchronize()`
method to ensure that the operation has completed before we use the `HostBuffer`
object. Then we can initialize the input vectors with values and print them.

Now let's run the program to verify that everything is working so far.

```bash
mojo vector_addition.mojo
```

You should see the following output:

```output
LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer:  HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
```

:::note

You might notice that we don't explicitly call any methods to free the host
memory allocated by our `HostBuffer`s. That's because a `HostBuffer` is subject
to Mojo's standard ownership and lifecycle mechanisms. The Mojo compiler
analyzes our program to determine the last point that the owner of or a
reference to an object is used and automatically adds a call to the object's
destructor. In our program, we last reference the buffers at the end of our
program's `main()` method. However in a more complex program, the `HostBuffer`
could persist across calls to multiple kernel functions if it is referenced at
later points in the program. See the [Ownership](/mojo/manual/values/ownership)
and [Intro to value lifecycle](/mojo/manual/lifecycle) sections of the Mojo
Manual for more information on Mojo value ownership and value lifecycle
management.

:::

## 7. Copy the input vectors to GPU memory and allocate an output vector

Now that we have our input vectors allocated and initialized on the CPU, let's
copy them to the GPU so that they'll be available for the kernel function to
use. While we're at it, we'll also allocate memory on the GPU for the output
vector that will hold the result of the kernel function.

Add the following code to the end of the `main()` function:

```mojo title="vector_addition.mojo"
        # Create DeviceBuffers for the input vectors
        lhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)
        rhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](vector_size)

        # Copy the input vectors from the HostBuffers to the DeviceBuffers
        ctx.enqueue_copy(dst_buf=lhs_device_buffer, src_buf=lhs_host_buffer)
        ctx.enqueue_copy(dst_buf=rhs_device_buffer, src_buf=rhs_host_buffer)

        # Create a DeviceBuffer for the result vector
        result_device_buffer = ctx.enqueue_create_buffer[float_dtype](
            vector_size
        )
```

The [`DeviceBuffer`](/mojo/stdlib/gpu/host/device_context/DeviceBuffer) type is
analogous to the `HostBuffer` type, but represents a block of device memory
associated with a particular `DeviceContext`. Specifically, the buffer is
located in the device's *global memory* space, which is accessible by all
threads executing on the device. As with a `HostBuffer`, a `DeviceBuffer` is
subject to Mojo's standard ownership and lifecycle mechanisms. It persists until
it is no longer referenced in the program or until the `DeviceContext` itself
is destroyed.

The
[`enqueue_create_buffer()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_create_buffer)
method accepts the data type as a compile-time parameter and the size of the
buffer as a run-time argument and returns a `DeviceBuffer`. The operation is
asynchronous, but we don't need to call the `synchronize()` method yet because
we have more operations to add to the queue.

The [`enqueue_copy()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_copy)
method is overloaded to support copying from host to device, device to host, or
even device to device for systems that have multiple GPUs. In this example, we
use it to copy the data in our `HostBuffer`s to the `DeviceBuffer`s.

:::note

Both `DeviceBuffer` and `HostBuffer` also include
[`enqueue_copy_to()`](/mojo/stdlib/gpu/host/device_context/DeviceBuffer#enqueue_copy_to)
and
[`enqueue_copy_from()`](/mojo/stdlib/gpu/host/device_context/DeviceBuffer#enqueue_copy_from)
methods. These are simply convenience methods that call the `enqueue_copy()`
method on their corresponding `DeviceContext`. Therefore, we could have written
the copy operations in the previous example with the following equivalent code:

```mojo
    lhs_host_buffer.enqueue_copy_to(dst=lhs_device_buffer)
    rhs_host_buffer.enqueue_copy_to(dst=rhs_device_buffer)
```

:::

## 8. Create `LayoutTensor` views

One last step before writing the kernel function is that we're going to create a
[`LayoutTensor`](/mojo/kernels/layout/layout_tensor/LayoutTensor) view for each
of the vectors. `LayoutTensor` provides a powerful abstraction for
multi-dimensional data with precise control over memory organization. It
supports various memory layouts (row-major, column-major, tiled),
hardware-specific optimizations, and efficient parallel access patterns.
We don't need all of these features for this tutorial, but in more
complex kernels it's a useful tool for manipulating data. So even though it
isn't strictly necessary for this example, we'll use `LayoutTensor` because
you'll see it in more complex examples and it's good to get familiar with it.

First add the following import to the top of the file:

```mojo title="vector_addition.mojo"
from layout import Layout, LayoutTensor
```

A [`Layout`](/mojo/kernels/layout/layout/Layout) is a representation of memory
layouts using shape and stride information, and it maps between logical
coordinates and linear memory indices. We'll need to use the same `Layout`
definition multiple times, so add the following alias to the top of the file
after the other aliases:

```mojo title="vector_addition.mojo"
alias layout = Layout.row_major(vector_size)
```

And finally add the following code to the end of the `main()` function
to create `LayoutTensor` views for each of the vectors:

```mojo title="vector_addition.mojo"
        # Wrap the DeviceBuffers in LayoutTensors
        lhs_tensor = LayoutTensor[float_dtype, layout](lhs_device_buffer)
        rhs_tensor = LayoutTensor[float_dtype, layout](rhs_device_buffer)
        result_tensor = LayoutTensor[float_dtype, layout](result_device_buffer)
```

## 9. Define the vector addition kernel function

Now we're ready to write the kernel function. First add the following imports
(note that we've added `block_dim` to the list of imports from `gpu.id`):

```mojo title="vector_addition.mojo"
from gpu.id import block_dim, block_idx, thread_idx
from math import ceildiv
```

Then, add the following code to `vector_addition.mojo` just before the
`main()` function:

```mojo title="vector_addition.mojo"
# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
alias block_size = 256
alias num_blocks = ceildiv(vector_size, block_size)

fn vector_addition(
    lhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    rhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
):
    """Calculate the element-wise sum of two vectors on the GPU."""

    # Calculate the index of the vector element for the thread to process
    var tid = block_idx.x * block_dim.x + thread_idx.x

    # Don't process out of bounds elements
    if tid 
  Click here to see the complete version of `vector_addition.mojo`.

```mojo title="vector_addition.mojo"
from gpu.host import DeviceContext
from gpu.id import block_dim, block_idx, thread_idx
from layout import Layout, LayoutTensor
from math import ceildiv
from sys import has_accelerator

# Vector data type and size
alias float_dtype = DType.float32
alias vector_size = 1000
alias layout = Layout.row_major(vector_size)

# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
alias block_size = 256
alias num_blocks = ceildiv(vector_size, block_size)

fn vector_addition(
    lhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    rhs_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
    out_tensor: LayoutTensor[float_dtype, layout, MutableAnyOrigin],
):
    """Calculate the element-wise sum of two vectors on the GPU."""

    # Calculate the index of the vector element for the thread to process
    var tid = block_idx.x * block_dim.x + thread_idx.x

    # Don't process out of bounds elements
    if tid 

The `enqueue_function()` method enqueues the compilation and invocation of the
`vector_addition()` kernel function, passing the input and output tensors as
arguments. The `grid_dim` and `block_dim` arguments use the `num_blocks` and
`block_size` aliases we defined in the previous step.

After the kernel function has been compiled and enqueued, we create a
`HostBuffer` to hold the result vector. Then we copy the result vector from the
`DeviceBuffer` to the `HostBuffer`. Finally, we synchronize the `DeviceContext`
to run all enqueued operations. After synchronizing, we can print the result
vector to the console.

At this point, the Mojo compiler determines that the `DeviceContext`, the
`DeviceBuffer`s, the `HostBuffer`s, and the `LayoutTensor`s are no longer used
and so it automatically invokes their destructors to free their allocated
memory. (For a detailed explanation of object lifetime and destruction in Mojo,
see the [Death of a value](/mojo/manual/lifecycle/death) section of the Mojo
Manual.)

So it's finally time to run the program to see the results of our hard work.

```bash
mojo vector_addition.mojo
```

You should see the following output:

```output
LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer:  HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
Result vector: HostBuffer([0.0, 1.5, 3.0, ..., 1495.5, 1497.0, 1498.5])
```

And now that you're done with the tutorial, exit your project's virtual
environment:

```bash
exit
```

## Summary

In this tutorial, we've learned how to use Mojo's `gpu.host` package to write a
simple kernel function that performs an element-wise addition of two vectors. We
covered:

- Understanding basic GPU concepts like devices, grids, and thread blocks.
- Moving data between CPU and GPU memory.
- Writing and compiling a GPU kernel function.
- Executing parallel computations on the GPU.

## Next steps

Now that you understand the basics of GPU programming with Mojo, here are some
suggested next steps:

- Check out more
  [examples](https://github.com/modular/modular/tree/main/examples/gpu_functions)
  of GPU programming with Mojo in the public [Modular GitHub
  repository](https://github.com/modular/modular).

- Learn more about GPU programming in Mojo and practice your skills by solving
  the [Mojo GPU puzzles](https://builds.modular.com/puzzles).

- Read the [GPU basics](/mojo/manual/gpu/basics) section of the Mojo Manual
  to find out more about GPU programming in Mojo.

- Read the [Introduction to layouts](/mojo/manual/layout/layouts) section of the
  Mojo Manual to learn more about the `layout` package and managing layouts.

- Check out the [Mojo Manual](/mojo/manual) for more information on the Mojo
  language.

- Learn more about other features of the [Modular platform](/max/intro) for
  building and deploying high-performance AI endpoints.

import TutorialStack from '@site/src/components/TutorialStack';

export const maxTutorials = [
  'build-custom-ops',
  'magic',
];

export const mojoTutorials = [
  'get-started',
];

---

## Get started with Magic

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import SmallCards from '@site/src/components/SmallCards';
import MaxInstall from '@site/src/components/MaxInstall';

Magic is a package manager and virtual environment manager for any language,
including Python and Mojo. It builds upon the conda and PyPI packaging
ecosystems, which provide access to thousands of packages for Python and other
languages, while also adding functionality for MAX and Mojo.

The `magic` CLI allows you to instantly launch code examples and create new
projects that are fully contained and reproducible across systems. All the
package dependencies and environment settings are magically managed for you.

This page provides an introduction to basic `magic` commands. For a deep-dive
into more features, see the [Magic tutorial](/max/tutorials/magic).

:::note

Magic is built upon [pixi](https://github.com/prefix-dev/pixi), so you'll see
this name appear below.

:::

## Install Magic

You can install Magic on macOS and Ubuntu with this command:

Then run the `source` command that's printed in your terminal.

To see the available commands, print the help:

```sh
magic --help
```

### Enable auto-completion

To enable auto-completion for `magic`, run the following commands:

  
```sh
BASHRC=$( [ -f "$HOME/.bash_profile" ] && echo "$HOME/.bash_profile" || echo "$HOME/.bashrc" )
echo 'eval "$(magic completion --shell bash)"' >> "$BASHRC"
source "$BASHRC"
```

  
```sh
echo 'eval "$(magic completion --shell zsh)"' >> ~/.zshrc
source ~/.zshrc
```

  
```sh
echo 'magic completion --shell fish | source' >> ~/.config/fish/config.fish
source ~/.config/fish/config.fish
```

  
### Update Magic

You can update with the [`self-update`](/magic/commands#magic-self-update)
command:

```sh
magic self-update
```

### Uninstall Magic

To remove Magic, delete the binary:

```sh
rm ~/.modular/bin/magic
```

To remove packages installed for your projects, delete the corresponding
project directories.

## Create a project

You can create a project with its own package dependencies and virtual
environment using the [`magic init`](/magic/commands#magic-init) command.

By default, this creates a configuration file called `pixi.toml`, but we
recommend that you specify the `--format` option as shown below, to instead
create a `pyproject.toml` or `mojoproject.toml` file for enhanced Python and
Mojo features, respectively.

:::note MAX build versions

MAX is available as either a stable or nightly build. For more detail about
installing MAX versions, instead see [MAX packages](/max/packages).

:::

### Create a Python project

Here's how to create a new Python project and install MAX:

1. Create a Python project with the [`magic init`](/magic/commands#magic-init)
command:

    ```sh
    magic init my-project --format pyproject
    ```

    This creates a `my-project` directory and a `pyproject.toml` file that
    defines the project dependencies and more. (If you omit the directory name,
    it creates the `pyproject.toml` file in the current directory.)

2. Enter the project directory and use [`magic run`](/magic/commands#magic-run)
to execute code inside the virtual environment:

    ```sh
    cd my-project
    ```

    ```sh
    magic run python3 --version
    ```

    Or, activate the environment shell with [`magic
    shell`](/magic/commands#magic-run):

    ```sh
    magic shell
    ```

    ```sh
    python3 --version
    ```

    Then use `exit` to deactivate the shell:

    ```sh
    exit
    ```

    Always exit the shell before changing projects.

3. If you want a different Python version, open the `pyproject.toml` file and
edit the [version
specifier](https://packaging.python.org/en/latest/specifications/version-specifiers/#id5)
defined with this line:

    ```toml
    requires-python = ">= 3.11"
    ```

4. To install Python packages for your project, use [`magic
add`](/magic/commands#magic-add). We recommend you always specify the version,
for example:

    ```sh
    magic add "max~=25.3"
    ```

You can run commands such as `magic add` anywhere inside a Magic project
directory, whether or not you've activated the shell.

For more information about using Magic for your Python projects,
read [using Pixi for Python](https://pixi.sh/latest/tutorials/python/) (just
replace each `pixi` command with `magic`).

### Create a Mojo project

Here's how to create a new Mojo project:

1. Create a Mojo project with [`magic init`](/magic/commands#magic-init):

    ```sh
    magic init my-mojo-project --format mojoproject
    ```

    This creates the `my-mojo-project` directory and creates a
    `mojoproject.toml` file inside, which defines the project dependencies and
    more. If you omit the path name, Magic creates the confg file in the current
    directory.

    By default, `mojoproject.toml` includes `max` as a dependency because `max`
    is the package that installs Mojo.

2. Enter the project and use [`magic run`](/magic/commands#magic-run) to
execute code inside the environment:

    ```sh
    cd my-mojo-project
    ```

    ```sh
    magic run mojo --version
    ```

    :::note

    By default, `magic` creates each project and adds the latest [nightly
    release](/max/packages#nightly-release) of MAX/Mojo as a dependency. If you
    prefer to use a stable release, you can specify the version you want like
    this:

    ```sh
    magic add "max~=25.3"
    ```

    :::

    You can also activate the environment shell with [`magic
    shell`](/magic/commands#magic-run):

    ```sh
    magic shell
    ```

    ```sh
    mojo --version
    ```

    Then use `exit` to deactivate the shell before changing projects:

    ```sh
    exit
    ```

3. If you want to use Python with Mojo, specify the Python version and Python
packages with [`magic add`](/magic/commands#magic-add). For example:

    ```sh
    magic add "python==3.9"
    ```

    :::caution

    If your Mojo project has Python package dependencies and you create an
    executable with [`mojo build`](/mojo/cli/build), the executable might not
    work outside of the Magic environment. That's because the Mojo executable
    doesn't include the Python packages, so they must be provided in the
    environment where you run it (such as inside the Magic environment where
    you built the executable).

    :::

You can run commands such as `magic add` anywhere inside a Magic project
directory, whether or not you've activated the shell.

### Convert a conda project to Magic

If you have an existing conda project, you can convert the `environment.yml`
configuration file to Magic with this command:

```sh
magic init --import environment.yml
```

:::caution

You might encounter issues if you invoke `magic` within a `conda` virtual
environment. It's best if you don't mix the two tools.

:::

## Manage packages

You can add Python and Mojo packages to your project by running [`magic
add`](/magic/commands/#magic-add) inside your project directory (every project
has its own package versions). For example:

```sh
magic add "max~=25.3" "numpy= 3.11"
```

If you created a [Mojo project](#create-a-mojo-project), you can modify the
Python version like any other package dependency:

```sh
magic add "python==3.10"
```

The next time you run a `magic` command, it updates Python with the appropriate
version:

```sh
magic run python3 --version
```

```output
Python 3.10
```

### The `magic.lock` file

Although the project configuration file (`pixi.toml`, `pyproject.toml`, or
`mojoproject.toml`) defines your project dependencies, it doesn't define the
project's transitive dependencies, which are the dependencies of your project
dependencies. Nor does the configuration file always specify the exact package
version that is actually installed (such as when you [specify a
version](https://packaging.python.org/en/latest/specifications/version-specifiers/#id5)
merely as less-than ``).

The transitive dependencies and actual installed versions are instead
specified in the `magic.lock` file, which is automatically generated—you should
not edit this file by hand.

This file is crucial to ensure that you can reliably reproduce your environment
across different machines. You can learn more about it from the [Pixi lock file
docs](https://pixi.sh/latest/features/lockfile/).

## Known issues

- You might encounter issues if you invoke `magic` within a `conda` or `venv`
virtual environment. It's best if you don't mix Magic with other virtual
environment tools.

- If you also have `pixi` installed, it generally should work with projects you
created using `magic`, but you might see some issues so we advise you only use
`magic` for MAX and Mojo projects.

- Linux aarch64 (ARM64) does not work with projects using PyTorch 2.2.2.

- The [MAX Replit
pipeline](https://github.com/modular/modular/tree/main/examples/graph-api/pipelines/replit)
currently doesn't work with the max-conda package.

## More reading

You can learn more about the available commands by printing the help:

```sh
magic -h
```

Or see all the [Magic commands here](/magic/commands).

If you have more questions, see the [Magic FAQ](/magic/faq).

And because Magic is built upon `pixi`, you can also learn more from the [pixi
documentation](https://pixi.sh/latest/) (just replace each `pixi` command with
`magic`). However, there are several differences between `magic` and `pixi`.
For example, `magic` does not support `exec`, `auth`, and `upload` commands,
and probably others to come.

export const cards = [
  {
    title: 'Get started with MAX',
    description:
    'Try one of our tutorials to deploy an LLM using MAX.',
    link: '/max/tutorials',
  },
  {
    title: 'Get started with Mojo',
    description:
    'Learn key features of Mojo by building an application from scratch in this hands-on tutorial.',
    link: '/mojo/manual/get-started',
  },
  {
    title: 'A step-by-step guide to Magic',
    description:
    'Learn how to get started and get the most out of the Magic.',
    link: '/max/tutorials/magic',
  },
];

---

## Get started with MAX Graph in Python

import InstallModular from '@site/docs/_includes/install-modular.mdx';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

MAX Graph is a high-performance computation framework that lets you build and
execute efficient machine learning models. It provides a flexible way to define
computational workflows as graphs, where each node represents an operation
(like matrix multiplication or addition) and edges represent the flow of data.
By using MAX Graph, you can create optimized machine learning models that run
faster and more efficiently on modern hardware.

In this tutorial, you'll build a graph using the MAX Graph API in Python with an
[`ops` function](/max/api/python/graph/ops).

To do this, you will complete the following steps:

1. [Build a simple graph that adds two numbers](#build-the-graph)
2. [Create an inference session to load and compile the graph](#create-inference-session)
3. [Execute the graph with input data](#execute-the-graph)

By the end of this tutorial, you'll have an understanding of how to construct
basic computational graphs, set up inference sessions, and run computations
using the MAX Graph API.

## Set up your environment

Create a Python project to install our APIs and CLI tools.

Then, create a working directory.

  
  Create a folder called `max_ops`:

  ```sh
  mkdir max_ops
  cd max_ops
  ```

  You can check your MAX version like this:

  ```sh
  pip show modular
  ```

  You can check your Python version like this:

  ```sh
  python --version
  ```

  
  Create a folder called `max_ops`:

  ```sh
  mkdir max_ops
  cd max_ops
  ```

  You can check your MAX version like this:

  ```sh
  uv pip show modular
  ```

  You can check your Python version like this:

  ```sh
  python --version
  ```

  
  Change folders to your working directory:

  ```sh
  cd src/quickstart
  ```

  You can check your MAX version like this:
  ```sh
  magic run max --version
  ```

  You can check your Python version like this:
  ```sh
  magic run python --version
  ```

  
If you have any questions along the way, ask them on [our Discord channel](https://discord.gg/modular).

## 1. Build the graph {#build-the-graph}

Now with our environment and packages setup, lets create the graph.
This graph will define a computational workflow that adds two tensors together.

Let's start by creating a new file called `addition.py` inside of your working
directory and add the following libraries:

```python
from typing import Any
import numpy as np
from max import engine
from max.dtype import DType
from max.graph import DeviceRef, Graph, TensorType, ops
```

To create a computational graph, use the
[`Graph()`](/max/api/python/graph/Graph) class from the MAX Graph API. When
initializing, specify a name for the graph and define the types of inputs it
will accept.

```python
def add_tensors(a: np.ndarray, b: np.ndarray) -> dict[str, Any]:
    # 1. Build the graph
    input_type = TensorType(
        dtype=DType.float32, shape=(1,), device=DeviceRef.CPU()
    )
    with Graph(
        "simple_add_graph", input_types=(input_type, input_type)
    ) as graph:
        lhs, rhs = graph.inputs
        out = ops.add(lhs, rhs)
        graph.output(out)
```

Inside the context manager, access the graph's inputs using the
[`inputs`](/max/api/python/graph/Graph#max.graph.Graph.inputs) property. This
returns a symbolic tensor representing the input arguments.

The symbolic tensor is a placeholder that represents the shape and type of data
that will flow through the graph during the execution, rather than containing
the actual numeric values like in eager execution.

Then use the [`add()`](/max/api/python/graph/ops#max.graph.ops.add) function
from the [`ops`](/max/api/python/graph/ops) package to add the two input
tensors. This creates a new symbolic tensor representing the sum.

Finally, set the output of the graph using the
[`output()`](/max/api/python/graph/Graph#max.graph.Graph.output) method. This
specifies which tensors should be returned when the graph is executed.

Now, add a `print()` function to the graph to see what's created.

```python
def add_tensors(a: np.ndarray, b: np.ndarray) -> dict[str, any]:
    # 1. Build the graph
        # ...
        print("final graph:", graph)
```

The output will show us the structure of our graph, including the input it
expects and the operations it will perform. This helps us understand how our
graph will process data when we use it.

Next, let's load the graph into an inference session.

## 2. Create an inference session {#create-inference-session}

Now that our graph is constructed, let's set up an environment where it can
operate. This involves creating an inference session and loading our graph into
it.

Create an
[`InferenceSession()`](/max/api/python/engine#max.engine.InferenceSession)
instance that loads and runs the graph inside the `add_tensors()` function.

```python
def add_tensors(a: np.ndarray, b: np.ndarray) -> dict[str, any]:
    # 1. Build the graph
    # ...
    # 2. Create an inference session
    session = engine.InferenceSession()
    model = session.load(graph)
```

This step transforms our abstract graph into a computational model that's ready
for execution.

To ensure our model is set up correctly, let's examine its input requirements.

Print the graph's input metadata by using the
[`input_metadata`](/max/api/python/engine#max.engine.Model.input_metadata)
property.

```python
def add_tensors(a: np.ndarray, b: np.ndarray) -> dict[str, any]:
    # 1. Build the graph
    # ...
    # 2. Create an inference session
    session = engine.InferenceSession()
    model = session.load(graph)
    # highlight-start
    for tensor in model.input_metadata:
    # highlight-end
        print(
            f"name: {tensor.name}, shape: {tensor.shape}, dtype: {tensor.dtype}"
        )
```

This will output the exact specifications of the input our model expects,
helping us prepare appropriate data for processing.

Next, let's execute the graph.

## 3. Execute the graph {#execute-the-graph}

To give the model something to add, create two inputs of a shape and a data type
that match our graph's input requirements.
Then pass the inputs to the
[`execute()`](/max/api/python/engine#max.engine.Model.execute) function:

```python
def add_tensors(a: np.ndarray, b: np.ndarray) -> dict[str, any]:
    # ...
    # 2. Create an inference session
    # ...
    # 3. Execute the graph
    # highlight-start
    ret = model.execute(a, b)[0]
    # highlight-end
    print("result:", ret)
    return ret
```

:::note

Starting in 24.6.0, the `model.execute()` command no longer accepts keyword
arguments. In a future release we will restore this functionality with support
for GPUs. For compatibility with existing code that uses keyword arguments, you
can use the `execute_legacy()` function.

:::

## 4. Run the example

Now that we've built our graph, created an inference session, and defined how to
execute the graph, let's put it all together and run our complete example.

At the end of your `addition.py` file, add the following code:

```python
if __name__ == "__main__":
    input0 = np.array([1.0], dtype=np.float32)
    input1 = np.array([1.0], dtype=np.float32)
    add_tensors(input0, input1)
```

This passes your arguments `input0` and `input1` to the `add_tensors()`
function.

Then, run the Python file from the command line:

  
  ```sh
  python addition.py
  ```

  
  ```sh
  python addition.py
  ```

  
  ```sh
  magic run python addition.py
  ```

  
You've successfully created your first graph using the MAX Graph API in Python.
Let's examine what was printed to the terminal:

```output
final graph: mo.graph @simple_add_graph(%arg0: !mo.tensor, %arg1: !mo.tensor) -> !mo.tensor attributes {argument_names = ["input0", "input1"], result_names = ["output0"]} {
  %0 = rmo.add(%arg0, %arg1) : (!mo.tensor, !mo.tensor) -> !mo.tensor
  mo.output %0 : !mo.tensor
}
```
- Two input tensors (`%arg0`, `%arg1`) of shape `[1]` and float32 type
- The addition operation connecting them
- One output tensor of matching shape/type

The metadata lines confirm both input tensors match the required specifications.

```output
name: input0, shape: [1], dtype: DType.float32
name: input1, shape: [1], dtype: DType.float32
```

The result shows the addition worked correctly:
$$
[1.0] + [1.0] = [2.0]
$$

```output
result: [2.]
```

Now that you've built your first MAX Graph that performs addition, you can
explore more complex examples:

- [MAX Graph API example](https://github.com/modular/modular/tree/main/tutorials/max-graph-python)
- [MAX Graph implementation of LLama3](https://github.com/modular/modular/tree/main/max)

## Next steps

import TutorialStack from '@site/src/components/TutorialStack';

export const maxTutorials = [
  'build-custom-ops',
  'magic',
];

---

## Get started with Mojo

import GetMagic from '@site/src/includes/get_magic.mdx';
import Requirements from '@site/src/components/Requirements';
import { requirementsNoGPU } from '@site/docs/max/requirements';

:::tip

Want to write a GPU function with Mojo? See how to [get started with GPU
programming with Mojo](/mojo/manual/gpu/intro-tutorial).

:::

Get ready to learn Mojo! This tutorial is designed to give you a tour of several
features of Mojo by building a complete program that does much more than simply
printing "Hello, world!"

In fact, we'll build a version of [Conway's Game of
Life](https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life), which is a simple
simulation to explore self-replicating systems. If you haven't heard of it
before, don't worry, it will make sense when you see it in action. Let's just
get started so you can learn Mojo programming basics, including the following:

- Using basic built-in types like `Int` and `String`
- Using a `List` to manage a sequence of values
- Creating custom types in the form of structs (data structures)
- Creating and importing Mojo modules
- Importing and using Python libraries

This tutorial might be a little long because there's a lot to learn, but we
tried to keep the explanations simple, and we included links along the way for
you to go learn more about each topic. If you just want to see the finished
code, you can [get it on
GitHub](https://github.com/modular/modular/tree/main/examples/mojo/life).

System requirements:

## 1. Create a Mojo project with `magic`

We'll start by using the `magic` CLI to create a virtual environment
and generate our initial project directory.

1. 

2. Navigate to the directory in which you want to create the project
    and execute:

    ```bash
    magic init life --format mojoproject
    ```

    This creates a project directory named `life`.

3. Let's go into the directory and list its contents:

    ```bash
    cd life
    ```

    ```bash
    ls -A
    ```

    ```output
    .gitattributes
    .gitignore
    .magic
    magic.lock
    mojoproject.toml
    ```

You should see that the project directory contains:

- An initial `mojoproject.toml` manifest file, which defines the project
  dependencies and other features
- A [lock file](/magic#the-magiclock-file) named `magic.lock`, which specifies
  the transitive dependencies and actual package versions installed in the
  project's virtual environment

  :::note

  Never edit the lock file directly. The `magic` command automatically updates the lock file if you edit the manifest file.

  :::

- A `.magic` subdirectory containing the conda virtual environment for the
  project
- Initial `.gitignore` and `.gitattributes` files that you can optionally use if
  you plan to use `git` version control with the project

Because we used the `--format mojoproject` option when creating the project,
`magic` automatically added the `max` package as a dependency, which includes
Mojo. Let's verify that our project is configured correctly by checking the
version of Mojo that's installed within our project's virtual environment.
`magic run` executes a command in the project's virtual environment, so let's
use it to execute `mojo --version`:

```bash
magic run mojo --version
```

You should see a version string indicating the version of Mojo installed, which
by default should be the latest released version.

Great! Now let's write our first Mojo program.

## 2. Create a "Hello, world" program

You can use any editor or IDE that you like. If you're using [Visual Studio
Code](https://code.visualstudio.com/) you can take advantage of the [Mojo for
Visual Studio Code
extension](https://marketplace.visualstudio.com/items?itemName=modular-mojotools.vscode-mojo),
which provides features like syntax highlighting, code completion, and debugging
support.

In the project directory, create a file named `life.mojo` containing the
following lines of code:

```mojo title="life.mojo"
# My first Mojo program!
def main():
    print("Hello, World!")
```

If you've programmed before in Python, this should look familiar.

- We're using the `def` keyword to define a function named `main`.
- You can use any number of spaces or tabs for indentation as long as you use
  the same indentation for the entire code block. We'll follow the [Python style
  guide](https://peps.python.org/pep-0008/) and use 4 spaces.
- This [`print()`](/mojo/stdlib/builtin/io/print) function is a Mojo built-in so it
  doesn't require an import.

An executable Mojo program *requires* you to define a no-argument `main()` as
its entry point. Running the program automatically invokes the `main()`
function, and your program exits when the `main()` function returns.

To run the program, we first need to start a shell session in our project's
virtual environment:

```bash
magic shell
```

Later on, when you want to exit the virtual environment, just type `exit`.

Now we can use the `mojo` command to run our program.

```bash
mojo life.mojo
```

```output
Hello, World!
```

Mojo is a compiled language, not an interpreted one like Python. So when we run
our program like this, `mojo` performs [just-in-time
compilation](https://en.wikipedia.org/wiki/Just-in-time_compilation) (JIT) and
then runs the result.

We can also compile our program into an executable file using [`mojo
build`](/mojo/cli/build) like this:

```bash
mojo build life.mojo
```

By default, this saves an executable file to the current directory named `life`.

```bash
./life
```

```output
Hello, World!
```

## 3. Create and use variables

Let's extend this basic program by prompting the user for their name and
including that in the greeting printed. The built-in
[`input()`](/mojo/stdlib/builtin/io/input) function accepts an optional
[`String`](/mojo/stdlib/collections/string/string/String) argument to use as a
prompt, and returns a `String` consisting of the characters the user entered
(with the newline character at the end stripped off).

So let's declare a variable, assign the return value from `input()` to it, and
build a customized greeting.

```mojo title="life.mojo"
def main():
    var name: String = input("Who are you? ")
    var greeting: String = "Hi, " + name + "!"
    print(greeting)
```

Go ahead and run it:

```bash
mojo life.mojo
```

```output
Who are you? Edna
Hi, Edna!
```

Notice that this code uses a `String` type annotation indicating the type of
value that the variable can contain. The Mojo compiler performs [static type
checking](https://en.wikipedia.org/wiki/Type_system#Static_type_checking), which
means that you'll encounter a compile-time error if your code tries to assign a
value of one type to a variable of a different type.

Mojo also supports implicitly declared variables, where you simply assign a
value to a new variable without using the `var` keyword or indicating its type.
So we can replace the code we just entered with the following, and it works
exactly the same.

```mojo title="life.mojo"
def main():
    name = input("Who are you? ")
    greeting = "Hi, " + name + "!"
    print(greeting)
```

However, implicitly declared variables still have a fixed type, which Mojo
automatically infers from the initial value assignment. So in this example both
`name` and `greeting` are inferred as `String` type variables. If you then try
to assign an integer value like 42 to the `name` variable, you'll get a
compile-time error because of the type mismatch. You can learn more about Mojo
variables in the [Variables](/mojo/manual/variables) section of the Mojo manual.

## 4. Use Mojo `Int` and `List` types to represent the game state

As originally envisioned by John Conway, the game's "world" is an infinite,
two-dimensional grid of square cells, but for our implementation we'll constrain
the grid to a finite size. A drawback to making the edges of the grid a hard
boundary is that there are fewer neighboring cells around the edges compared to
the interior, which tends to cause die offs. Therefore, we'll model the world as
a toroid (a donut shape), where the top row is considered adjacent to the bottom
row, and the left column is considered adjacent to the right column. This will
come into play later when we implement the algorithm for calculating each
subsequent generation.

To keep track of the height and width of our grid we'll use
[`Int`](/mojo/stdlib/builtin/int/Int), which represents a signed integer of the
[word size](https://en.wikipedia.org/wiki/Word_(computer_architecture)) of the
CPU, typically 32 or 64 bits.

To represent the state of an individual cell, we'll represent the cell state
with an `Int` value of 1 (populated) or 0 (unpopulated). Later, when we need to
determine the number of populated neighbors surrounding a cell, we can simply
add the values of the neighboring cells.

To represent the state of the entire grid, we need a [collection
type](/mojo/manual/types#collection-types). The most appropriate for this use
case is [`List`](/mojo/stdlib/collections/list/List), which is a
dynamically-sized sequence of values.

All of the values in a Mojo `List` must be the same type so that the Mojo
compiler can ensure type safety. (For example, when we retrieve a value from a
`List[Int]`, the compiler knows that the value is an `Int` and can verify that
we then use it correctly). Mojo collections are implemented as [generic
types](https://en.wikipedia.org/wiki/Generic_programming), so that we can
indicate the type of values the specific collection will hold by specifying a
[type parameter](/mojo/manual/parameters/#parameterized-structs) in square
brackets like this:

```mojo
# The List in row can contain only Int values
row = List[Int]()

# The List in names can contain only String values
names = List[String]()
```

We can also create a `List` with an initial set of values and let the compiler
infer the type.

```mojo
nums = List(12, -7, 64)  # A List[Int] containing 3 Int values
```

The Mojo `List` type includes the ability to append to the list, pop values out
of the list, and access list items using subscript notation. Here's a taste of
those operations:

```mojo
nums = List(12, -7, 64)
nums.append(-937)
print("Number of elements in the list:", len(nums))
print("Popping last element off the list:", nums.pop())
print("First element of the list:", nums[0])
print("Second element of the list:", nums[1])
print("Last element of the list:", nums[-1])
```

```output
Number of elements in the list: 4
Popping last element off the list: -937
First element of the list: 12
Second element of the list: -7
Last element of the list: 64
```

We can also nest `List`s:

```mojo
grid = List(
    List(11, 22),
    List(33, 44)
)
print("Row 0, Column 0:", grid[0][0])
print("Row 0, Column 1:", grid[0][1])
print("Row 1, Column 0:", grid[1][0])
print("Row 1, Column 1:", grid[1][1])
```

```output
Row 0, Column 0: 11
Row 0, Column 1: 22
Row 1, Column 0: 33
Row 1, Column 1: 44
```

This looks like a good way to represent the state of the grid for our program.
So let's update the `main()` function with the following code that defines an
8x8 grid containing the initial state of a
"[glider](https://en.wikipedia.org/wiki/Glider_(Conway%27s_Game_of_Life))"
pattern.

```mojo title="life.mojo"
def main():
    num_rows = 8
    num_cols = 8
    glider = List(
        List(0, 1, 0, 0, 0, 0, 0, 0),
        List(0, 0, 1, 0, 0, 0, 0, 0),
        List(1, 1, 1, 0, 0, 0, 0, 0),
        List(0, 0, 0, 0, 0, 0, 0, 0),
        List(0, 0, 0, 0, 0, 0, 0, 0),
        List(0, 0, 0, 0, 0, 0, 0, 0),
        List(0, 0, 0, 0, 0, 0, 0, 0),
        List(0, 0, 0, 0, 0, 0, 0, 0),
    )
```

## 5. Create and use a function to print the grid

Now let's create a function to generate a string representation of the game grid
that we can print to the terminal.

There are actually two different keywords that we can use to define functions in
Mojo: `def` and `fn`. Using `fn` gives us finer level control over the function
definition, whereas `def` provides a good set of default behaviors for most use
cases.

Let's add the following definition of a function named `grid_str()` to our
program. The Mojo compiler doesn't care whether we add our function before or
after `main()`, but the convention is to put `main()` at the end.

```mojo title="life.mojo"
def grid_str(rows: Int, cols: Int, grid: List[List[Int]]) -> String:
    # Create an empty String
    str = String()

    # Iterate through rows 0 through rows-1
    for row in range(rows):
        # Iterate through columns 0 through cols-1
        for col in range(cols):
            if grid[row][col] == 1:
                str += "*"  # If cell is populated, append an asterisk
            else:
                str += " "  # If cell is not populated, append a space
        if row != rows-1:
            str += "\n"     # Add a newline between rows, but not at the end
    return str
```

When we pass a value to a Mojo function, the default behavior for `def` is that
an argument is treated as a read-only reference to the value. However, if the
Mojo compiler determines that there is code in the function that can change the
value, then the argument gets a copy of the original value assigned to it. As
we'll see later, we can specify a different behavior by including an explicit
[argument convention](/mojo/manual/values/ownership#argument-conventions). In
contrast, when you define a function with `fn` Mojo simply treats each argument
as a read-only reference by default unless you provide an explicit argument
convention.

Each argument name is followed by a type annotation indicating the type of value
you can pass to the argument. Just like when you're assigning a value to a
variable, you'll encounter a compile-time error if your code tries to pass a
value of one type to an argument of a different type. Finally, the `-> String`
following the argument list indicates that this function has a `String` type
return value.

In the body of the function, we generate a `String` by appending an asterisk for
each populated cell and a space for each unpopulated cell, separating each row
of the grid with a newline character. We use nested `for` loops to iterate
through each row and column of the grid, using
[`range()`](/mojo/stdlib/builtin/range/range) to generate a sequence of integers
from 0 up to but not including the given end value. Then we append the correct
characters to the `String` representation. See [Control
flow](/mojo/manual/control-flow) for more information on `if`, `for`, and other
control flow structures in Mojo.

:::note

As described in [The `for`
statement](/mojo/manual/control-flow#the-for-statement) section of the Mojo
manual, it's possible to iterate over the elements of a `List` directly instead
of iterating over the values of a `range()` and then accessing the `List`
elements by their numeric index. However, iterating over a `List` directly
currently returns a *reference* to the element, which then requires using the
dereference operator, `[]`, to access the actual element value. The code looks
like this:

```mojo
nums = List(12, -7, 64)
for value in nums:
    print("Value:", value[])
```

This behavior is likely to change in the future, at which point iterating over a
`List` won't require using the dereference operator. But for this tutorial,
we'll stick with iterating over a `range()` and accessing the `List` elements by
their numeric index.

:::

Now that we've defined our `grid_str()` function, let's invoke it from `main()`.

```mojo title="life.mojo"
def main():
    ...
    result = grid_str(num_rows, num_cols, glider)
    print(result)
```

Then run the program to see the result:

```bash
mojo life.mojo
```

```output
 *
  *
***

```

We can see that the position of the asterisks matches the location of the 1s in
the `glider` grid.

## 6. Create a module and define a custom type

We're currently passing three arguments to `grid_str()` to describe the size and
state of the grid to print. A better approach would be to define our own custom
type that encapsulates all information about the grid. Then any function that
needs to manipulate a grid can accept just a single argument. We can do this by
defining a Mojo *struct*, which is a custom data structure.

A [Mojo struct](/mojo/manual/structs) is a custom type consisting of:

- Fields, which are variables containing the data associated with the structure
- Methods, which are functions that we can optionally define to manipulate
  instances of the struct that we create

:::note

Mojo structs are similar to classes. However, Mojo structs do *not* support
inheritance. Mojo doesn't support classes at this time.

:::

We could define the struct in our existing `life.mojo` source file, but let's
create a separate *module* for the struct. A module is simply a Mojo source file
containing struct and function definitions that can be imported into other Mojo
source files. To learn more about creating and importing modules, see the
[Modules and packages](/mojo/manual/packages) section of the Mojo manual .

So create a new source file named `gridv1.mojo` in the project directory
containing the following definition of a struct named `Grid` consisting of three
fields:

```mojo title="gridv1.mojo"
@value
struct Grid():
    var rows: Int
    var cols: Int
    var data: List[List[Int]]
```

Mojo requires you to declare all of the fields in the struct definition. You
can't add fields dynamically at run-time. You must declare the type for each
field, but you cannot assign a value as part of the field declaration. Instead,
the [constructor](/mojo/manual/lifecycle/life#constructor) is responsible for
initializing the value of all fields.

Mojo structs support several different [lifecycle
methods](/mojo/manual/lifecycle/) defining the behavior when an instance of the
struct is created, moved, copied, and destroyed. For structs that are basic
aggregations of other types and don't require custom resource management or
lifecycle behaviors, you can simply add the
[`@value`](/mojo/manual/structs#value-decorator) decorator to your struct
definition to have the Mojo compiler automatically generate lifecycle methods
for you.

Because we used the `@value` decorator, `Grid` includes a "member-wise"
[constructor](/mojo/manual/lifecycle/life#constructor) . The constructor's
arguments are the same names and types as the struct's fields and appear in the
same order. So this means that we can create an instance of `Grid` like this:

```mojo
my_grid = Grid(2, 2, List(List(0, 1), List(1, 1)))
```

We can then access the field values with "dot" syntax like this:

```mojo
print("Rows:", my_grid.rows)
```

```output
Rows: 2
```

## 7. Import a module and use our custom `Grid` type

Now let's edit `life.mojo` to import `Grid` from our new module and update our
code to use it.

```mojo title="life.mojo"
from gridv1 import Grid

def grid_str(grid: Grid) -> String:
    # Create an empty String
    str = String()

    # Iterate through rows 0 through rows-1
    for row in range(grid.rows):
        # Iterate through columns 0 through cols-1
        for col in range(grid.cols):
            if grid.data[row][col] == 1:
                str += "*"  # If cell is populated, append an asterisk
            else:
                str += " "  # If cell is not populated, append a space
        if row != grid.rows - 1:
            str += "\n"     # Add a newline between rows, but not at the end
    return str

def main():
    glider = List(
        List(0, 1, 0, 0, 0, 0, 0, 0),
        List(0, 0, 1, 0, 0, 0, 0, 0),
        List(1, 1, 1, 0, 0, 0, 0, 0),
        List(0, 0, 0, 0, 0, 0, 0, 0),
        List(0, 0, 0, 0, 0, 0, 0, 0),
        List(0, 0, 0, 0, 0, 0, 0, 0),
        List(0, 0, 0, 0, 0, 0, 0, 0),
        List(0, 0, 0, 0, 0, 0, 0, 0),
    )
    start = Grid(8, 8, glider)
    result = grid_str(start)
    print(result)
```

At this point we've made several changes to improve the structure of our
program, but the output should remain the same.

```bash
mojo life.mojo
```

```output
 *
  *
***

```

## 8. Implement `grid_str()` as a method

Our `grid_str()` function is really a utility function unique to the `Grid`
type. So rather than defining it as a standalone function, it makes more sense
to define it as part of the `Grid` type as a method.

To do so, move the function into `gridv1.mojo` and edit it to look like this (or
simply copy the code below into `gridv1.mojo`):

```mojo title="gridv1.mojo"
@value
struct Grid():
    var rows: Int
    var cols: Int
    var data: List[List[Int]]

    def grid_str(self) -> String:
        # Create an empty String
        str = String()

        # Iterate through rows 0 through rows-1
        for row in range(self.rows):
            # Iterate through columns 0 through cols-1
            for col in range(self.cols):
                if self.data[row][col] == 1:
                    str += "*"  # If cell is populated, append an asterisk
                else:
                    str += " "  # If cell is not populated, append a space
            if row != self.rows - 1:
                str += "\n"     # Add a newline between rows, but not at the end
        return str
```

So aside from moving the code from one source file to another, there are a few
other changes that we made.

- The function definition is indented to indicate that it's a method defined by
  the `Grid` struct. This also changes the way that we invoke the function.
  Instead of `grid_str(my_grid)` we now write `my_grid.grid_str()`.
- We've changed the argument name to `self`. When you invoke an instance method,
  Mojo automatically passes the instance as the first argument, followed by any
  explicit arguments that you provide. Although we could use any name we like
  for this argument, the convention is to call it `self`.
- We've deleted the argument's type annotation. The compiler knows that the
  first argument of the method is an instance of the struct, so it doesn't
  require an explicit type annotation.

Now that we've refactored the function into an instance method, we also need to
update the code in `life.mojo` where we invoke it from `main()`:

```mojo title="life.mojo"
def main():
    ...
    start = Grid(8, 8, glider)
    print(start.grid_str())
```

Once again, our refactoring has improved the structure of our code, but it still
produces the same output. You can verify that by running the program again.

## 9. Implement support for the `StringableRaising` trait

You can convert most Mojo types to `String` using `String(my_val)` to produce a
`String` representation of that instance. But you'll get an error if you try to
do that with our current implementation of `Grid`. So let's fix that.

Because the Mojo compiler performs static type checking, a `String` constructor
can accept a value only if its type implements some required behavior—in this
case, it only accepts types that can generate a `String` representation.

To enable that, Mojo supports [*traits*](/mojo/manual/traits). A trait is a set
of requirements in the form of one or more method signatures. A type can
*conform* to that trait by implementing all of the method signatures declared in
the trait. Then we can have a function that indicates that it accepts values of
any type that conform to a specified trait. (This type of function is sometimes
referred to as a [*generic*
function](/mojo/manual/parameters/#parameters-and-generics).)

In the case of `String()`, it requires a type to conform to either the `Stringable`
or `StringableRaising` trait. Each trait requires a conforming type to implement
a `__str__()` method that returns a `String` representation. The only difference
between the two traits is that `Stringable` requires that the method *cannot*
raise an error, whereas `StringableRaising` indicates that the method *might*
raise an error. (To learn more, read [The `Stringable`, `Representable`, and
`Writable`
traits](/mojo/manual/traits#the-stringable-representable-and-writable-traits).)

Our `grid_str()` method already returns a `String` representation, so it looks
like we just have to rename it to `__str__()`. But we also need to indicate
which trait `Grid` conforms to. In our case, it's `StringableRaising` because we
used `def` to define the method. If you define a function or method with `def`,
the compiler *always* assumes that the function *can* raise an error. In
contrast, if you define a function or method with `fn` you must explicitly
indicate with a `raises` keyword if it can raise an error.

So in `gridv1.mojo` we need to update the `Grid` declaration to indicate that
the type conforms to `StringableRaising` and rename the `grid_str()` method to
`__str__()`:

```mojo title="gridv1.mojo"
@value
struct Grid(StringableRaising):
    ...
    def __str__(self) -> String:
        ...
```

Now let's verify that `String()` works with an instance of `Grid`.

```mojo title="life.mojo"
def main():
    ...
    start = Grid(8, 8, glider)
    print(String(start))
```

If you run the program again, you should still see the same glider pattern as before.

```bash
mojo life.mojo
```

```output
 *
  *
***

```

## 10. Implement methods to support indexing

Looking at the implementation of `__str__()` you'll notice that we use
`self.data[row][col]` to retrieve the value of a cell in the grid. And if
`my_grid` is an instance of `Grid`, we would use `my_grid.data[row][col]` to
refer to a cell in the grid. This breaks a fundamental principle of
encapsulation in that we need to know that `Grid` stores the game state in a
field called `data`, and that field is a `List[List[Int]]`. If we later decide
to change the internal implementation of `Grid`, then there could be a lot of
code that would need to be changed.

A cleaner approach is to provide "getter" and "setter" methods to access cell
values. We could simply define methods like `get_cell()` and `set_cell()`, but
this is a good opportunity to show how we can define the behavior of built-in
operators for custom Mojo types. Specifically, we'll implement support for
indexing, so that we can refer to a cell with syntax like `my_grid[row, col]`.
This will be useful when we implement support for evolving the state of the
grid.

As described in [Operators, expressions, and dunder
methods](/mojo/manual/operators), Mojo allows us to define the behavior of many
of the built-in operators for a custom type by implementing special *dunder*
(double underscore) methods. In the case of indexing, the two methods are
`__getitem__()` and `__setitem__()`. So let's add the following methods to the
`Grid` struct in `gridv1.mojo`:

```mojo title="gridv1.mojo"
@value
struct Grid(StringableRaising):
    ...
    def __getitem__(self, row: Int, col: Int) -> Int:
        return self.data[row][col]

    def __setitem__(mut self, row: Int, col: Int, value: Int) -> None:
        self.data[row][col] = value
```

The implementation of `__getitem__()` is easy. For the given values of `row` and
`col` we just need to retrieve and return the corresponding value from the
nested `List[List[Int]]` stored in the `data` field of the instance.

The body of `__setitem__()` is similarly straightforward. We just take the given
`value` and store it in the corresponding `row` and `col` in `data`. One thing
new in the declaration is that we set the return type to `None` to indicate that
the method doesn't have a return value. But more notable is that we've added the
`mut` [argument convention](/mojo/manual/values/ownership#argument-conventions)
to the `self` argument to explicitly tell the Mojo compiler that we want to
mutate the state of the current instance. If we were to omit `mut`, we would get
an error because the compiler would default to read-only access for the
argument.

Now that we've implemented these methods, we can update `__str__()` to use
indexing syntax to access the cell value.

```mojo title="gridv1.mojo"
@value
struct Grid(StringableRaising):
    ...
    def __str__(self) -> String:
        ...
            # Iterate through columns 0 through cols-1
            for col in range(self.cols):
                if self[row, col] == 1:
                    ...
```

  Click here to see the complete `gridv1.mojo` so far:

```mojo title="gridv1.mojo"
@value
struct Grid(StringableRaising):
    var rows: Int
    var cols: Int
    var data: List[List[Int]]

    def __str__(self) -> String:
        # Create an empty String
        str = String()

        # Iterate through rows 0 through rows-1
        for row in range(self.rows):
            # Iterate through columns 0 through cols-1
            for col in range(self.cols):
                if self[row, col] == 1:
                    str += "*"  # If cell is populated, append an asterisk
                else:
                    str += " "  # If cell is not populated, append a space
            if row != self.rows - 1:
                str += "\n"     # Add a newline between rows, but not at the end
        return str

    def __getitem__(self, row: Int, col: Int) -> Int:
        return self.data[row][col]

    def __setitem__(mut self, row: Int, col: Int, value: Int) -> None:
        self.data[row][col] = value
```

Our refactoring hasn't changed our program's behavior, but it's still a good
idea to run it to be sure that we don't have any errors in our code.

## 11. Define a static method to generate random grids

So far, we've used the glider to build the basic functionality of our `Grid`
type. But what's much more interesting is to start with a grid in a random state
and see how it evolves over time.

Let's add a *static method* named `random()` to the `Grid` struct to generate
and return an instance of `Grid` with a random state. A static method doesn't
operate on specific instances of the type, so it can be invoked as a utility
function. We indicate that a method is a static method by using the
`@staticmethod` decorator.

```mojo title="gridv1.mojo"
import random

@value
struct Grid(StringableRaising):
    ...
    @staticmethod
    def random(rows: Int, cols: Int) -> Self:
        # Seed the random number generator using the current time.
        random.seed()

        data = List[List[Int]]()

        for row in range(rows):
            row_data = List[Int]()
            for col in range(cols):
                # Generate a random 0 or 1 and append it to the row.
                row_data.append(Int(random.random_si64(0, 1)))
            data.append(row_data)

        return Self(rows, cols, data)
```

At the top of the file we're importing the `random` package from the Mojo
standard library. It includes several functions related to random number
generation.

By default, the [pseudorandom number
generator](https://en.wikipedia.org/wiki/Pseudorandom_number_generator) used by
the Mojo standard library currently uses a fixed seed. This means that it
generates the same sequence of numbers unless you provide a different seed,
which is useful for testing purposes. But for this application we want to call
`random.seed()` to set a seed value based on the current time, which gives us a
unique value every time.

Then we create an empty `List[List[Int]]` that we populate with a random initial
state. For each cell, we call
[`random.random_si64()`](/mojo/stdlib/random/random/random_si64), which returns
a random integer value from the provided minimum and maximum values of 0 and 1,
respectively. This function actually returns a value of type `Int64`, which is a
signed 64-bit integer value. As described in [Numeric
types](/mojo/manual/types#numeric-types), this is *not* the same as the `Int`
type whose precision is dependent on the native word size of the system.
Therefore we're passing this value to the
[`Int()`](/mojo/stdlib/builtin/int/Int/#__init__) constructor, which explicitly
converts a numeric value to an `Int`.

The return type of the method is `Self`, which is an alias for the type of the
struct. This is a convenient shortcut if the actual name of the struct is long
or includes parameters.

The last line uses `Self()` to invoke the struct's constructor and return a
newly created instance with random data.

Now we can update the `main()` function in `life.mojo` to create a random `Grid`
and print it.

```mojo title="life.mojo"
...

def main():
    start = Grid.random(8, 16)
    print(String(start))
```

Run the program a few times to verify that it generates a different grid each
time.

```bash
mojo life.mojo
```

```output
*** *      ****
*  ****   ******
* * *****
*  * ** **
 *    * ** ****
* **  * * * ***
 * * **  **  **
  * ***** **
```

## 12. Implement a method to evolve the grid

It's finally time to let our world evolve. We'll implement an `evolve()` method
to calculate the state of the grid for the next generation. One option would be
to do an in-place modification of the existing `Grid` instance. But instead
we'll have `evolve()` return a new instance of `Grid` for the next generation.

```mojo title="gridv1.mojo"
...
struct Grid(StringableRaising):
    ...
    def evolve(self) -> Self:
        next_generation = List[List[Int]]()

        for row in range(self.rows):
            row_data = List[Int]()

            # Calculate neighboring row indices, handling "wrap-around"
            row_above = (row - 1) % self.rows
            row_below = (row + 1) % self.rows

            for col in range(self.cols):
                # Calculate neighboring column indices, handling "wrap-around"
                col_left = (col - 1) % self.cols
                col_right = (col + 1) % self.cols

                # Determine number of populated cells around the current cell
                num_neighbors = (
                    self[row_above, col_left]
                    + self[row_above, col]
                    + self[row_above, col_right]
                    + self[row, col_left]
                    + self[row, col_right]
                    + self[row_below, col_left]
                    + self[row_below, col]
                    + self[row_below, col_right]
                )

                # Determine the state of the current cell for the next generation
                new_state = 0
                if self[row, col] == 1 and (num_neighbors == 2 or num_neighbors == 3):
                    new_state = 1
                elif self[row, col] == 0 and num_neighbors == 3:
                    new_state = 1
                row_data.append(new_state)

            next_generation.append(row_data)

        return Self(self.rows, self.cols, next_generation)
```

We start out with an empty `List[List[Int]]` to represent the state of the next
generation. Then we use nested `for` loops to iterate over each row and each
column of the existing `Grid` to determine the state of each cell in the next
generation.

For each cell in the grid we need to count the number of populated neighboring
cells. Because we're modeling the world as a toroid, we need to consider the top
and bottom rows as adjacent and the left-most and right-most columns as
adjacent. So as we iterate through each row and column, we're using the modulo
operator, `%`, to handle "wrap-around" when we calculate the indices of the rows
above and below and the columns to the left and right of the current cell. (For
example, if there are 8 rows, then `-1 % 8` is 7.)

Then we apply the Game of Life rules that determines if the current cell is
populated (1) or unpopulated (0) for the next generation:

- A populated cell with either 2 or 3 populated neighbors remains populated in
  the next generation
- An unpopulated cell with exactly 3 populated neighbors becomes populated in
  the next generation
- All other cells become unpopulated in the next generation

After calculating the state of the next generation, we use `Self()` to create an
new instance of `Grid`, and return the newly created instance.

Now that we can evolve the grid, let's use it in `life.mojo`. We'll add a
`run_display()` function to control the game's main loop:

- Display the current `Grid`
- Prompt the user to continue or quit
- Break out of the loop if the user enters `q`
- Otherwise, calculate the next generation and loop again

Then we'll update `main()` to create a random initial `Grid` and pass it to
`run_display()`. Here is the updated version of `life.mojo`:

```mojo title="life.mojo"
from gridv1 import Grid

def run_display(owned grid: Grid) -> None:
    while True:
        print(String(grid))
        print()
        if input("Enter 'q' to quit or press  to continue: ") == "q":
            break
        grid = grid.evolve()

def main():
    start = Grid.random(16, 16)
    run_display(start)
```

Run the program and verify that each call to `evolve()` successfully produces a
new generation.

So now we have a working version of the Game of Life, but the terminal interface
is not very pretty. Let's spice things up with a nicer graphical user interface,
using a Python library.

## 13. Import and use a Python package

Mojo lets you import Python modules, call Python functions, and interact with
Python objects from Mojo code. To demonstrate this capability, we're going to
use a Python package called [pygame](https://www.pygame.org) to create and
manage a graphical user interface for our Game of Life program.

First, we need to update our `mojoproject.toml` file to add a dependency on
Python and the `pygame` package. So in the project directory, execute the
following command from the terminal:

```bash
magic add "python>=3.11,=2.6.1, None:
    # Import the pygame Python package
    pygame = Python.import_module("pygame")

    # Initialize pygame modules
    pygame.init()

    # Create a window and set its title
    window = pygame.display.set_mode(
        Python.tuple(window_height, window_width)
    )
    pygame.display.set_caption("Conway's Game of Life")

    cell_height = window_height / grid.rows
    cell_width = window_width / grid.cols
    border_size = 1
    cell_fill_color = pygame.Color(cell_color)
    background_fill_color = pygame.Color(background_color)

    running = True
    while running:
        # Poll for events
        event = pygame.event.poll()
        if event.type == pygame.QUIT:
            # Quit if the window is closed
            running = False
        elif event.type == pygame.KEYDOWN:
            # Also quit if the user presses  or 'q'
            if Bool(event.key == pygame.K_ESCAPE) or Bool(
                event.key == pygame.K_q
            ):
                running = False

        # Clear the window by painting with the background color
        window.fill(background_fill_color)

        # Draw each live cell in the grid
        for row in range(grid.rows):
            for col in range(grid.cols):
                if grid[row, col]:
                    x = col * cell_width + border_size
                    y = row * cell_height + border_size
                    width = cell_width - border_size
                    height = cell_height - border_size
                    pygame.draw.rect(
                        window,
                        cell_fill_color,
                        Python.tuple(x, y, width, height),
                    )

        # Update the display
        pygame.display.flip()

        # Pause to let the user appreciate the scene
        time.sleep(pause)

        # Next generation
        grid = grid.evolve()

    # Shut down pygame cleanly
    pygame.quit()

def main():
    start = Grid.random(128, 128)
    run_display(start)
```

Each argument for `run_display()` other than `grid` has a default value
associated with it (for example, the default `window_height` is 600 pixels). If
you don't explicitly pass a value for an argument when you invoke
`run_display()`, Mojo uses the default value specified in the function
definition.

After importing the `pygame` module, we call `pygame.init()` to initialize all
the pygame subsystems.

The `set_mode()` function creates and initializes a window, with the height and
width passed as a Python tuple of two values. This returns a
[`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) wrapper for the
window, which we can then use to call functions and set attributes to manipulate
the window. (For more information about interacting with Python objects from
Mojo, see [Python types](/mojo/manual/python/types).)

The bulk of the `run_display()` function is a loop that uses `pygame` to poll
for events like key presses and mouse clicks. If it detects that the user
presses `q` or the `` key or closes the display window, it ends the
program with `pygame.quit()`. Otherwise, it clears the window and then iterates
through all cells in the grid to display the populated cells. After sleeping for
`pause` seconds, it evolves the grid to the next generation and loops again.

So it's finally time to try it out.

```bash
mojo life.mojo
```

Now when you run the program you should see a new window appear on screen
displaying your evolving grid. We now have a fully functional implementation of
the Game of Life with a nice interface. We've come quite a way from just
displaying a few asterisks on the terminal!

![game_of_life_screen.png](images/game-of-life-screen.png)

To quit the program press the `q` or `` key, or close the window.

And now that we're done with the tutorial, exit our project's virtual
environment:

```bash
exit
```

## Summary

Congratulations on writing a complete Mojo application from scratch! Along the
way, you got a taste of:

- Using Magic to create, build, and run a Mojo program
- Using Mojo built-in types like `Int`, `String`, and `List`
- Creating and using variables and functions
- Using control structures like `if`, `while`, and `for`
- Defining and using a custom Mojo struct
- Creating and importing a Mojo module
- Using modules from the Mojo standard library
- Importing and using a Python module

## Next steps

Now that you've seen a bit of what Mojo can do, here are some suggested next
steps:

- Read through the [Mojo manual](/mojo/manual/) for more detail about all of
  Mojo's features.

- Check out [Get started with GPU programming with Mojo and the MAX
  Driver](/mojo/manual/gpu/intro-tutorial) for an example of how to write GPU
  functions with Mojo.

- Explore more Mojo
  [examples](https://github.com/modular/modular/tree/main/examples/mojo) in the
  public [MAX GitHub repository](https://github.com/modular/modular).

import TutorialStack from '@site/src/components/TutorialStack';

export const maxTutorials = [
  'magic',
];

export const mojoTutorials = [
  'gpu/intro-tutorial',
];

---

## get_accum_type

`get_accum_type[dtype: DType, *, preferred_accum_type: DType = float32]() -> DType`

Returns the recommended dtype for accumulation operations.

Half precision and float8 types can introduce numerical error if they are
used in reduction/accumulation operations. This method returns a higher
precision dtype to use for accumulation if a half precision types is
provided, otherwise it returns the original dtype.

The rules are as follows:
\- If the dtype is a float8 type, return a float16 type.
\- If the dtype is a bfloat16 precision type, return a float32 type.
\- If the dtype is a float16 precision type, return a float32 dtype if
the preferred\_accum\_type is float32, otherwise return a float16
type.
\- Otherwise, return the original type.

**Parameters:**

* ​dtype (`DType`): The dtype of some accumulation operation.
* ​preferred\_accum\_type (`DType`): The preferred dtype for accumulation.

**Returns:**

The recommended dtype for accumulation operations based on the input
dtype and the preferred accumulation type.

---

## get_cblas_f32_function

`get_cblas_f32_function() -> fn(_CBLASOrder, _CBLASTranspose, _CBLASTranspose, SIMD[int32, 1], SIMD[int32, 1], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1], SIMD[float32, 1], UnsafePointer[SIMD[float32, 1]], SIMD[int32, 1]) -> None`

---

## get_config_from_shape

`get_config_from_shape[a_type: DType, b_type: DType, c_type: DType, static_N: Int, static_K: Int, transpose_b: Bool = False, target: StringSlice[StaticConstantOrigin] = _accelerator_arch()](dyn_M: Int, ctx: DeviceContext) -> MatmulConfig[a_type, b_type, c_type, transpose_b]`

---

## get_conv_num_partitions

`get_conv_num_partitions[micro_kernel_w: Int, micro_kernel_f: Int](num_threads: Int, conv_shape: ConvShape[rank]) -> IndexList[4]`

Partition the worload in (batch, C, F, HOWO) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions.

---

## get_conv_num_tasks

`get_conv_num_tasks(num_threads: Int, conv_shape: ConvShape[rank]) -> Int`

---

## get_conv_shape

`get_conv_shape[rank: Int, filter_packed: Bool](output: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], input: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], stride: IndexList[rank], dilation: IndexList[rank], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], num_groups: Int) -> ConvShape[rank]`

---

## get_conv_tile_shape

`get_conv_tile_shape[type: DType](c: Int, filter_window_size: Int, micro_kernel_width: Int) -> IndexList[2]`

Compute the (c, f) tile shape in L2. Assume NHWC layout, the tile shape is (R, S, c\_tile, f\_tile). R and S are by default fully covered. The heuristic tried to block in C as much as possible. If C is small, it would start to block F.

---

## get_conv_tile_size

`get_conv_tile_size[type: DType]() -> Int`

---

## get_conv2d_shape

`get_conv2d_shape[output_shape: DimList, input_shape: DimList, filter_shape: DimList, type: DType, data_layout: Image2DLayout, filter_layout: Image2DLayout](output: NDBuffer[type, 4, origin, output_shape], input: NDBuffer[type, 4, origin, input_shape], filter: NDBuffer[type, 4, origin, filter_shape], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[2], dilation: IndexList[2], num_groups: Int) -> ConvShape[2]`

`get_conv2d_shape[filter_rank: Int, output_shape: DimList, input_shape: DimList, filter_shape: DimList, type: DType, data_layout: Image2DLayout, filter_layout: Image2DLayout](output: NDBuffer[type, 4, origin, output_shape], input: NDBuffer[type, 4, origin, input_shape], filter: NDBuffer[type, filter_rank, origin, filter_shape], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[2], dilation: IndexList[2], num_groups: Int) -> ConvShape[2]`

---

## get_current_trace_id

`get_current_trace_id[level: TraceLevel]() -> Int`

Returns the id of last created trace entry on the current thread.

**Parameters:**

* ​level (`TraceLevel`): The trace level to check.

**Returns:**

The ID of the current trace if profiling is enabled, otherwise 0.

---

## get_direct_conv_micro_kernel_height

`get_direct_conv_micro_kernel_height() -> Int`

---

## get_direct_conv_micro_kernel_width

`get_direct_conv_micro_kernel_width() -> Int`

---

## get_dispatch_table

`get_dispatch_table[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool]() -> Dict[String, MatmulConfig[a_type, b_type, c_type, transpose_b]]`

---

## get_fragment_size

`get_fragment_size[mma_shape: IndexList[3]]() -> IndexList[3]`

Calculates the fragment size per thread for a given MMA shape.

For tensor core operations, each thread in a warp handles a portion of the
computation. This function determines how many elements each thread needs to
process for the A, B, and C/D matrices based on the MMA shape.

**Parameters:**

* ​mma\_shape (`IndexList[3]`): An `IndexList[3]` containing the MMA dimensions \[M, N, K].

**Returns:**

An `IndexList[3]` containing the fragment sizes per thread for matrices
A, B, and C/D respectively, calculated as:
`[M*K/WARP_SIZE, N*K/WARP_SIZE, M*N/WARP_SIZE]`.

---

## get_identity_rope_coeff

`get_identity_rope_coeff[width: Int, type: DType]() -> SIMD[type, width]`

---

## get_kernel_config

`get_kernel_config[a_type: DType, b_type: DType, c_type: DType, *, kernel_type: Bool = False]() -> KernelConfig`

Utility function to extract matmul configuration parameters for exported Functions.     TODO: Add target dependent configuration parameters.

---

## get_kernel_type

`get_kernel_type(m: Int, n: Int, k: Int) -> Bool`

---

## get_linkage_name

`get_linkage_name[func_type: AnyTrivialRegType, //, target: target, func: func_type]() -> StringSlice[StaticConstantOrigin]`

Returns `func` symbol name.

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): Type of func.
* ​target (`target`): The compilation target.
* ​func (`func_type`): A mojo function.

**Returns:**

Symbol name.

`get_linkage_name[func_type: AnyTrivialRegType, //, func: func_type]() -> StringSlice[StaticConstantOrigin]`

Returns `func` symbol name.

**Parameters:**

* ​func\_type (`AnyTrivialRegType`): Type of func.
* ​func (`func_type`): A mojo function.

**Returns:**

Symbol name.

---

## get_matmul_arch_factor

`get_matmul_arch_factor[use_vnni: Bool, use_i8mm: Bool]() -> Int`

---

## get_matmul_kernel_shape

`get_matmul_kernel_shape[a_type: DType, b_type: DType, c_type: DType, kernel_type: Bool]() -> MicroKernelShape`

---

## get_matmul_kernel_shape_ARM

`get_matmul_kernel_shape_ARM[a_type: DType, b_type: DType, c_type: DType, kernel_type: Bool]() -> MicroKernelShape`

---

## get_matmul_kernel_shape_x86

`get_matmul_kernel_shape_x86[kernel_type: Bool]() -> MicroKernelShape`

---

## get_matmul_num_tasks

`get_matmul_num_tasks[a_type: DType, b_type: DType, c_type: DType, simd_size: Int, kernel_type: Bool](m: Int, n: Int, k: Int, max_num_tasks: Int) -> Int`

Compute the number of tasks for parallel matmul. The max number of tasks is typically the number of threads/cores.

---

## get_matmul_prefetch_b_distance_k

`get_matmul_prefetch_b_distance_k() -> Int`

---

## get_mha_decoding_num_partitions

`get_mha_decoding_num_partitions[num_heads: Int, group: Int](batch_size: Int, num_keys: Int, ctx: DeviceContext) -> Int`

---

## get_micro_kernel_shape

`get_micro_kernel_shape[rank: Int, WO: Dim, F: Dim, conv_attr: ConvInfoStatic[rank], simd_size: Int]() -> IndexList[2]`

---

## get_min_task_size

`get_min_task_size() -> Int`

---

## get_mma_shape

`get_mma_shape[input_type: DType, accum_type: DType, shape_id: Int = 0]() -> IndexList[3]`

Returns the appropriate matrix multiply-accumulate (MMA) shape for tensor core operations.

Selects the optimal MMA shape based on the GPU architecture, input data type,
accumulation data type, and optional shape identifier. This function handles
different configurations for both NVIDIA and AMD GPUs.

**Parameters:**

* ​input\_type (`DType`): The data type of the input matrices (A and B).
* ​accum\_type (`DType`): The data type used for accumulation (C and D).
* ​shape\_id (`Int`): Optional identifier to select between multiple valid shapes (default: 0).

**Returns:**

An `IndexList[3]` containing the MMA dimensions in the format `[M, N, K]`,
where `M×N` is the output matrix size and `K` is the reduction dimension.

---

## get_num_partitions

`get_num_partitions[micro_kernel_height: Int, micro_kernel_f_size: Int](num_threads: Int, conv_shape: ConvShape[rank]) -> IndexList[4]`

Partition the worload in (batch\&group, C, F, H) dimensions. HOWO is the combination of HO and WO dimensions. The actual number of tasks are the product of return num\_partitions.

---

## get_pack_data_size

`get_pack_data_size[type: DType]() -> Int`

Utility to compute the number of elements to pack in each tile. Returns:     The number of elements to pack.

---

## get_packB_unroll_factor

`get_packB_unroll_factor() -> Int`

---

## get_padding_output_shape

`get_padding_output_shape[rank: Int](input_shape: IndexList[rank], paddings: LayoutTensor[index, __init__[::Origin[::Bool(IntTuple((rank * 2))), origin]) -> IndexList[rank]`

---

## get_partition

`get_partition(task_id: Int, num_partitions: IndexList[4], conv_shape: ConvShape[rank], micro_kernel_height: Int, micro_kernel_f_size: Int) -> ConvPartition`

---

## get_partition

`get_partition(task_id: Int, num_partitions: IndexList[4], conv_shape: ConvShape[rank], micro_kernel_height: Int, micro_kernel_f_size: Int) -> ConvPartition`

---

## get_partitioned_matmul

`get_partitioned_matmul[a_type: DType, b_type: DType, c_type: DType, kernel_rows: Int, kernel_cols: Int](m: Int, n: Int, k: Int, task_id: Int, num_tasks: Int) -> SubMatmulConfig`

---

## get_partitioned_matmul_mojo

`get_partitioned_matmul_mojo[b_type: DType, kernel_rows: Int, kernel_cols: Int, use_i8mm: Bool = False](m: Int, n: Int, k: Int, task_id: Int, num_tasks: Int) -> SubMatmulConfig`

---

## get_partitioned_matmul_mojo_shape

`get_partitioned_matmul_mojo_shape[b_type: DType, kernel_rows: Int, kernel_cols: Int, use_i8mm: Bool](m: Int, n: Int, k: Int, num_tasks: Int) -> IndexList[2]`

---

## get_safetensors_idx

`get_safetensors_idx(head_dim_idx: Int, head_size: Int) -> Tuple[Int, Int]`

---

## get_sliding_window_out_dim

`get_sliding_window_out_dim[ceil_mode: Bool = False](in_dim: Int, ft_dim: Int, dilation: Int, stride: Int, pad: Int) -> Int`

Return output dimension for a sliding window operation along some dimension.

**Parameters:**

* ​ceil\_mode (`Bool`): Define rounding mode for shape calculation.

**Args:**

* ​in\_dim (`Int`): The size of the input dimension.
* ​ft\_dim (`Int`): The size of the corresponding filter dimension.
* ​dilation (`Int`): The dilation for the sliding window operation.
* ​stride (`Int`): The stride for the sliding window operation.
* ​pad (`Int`): The total padding for the sliding window operation.

**Returns:**

The size of the output dimension.

---

## get_start_and_end_for_partitions

`get_start_and_end_for_partitions[tile_size: Int](num_keys: Int, num_partitions: Int, partition_idx: Int) -> Tuple[Int, Int]`

Calculate start and end indices for a partition.

**Args:**

* ​num\_keys (`Int`): Total number of keys (sequence length).
* ​num\_partitions (`Int`): Number of partitions to split keys into.
* ​partition\_idx (`Int`): Index of current partition (0 to num\_partitions-1).

**Returns:**

Tuple of (start\_idx, end\_idx) for the partition, aligned to tile\_size.

---

## get_static_string

`get_static_string[string: StringSlice[StaticConstantOrigin], *extra: StringSlice[StaticConstantOrigin]]() -> StringSlice[StaticConstantOrigin]`

Form a StaticString from compile-time StringSlice values. This guarantees that the returned string is compile-time constant in static memory.  It also guarantees that there is a 'nul' zero byte at the end, which is not included in the returned range.

**Parameters:**

* ​string (`StringSlice[StaticConstantOrigin]`): The first StringSlice value.
* ​\*extra (`StringSlice[StaticConstantOrigin]`): Additional StringSlice values to concatenate.

**Returns:**

The string value as a StaticString.

---

## getenv

`getenv(owned name: String, default: String = __init__[__mlir_type.!kgen.string]("")) -> String`

Returns the value of the given environment variable.

**Constraints:**

The function only works on macOS or Linux and returns an empty string
otherwise.

**Args:**

* ​name (`String`): The name of the environment variable.
* ​default (`String`): The default value to return if the environment variable
  doesn't exist.

**Returns:**

The value of the environment variable.

---

## getpwnam

`getpwnam(owned name: String) -> Passwd`

Retrieves the user ID in the password database for the given user name.

**Constraints:**

This function is constrained to run on Linux or macOS operating systems
only.

**Args:**

* ​name (`String`): The name of the user to retrieve the password entry for.

**Returns:**

An object containing the user's account information, including login
name, encrypted password, user ID, group ID, real name, home directory,
and shell program.

**Raises:**

If the user name does not exist or there is an error retrieving the
information.

---

## getpwuid

`getpwuid(uid: Int) -> Passwd`

Retrieve the password database entry for a given user ID.

**Constraints:**

This function is constrained to run on Linux or macOS operating systems
only.

**Args:**

* ​uid (`Int`): The user ID for which to retrieve the password database entry.

**Returns:**

An object containing the user's account information, including login
name, encrypted password, user ID, group ID, real name, home directory,
and shell program.

**Raises:**

If the user ID does not exist or there is an error retrieving the
information.

---

## getsize

`getsize[PathLike: PathLike, //](path: PathLike) -> Int`

Return the size, in bytes, of the specified path.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the file.

**Returns:**

The size of the path in bytes.

---

## gettempdir

`gettempdir() -> Optional[String]`

Return the default directory to use for temporary files.

**Returns:**

The name of the default temporary directory.

---

## getuid

`getuid() -> Int`

Retrieve the user ID of the calling process.

**Constraints:**

This function is constrained to run on Linux or macOS operating systems only.

**Returns:**

The user ID of the calling process.

---

## gevm_kernel

`gevm_kernel[c_type: DType, a_type: DType, b_type: DType, *, tile_size: Int, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: UnsafePointer[SIMD[c_type, 1]], a: UnsafePointer[SIMD[a_type, 1]], b: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)`

---

## gevm_tc_kernel_vector_8x

`gevm_tc_kernel_vector_8x[c_type: DType, a_type: DType, b_type: DType, tile_size: Int, simd_width: Int, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c: NDBuffer[c_type, 2, MutableAnyOrigin], a: NDBuffer[a_type, 2, MutableAnyOrigin], b: NDBuffer[b_type, 2, MutableAnyOrigin], m: UInt, n: UInt, k: UInt)`

---

## globals

This module provides GPU-specific global constants and configuration values.

The module defines hardware-specific constants like warp size and thread block limits
that are used throughout the GPU programming interface. It handles both NVIDIA and AMD
GPU architectures, automatically detecting and configuring the appropriate values based
on the available hardware.

The constants are resolved at compile time based on the target GPU architecture and
are used to optimize code generation and ensure hardware compatibility.

## Aliases

### `MAX_THREADS_PER_BLOCK_METADATA`

`alias MAX_THREADS_PER_BLOCK_METADATA = _resolve_max_threads_per_block_metadata()`

This is metadata tag that is used in conjunction with \_\_llvm\_metadata to give a hint to the compiler about the max threads per block that's used.

### `WARP_SIZE`

`alias WARP_SIZE = _resolve_warp_size()`

The number of threads that execute in lockstep within a warp on the GPU.

This constant represents the hardware warp size, which is the number of threads that execute
instructions synchronously as a unit. The value is architecture-dependent:

* 32 threads per warp on NVIDIA GPUs
* 64 threads per warp on AMD GPUs
* 0 if no GPU is detected

The warp size is a fundamental parameter that affects:

* Thread scheduling and execution
* Memory access coalescing
* Synchronization primitives
* Overall performance optimization

---

## Glossary

import MDXListing from '@site/src/components/Listing/MDXListing';

## AI terms

export const aiTerms = [
        'ai/*.mdx'
    ]

## GPU terms

export const gpuTerms = [
        'gpu/*.mdx'
    ]

---

## gpu

Provides low-level programming constructs for working with GPUs.

These low level constructs allow you to write code that runs on the GPU with
traditional programming style--partitioning work across threads that are mapped
onto 1-, 2-, or 3-dimensional blocks. The thread blocks can subsequently be
grouped into a grid of thread blocks.

A *kernel* is a function that runs on the GPU in parallel across many threads.
Currently, the
[`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) struct
provides the interface for compiling and launching GPU kernels inside MAX
[custom operations](/max/custom-ops/).

The [`gpu.host`](/mojo/stdlib/gpu/host/) package includes APIs to manage
interaction between the *host* (that is, the CPU) and *device* (that is, the GPU
or accelerator).

See the [`gpu.id`](/mojo/stdlib/gpu/id#aliases) module for a list of aliases you
can use to access information about the grid and the current thread, including
block dimensions, block index in the grid and thread index.

The [`sync`](/mojo/stdlib/gpu/sync/) module provides functions for synchronizing
threads.

For an example of launching a GPU kernel from a MAX custom operation, see the
[vector addition example](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/vector_addition.mojo)
in the MAX repo.

## Packages

* [​`comm`](/mojo/stdlib/gpu/comm/): The `gpu.comm` package provides communication primitives for GPUs.
* [​`host`](/mojo/stdlib/gpu/host/): Implements the gpu host package.

## Modules

* [​`block`](/mojo/stdlib/gpu/block/): GPU block-level operations and utilities.
* [​`cluster`](/mojo/stdlib/gpu/cluster/): This module provides low-level NVIDIA GPU cluster synchronization primitives for SM90+ architectures.
* [​`globals`](/mojo/stdlib/gpu/globals/): This module provides GPU-specific global constants and configuration values.
* [​`grid_controls`](/mojo/stdlib/gpu/grid_controls/): Grid Dependent Control primitives for NVIDIA Hopper (SM90+) GPUs.
* [​`id`](/mojo/stdlib/gpu/id/): This module provides GPU thread and block indexing functionality.
* [​`intrinsics`](/mojo/stdlib/gpu/intrinsics/): Provides low-level GPU intrinsic operations and memory access primitives.
* [​`memory`](/mojo/stdlib/gpu/memory/): This module provides GPU memory operations and utilities.
* [​`mma`](/mojo/stdlib/gpu/mma/): This module includes utilities for working with the warp-matrix-matrix-multiplication (wmma) instructions.
* [​`mma_sm100`](/mojo/stdlib/gpu/mma_sm100/): This module includes utilities for working with the SM100 MMA instructions.
* [​`mma_util`](/mojo/stdlib/gpu/mma_util/): Matrix multiply accumulate (MMA) utilities for GPU tensor cores.
* [​`profiler`](/mojo/stdlib/gpu/profiler/): This module provides GPU profiling functionality.
* [​`random`](/mojo/stdlib/gpu/random/): Random number generation for GPU kernels.
* [​`semaphore`](/mojo/stdlib/gpu/semaphore/): This module provides a device-wide semaphore implementation for NVIDIA GPUs.
* [​`sync`](/mojo/stdlib/gpu/sync/): This module provides GPU synchronization primitives and barriers.
* [​`tcgen05`](/mojo/stdlib/gpu/tcgen05/): This module includes utilities for working with the tensorcore 5th generation (tcgen05) instructions.
* [​`tensor_ops`](/mojo/stdlib/gpu/tensor_ops/): This module provides tensor core operations and utilities for GPU computation.
* [​`warp`](/mojo/stdlib/gpu/warp/): GPU warp-level operations and utilities.

---

## GPU debugging

The MAX SDK provides support for debugging Mojo code running on GPU using
[CUDA-GDB](https://docs.nvidia.com/cuda/cuda-gdb/index.html#). You can either
debug using the `cuda-gdb` command-line interface, or through VS Code, using the
Mojo and NVIDIA extensions.

:::note Limitations

Currently there are a couple of notable limitations to debugging Mojo code on
GPU:

- GPU debugging is supported only on NVIDIA GPUs.
- You cannot debug Mojo code running inside a MAX
  [custom operation](/max/custom-ops/). (You can only debug Mojo GPU code
  launched from a Mojo program when using the
  [gpu.host](/mojo/stdlib/gpu/host/) API).

:::

## GPU debugging setup

To debug Mojo code on GPU, you need to be able run Mojo code on GPU. Currently
this requires a Linux system with a supported GPU. For details, see
[GPU requirements](/max/faq/#gpu-requirements).

If you're using VS Code, you can run it on the same system where your GPU code
is running (the "target system"), or on a separate system using remote
debugging.

To debug on GPU, you need to choose to debug with CUDA-GDB when you start a
debugging session. Note that CUDA-GDB has very limited debugging capabilities
for Mojo code running on the CPU.

To set up for GPU debugging:

1. Install the [NVIDIA CUDA Toolkit](https://developer.nvidia.com/cuda-toolkit)
   12.4 or later on the target system. Make sure that the `cuda-gdb`
   binary is in your `$PATH` environment variable. For example, if you have CUDA
   Toolkit 12.8 installed, add `/usr/local/cuda12.8/bin` to your `$PATH`.

2. If using VS Code, install
   [Nsight Visual Studio Code Edition](https://marketplace.visualstudio.com/items?itemName=nvidia.nsight-vscode-edition)
   from the Visual Studio Marketplace.

### Using the classic debugger backend

CUDA-GDB includes two debugger backends, called the "classic debugger" and
the "universal debugger." By default, Mojo uses the CUDA-GDB universal debugger.
However, some systems require the classic debugger, instead. If you find that the
debugger is losing its connection with the process being debugged, you may need
to use the classic debugger. To use the classic debugger backend:

- On the target system, set the environment variable
  `CUDBG_USE_LEGACY_DEBUGGER` to `1` in your shell configuration file (for
  example, `.bashrc`, `.zshrc` or `config.fish`). Source the file or start a
  new shell.

- When creating a launch configuration for GPU debugging, add the following
  settings to the `launch.json` configuration:

  ```json
  "legacyDebugger": true,
  "initCommands": [
      "set environment CUDBG_USE_LEGACY_DEBUGGER=1"
  ],
  ```

## Start GPU debugging from the command line

To start a GPU debugging session in VS Code from the command line, run the
following command on the target system:

```bash
mojo debug --cuda-gdb --break-on-launch --vscode myproject.mojo
```

To use the CUDA-GDB command line debugger, omit the `--vscode` argument.

The `--break-on-launch` flag is optional but very useful: it stops execution as
soon as the GPU kernel launches, allowing you to set breakpoints inside the GPU
code.

## Start GPU debugging from VS Code

The easiest way to start GPU debugging from VS Code is to add a
[Launch configuration](#launch-configurations). For example, the following
launch configuration starts debugging the current Mojo file using CUDA-GDB.

```json
  {
    "type": "mojo-cuda-gdb",
    "request": "launch",
    "name": "Mojo: Debug current Mojo file with CUDA-GDB",
    "description": "Launch and debug the Mojo file that is active on the editor when the debug session starts, using CUDA-GDB.",
    "mojoFile": "${file}",
    "args": [],
    "env": [],
    "cwd": "${workspaceFolder}",
    "breakOnLaunch": true,
    "legacyDebugger": true,
    "initCommands": [
        "set environment CUDBG_USE_LEGACY_DEBUGGER=1"
    ],
  },
```

The last two settings, `legacyDebugger` and `initCommands` should only be
included if required on your system to maintain a stable connection to the
process being debugged, as described in
[Using the classic debugger backend](#using-the-classic-debugger-backend).

## Issuing CUDA-GDB commands

Some features of the debugger are only available via CUDA-GDB commands, so it's
worth familiarizing yourself with CUDA-GDB even if you're using VS Code as a
frontend.

If you're running in VS Code, you can enter CUDA-GDB commands in the debug
console by prefixing them with a single backtick (`). Note that the console may
automatically add a second backtick at the end of your command, which prevents
it from being recognized as a CUDA-GDB command. Be sure to remove the second
backtick before submitting the command.

The examples in the following section are raw CUDA-GDB commands, without the
backtick.

A good starting point for learning about CUDA-GDB is the
[CUDA-GDB User Manual](https://docs.nvidia.com/cuda/cuda-gdb/index.html#)

## Tips and tricks

The following sections provide tips for some common tasks in GPU debugging.

### Setting breakpoints in GPU code

There are a few quirks to setting breakpoints in GPU code. If you set a
breakpoint in GPU code before the GPU kernel launches, the breakpoint will show
up in a different function (a CPU function).

When the debugger pauses at this first breakpoint, click **Continue** to resume
execution, and the debugger should stop at the correct location in the GPU code.
When paused at the first breakpoint, you can add more breakpoints in the GPU
code, however, the breakpoints won't show up in the left gutter until the GPU
kernel launches.

Symbol breakpoints aren't supported when debugging with CUDA-GDB.

On the CUDA-GDB command line, you can set a breakpoint using the `break`
command (which can be abbreviated to `b`):

break filename.mojo:line_number
b filename.mojo:line_number

### Stepping not supported on GPU

The step commands—**Step Over**, **Step Into**, and **Step Out** do not work
reliably on GPU. Instead we recommend adding breakpoints and using **Continue**
to move between breakpoints.

### Changing kernel focus

You can use CUDA-GDB commands to change the current _kernel focus_: that is,
the block and thread index that you're currently inspecting. Use the `cuda`
to inspect the current focus or change focus:

```plaintext
cuda block thread
block (0,0,0), thread (0,0,0)
cuda block 0,0,0 thread 1,0,0
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (1,0,0), device 0, sm 0, warp 0, lane 1]
```

For more information, see
[Kernel focus](https://docs.nvidia.com/cuda/cuda-gdb/index.html#kernel-focus) in
the CUDA-GDB documentation.

### Inspecting registers

Inspect the values of registers using the `info registers` command.

```plaintext
`info registers $R0 $R1
info registers $R0 $R1
R0             0xebfffda8          -335544920
R1             0xfffda8            16776616
```

For more information, see
[Registers](https://docs.nvidia.com/cuda/cuda-gdb/index.html#registers) in
the CUDA-GDB documentation.

---

## GPU glossary

import MDXListing from '@site/src/components/Listing/MDXListing';

export const terms = [
        '*.mdx'
    ]

---

## GPU memory

GPU memory consists of both on-chip memory and external dynamic random-access
memory (DRAM), often referred to as *device memory* (in contrast to the *host
memory* used by the CPU).

On-chip memory includes:

- A register file for each [streaming
  multiprocessor](streaming-multiprocessor.mdx) (SM), containing the
  [registers](register.mdx) used by threads executing on the SMs cores

- An L1 cache for each SM to cache reads from global memory

- Shared memory for each SM, containing data explicitly shared between the
  threads of a given [thread block](thread-block.mdx) executing on the SM

- A read-only constant cache for each SM, which caches data read from the
  constant memory space in global memory

- An L2 cache shared by all SMs that is used to cache accesses to local or
  global memory, including temporary register spills

Device memory includes:

- Global memory, which contains data accessible to all threads

- Constant memory, which contains data explicitly identified as read-only by the
  programmer, and which is accessible to all threads

- Local memory, which contains data private to an individual thread, such as
  statically allocated arrays, spilled registers, and other elements of the
  thread's call stack

Data in global memory persists until explicitly freed, even across
[kernel](kernel.mdx) functions. This means that one kernel can write data to
global memory and then a subsequent kernel can read that data.

---

## gpu_qint4_repack_GPTQ

`gpu_qint4_repack_GPTQ[b_shape: DimList, b_packed_shape: DimList, //, group_size: Int, target: StringSlice[StaticConstantOrigin]](b: NDBuffer[uint8, 2, origin, b_shape], b_packed: NDBuffer[uint8, 2, origin, b_packed_shape], perm_idx: OptionalReg[NDBuffer[int32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[int32, 1, MutableAnyOrigin]]({:i1 0, 1}), ctx: DeviceContextPtr = DeviceContextPtr())`

---

## gpu_qint4_repack_Q4_0

`gpu_qint4_repack_Q4_0[b_shape: DimList, //, target: StringSlice[StaticConstantOrigin]](b: NDBuffer[uint8, 2, origin, b_shape], b_packed: NDBuffer[uint8, 2, origin, b_shape], ctx: DeviceContextPtr = DeviceContextPtr())`

---

## graph

APIs to build inference graphs for MAX Engine with Python.

## Classes

* [`BufferValue`](/max/api/python/graph/BufferValue): Represents a mutable semantic tensor within a Graph.
* [`Graph`](/max/api/python/graph/Graph): Represents a graph for MAX Engine.
* [`KernelLibrary`](/max/api/python/graph/KernelLibrary): Represents a library with custom ops.
* [`TensorValue`](/max/api/python/graph/TensorValue): Represents a value semantic tensor within a Graph.
* [`Value`](/max/api/python/graph/Value): Represents a symbolic value within a Graph.
* [`Weight`](/max/api/python/graph/Weight): Represents a weight value in a graph.

## Modules

* [`ops`](/max/api/python/graph/ops): Ops you can add when staging a graph.
* [`quantization`](/max/api/python/graph/quantization): APIs to quantize graph tensors.
* [`type`](/max/api/python/graph/type): APIs for graph value types.

---

## Graph

## `Graph` {#max.graph.Graph}

> *class* max.graph.Graph(name, forward=None, input\_types=(), path=None, \*args, custom\_extensions=\[], context=None, kernel\_library=None, module=None, \*\*kwargs)

Represents a single MAX graph.

A Graph is a callable routine in MAX Engine. Like functions, graphs have a
name and signature. Unlike a function, which follows an imperative
programming model, a Graph follows a dataflow programming model, using
lazily-executed, parallel operations instead of sequential instructions.

When you instantiate a graph, you must specify the input shapes as one or
more `TensorType` values. Then, build a sequence of ops and set the
graph output with [`output()`](#max.graph.Graph.output). For example:

```python
from dataclasses import dataclass

import numpy as np
from max.dtype import DType
from max.graph import Graph, TensorType, TensorValue, ops

@dataclass
class Linear:
    weight: np.ndarray
    bias: np.ndarray

    def __call__(self, x: TensorValue) -> TensorValue:
        weight_tensor = ops.constant(self.weight, dtype=DType.float32, device=DeviceRef.CPU())
        bias_tensor = ops.constant(self.bias, dtype=DType.float32, device=DeviceRef.CPU())
        return ops.matmul(x, weight_tensor) + bias_tensor

linear_graph = Graph(
    "linear",
    Linear(np.ones((2, 2)), np.ones((2,))),
    input_types=[TensorType(DType.float32, (2,))]
)
```

You can’t call a Graph directly from Python. You must compile it and
execute it with MAX Engine. For more detail, see the tutorial about how to
[build a graph with MAX
Graph](/max/tutorials/get-started-with-max-graph-in-python).

When creating a graph, a global sequence of chains is initialized and stored
in Graph.\_current\_chain. Every side-effecting op, e.g. buffer\_load,
store\_buffer, load\_slice\_buffer, store\_slice\_buffer, will use the current
chain to perform the op and and update Graph.\_current\_chain with a new
chain. Currently, the input/output chains for mutable ops can be used at
most once. The goal of this design choice is to prevent data races.

**Parameters:**

* **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **forward** (`Optional` `[` `Callable` `]` )
* **input\_types** (`Iterable` `[` [`Type`](type.md#max.graph.type.Type) `]` )
* **path** (`Optional` `[` `Path` `]` )
* **custom\_extensions** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `Path` `]` )
* **context** (`Optional` `[` `mlir.Context` `]` )
* **kernel\_library** (`Optional` `[` [`KernelLibrary`](KernelLibrary.md#max.graph.KernelLibrary) `]` )
* **module** (`Optional` `[` `mlir.Module` `]` )

### `add_subgraph()` {#max.graph.Graph.add_subgraph}

> add\_subgraph(name, forward=None, input\_types=(), path=None, custom\_extensions=\[])

**Parameters:**

* **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **forward** ([`Callable`](https://docs.python.org/3/library/typing.html#typing.Callable)  `|`  `None` )
* **input\_types** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`Type`](type.md#max.graph.type.Type) `]` )
* **path** ([`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path)  `|`  `None` )
* **custom\_extensions** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) `]` )

**Return type:**

[*Graph*](#max.graph.Graph)

### `add_weight()` {#max.graph.Graph.add_weight}

> add\_weight(weight, force\_initial\_weight\_on\_host=True)

Adds a weight to the graph.

If the weight is in the graph already, return the existing value.

**Parameters:**

* **weight** ([`Weight`](Weight.md#max.graph.Weight) ) – The weight to add to the graph.
* **force\_initial\_weight\_on\_host** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – If true, then forces weights
  to initially be allocated on host before being moved to
  the indicated device. This is needed as a stop gap
  until we have a more fleshed out ownership model of
  external constants.

**Returns:**

A [`TensorValue`](TensorValue.md#max.graph.TensorValue) that contains this weight.

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If a weight with the same name already exists in the graph.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `current` {#max.graph.Graph.current}

> current

### `inputs` {#max.graph.Graph.inputs}

> *property* inputs\*: [Sequence](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[Value](Value.md#max.graph.Value)]\*

The input values of the graph.

### `kernel_libraries_paths` {#max.graph.Graph.kernel_libraries_paths}

> *property* kernel\_libraries\_paths\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[Path](https://docs.python.org/3/library/pathlib.html#pathlib.Path)]\*

Returns the list of extra kernel libraries paths for the custom ops.

### `local_weights_and_chain()` {#max.graph.Graph.local_weights_and_chain}

> local\_weights\_and\_chain()

### `output()` {#max.graph.Graph.output}

> output(\*outputs)

Sets the output nodes of the [`Graph`](#max.graph.Graph).

**Parameters:**

**outputs** ([`Value`](Value.md#max.graph.Value) )

**Return type:**

None

### `output_types` {#max.graph.Graph.output_types}

> *property* output\_types\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[Type](type.md#max.graph.type.Type)]\*

View of the types of the graph output terminator.

---

## GreaterThanComparable

A type which can be greater than compared with other instances of itself.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__gt__`

`__gt__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is greater than `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is greater than `rhs`.

---

## GreaterThanOrEqualComparable

A type which can be greater than or equal to compared with other instances of itself.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__ge__`

`__ge__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is greater than or equal to `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is greater than or equal to `rhs`.

---

## Grid

A grid is the top-level organizational structure of the threads executing a
[kernel](kernel.mdx) function on a GPU. A grid consists of multiple [thread
blocks](thread-block.mdx), which are further divided into individual
[threads](thread.mdx) that execute the kernel function concurrently.

The division of a grid into thread blocks serves multiple crucial purposes:

- First, it breaks down the overall workload — managed by the grid — into
  smaller, more manageable portions that can be processed independently. This
  division allows for better resource utilization and scheduling flexibility
  across multiple [streaming multiprocessors](streaming-multiprocessor.mdx)
  (SMs) in the GPU.

- Second, thread blocks provide a scope for threads to collaborate through
  shared memory and synchronization primitives, enabling efficient parallel
  algorithms and data sharing patterns.

- Finally, thread blocks help with scalability by allowing the same program to
  run efficiently across different GPU architectures, as the hardware can
  automatically distribute blocks based on available resources.

The programmer specifies the number of thread blocks in a grid and how they are
arranged across one, two, or three dimensions. Typically, the programmer
determines the dimensions of the grid based on the dimensionality of the data to
process. For example, a programmer might choose a 1-dimensional grid for
processing large vectors, a 2-dimensional grid for processing matrices, and a
3-dimensional grid for processing the frames of a video. Each block within the
grid is assigned a unique [block index](block-index.mdx) that determines its
position within the grid.

Similarly, the programmer also specifies the number of threads per thread block
and how they are arranged across one, two, or three dimensions. Each thread
within a block is assigned a unique [thread index](thread-index.mdx) that
determines its position within the block. The combination of block index and
thread index uniquely identify the position of a thread within the overall grid.

---

## grid_controls

Grid Dependent Control primitives for NVIDIA Hopper (SM90+) GPUs.

This module provides low-level primitives for managing grid dependencies on NVIDIA
Hopper architecture and newer GPUs. It enables efficient orchestration of multi-grid
workloads by allowing grids to launch dependent grids and synchronize with them.

The module includes functions that map directly to CUDA grid dependency control
instructions, providing fine-grained control over grid execution order:

* `launch_dependent_grids()`: Triggers execution of grids that depend on the
  current grid
* `wait_on_dependent_grids()`: Blocks until all dependent grids complete execution

These primitives are essential for implementing complex GPU execution pipelines where
multiple kernels need to execute in a specific order with minimal overhead. They
eliminate the need for host-side synchronization when orchestrating dependent GPU work.

## Structs

* [​`PDL`](/mojo/stdlib/gpu/grid_controls/PDL): Programmatic Dependency Launch (PDL) control structure.
* [​`PDLLevel`](/mojo/stdlib/gpu/grid_controls/PDLLevel): Programmatic Dependency Launch (PDL) level.

## Functions

* [​`launch_dependent_grids`](/mojo/stdlib/gpu/grid_controls/launch_dependent_grids): Launches dependent grids that were previously configured to depend on the current grid.
* [​`wait_on_dependent_grids`](/mojo/stdlib/gpu/grid_controls/wait_on_dependent_grids): Waits for all dependent grids launched by this grid to complete execution.

---

## group_norm

Group Normalization implementation using the graph API.

## `GroupNorm` {#max.nn.norm.group_norm.GroupNorm}

> *class* max.nn.norm.group\_norm.GroupNorm(num\_groups, num\_channels, eps=1e-05, affine=True, device=cpu:0)

Group normalization block.

Divides channels into groups and computes normalization stats per group.
Follows the implementation pattern from PyTorch’s group\_norm.

**Parameters:**

* **num\_groups** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of groups to separate the channels into
* **num\_channels** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of input channels
* **eps** ([`float`](https://docs.python.org/3/library/functions.html#float) ) – Small constant added to denominator for numerical stability
* **affine** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – If True, apply learnable affine transform parameters
* **device** (`DeviceRef` )

---

## grouped_matmul

`grouped_matmul[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], max_num_tokens_per_expert: Int, num_active_experts: Int, ctx: DeviceContext)`

---

## grouped_matmul

## Aliases

### `NumWarpPerWarpGroup`

`alias NumWarpPerWarpGroup = 4`

### `WARP_GROUP_SIZE`

`alias WARP_GROUP_SIZE = 128`

## Functions

* [​`default_config_sm90`](./default_config_sm90):
* [​`grouped_matmul`](./grouped_matmul):
* [​`grouped_matmul_kernel`](./grouped_matmul_kernel):
* [​`grouped_matmul_sm90`](./grouped_matmul_sm90):
* [​`naive_grouped_matmul`](./naive_grouped_matmul):
* [​`naive_grouped_matmul_kernel`](./naive_grouped_matmul_kernel):

---

## grouped_matmul_kernel

`grouped_matmul_kernel[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, c_desc_layout: Layout, c_smem_layout: Layout, cluster_shape: StaticTuple[SIMD[int32, 1], 3], a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), c_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = True, num_threads: Int = 128, pipeline_stages: Int = 7, use_tma_store: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c_tma_op: TMATensorTile[c_type, c_smem_layout, c_desc_layout], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin])`

---

## grouped_matmul_sm90

`grouped_matmul_sm90[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool = True, wgmma_shape: IndexList[3] = Index(64, 256, 16), config: MatmulConfig[a_type, b_type, c_type, transpose_b, wgmma_shape] = default_config_sm90[::DType,::DType,::DType,::Bool,::IndexList[::Int(), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], max_num_tokens_per_expert: Int, b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], num_active_experts: Int, ctx: DeviceContext)`

---

## Handle

`struct Handle[backend: Backend = _resolve_backend[linalg::vendor_blas::Backend,::DType]()]`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `resolved_backend`

`alias resolved_backend = _resolve_backend[linalg::vendor_blas::Backend,::DType]()`

### `type`

`alias type = Variant[UnsafePointer[NoneType], UnsafePointer[NoneType], Handle, UnsafePointer[NoneType]]`

## Methods

### `__init__`

`__init__(out self)`

### `__is__`

`__is__(self, other: Backend) -> Bool`

### `__isnot__`

`__isnot__(self, other: Backend) -> Bool`

### `__enter__`

`__enter__(self) -> Self`

### `__exit__`

`__exit__(mut self)`

---

## has_accelerator

`has_accelerator() -> Bool`

Returns True if the host system has an accelerator and False otherwise.

**Returns:**

True if the host system has an accelerator.

---

## has_amd_gpu_accelerator

`has_amd_gpu_accelerator() -> Bool`

Returns True if the host system has an AMD GPU and False otherwise.

**Returns:**

True if the host system has an AMD GPU.

---

## has_avx

`has_avx() -> Bool`

Returns True if the host system has AVX, otherwise returns False.

**Returns:**

True if the host system has AVX, otherwise returns False.

---

## has_avx2

`has_avx2() -> Bool`

Returns True if the host system has AVX2, otherwise returns False.

**Returns:**

True if the host system has AVX2, otherwise returns False.

---

## has_avx512f

`has_avx512f() -> Bool`

Returns True if the host system has AVX512, otherwise returns False.

**Returns:**

True if the host system has AVX512, otherwise returns False.

---

## has_fma

`has_fma() -> Bool`

Returns True if the host system has FMA (Fused Multiply-Add) support, otherwise returns False.

**Returns:**

True if the host system has FMA support, otherwise returns False.

---

## has_intel_amx

`has_intel_amx() -> Bool`

Returns True if the host system has Intel AMX support, otherwise returns False.

**Returns:**

True if the host system has Intel AMX and False otherwise.

---

## has_neon

`has_neon() -> Bool`

Returns True if the host system has Neon support, otherwise returns False.

**Returns:**

True if the host system support the Neon instruction set.

---

## has_neon_int8_dotprod

`has_neon_int8_dotprod() -> Bool`

Returns True if the host system has the Neon int8 dot product extension, otherwise returns False.

**Returns:**

True if the host system support the Neon int8 dot product extension and
False otherwise.

---

## has_neon_int8_matmul

`has_neon_int8_matmul() -> Bool`

Returns True if the host system has the Neon int8 matrix multiplication extension (I8MM), otherwise returns False.

**Returns:**

True if the host system support the Neon int8 matrix multiplication
extension (I8MM) and False otherwise.

---

## has_nvidia_gpu_accelerator

`has_nvidia_gpu_accelerator() -> Bool`

Returns True if the host system has an NVIDIA GPU and False otherwise.

**Returns:**

True if the host system has an NVIDIA GPU.

---

## has_sse4

`has_sse4() -> Bool`

Returns True if the host system has sse4, otherwise returns False.

**Deprecated:**

Use `CompilationTarget.has_sse4()` instead.

**Returns:**

True if the host system has sse4, otherwise returns False.

---

## has_vnni

`has_vnni() -> Bool`

Returns True if the host system has avx512\_vnni, otherwise returns False.

**Returns:**

True if the host system has avx512\_vnni, otherwise returns False.

---

## hash

`hash[T: Hashable](hashable: T) -> UInt`

Hash a Hashable type using its underlying hash implementation.

**Parameters:**

* ​T (`Hashable`): Any Hashable type.

**Args:**

* ​hashable (`T`): The input data to hash.

**Returns:**

A 64-bit integer hash based on the underlying implementation.

`hash(bytes: UnsafePointer[SIMD[uint8, 1], alignment=alignment, mut=False, origin=origin], n: Int) -> UInt`

Hash a byte array using a SIMD-modified DJBX33A hash algorithm.

*This hash function is not suitable for cryptographic purposes.* The
algorithm is easy to reverse and produce deliberate hash collisions.
The hash function is designed to have relatively good mixing and statistical
properties for use in hash-based data structures.  We *do* however initialize
a random hash secret which is mixed into the final hash output. This can help
prevent DDOS attacks on applications which make use of this function for
dictionary hashing. As a consequence, hash values are deterministic within an
individual runtime instance ie.  a value will always hash to the same thing,
but in between runs this value will change based on the hash secret.

We take advantage of Mojo's first-class SIMD support to create a
SIMD-vectorized hash function, using some simple hash algorithm as a base.

* Interpret those bytes as a SIMD vector, padded with zeros to align
  to the system SIMD width.
* Apply the simple hash function parallelized across SIMD vectors.
* Hash the final SIMD vector state to reduce to a single value.

Python uses DJBX33A with a hash secret for smaller strings, and
then the SipHash algorithm for longer strings. The arguments and tradeoffs
are well documented in PEP 456. We should consider this and deeper
performance/security tradeoffs as Mojo evolves.

References:

* [Wikipedia: Non-cryptographic hash function](https://en.wikipedia.org/wiki/Non-cryptographic_hash_function)
* [Python PEP 456](https://peps.python.org/pep-0456/)
* [PHP Hash algorithm and collisions](https://www.phpinternalsbook.com/php5/hashtables/hash_algorithm.html)

```mojo
from random import rand
var n = 64
var rand_bytes = UnsafePointer[UInt8].alloc(n)
rand(rand_bytes, n)
hash(rand_bytes, n)
```

**Args:**

* ​bytes (`UnsafePointer[SIMD[uint8, 1], alignment=alignment, mut=False, origin=origin]`): The byte array to hash.
* ​n (`Int`): The length of the byte array.

**Returns:**

A 64-bit integer hash. This hash is *not* suitable for
cryptographic purposes, but will have good low-bit
hash collision statistical properties for common data structures.

---

## hash

Implements the `Hashable` trait and `hash()` built-in function.

There are a few main tools in this module:

* `Hashable` trait for types implementing `__hash__(self) -> UInt`
* `hash[T: Hashable](hashable: T) -> Int` built-in function.
* A `hash()` implementation for arbitrary byte strings,
  `hash(data: UnsafePointer[UInt8], n: Int) -> Int`,
  is the workhorse function, which implements efficient hashing via SIMD
  vectors. See the documentation of this function for more details on the hash
  implementation.
* `hash(SIMD)` and `hash(UInt8)` implementations
  These are useful helpers to specialize for the general bytes implementation.

## Traits

* [​`Hashable`](/mojo/stdlib/hashlib/hash/Hashable): A trait for types which specify a function to hash their data.

## Functions

* [​`hash`](/mojo/stdlib/hashlib/hash/hash): Hash a Hashable type using its underlying hash implementation.

---

## Hashable

A trait for types which specify a function to hash their data.

This hash function will be used for applications like hash maps, and
don't need to be cryptographically secure. A good hash function will
hash similar / common types to different values, and in particular
the *low order bits* of the hash, which are used in smaller dictionaries,
should be sensitive to any changes in the data structure. If your type's
hash function doesn't meet this criteria it will get poor performance in
common hash map implementations.

```mojo
@value
struct Foo(Hashable):
    fn __hash__(self) -> UInt:
        return 4  # chosen by fair random dice roll

var foo = Foo()
print(hash(foo))
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__hash__`

`__hash__(self: _Self) -> UInt`

Return a 64-bit hash of the type's data.

**Returns:**

A 64-bit integer hash of this instance's data.

---

## hashlib

Implements the hashlib package that provides various hash algorithms.

## Modules

* [​`hash`](/mojo/stdlib/hashlib/hash/): Implements the `Hashable` trait and `hash()` built-in function.

---

## hex

`hex(value: SIMD[dtype, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0x")) -> String`

Returns the hex string representation of the given integer.

The hexadecimal representation is a base-16 encoding of the integer value.

The returned string will be prefixed with "0x" to indicate that the
subsequent digits are hex.

**Args:**

* ​value (`SIMD[dtype, 1]`): The integer value to format.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

A string containing the hex representation of the given integer.

`hex[T: Intable, //](value: T, /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0x")) -> String`

Returns the hex string representation of the given integer.

The hexadecimal representation is a base-16 encoding of the integer value.

The returned string will be prefixed with "0x" to indicate that the
subsequent digits are hex.

**Parameters:**

* ​T (`Intable`): The indexer type to represent in hexadecimal.

**Args:**

* ​value (`T`): The integer value to format.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

A string containing the hex representation of the given integer.

`hex(value: SIMD[bool, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0x")) -> String`

Returns the hex string representation of the given scalar bool.

The hexadecimal representation is a base-16 encoding of the bool.

The returned string will be prefixed with "0x" to indicate that the
subsequent digits are hex.

**Args:**

* ​value (`SIMD[bool, 1]`): The bool value to format.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

A string containing the hex representation of the given bool.

---

## hf

## `ContinuousHFStaticCache` {#max.nn.kv_cache.hf.ContinuousHFStaticCache}

> *class* max.nn.kv\_cache.hf.ContinuousHFStaticCache(config, max\_batch\_size, max\_seq\_len, device, dtype=torch.float32, layer\_device\_map=None)

**Parameters:**

* **config** (`PretrainedConfig` )
* **max\_batch\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **device** (`device` )
* **dtype** (`dtype` )
* **layer\_device\_map** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `device`  `|`  [`int`](https://docs.python.org/3/library/functions.html#int) `]`  `|`  `None` )

### `external_claim()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.external_claim}

> external\_claim(seq\_ids)

**Parameters:**

**seq\_ids** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` )

**Return type:**

None

### `get_attention_mask()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.get_attention_mask}

> get\_attention\_mask(seq\_ids)

**Parameters:**

**seq\_ids** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` )

**Return type:**

*Tensor*

### `release()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.release}

> release(seq\_id)

**Parameters:**

**seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

None

### `reset()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.reset}

> reset()

Resets the cache values while preserving the objects

**Return type:**

None

### `set_active_slots()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.set_active_slots}

> set\_active\_slots(seq\_ids)

**Parameters:**

**seq\_ids** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` )

**Return type:**

None

### `set_cache_position()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.set_cache_position}

> set\_cache\_position(cache\_position)

**Parameters:**

**cache\_position** (`Tensor` )

### `update()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.update}

> update(key\_states, value\_states, layer\_idx, cache\_kwargs=None)

Updates the cache with the new key\_states and value\_states for the layer layer\_idx.
It is VERY important to index using a tensor, otherwise you introduce a copy to the device.

**Parameters:**

* **key\_states** (torch.Tensor) – The new key states to cache.
* **value\_states** (torch.Tensor) – The new value states to cache.
* **layer\_idx** (int) – The index of the layer to cache the states for.
* **cache\_kwargs** (Dict\[str, Any], optional) – Additional arguments for the cache subclass. The StaticCache needs the cache\_position input
  to know how where to write in the cache.

**Returns:**

A tuple containing the updated key and value states.

**Return type:**

[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[*Tensor*, *Tensor*]

### `update_attention_pattern()` {#max.nn.kv_cache.hf.ContinuousHFStaticCache.update_attention_pattern}

> update\_attention\_pattern(seq\_id, attention\_mask)

**Parameters:**

* **seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **attention\_mask** (`Tensor` )

**Return type:**

None

---

## hf_pipeline

Generalized Token Generation Pipeline

## `HFEmbeddingsPipeline` {#max.pipelines.lib.hf_pipeline.HFEmbeddingsPipeline}

> *class* max.pipelines.lib.hf\_pipeline.HFEmbeddingsPipeline(pipeline\_config, torch\_device\_type)

Generalized token generator pipeline.

**Parameters:**

* **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) )
* **torch\_device\_type** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )

### `encode()` {#max.pipelines.lib.hf_pipeline.HFEmbeddingsPipeline.encode}

> encode(batch)

Encodes a batch of text inputs.

**Parameters:**

**batch** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  [`TextContext`](core.md#max.pipelines.core.TextContext) `]` )

**Return type:**

[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [*EmbeddingsResponse*](core.md#max.pipelines.core.EmbeddingsResponse)]

### `prepare_initial_token_inputs()` {#max.pipelines.lib.hf_pipeline.HFEmbeddingsPipeline.prepare_initial_token_inputs}

> prepare\_initial\_token\_inputs(context\_batch)

**Parameters:**

**context\_batch** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TextContext`](core.md#max.pipelines.core.TextContext) `]` )

**Return type:**

[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[*Tensor*, *Tensor*]

## `HFTextGenerationPipeline` {#max.pipelines.lib.hf_pipeline.HFTextGenerationPipeline}

> *class* max.pipelines.lib.hf\_pipeline.HFTextGenerationPipeline(pipeline\_config, torch\_device\_type)

HuggingFace text token generator pipeline.

**Parameters:**

* **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) )
* **torch\_device\_type** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )

### `next_token()` {#max.pipelines.lib.hf_pipeline.HFTextGenerationPipeline.next_token}

> next\_token(batch, num\_steps)

Provided a batch, process batch inputs, execute the graph for num\_steps in a multi-step scenario,
then decode the tokens holistically and return the list of decoded tokens.

**Parameters:**

* **batch** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  [`TextContext`](core.md#max.pipelines.core.TextContext) `]` )
* **num\_steps** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [*TextGenerationResponse*](core.md#max.pipelines.core.TextGenerationResponse)]

### `release()` {#max.pipelines.lib.hf_pipeline.HFTextGenerationPipeline.release}

> release(context)

Releases resources associated with this context.

**Parameters:**

**context** (`TokenGeneratorContext` ) – Finished context.

**Return type:**

None

---

## hf_utils

Utilities for interacting with HuggingFace Files/Repos.

## `HuggingFaceFile` {#max.pipelines.lib.hf_utils.HuggingFaceFile}

> *class* max.pipelines.lib.hf\_utils.HuggingFaceFile(repo\_id, filename, revision=None)

A simple object for tracking Hugging Face model metadata. The repo\_id will
frequently be used to load a tokenizer, whereas the filename is used to
download model weights.

**Parameters:**

* **repo\_id** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **filename** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **revision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `None` )

### `download()` {#max.pipelines.lib.hf_utils.HuggingFaceFile.download}

> download(force\_download=False)

Download the file and return the file path where the data is saved locally.

**Parameters:**

**force\_download** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Return type:**

[*Path*](https://docs.python.org/3/library/pathlib.html#pathlib.Path)

### `exists()` {#max.pipelines.lib.hf_utils.HuggingFaceFile.exists}

> exists()

**Return type:**

[bool](https://docs.python.org/3/library/functions.html#bool)

### `filename` {#max.pipelines.lib.hf_utils.HuggingFaceFile.filename}

> filename\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

### `repo_id` {#max.pipelines.lib.hf_utils.HuggingFaceFile.repo_id}

> repo\_id\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

### `revision` {#max.pipelines.lib.hf_utils.HuggingFaceFile.revision}

> revision\*: [str](https://docs.python.org/3/library/stdtypes.html#str) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

### `size()` {#max.pipelines.lib.hf_utils.HuggingFaceFile.size}

> size()

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int) | None

## `HuggingFaceRepo` {#max.pipelines.lib.hf_utils.HuggingFaceRepo}

> *class* max.pipelines.lib.hf\_utils.HuggingFaceRepo(repo\_id, revision='main', trust\_remote\_code=False, repo\_type=None)

A class for interacting with HuggingFace Repos.

**Parameters:**

* **repo\_id** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **revision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **trust\_remote\_code** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )
* **repo\_type** (`RepoType`  `|`  `None` )

### `download()` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.download}

> download(filename, force\_download=False)

**Parameters:**

* **filename** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **force\_download** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Return type:**

[*Path*](https://docs.python.org/3/library/pathlib.html#pathlib.Path)

### `encoding_for_file()` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.encoding_for_file}

> encoding\_for\_file(file)

**Parameters:**

**file** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) )

**Return type:**

*SupportedEncoding*

### `file_exists()` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.file_exists}

> file\_exists(filename)

**Parameters:**

**filename** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )

**Return type:**

[bool](https://docs.python.org/3/library/functions.html#bool)

### `files_for_encoding()` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.files_for_encoding}

> files\_for\_encoding(encoding, weights\_format=None)

**Parameters:**

* **encoding** (`SupportedEncoding` )
* **weights\_format** (`WeightsFormat`  `|`  `None` )

**Return type:**

[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[*WeightsFormat*, [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Path*](https://docs.python.org/3/library/pathlib.html#pathlib.Path)]]

### `formats_available` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.formats_available}

> *property* formats\_available\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[WeightsFormat]\*

### `info` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.info}

> *property* info\*: ModelInfo\*

### `repo_id` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.repo_id}

> repo\_id\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

The HuggingFace repo id. While it’s called repo\_id, it can be a HF
remote or local path altogether.

### `repo_type` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.repo_type}

> repo\_type\*: RepoType | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

The type of repo. This is inferred from the repo\_id.

### `revision` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.revision}

> revision\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\* *= 'main'*

The revision to use for the repo.

### `size_of()` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.size_of}

> size\_of(filename)

**Parameters:**

**filename** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int) | None

### `supported_encodings` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.supported_encodings}

> *property* supported\_encodings\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[SupportedEncoding]\*

### `trust_remote_code` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.trust_remote_code}

> trust\_remote\_code\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False*

Whether to trust remote code.

### `weight_files` {#max.pipelines.lib.hf_utils.HuggingFaceRepo.weight_files}

> *property* weight\_files\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[WeightsFormat, [list](https://docs.python.org/3/library/stdtypes.html#list)\[[str](https://docs.python.org/3/library/stdtypes.html#str)]]\*

## `download_weight_files()` {#max.pipelines.lib.hf_utils.download_weight_files}

> max.pipelines.lib.hf\_utils.download\_weight\_files(huggingface\_model\_id, filenames, revision=None, force\_download=False, max\_workers=8)

Provided a HuggingFace model id, and filenames, download weight files
: and return the list of local paths.

**Parameters:**

* **huggingface\_model\_id** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – The huggingface model identifier, ie. modularai/Llama-3.1-8B-Instruct-GGUF
* **filenames** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `]` ) – A list of file paths relative to the root of the HuggingFace repo.
  If files provided are available locally, download is skipped, and
  the local files are used.
* **revision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `None` ) – The HuggingFace revision to use. If provided, we check our cache
  directly without needing to go to HuggingFace directly, saving a
  network call.
* **force\_download** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – A boolean, indicating whether we should force the files to be
  redownloaded, even if they are already available in our local cache,
  or a provided path.
* **max\_workers** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of worker threads to concurrently download files.

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Path*](https://docs.python.org/3/library/pathlib.html#pathlib.Path)]

## `generate_local_model_path()` {#max.pipelines.lib.hf_utils.generate_local_model_path}

> max.pipelines.lib.hf\_utils.generate\_local\_model\_path(repo\_id, revision)

Generate the local filesystem path where a HuggingFace model repo is cached.

This function takes a HuggingFace repository ID and revision hash and returns the full local
filesystem path where the model files are cached by the huggingface\_hub library. The path
follows the standard HuggingFace caching convention of:
\~/.cache/huggingface/hub/models–{org}–{model}/snapshots/{revision}

**Parameters:**

* **repo\_id** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – The HuggingFace repository ID in the format “org/model”
  (e.g. “HuggingFaceTB/SmolLM2-135M”)
* **revision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – The specific model revision hash to use, typically from a repo lock file

**Returns:**

The absolute path to the cached model files for the specified revision.
For example: “\~/.cache/huggingface/hub/models–HuggingFaceTB–SmolLM2-135M/snapshots/abc123”

**Return type:**

[str](https://docs.python.org/3/library/stdtypes.html#str)

**Raises:**

[**FileNotFoundError**](https://docs.python.org/3/library/exceptions.html#FileNotFoundError) – If the model path does not exist locally

## `repo_exists_with_retry()` {#max.pipelines.lib.hf_utils.repo_exists_with_retry}

> max.pipelines.lib.hf\_utils.repo\_exists\_with\_retry(repo\_id, revision)

Wrapper around huggingface\_hub.revision\_exists with retry logic.
Uses exponential backoff with 25% jitter, starting at 1s and doubling each retry.

We use revision\_exists here instead of repo\_exists because repo\_exists
does not take in a revision parameter.

See huggingface\_hub.revision\_exists for details

**Parameters:**

* **repo\_id** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **revision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )

**Return type:**

[bool](https://docs.python.org/3/library/functions.html#bool)

---

## hierarchical_unzip

`hierarchical_unzip(layout_a: Layout, tiler: List[Layout]) -> Layout`

Hierarchically unzips a layout according to a list of layouts.

This function creates a hierarchical layout by unzipping the first layout
according to the layouts in the tiler list. It's useful for decomposing
a layout into hierarchical components for more efficient memory access
patterns or to enable specialized tensor operations.

Example:

```mojo
from layout import Layout, LayoutList, IntTuple
from layout.layout import hierarchical_unzip

# Create a layout to unzip
var base = Layout.row_major(6, 8)
var tilers = LayoutList()
tilers.append(Layout(IntTuple(2, 2)))
var result = hierarchical_unzip(base, tilers)
```

.

**Args:**

* ​layout\_a (`Layout`): The layout to be unzipped.
* ​tiler (`List[Layout]`): A list of layouts defining the unzipping patterns.

**Returns:**

A new layout representing the hierarchical unzipping with components
from both the original layout and the tiler layouts.

`hierarchical_unzip(layout_a: Layout, layout_b: Layout) -> Layout`

Hierarchically unzips a layout according to another layout.

This function creates a hierarchical layout by unzipping the first layout
according to the second layout. It's a fundamental operation for decomposing
a layout into hierarchical components, which enables more efficient memory
access patterns for various tensor operations.

Example:

```mojo
from layout import Layout, IntTuple
from layout.layout import hierarchical_unzip

# Create layouts
var base = Layout.row_major(6, 8)
var pattern = Layout(IntTuple(2, 2))
var result = hierarchical_unzip(base, pattern)
```

.

**Args:**

* ​layout\_a (`Layout`): The layout to be unzipped.
* ​layout\_b (`Layout`): The layout defining the unzipping pattern.

**Returns:**

A new layout representing the hierarchical unzipping of layout\_a
according to the pattern defined by layout\_b.

---

## hopper_matmul_tma_wgmma

`hopper_matmul_tma_wgmma[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, wgmma_shape: IndexList[3], block_tile_shape: IndexList[3]](c_device: NDBuffer[c_type, 2, origin, c_shape], a_device: NDBuffer[a_type, 2, origin, a_shape], b_device: NDBuffer[b_type, 2, origin, b_shape], M: Int, N: Int, K: Int, ctx: DeviceContext)`

---

## hopper_matmul_tma_wgmma_kernel

`hopper_matmul_tma_wgmma_kernel[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, transpose_b: Bool = True, promotion_frequency: Int = 1](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin])`

---

## host

Implements the gpu host package.

## Modules

* [​`constant_memory_mapping`](/mojo/stdlib/gpu/host/constant_memory_mapping/): This module provides functionality for mapping constant memory between host and device.
* [​`device_attribute`](/mojo/stdlib/gpu/host/device_attribute/): This module defines GPU device attributes that can be queried from CUDA-compatible devices.
* [​`device_context`](/mojo/stdlib/gpu/host/device_context/): This module provides functionality for interacting with accelerators. In particular the [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext) struct, which represents a single stream of execution on a given accelerator. You can use this struct to allocate accelerator memory, copy data to and from the accelerator, and compile and execute functions on the accelerator.
* [​`dim`](/mojo/stdlib/gpu/host/dim/): This module implements the dim type.
* [​`func_attribute`](/mojo/stdlib/gpu/host/func_attribute/): GPU Kernel Function Attributes Module
* [​`info`](/mojo/stdlib/gpu/host/info/): Contains information about GPU architectures and their capabilities.
* [​`launch_attribute`](/mojo/stdlib/gpu/host/launch_attribute/): GPU Launch Attributes for Kernel Execution Control

---

## HostBuffer

`struct HostBuffer[type: DType]`

Represents a block of host-resident storage. For GPU devices, a host buffer is allocated in the host's global memory.

To allocate a `HostBuffer`, use one of the methods provided by
`DeviceContext`, such as
[`enqueue_create_host_buffer()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_create_host_buffer).

## Parameters

* ​type (`DType`): Data type to be stored in the buffer.

## Implemented traits

`AnyType`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Creates a copy of an existing host buffer by incrementing its reference count.

This copy constructor creates a new reference to the same underlying host buffer
by incrementing the reference count of the native buffer object. Both the original
and the copy will refer to the same memory on the device.

**Args:**

* ​existing (`Self`): The host buffer to copy.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Initializes this buffer by taking ownership of an existing buffer.

This move constructor transfers ownership of the device buffer from the existing
instance to the new instance without incrementing the reference count.

**Args:**

* ​existing (`Self`): The buffer to move from, which will no longer be valid after this call.

### `__del__`

`__del__(owned self)`

Releases resources associated with this host buffer.

This function schedules an owned buffer free using the stream in the
device context. The actual deallocation may occur asynchronously after
all operations using this buffer have completed.

### `__getitem__`

`__getitem__(self, idx: Int) -> SIMD[type, 1]`

Retrieves the element at the specified index from the host buffer.

This operator allows direct access to individual elements in the host buffer
using array indexing syntax.

**Args:**

* ​idx (`Int`): The index of the element to retrieve.

**Returns:**

The scalar value at the specified index.

### `__setitem__`

`__setitem__(self, idx: Int, val: SIMD[type, 1])`

Sets the element at the specified index in the host buffer.

This operator allows direct modification of individual elements in the host buffer
using array indexing syntax.

**Args:**

* ​idx (`Int`): The index of the element to modify.
* ​val (`SIMD[type, 1]`): The new value to store at the specified index.

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

### `__len__`

`__len__(self) -> Int`

Returns the number of elements in this buffer.

This method calculates the number of elements by dividing the total byte size
of the buffer by the size of each element.

**Returns:**

The number of elements in the buffer.

### `create_sub_buffer`

`create_sub_buffer[view_type: DType](self, offset: Int, size: Int) -> HostBuffer[view_type]`

Creates a sub-buffer view of this buffer with a different element type.

This method creates a new buffer that references a subset of the memory in this
buffer, potentially with a different element type. The sub-buffer shares the
underlying memory with the original buffer.

**Parameters:**

* ​view\_type (`DType`): The data type for elements in the new sub-buffer.

**Args:**

* ​offset (`Int`): The starting offset in elements from the beginning of this buffer.
* ​size (`Int`): The number of elements in the new sub-buffer.

**Returns:**

A new HostBuffer referencing the specified region with the specified element type.

### `enqueue_copy_to`

`enqueue_copy_to(self, dst: Self)`

Enqueues an asynchronous copy from this buffer to another host buffer.

This method schedules a memory copy operation from this buffer to the destination
buffer. The operation is asynchronous and will be executed in the stream associated
with this buffer's context.

**Args:**

* ​dst (`Self`): The destination host buffer to copy data to.

`enqueue_copy_to(self, dst: DeviceBuffer[type])`

Enqueues an asynchronous copy from this buffer to a device buffer.

This method schedules a memory copy operation from this buffer to the destination
buffer. The operation is asynchronous and will be executed in the stream associated
with this buffer's context.

**Args:**

* ​dst (`DeviceBuffer[type]`): The destination device buffer to copy data to.

`enqueue_copy_to(self, dst_ptr: UnsafePointer[SIMD[type, 1]])`

Enqueues an asynchronous copy from this buffer to host memory.

This method schedules a memory copy operation from this device buffer to the
specified host memory location. The operation is asynchronous and will be
executed in the stream associated with this buffer's context.

**Args:**

* ​dst\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the destination host memory location.

### `enqueue_copy_from`

`enqueue_copy_from(self, src: Self)`

Enqueues an asynchronous copy to this buffer from another host buffer.

This method schedules a memory copy operation to this buffer from the source
buffer. The operation is asynchronous and will be executed in the stream
associated with this buffer's context.

**Args:**

* ​src (`Self`): The source host buffer to copy data from.

`enqueue_copy_from(self, src: DeviceBuffer[type])`

Enqueues an asynchronous copy to this buffer from a device buffer.

This method schedules a memory copy operation to this buffer from the source
buffer. The operation is asynchronous and will be executed in the stream
associated with this buffer's context.

**Args:**

* ​src (`DeviceBuffer[type]`): The source device buffer to copy data from.

`enqueue_copy_from(self, src_ptr: UnsafePointer[SIMD[type, 1]])`

Enqueues an asynchronous copy to this buffer from host memory.

This method schedules a memory copy operation to this device buffer from the
specified host memory location. The operation is asynchronous and will be
executed in the stream associated with this buffer's context.

**Args:**

* ​src\_ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to the source host memory location.

### `enqueue_fill`

`enqueue_fill(self, val: SIMD[type, 1]) -> Self`

Enqueues an operation to fill this buffer with a specified value.

This method schedules a memory set operation that fills the entire buffer
with the specified value. The operation is asynchronous and will be executed
in the stream associated with this buffer's context.

**Args:**

* ​val (`SIMD[type, 1]`): The value to fill the buffer with.

**Returns:**

Self reference for method chaining.

### `reassign_ownership_to`

`reassign_ownership_to(self, ctx: DeviceContext)`

Transfers ownership of this buffer to another device context.

This method changes the device context that owns this buffer. This can be
useful when sharing buffers between different contexts or when migrating
workloads between devices.

**Args:**

* ​ctx (`DeviceContext`): The new device context to take ownership of this buffer.

### `take_ptr`

`take_ptr(owned self) -> UnsafePointer[SIMD[type, 1]]`

Takes ownership of the device pointer from this buffer.

This method releases the device pointer from the buffer's control and
returns it to the caller. After this call, the buffer no longer owns
the pointer, and the caller is responsible for managing its lifecycle.

**Returns:**

The raw device pointer that was owned by this buffer.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[SIMD[type, 1]]`

Returns the raw device pointer without transferring ownership.

This method provides direct access to the underlying device pointer
for advanced use cases. The buffer retains ownership of the pointer.

**Returns:**

The raw device pointer owned by this buffer.

### `context`

`context(self) -> DeviceContext`

Returns the device context associated with this buffer.

This method retrieves the device context that owns this buffer and is
responsible for managing its lifecycle and operations.

**Returns:**

The device context associated with this buffer.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes a string representation of this buffer to the provided writer.

This method formats the buffer's contents as a string and writes it to
the specified writer. For large buffers, a compact representation is used.

**Parameters:**

* ​W (`Writer`): The writer type.

**Args:**

* ​writer (`W`): The writer to output the formatted string to.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the `HostBuffer`.

This method creates a human-readable string representation of the buffer's contents
by mapping the device memory to host memory and formatting the elements.

**Returns:**

A string containing the formatted buffer contents.

### `as_span`

`as_span(ref self) -> Span[SIMD[type, 1], self_is_origin]`

Returns a `Span` pointing to the underlying memory of the `HostBuffer`.

**Returns:**

A `Span` pointing to the underlying memory of the `HostBuffer`.

---

## hypot

`hypot[dtype: DType, width: Int, //](arg0: SIMD[dtype, width], arg1: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `hypot` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​arg0 (`SIMD[dtype, width]`): The first input argument.
* ​arg1 (`SIMD[dtype, width]`): The second input argument.

**Returns:**

The `hypot` of the inputs.

---

## id

This module provides GPU thread and block indexing functionality.

It defines aliases and functions for accessing GPU grid, block, thread and cluster
dimensions and indices. These are essential primitives for GPU programming that allow
code to determine its position and dimensions within the GPU execution hierarchy.

Most functionality is architecture-agnostic, with some NVIDIA-specific features clearly marked.
The module is designed to work seamlessly across different GPU architectures while providing
optimal performance through hardware-specific optimizations where applicable.

## Aliases

### `block_dim`

`alias block_dim = _BlockDim()`

Contains the dimensions of the block as `x`, `y`, and `z` values (for example, `block_dim.y`)

### `block_id_in_cluster`

`alias block_id_in_cluster = _Cluster_BlockIdx()`

Contains the block id of the threadblock within a cluster, as `x`, `y`, and `z` values.

### `block_idx`

`alias block_idx = _BlockIdx()`

Contains the block index in the grid, as `x`, `y`, and `z` values.

### `cluster_dim`

`alias cluster_dim = _ClusterDim()`

Contains the dimensions of the cluster, as `x`, `y`, and `z` values.

### `cluster_idx`

`alias cluster_idx = _ClusterIdx()`

Contains the cluster index in the grid, as `x`, `y`, and `z` values.

### `global_idx`

`alias global_idx = _GridIdx()`

Contains the global offset of the kernel launch, as `x`, `y`, and `z` values.

### `grid_dim`

`alias grid_dim = _GridDim()`

Provides accessors for getting the `x`, `y`, and `z` dimensions of a grid.

### `thread_idx`

`alias thread_idx = _ThreadIdx()`

Contains the thread index in the block, as `x`, `y`, and `z` values.

## Functions

* [​`lane_id`](/mojo/stdlib/gpu/id/lane_id): Returns the lane ID of the current thread within its warp.
* [​`sm_id`](/mojo/stdlib/gpu/id/sm_id): Returns the Streaming Multiprocessor (SM) ID of the current thread.
* [​`warp_id`](/mojo/stdlib/gpu/id/warp_id): Returns the warp ID of the current thread within its block. The warp ID is a unique identifier for each warp within a block, ranging from 0 to BLOCK\_SIZE/WARP\_SIZE-1. This ID is commonly used for warp-level programming and synchronization within a block.

---

## identifiable

## Traits

* [​`Identifiable`](/mojo/stdlib/builtin/identifiable/Identifiable): The Identifiable trait denotes a type with an identity which can be compared with other instances of itself.
* [​`TypeIdentifiable`](/mojo/stdlib/builtin/identifiable/TypeIdentifiable): Denotes a type that can be uniquely identified.

---

## Identifiable

The Identifiable trait denotes a type with an identity which can be compared with other instances of itself.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__is__`

`__is__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` has the same identity as `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is `rhs`.

### `__isnot__`

`__isnot__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` has a different identity than `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is not `rhs`.

---

## identity

`identity(x: SIMD[dtype, size]) -> SIMD[dtype, size]`

---

## IdentityScoreMod

`@register_passable(trivial)`
`struct IdentityScoreMod`

IdentityScoreMod simply returns attention score.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`ScoreModTrait`,
`UnknownDestructibility`

## Aliases

### `name_str`

`alias name_str = __init__[__mlir_type.!kgen.string]("no_pos")`

## Methods

### `score_mod`

`score_mod[type: DType, width: Int, //, *, element_type: DType = int32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width], max_prompt_len: Int = 0) -> SIMD[type, width]`

---

## idx2crd

`idx2crd(idx: IntTuple[origin], shape: IntTuple[origin]) -> IntTuple`

Converts a linear index to a coordinate tuple within a given shape.

This function splits an index into a coordinate within a Shape via a
colexicographical enumeration of coordinates in Shape.

**Args:**

* ​idx (`IntTuple[origin]`): The linear index to convert.
* ​shape (`IntTuple[origin]`): The shape of the tensor/array.

**Returns:**

A new `IntTuple` containing the coordinates corresponding to the linear index.

`idx2crd(idx: IntTuple[origin], shape: IntTuple[origin], _stride: IntTuple[origin]) -> IntTuple`

Converts a linear index to a coordinate tuple within a given shape using custom strides.

**Args:**

* ​idx (`IntTuple[origin]`): The linear index to convert.
* ​shape (`IntTuple[origin]`): The shape of the tensor/array.
* ​\_stride (`IntTuple[origin]`): Custom strides to use for the conversion.

**Returns:**

A new `IntTuple` containing the coordinates corresponding to the linear index.

---

## idx2crd

`idx2crd[: ImmutableOrigin, : ImmutableOrigin, : ImmutableOrigin, //, idx_t: IntTuple[$2], shape_t: IntTuple[$1], stride_t: IntTuple[$0]](idx: RuntimeTuple[idx_t, element_type=element_type], shape: RuntimeTuple[shape_t, element_type=element_type], stride: RuntimeTuple[stride_t, element_type=element_type]) -> RuntimeTuple[idx2crd[::Origin[::Bool(idx_t, shape_t, stride_t), element_type=element_type]`

Converts a linear index to multi-dimensional coordinates. This function transforms a flat index into coordinate values based on the provided shape and stride information. This is essential for mapping linear memory accesses to multi-dimensional tensor elements.

**Constraints:**

The index must be a scalar value (not a tuple).

**Parameters:**

* ​idx\_t (`IntTuple[$2]`): IntTuple type of the index.
* ​shape\_t (`IntTuple[$1]`): IntTuple type of the shape.
* ​stride\_t (`IntTuple[$0]`): IntTuple type of the stride.

**Args:**

* ​idx (`RuntimeTuple[idx_t, element_type=element_type]`): The linear index to convert.
* ​shape (`RuntimeTuple[shape_t, element_type=element_type]`): The shape of the multi-dimensional array.
* ​stride (`RuntimeTuple[stride_t, element_type=element_type]`): The stride values for each dimension.

**Returns:**

A `RuntimeTuple` containing the multi-dimensional coordinates.

`idx2crd[: ImmutableOrigin, : ImmutableOrigin, //, idx_t: IntTuple[$1], shape_t: IntTuple[$0]](idx: RuntimeTuple[idx_t, element_type=element_type], shape: RuntimeTuple[shape_t, element_type=element_type]) -> RuntimeTuple[idx2crd[::Origin[::Bool(idx_t, shape_t, prefix_product[::Origin[::Bool(shape_t)), element_type=element_type]`

Converts a linear index to multi-dimensional coordinates using shape-derived strides. This is a convenience overload of `idx2crd` that automatically calculates the stride values from the shape using `prefix_product`. This is the common case for row-major storage order tensors.

**Parameters:**

* ​idx\_t (`IntTuple[$1]`): IntTuple type of the index.
* ​shape\_t (`IntTuple[$0]`): IntTuple type of the shape.

**Args:**

* ​idx (`RuntimeTuple[idx_t, element_type=element_type]`): The linear index to convert.
* ​shape (`RuntimeTuple[shape_t, element_type=element_type]`): The shape of the multi-dimensional array.

**Returns:**

A `RuntimeTuple` containing the multi-dimensional coordinates calculated using
automatically derived strides from the shape.

---

## idx2crd2

`idx2crd2(idx: IntTuple[origin], shape: IntTuple[origin], _stride: IntTuple[origin]) -> IntTuple`

Convert a linear index to coordinates.

This function handles the actual conversion logic for different input combinations.

Notes:

* Handles four cases: tuple-tuple-tuple, tuple-int-int, int-tuple-tuple, and int-int-int.
* When input shapes don't match, `abort()` will be called.

**Args:**

* ​idx (`IntTuple[origin]`): The linear index to convert.
* ​shape (`IntTuple[origin]`): The shape of the tensor/array.
* ​\_stride (`IntTuple[origin]`): Custom strides to use for the conversion. If empty, strides are computed
  from the shape using prefix\_product.

**Returns:**

A new IntTuple containing the coordinates corresponding to the linear index.

---

## image

## Structs

* [​`Image2DLayout`](./Image2DLayout):
* [​`ImageData`](./ImageData): Utility class that generalizes conv2d data and filter tensor with a given data layout.
* [​`ImageShape`](./ImageShape): A data-layout agnostic representation of tensor shapes used in conv2d.
* [​`PadHandling`](./PadHandling):

---

## Image2DLayout

`@register_passable(trivial)`
`struct Image2DLayout`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `FRSCf`

`alias FRSCf = Image2DLayout(3)`

### `NCHW`

`alias NCHW = Image2DLayout(1)`

### `NHWC`

`alias NHWC = Image2DLayout(0)`

### `RSCF`

`alias RSCF = Image2DLayout(2)`

### `UNKNOWN`

`alias UNKNOWN = Image2DLayout(-1)`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

---

## ImageData

`@register_passable(trivial)`
`struct ImageData[shape: DimList, type: DType, static_layout: Image2DLayout, origin: MutableOrigin]`

Utility class that generalizes conv2d data and filter tensor with a given data layout.

## Fields

* ​data (`NDBuffer[type, 4, origin, shape]`):
* ​dynamic\_layout (`Image2DLayout`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(data: NDBuffer[type, 4, origin, shape], layout: Image2DLayout) -> Self`

Construct of an image data instance with dynamic layout param.

**Args:**

* ​data (`NDBuffer[type, 4, origin, shape]`): A 4d buffer containing the actual data.
* ​layout (`Image2DLayout`): Data layout tag.

`@implicit`
`__init__(data: NDBuffer[type, 4, origin, shape]) -> Self`

### `__getitem__`

`__getitem__(self, n: Int, c: Int, h: Int, w: Int) -> SIMD[type, 1]`

Reads the underlying data buffer based on the tensor index and under- lying data layout.

**Args:**

* ​n (`Int`): Index on the batch dimension.
* ​c (`Int`): Index on the channel dimension.
* ​h (`Int`): Index on the height dimension.
* ​w (`Int`): Index on the width dimension.

**Returns:**

The value stored at the given index position.

### `__setitem__`

`__setitem__(self, n: Int, c: Int, h: Int, w: Int, value: SIMD[type, 1])`

Writes the underlying data buffer based on the tensor index and under- lying data layout.

**Args:**

* ​n (`Int`): Index on the batch dimension.
* ​c (`Int`): Index on the channel dimension.
* ​h (`Int`): Index on the height dimension.
* ​w (`Int`): Index on the width dimension.
* ​value (`SIMD[type, 1]`): The value to store at the given index position.

### `to_static_layout`

`to_static_layout[new_static_layout: Image2DLayout](self) -> ImageData[shape, type, new_static_layout, origin]`

Conversion utility from a fully dynamic data structure, e.g. from c shim to one with compile-time known data layout.

**Returns:**

The image data with static data layout.

### `get_layout`

`get_layout(self) -> Image2DLayout`

The getter function of the underlying data layout, resolving from either statically or dynamically provided information.

**Returns:**

The resolved data layout tag for this image instance.

### `get_flat_index`

`get_flat_index(self, n: Int, c: Int, h: Int, w: Int) -> Int`

Converts the dimension index to the flat index of the underlying data based on the tensor layout.

**Args:**

* ​n (`Int`): Index on the batch dimension.
* ​c (`Int`): Index on the channel dimension.
* ​h (`Int`): Index on the height dimension.
* ​w (`Int`): Index on the width dimension.

**Returns:**

An integer containing the index based on the underlying
data layout.

### `get_tuple_index`

`get_tuple_index(self, idx: Int) -> IndexList[4]`

Converts the flat index to the dimension index of the underlying data based on the tensor layout.

**Args:**

* ​idx (`Int`): Flat index.

**Returns:**

A IndexList containing the index in NCHW order.

### `num_elements`

`num_elements(self) -> Int`

---

## ImageShape

`@register_passable(trivial)`
`struct ImageShape`

A data-layout agnostic representation of tensor shapes used in conv2d.

## Fields

* ​N (`Int`):
* ​C (`Int`):
* ​H (`Int`):
* ​W (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__[shape: DimList, type: DType, layout: Image2DLayout](image_data: ImageData[shape, type, layout, origin]) -> Self`

Constructor of an ImageShape instance from an ImageData.

**Args:**

* ​image\_data (`ImageData[shape, type, layout, origin]`): The image data instance to extract shape
  info from.

---

## implicitarg_ptr

`implicitarg_ptr() -> UnsafePointer[SIMD[uint8, 1], address_space=AddressSpace(4)]`

Get a pointer to AMD's implicit arguments table.

**Returns:**

A pointer to LLVM's implicit arguments table.

---

## ImplicitlyBoolable

The `ImplicitlyBoolable` trait describes a type that can be implicitly converted to a `Bool`.

Types conforming to this trait can be passed to a function that expects a
`Bool` without explicitly converting to it. Accordingly, most types should
conform to `Boolable` instead, since implicit conversions to `Bool` can have
unintuitive consequences.

This trait requires the type to implement the `__as_bool__()` method. For
example:

```mojo
struct Foo(ImplicitlyBoolable):
    var val: Bool

    fn __as_bool__(self) -> Bool:
        return self.val

    fn __bool__(self) -> Bool:
        return self.__as_bool__()
```

## Implemented traits

`AnyType`,
`Boolable`,
`UnknownDestructibility`

## Methods

### `__bool__`

`__bool__(self: _Self) -> Bool`

Get the boolean representation of the value.

**Returns:**

The boolean representation of the value.

### `__as_bool__`

`__as_bool__(self: _Self) -> Bool`

Get the boolean representation of the value.

**Returns:**

The boolean representation of the value.

---

## ImplicitlyIntable

The `ImplicitlyIntable` trait describes a type that can be converted to an Int implicitly.

This trait requires the type to implement the `__as_int__()` method. For
example:

```mojo
struct Foo(ImplicitlyIntable):
    var i: Int

    fn __int__(self) -> Int:
        return self.i

    fn __as_int__(self) -> Int:
        return self.__int__()

```

Now you can use `Foo` anywhere that an `Int` is expected, e.g. equality
checks:

```mojo
foo = Foo(42)
assert_equal(Int(42), foo)
```

## Implemented traits

`AnyType`,
`Copyable`,
`Intable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `__as_int__`

`__as_int__(self: _Self) -> Int`

Implicitly convert to an integral representation of the value, wherever an `Int` is expected.

**Returns:**

The integral representation of the value.

### `__int__`

`__int__(self: _Self) -> Int`

Get the integral representation of the value.

**Returns:**

The integral representation of the value.

---

## index

`index[T: Indexer](idx: T, /) -> index`

Returns the value of `__index__` for the given value.

**Parameters:**

* ​T (`Indexer`): A type conforming to the `Indexer` trait.

**Args:**

* ​idx (`T`): The value.

**Returns:**

An `__mlir_type` representing the index value.

---

## index

Implements `IndexList` which is commonly used to represent N-D indices.

You can import these APIs from the `utils` package. For example:

```mojo
from utils import IndexList
```

## Structs

* [​`IndexList`](/mojo/stdlib/utils/index_/IndexList): A base struct that implements size agnostic index functions.

## Functions

* [​`Index`](/mojo/stdlib/utils/index_/Index-function): Constructs a 1-D Index from the given value.
* [​`product`](/mojo/stdlib/utils/index_/product): Computes a product of values in the tuple up to the given index.

---

## Index

`Index[T0: Intable, //, *, dtype: DType = int64](x: T0) -> IndexList[1, element_type=dtype]`

Constructs a 1-D Index from the given value.

**Parameters:**

* ​T0 (`Intable`): The type of the 1st argument.
* ​dtype (`DType`): The integer type of the underlying element.

**Args:**

* ​x (`T0`): The initial value.

**Returns:**

The constructed IndexList.

`Index[*, dtype: DType = int64](x: UInt) -> IndexList[1, element_type=dtype]`

Constructs a 1-D Index from the given value.

**Parameters:**

* ​dtype (`DType`): The integer type of the underlying element.

**Args:**

* ​x (`UInt`): The initial value.

**Returns:**

The constructed IndexList.

`Index[T0: Intable, T1: Intable, //, *, dtype: DType = int64](x: T0, y: T1) -> IndexList[2, element_type=dtype]`

Constructs a 2-D Index from the given values.

**Parameters:**

* ​T0 (`Intable`): The type of the 1st argument.
* ​T1 (`Intable`): The type of the 2nd argument.
* ​dtype (`DType`): The integer type of the underlying element.

**Args:**

* ​x (`T0`): The 1st initial value.
* ​y (`T1`): The 2nd initial value.

**Returns:**

The constructed IndexList.

`Index[*, dtype: DType = int64](x: UInt, y: UInt) -> IndexList[2, element_type=dtype]`

Constructs a 2-D Index from the given values.

**Parameters:**

* ​dtype (`DType`): The integer type of the underlying element.

**Args:**

* ​x (`UInt`): The 1st initial value.
* ​y (`UInt`): The 2nd initial value.

**Returns:**

The constructed IndexList.

`Index[T0: Intable, T1: Intable, T2: Intable, //, *, dtype: DType = int64](x: T0, y: T1, z: T2) -> IndexList[3, element_type=dtype]`

Constructs a 3-D Index from the given values.

**Parameters:**

* ​T0 (`Intable`): The type of the 1st argument.
* ​T1 (`Intable`): The type of the 2nd argument.
* ​T2 (`Intable`): The type of the 3rd argument.
* ​dtype (`DType`): The integer type of the underlying element.

**Args:**

* ​x (`T0`): The 1st initial value.
* ​y (`T1`): The 2nd initial value.
* ​z (`T2`): The 3rd initial value.

**Returns:**

The constructed IndexList.

`Index[T0: Intable, T1: Intable, T2: Intable, T3: Intable, //, *, dtype: DType = int64](x: T0, y: T1, z: T2, w: T3) -> IndexList[4, element_type=dtype]`

Constructs a 4-D Index from the given values.

**Parameters:**

* ​T0 (`Intable`): The type of the 1st argument.
* ​T1 (`Intable`): The type of the 2nd argument.
* ​T2 (`Intable`): The type of the 3rd argument.
* ​T3 (`Intable`): The type of the 4th argument.
* ​dtype (`DType`): The integer type of the underlying element.

**Args:**

* ​x (`T0`): The 1st initial value.
* ​y (`T1`): The 2nd initial value.
* ​z (`T2`): The 3rd initial value.
* ​w (`T3`): The 4th initial value.

**Returns:**

The constructed IndexList.

`Index[T0: Intable, T1: Intable, T2: Intable, T3: Intable, T4: Intable, //, *, dtype: DType = int64](x: T0, y: T1, z: T2, w: T3, v: T4) -> IndexList[5, element_type=dtype]`

Constructs a 5-D Index from the given values.

**Parameters:**

* ​T0 (`Intable`): The type of the 1st argument.
* ​T1 (`Intable`): The type of the 2nd argument.
* ​T2 (`Intable`): The type of the 3rd argument.
* ​T3 (`Intable`): The type of the 4th argument.
* ​T4 (`Intable`): The type of the 5th argument.
* ​dtype (`DType`): The integer type of the underlying element.

**Args:**

* ​x (`T0`): The 1st initial value.
* ​y (`T1`): The 2nd initial value.
* ​z (`T2`): The 3rd initial value.
* ​w (`T3`): The 4th initial value.
* ​v (`T4`): The 5th initial value.

**Returns:**

The constructed IndexList.

---

## index_tensor

## Functions

* [​`advanced_indexing_getitem`](./advanced_indexing_getitem): Implement basic numpy-style advanced indexing.
* [​`advanced_indexing_getitem_shape`](./advanced_indexing_getitem_shape): Calculate the output shape from advanced indexing.
* [​`advanced_indexing_setitem_inplace`](./advanced_indexing_setitem_inplace): Implement basic numpy-style advanced indexing with assignment.
* [​`index_tensor`](./index_tensor): Index\_tensor operation; based on modified implementation of gather\_nd.
* [​`index_tensor_shape`](./index_tensor_shape): Compute the output shape of a `index_tensor` operation, and assert the inputs are compatible.

---

## index_tensor

`index_tensor[type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, output_rank: Int, batch_dims: Int, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), single_thread_blocking_override: Bool = False](data: NDBuffer[type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], output: NDBuffer[type, output_rank, origin], ctx: DeviceContextPtr)`

Index\_tensor operation; based on modified implementation of gather\_nd.

**Parameters:**

* ​type (`DType`): Type of data tensor.
* ​indices\_type (`DType`): Type of indices tensor.
* ​data\_rank (`Int`): Rank of data tensor (data\_rank >= 1).
* ​indices\_rank (`Int`): Rank of indices tensor (indices\_rank >= 1).
* ​output\_rank (`Int`): Rank of output tensor.
* ​batch\_dims (`Int`): Number of batch dimensions. The gather of indexing
  starts from dimension of data\[batch\_dims:].
* ​target (`StringSlice[StaticConstantOrigin]`): The target architecture to execute on.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​data (`NDBuffer[type, data_rank, origin]`): Tensor of rank data\_rank >= 1.
* ​indices (`NDBuffer[indices_type, indices_rank, origin]`): Tensor of rank indices\_rank >= 1. All index values are expected
  to be within bounds \[-s, s-1] along axis of size s. It is an
  error if any of the index values are out of bounds.
* ​output (`NDBuffer[type, output_rank, origin]`): Tensor of rank data\_rank + indices\_rank - indices\_shape\[-1] - 1 - b.
* ​ctx (`DeviceContextPtr`): The DeviceContextPtr as prepared by the graph compiler.

---

## index_tensor_shape

`index_tensor_shape[input_rank: Int, indices_rank: Int, output_rank: Int, input_type: DType, indices_type: DType, batch_dims: Int, single_thread_blocking_override: Bool = True](input_buf: NDBuffer[input_type, input_rank, origin], indices_buf: NDBuffer[indices_type, indices_rank, origin]) -> IndexList[output_rank]`

Compute the output shape of a `index_tensor` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_rank (`Int`): Rank of the input tensor.
* ​indices\_rank (`Int`): Rank of the indices tensor.
* ​output\_rank (`Int`): Rank of the output tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​indices\_type (`DType`): Type of the indices tensor.
* ​batch\_dims (`Int`): Batch dimensions.
* ​single\_thread\_blocking\_override (`Bool`): If True, then reduction is run
  synchronously using a single thread.

**Args:**

* ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor.
* ​indices\_buf (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor.

**Returns:**

The output shape.

---

## Indexer

The `Indexer` trait is used for types that can index into a collection or pointer. The type returned is the underlying \_\_mlir\_type.index, enabling types like `UInt` to not have to be converted to an `Int` first. This type is implicitly convertable to an `Int`, so can be used anywhere an `Int` can e.g. for comparisons.

## Implemented traits

`AnyType`,
`Copyable`,
`Intable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `__index__`

`__index__(self: _Self) -> index`

Convert to index.

**Returns:**

The corresponding \_\_mlir\_type.index value.

### `__int__`

`__int__(self: _Self) -> Int`

Get the integral representation of the value.

**Returns:**

The integral representation of the value.

---

## IndexList

`@register_passable(trivial)`
`struct IndexList[size: Int, *, element_type: DType = int64]`

A base struct that implements size agnostic index functions.

## Parameters

* ​size (`Int`): The size of the tuple.
* ​element\_type (`DType`): The underlying dtype of the integer element value.

## Fields

* ​data (`StaticTuple[SIMD[element_type, 1], size]`): The underlying storage of the tuple value.

## Implemented traits

`AnyType`,
`Comparable`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Methods

### `__init__`

`__init__() -> Self`

Constructs a static int tuple of the given size.

`@implicit`
`__init__(data: StaticTuple[SIMD[element_type, 1], size]) -> Self`

Constructs a static int tuple of the given size.

**Args:**

* ​data (`StaticTuple[SIMD[element_type, 1], size]`): The StaticTuple to construct the IndexList from.

`@implicit`
`__init__(elems: Tuple[Int, Int]) -> Self`

Constructs a static int tuple given a tuple of integers.

**Args:**

* ​elems (`Tuple[Int, Int]`): The tuple to copy from.

`@implicit`
`__init__(elems: Tuple[Int, Int, Int]) -> Self`

Constructs a static int tuple given a tuple of integers.

**Args:**

* ​elems (`Tuple[Int, Int, Int]`): The tuple to copy from.

`@implicit`
`__init__(elems: Tuple[Int, Int, Int, Int]) -> Self`

Constructs a static int tuple given a tuple of integers.

**Args:**

* ​elems (`Tuple[Int, Int, Int, Int]`): The tuple to copy from.

`@implicit`
`__init__(*elems: Int, *, __list_literal__: Tuple[] = Tuple()) -> Self`

Constructs a static int tuple given a set of arguments.

**Args:**

* ​\*elems (`Int`): The elements to construct the tuple.
* ​**list\_literal** (`Tuple[]`): Specifies that this constructor can be used for
  list literals.

`@implicit`
`__init__(elem: Int) -> Self`

Constructs a static int tuple given a set of arguments.

**Args:**

* ​elem (`Int`): The elem to splat into the tuple.

`__init__(*, other: Self) -> Self`

Copy constructor.

**Args:**

* ​other (`Self`): The other tuple to copy from.

`@implicit`
`__init__(values: VariadicList[Int]) -> Self`

Creates a tuple constant using the specified values.

**Args:**

* ​values (`VariadicList[Int]`): The list of values.

### `__getitem__`

`__getitem__[idx: Int](self) -> Int`

Gets an element from the tuple by index.

**Parameters:**

* ​idx (`Int`): The element index.

**Returns:**

The tuple element value.

`__getitem__[I: Indexer](self, idx: I) -> Int`

Gets an element from the tuple by index.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The element index.

**Returns:**

The tuple element value.

### `__setitem__`

`__setitem__[idx: Int](mut self, val: Int)`

Sets an element in the tuple at the given static index.

**Parameters:**

* ​idx (`Int`): The element index.

**Args:**

* ​val (`Int`): The value to store.

`__setitem__[idx: Int](mut self, val: SIMD[element_type, 1])`

Sets an element in the tuple at the given static index.

**Parameters:**

* ​idx (`Int`): The element index.

**Args:**

* ​val (`SIMD[element_type, 1]`): The value to store.

`__setitem__(mut self, idx: Int, val: Int)`

Sets an element in the tuple at the given index.

**Args:**

* ​idx (`Int`): The element index.
* ​val (`Int`): The value to store.

### `__lt__`

`__lt__(self, rhs: Self) -> Bool`

Compares this tuple to another tuple using LT comparison.

A tuple is less-than another tuple if all corresponding elements of lhs
is less than rhs.

Note: This is **not** a lexical comparison.

**Args:**

* ​rhs (`Self`): Right hand side tuple.

**Returns:**

The comparison result.

### `__le__`

`__le__(self, rhs: Self) -> Bool`

Compares this tuple to another tuple using LE comparison.

A tuple is less-or-equal than another tuple if all corresponding
elements of lhs is less-or-equal than rhs.

Note: This is **not** a lexical comparison.

**Args:**

* ​rhs (`Self`): Right hand side tuple.

**Returns:**

The comparison result.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compares this tuple to another tuple for equality.

The tuples are equal if all corresponding elements are equal.

**Args:**

* ​rhs (`Self`): The other tuple.

**Returns:**

The comparison result.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Compares this tuple to another tuple for non-equality.

The tuples are non-equal if at least one element of LHS isn't equal to
the corresponding element from RHS.

**Args:**

* ​rhs (`Self`): The other tuple.

**Returns:**

The comparison result.

### `__gt__`

`__gt__(self, rhs: Self) -> Bool`

Compares this tuple to another tuple using GT comparison.

A tuple is greater-than than another tuple if all corresponding
elements of lhs is greater-than than rhs.

Note: This is **not** a lexical comparison.

**Args:**

* ​rhs (`Self`): Right hand side tuple.

**Returns:**

The comparison result.

### `__ge__`

`__ge__(self, rhs: Self) -> Bool`

Compares this tuple to another tuple using GE comparison.

A tuple is greater-or-equal than another tuple if all corresponding
elements of lhs is greater-or-equal than rhs.

Note: This is **not** a lexical comparison.

**Args:**

* ​rhs (`Self`): Right hand side tuple.

**Returns:**

The comparison result.

### `__add__`

`__add__(self, rhs: Self) -> Self`

Performs element-wise integer add.

**Args:**

* ​rhs (`Self`): Right hand side operand.

**Returns:**

The resulting index tuple.

### `__sub__`

`__sub__(self, rhs: Self) -> Self`

Performs element-wise integer subtract.

**Args:**

* ​rhs (`Self`): Right hand side operand.

**Returns:**

The resulting index tuple.

### `__mul__`

`__mul__(self, rhs: Self) -> Self`

Performs element-wise integer multiply.

**Args:**

* ​rhs (`Self`): Right hand side operand.

**Returns:**

The resulting index tuple.

### `__floordiv__`

`__floordiv__(self, rhs: Self) -> Self`

Performs element-wise integer floor division.

**Args:**

* ​rhs (`Self`): The elementwise divisor.

**Returns:**

The resulting index tuple.

### `__rfloordiv__`

`__rfloordiv__(self, rhs: Self) -> Self`

Floor divides rhs by this object.

**Args:**

* ​rhs (`Self`): The value to elementwise divide by self.

**Returns:**

The resulting index tuple.

### `__len__`

`__len__(self) -> Int`

Returns the size of the tuple.

**Returns:**

The tuple size.

### `as_tuple`

`as_tuple(self) -> StaticTuple[Int, size]`

Converts this IndexList to StaticTuple.

**Returns:**

The corresponding StaticTuple object.

### `canonicalize`

`canonicalize(self) -> IndexList[size]`

Canonicalizes the IndexList.

**Returns:**

Canonicalizes the object.

### `flattened_length`

`flattened_length(self) -> Int`

Returns the flattened length of the tuple.

**Returns:**

The flattened length of the tuple.

### `remu`

`remu(self, rhs: Self) -> Self`

Performs element-wise integer unsigned modulo.

**Args:**

* ​rhs (`Self`): Right hand side operand.

**Returns:**

The resulting index tuple.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this IndexList value to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__str__`

`__str__(self) -> String`

Get the tuple as a string.

**Returns:**

A string representation.

### `cast`

`cast[dtype: DType](self) -> IndexList[size, element_type=dtype]`

Casts to the target DType.

**Parameters:**

* ​dtype (`DType`): The dtype to cast towards.

**Returns:**

The list casted to the target type.

### `__hash__`

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with the underlying bytes.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

---

## inf

`inf[dtype: DType]() -> SIMD[dtype, 1]`

Gets a +inf value for the given dtype.

**Constraints:**

Can only be used for FP dtypes.

**Parameters:**

* ​dtype (`DType`): The value dtype.

**Returns:**

The +inf value of the given dtype.

---

## info

Contains information about GPU architectures and their capabilities.

This module provides detailed specifications for various GPU models including
NVIDIA and AMD GPUs. It includes information about compute capabilities,
memory specifications, thread organization, and performance characteristics.

## Aliases

### `A10`

`alias A10 = Info(__init__[__mlir_type.!kgen.string]("A10"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ampere"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.5999999999999996), __init__[__mlir_type.!kgen.string]("sm_86"), 72, 32, 1536, 32, 64, 2048, 32, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 16, 128, 4, 1024)`

### `A100`

`alias A100 = Info(__init__[__mlir_type.!kgen.string]("A100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ampere"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8), __init__[__mlir_type.!kgen.string]("sm_80"), 108, 32, 2048, 32, 64, 2048, 32, 167936, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)`

### `B100`

`alias B100 = Info(__init__[__mlir_type.!kgen.string]("B100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("blackwell"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](10), __init__[__mlir_type.!kgen.string]("sm_100a"), 132, 32, -1, 32, 64, 1536, 32, 262144, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)`

### `B200`

`alias B200 = Info(__init__[__mlir_type.!kgen.string]("B200"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("blackwell"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](10), __init__[__mlir_type.!kgen.string]("sm_100a"), 148, 32, -1, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)`

### `DEFAULT_GPU`

`alias DEFAULT_GPU = from_name[::StringSlice[::Bool()`

### `DEFAULT_GPU_ARCH`

`alias DEFAULT_GPU_ARCH = _accelerator_arch()`

### `DEFAULT_GPU_TARGET`

`alias DEFAULT_GPU_TARGET = from_name[::StringSlice[::Bool().target()`

### `H100`

`alias H100 = Info(__init__[__mlir_type.!kgen.string]("H100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("hopper"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](9), __init__[__mlir_type.!kgen.string]("sm_90a"), 132, 32, 2048, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)`

### `L4`

`alias L4 = Info(__init__[__mlir_type.!kgen.string]("L4"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ada"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.9000000000000004), __init__[__mlir_type.!kgen.string]("sm_89"), 58, 32, 1536, 32, 64, 2048, 32, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 24, 128, 4, 1024)`

### `MI300X`

`alias MI300X = Info(__init__[__mlir_type.!kgen.string]("MI300X"), Vendor(__init__[__mlir_type.!pop.int_literal](1)), __init__[__mlir_type.!kgen.string]("hip"), __init__[__mlir_type.!kgen.string]("gfx942"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.float_literal](9.4000000000000003), __init__[__mlir_type.!kgen.string]("CDNA3"), 304, 64, 2048, 64, 32, 2048, 2, 65536, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 2, 128, 4, 1024)`

### `NoGPU`

`alias NoGPU = Info(__init__[__mlir_type.!kgen.string]("NoGPU"), Vendor(__init__[__mlir_type.!pop.int_literal](0)), __init__[__mlir_type.!kgen.string]("none"), __init__[__mlir_type.!kgen.string]("no_gpu"), __init__[__mlir_type.!kgen.string](""), __init__[__mlir_type.!pop.int_literal](0), __init__[__mlir_type.!kgen.string](""), 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, __init__[__mlir_type.!kgen.string]("none"), 0, 0, 0, 0, 0, 0)`

### `OrinNano`

`alias OrinNano = Info(__init__[__mlir_type.!kgen.string]("Orin Nano"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ampere"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.6999999999999993), __init__[__mlir_type.!kgen.string]("sm_87"), 8, 32, 1536, 32, 64, 2048, 32, 167936, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 16, 128, 4, 1024)`

### `RTX2060`

`alias RTX2060 = Info(__init__[__mlir_type.!kgen.string]("RTX2060"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("turing"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](7.5), __init__[__mlir_type.!kgen.string]("sm_75"), 30, 32, 2048, 32, 64, 2048, 16, 65536, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 32768, 16, 32, 4, 1024)`

### `RTX4090`

`alias RTX4090 = Info(__init__[__mlir_type.!kgen.string]("RTX4090"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ada lovelace"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.9000000000000004), __init__[__mlir_type.!kgen.string]("sm_89"), 128, 32, -1, 32, 64, 1536, 24, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 24, 128, 4, 1024)`

### `RTX4090m`

`alias RTX4090m = Info(__init__[__mlir_type.!kgen.string]("RTX4090m"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("ada lovelace"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](8.9000000000000004), __init__[__mlir_type.!kgen.string]("sm_89"), 76, 32, -1, 32, 64, 1536, 24, 102400, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 24, 128, 4, 1024)`

### `RTX5090`

`alias RTX5090 = Info(__init__[__mlir_type.!kgen.string]("RTX5090"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("blackwell"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](12), __init__[__mlir_type.!kgen.string]("sm_120a"), 170, 32, -1, 32, 64, 1536, 32, 59392, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)`

## Structs

* [​`Info`](/mojo/stdlib/gpu/host/info/Info): Comprehensive information about a GPU architecture.
* [​`Vendor`](/mojo/stdlib/gpu/host/info/Vendor): Represents GPU vendors.

## Functions

* [​`is_cpu`](/mojo/stdlib/gpu/host/info/is_cpu): Checks if the target is a CPU (compile-time version).
* [​`is_gpu`](/mojo/stdlib/gpu/host/info/is_gpu): Checks if the target is a GPU (compile-time version).
* [​`is_valid_target`](/mojo/stdlib/gpu/host/info/is_valid_target): Checks if the target is valid (compile-time version).

---

## info

Implements methods for querying the host target info.

You can import these APIs from the `sys` package. For example:

```mojo
from sys import CompilationTarget

print(CompilationTarget.is_x86())
```

## Structs

* [​`CompilationTarget`](/mojo/stdlib/sys/info/CompilationTarget): A struct that provides information about a target architecture.

## Functions

* [​`alignof`](/mojo/stdlib/sys/info/alignof): Returns the align of (in bytes) of the type.
* [​`bitwidthof`](/mojo/stdlib/sys/info/bitwidthof): Returns the size of (in bits) of the type.
* [​`has_accelerator`](/mojo/stdlib/sys/info/has_accelerator): Returns True if the host system has an accelerator and False otherwise.
* [​`has_amd_gpu_accelerator`](/mojo/stdlib/sys/info/has_amd_gpu_accelerator): Returns True if the host system has an AMD GPU and False otherwise.
* [​`has_avx`](/mojo/stdlib/sys/info/has_avx): Returns True if the host system has AVX, otherwise returns False.
* [​`has_avx2`](/mojo/stdlib/sys/info/has_avx2): Returns True if the host system has AVX2, otherwise returns False.
* [​`has_avx512f`](/mojo/stdlib/sys/info/has_avx512f): Returns True if the host system has AVX512, otherwise returns False.
* [​`has_fma`](/mojo/stdlib/sys/info/has_fma): Returns True if the host system has FMA (Fused Multiply-Add) support, otherwise returns False.
* [​`has_intel_amx`](/mojo/stdlib/sys/info/has_intel_amx): Returns True if the host system has Intel AMX support, otherwise returns False.
* [​`has_neon`](/mojo/stdlib/sys/info/has_neon): Returns True if the host system has Neon support, otherwise returns False.
* [​`has_neon_int8_dotprod`](/mojo/stdlib/sys/info/has_neon_int8_dotprod): Returns True if the host system has the Neon int8 dot product extension, otherwise returns False.
* [​`has_neon_int8_matmul`](/mojo/stdlib/sys/info/has_neon_int8_matmul): Returns True if the host system has the Neon int8 matrix multiplication extension (I8MM), otherwise returns False.
* [​`has_nvidia_gpu_accelerator`](/mojo/stdlib/sys/info/has_nvidia_gpu_accelerator): Returns True if the host system has an NVIDIA GPU and False otherwise.
* [​`has_sse4`](/mojo/stdlib/sys/info/has_sse4): Returns True if the host system has sse4, otherwise returns False.
* [​`has_vnni`](/mojo/stdlib/sys/info/has_vnni): Returns True if the host system has avx512\_vnni, otherwise returns False.
* [​`is_32bit`](/mojo/stdlib/sys/info/is_32bit): Returns True if the maximum integral value is 32 bit.
* [​`is_64bit`](/mojo/stdlib/sys/info/is_64bit): Returns True if the maximum integral value is 64 bit.
* [​`is_amd_gpu`](/mojo/stdlib/sys/info/is_amd_gpu): Returns True if the target triple of the compiler is `amdgcn-amd-amdhsa` False otherwise.
* [​`is_apple_m1`](/mojo/stdlib/sys/info/is_apple_m1): Returns True if the host system is an Apple M1 with AMX support, otherwise returns False.
* [​`is_apple_m2`](/mojo/stdlib/sys/info/is_apple_m2): Returns True if the host system is an Apple M2 with AMX support, otherwise returns False.
* [​`is_apple_m3`](/mojo/stdlib/sys/info/is_apple_m3): Returns True if the host system is an Apple M3 with AMX support, otherwise returns False.
* [​`is_apple_m4`](/mojo/stdlib/sys/info/is_apple_m4): Returns True if the host system is an Apple M4 with AMX support, otherwise returns False.
* [​`is_apple_silicon`](/mojo/stdlib/sys/info/is_apple_silicon): Returns True if the host system is an Apple Silicon with AMX support, otherwise returns False.
* [​`is_big_endian`](/mojo/stdlib/sys/info/is_big_endian): Returns True if the host endianness is big and False otherwise.
* [​`is_gpu`](/mojo/stdlib/sys/info/is_gpu): Returns True if the target triple is GPU and  False otherwise.
* [​`is_little_endian`](/mojo/stdlib/sys/info/is_little_endian): Returns True if the host endianness is little and False otherwise.
* [​`is_neoverse_n1`](/mojo/stdlib/sys/info/is_neoverse_n1): Returns True if the host system is a Neoverse N1 system, otherwise returns False.
* [​`is_nvidia_gpu`](/mojo/stdlib/sys/info/is_nvidia_gpu): Returns True if the target triple of the compiler is `nvptx64-nvidia-cuda` False otherwise.
* [​`is_triple`](/mojo/stdlib/sys/info/is_triple): Returns True if the target triple of the compiler matches the input and False otherwise.
* [​`is_x86`](/mojo/stdlib/sys/info/is_x86): Returns True if the host system architecture is X86 and False otherwise.
* [​`num_logical_cores`](/mojo/stdlib/sys/info/num_logical_cores): Returns the number of hardware threads, including hyperthreads across all CPU sockets.
* [​`num_performance_cores`](/mojo/stdlib/sys/info/num_performance_cores): Returns the number of physical performance cores across all CPU sockets. If not known, returns the total number of physical cores.
* [​`num_physical_cores`](/mojo/stdlib/sys/info/num_physical_cores): Returns the number of physical cores across all CPU sockets.
* [​`os_is_linux`](/mojo/stdlib/sys/info/os_is_linux): Returns True if the host operating system is Linux.
* [​`os_is_macos`](/mojo/stdlib/sys/info/os_is_macos): Returns True if the host operating system is macOS.
* [​`os_is_windows`](/mojo/stdlib/sys/info/os_is_windows): Returns True if the host operating system is Windows.
* [​`simdbitwidth`](/mojo/stdlib/sys/info/simdbitwidth): Returns the vector size (in bits) of the specified target.
* [​`simdbytewidth`](/mojo/stdlib/sys/info/simdbytewidth): Returns the vector size (in bytes) of the specified target.
* [​`simdwidthof`](/mojo/stdlib/sys/info/simdwidthof): Returns the vector size of the type on the host system.
* [​`sizeof`](/mojo/stdlib/sys/info/sizeof): Returns the size of (in bytes) of the type.

---

## Info

`@register_passable(trivial)`
`struct Info[func_type: AnyTrivialRegType, func: func_type, target: target]`

Contains compilation information and results for a function.

Stores assembly/IR code, function metadata, and error information from
compiling a function.

Attributes:
populate: Function to populate captures

## Parameters

* ​func\_type (`AnyTrivialRegType`): Type of the function being compiled.
* ​func (`func_type`): The function being compiled.
* ​target (`target`): The target architecture to compile for.

## Fields

* ​asm (`StringSlice[StaticConstantOrigin]`): Generated assembly/IR code from the compilation process.
* ​function\_name (`StringSlice[StaticConstantOrigin]`): Mangled name of the compiled function, used for symbol resolution.
* ​module\_name (`StringSlice[StaticConstantOrigin]`): Name of the module containing the compiled function.
* ​num\_captures (`Int`): Number of variables captured by the function closure.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `populate`

`alias populate = rebind[AnyTrivialRegType,AnyTrivialRegType](compile_offload_closure(target, :!kgen.param func))`

Function pointer to populate captured variables in the function closure.

## Methods

### `__contains__`

`__contains__(self, content: String) -> Bool`

Checks if content exists in the assembly/IR.

**Args:**

* ​content (`String`): String to search for.

**Returns:**

True if content is found, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the assembly/IR to a writer.

**Parameters:**

* ​W (`Writer`): Type that implements the Writer interface for writing data.

**Args:**

* ​writer (`W`): Writer object to write the assembly to.

### `__str__`

`__str__(self) -> String`

Converts the assembly/IR to a string.

**Returns:**

The assembly/IR as a string.

### `write_text`

`write_text[path_like: PathLike](self, path: path_like)`

Writes the assembly/IR to a file.

**Parameters:**

* ​path\_like (`PathLike`): Type that implements the `PathLike` interface for file
  path representation.

**Args:**

* ​path (`path_like`): Path to write the file to.

**Raises:**

If file writing operations fail.

---

## Info

`@register_passable`
`struct Info`

Comprehensive information about a GPU architecture.

This struct contains detailed specifications about GPU capabilities,
including compute units, memory, thread organization, and performance
characteristics.

## Fields

* ​name (`StringSlice[StaticConstantOrigin]`): The model name of the GPU.
* ​vendor (`Vendor`): The vendor/manufacturer of the GPU (e.g., NVIDIA, AMD).
* ​api (`StringSlice[StaticConstantOrigin]`): The graphics/compute API supported by the GPU (e.g., CUDA, ROCm).
* ​arch\_name (`StringSlice[StaticConstantOrigin]`): The architecture name of the GPU (e.g., sm\_80, gfx942).
* ​compile\_options (`StringSlice[StaticConstantOrigin]`): Compiler options specific to this GPU architecture.
* ​compute (`SIMD[float32, 1]`): Compute capability version number for NVIDIA GPUs.
* ​version (`StringSlice[StaticConstantOrigin]`): Version string of the GPU architecture.
* ​sm\_count (`Int`): Number of streaming multiprocessors (SMs) on the GPU.
* ​warp\_size (`Int`): Number of threads in a warp/wavefront.
* ​threads\_per\_sm (`Int`): Maximum number of threads per streaming multiprocessor.
* ​threads\_per\_warp (`Int`): Number of threads that execute together in a warp/wavefront.
* ​warps\_per\_multiprocessor (`Int`): Maximum number of warps that can be active on a multiprocessor.
* ​threads\_per\_multiprocessor (`Int`): Maximum number of threads that can be active on a multiprocessor.
* ​thread\_blocks\_per\_multiprocessor (`Int`): Maximum number of thread blocks that can be active on a multiprocessor.
* ​shared\_memory\_per\_multiprocessor (`Int`): Size of shared memory available per multiprocessor in bytes.
* ​register\_file\_size (`Int`): Total size of the register file per multiprocessor in bytes.
* ​register\_allocation\_unit\_size (`Int`): Minimum allocation size for registers in bytes.
* ​allocation\_granularity (`StringSlice[StaticConstantOrigin]`): Description of how resources are allocated on the GPU.
* ​max\_registers\_per\_thread (`Int`): Maximum number of registers that can be allocated to a single thread.
* ​max\_registers\_per\_block (`Int`): Maximum number of registers that can be allocated to a thread block.
* ​max\_blocks\_per\_multiprocessor (`Int`): Maximum number of blocks that can be scheduled on a multiprocessor.
* ​shared\_memory\_allocation\_unit\_size (`Int`): Minimum allocation size for shared memory in bytes.
* ​warp\_allocation\_granularity (`Int`): Granularity at which warps are allocated resources.
* ​max\_thread\_block\_size (`Int`): Maximum number of threads allowed in a thread block.

## Implemented traits

`AnyType`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__lt__`

`__lt__(self, other: Self) -> Bool`

Compares if this GPU has lower compute capability than another.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if this GPU has lower compute capability, False otherwise.

### `__le__`

`__le__(self, other: Self) -> Bool`

Compares if this GPU has lower or equal compute capability.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if this GPU has lower or equal compute capability.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two GPU Info instances represent the same GPU model.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if both instances represent the same GPU model.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if two GPU Info instances represent different GPU models.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if instances represent different GPU models.

### `__gt__`

`__gt__(self, other: Self) -> Bool`

Compares if this GPU has higher compute capability than another.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if this GPU has higher compute capability, False otherwise.

### `__ge__`

`__ge__(self, other: Self) -> Bool`

Compares if this GPU has higher or equal compute capability.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if this GPU has higher or equal compute capability.

### `__is__`

`__is__(self, other: Self) -> Bool`

Identity comparison operator for GPU Info instances.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if both instances represent the same GPU model.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Negative identity comparison operator for GPU Info instances.

**Args:**

* ​other (`Self`): Another GPU Info instance to compare against.

**Returns:**

True if instances represent different GPU models.

### `target`

`target(self) -> target`

Gets the MLIR target configuration for this GPU.

**Returns:**

MLIR target configuration for the GPU.

### `from_target`

`static from_target[target: target]() -> Self`

Creates an Info instance from an MLIR target.

**Parameters:**

* ​target (`target`): MLIR target configuration.

**Returns:**

GPU info corresponding to the target.

### `from_name`

`static from_name[name: StringSlice[StaticConstantOrigin]]() -> Self`

Creates an Info instance from a GPU architecture name.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): GPU architecture name (e.g., "sm\_80", "gfx942").

**Returns:**

GPU info corresponding to the architecture name.

### `occupancy`

`occupancy(self, *, threads_per_block: Int, registers_per_thread: Int) -> SIMD[float64, 1]`

Calculates theoretical occupancy for given thread and register config.

Occupancy represents the ratio of active warps to the maximum possible
warps on a streaming multiprocessor.

Note:
TODO (KERN-795): Add occupancy calculation based on shared memory
usage and thread block size and take use the minimum value.

**Args:**

* ​threads\_per\_block (`Int`): Number of threads in each block.
* ​registers\_per\_thread (`Int`): Number of registers used by each thread.

**Returns:**

Occupancy as a ratio between 0.0 and 1.0.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes GPU information to a writer.

Outputs all GPU specifications and capabilities to the provided writer
in a human-readable format.

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait.

**Args:**

* ​writer (`W`): A Writer instance to output the GPU information.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the GPU information.

Converts all GPU specifications and capabilities to a human-readable
string format.

**Returns:**

String containing all GPU information.

---

## init_intel_amx

`init_intel_amx() -> Bool`

---

## inline_array

Provides a fixed-size array implementation with compile-time size checking.

The `InlineArray` type represents a fixed-size sequence of homogeneous elements
where the size is determined at compile time. It provides efficient memory
layout and bounds checking while maintaining type safety.  The `InlineArray`
type is part of the `prelude` module and therefore does not need to be imported
in order to use it.

Examples:

```mojo
# Create an array of 3 integers
var arr = InlineArray[Int, 3](1, 2, 3)

# Access elements
print(arr[0])  # Prints 1

# Fill with a value
var filled = InlineArray[Int, 5](fill=42)
```

Notes:

* For historical reasons, destructors are not run by default on the elements of
  an `InlineArray`. This can be controlled with the `run_destructors` parameter.
  In the future, this will default to `True` and the `run_destructors` parameter
  will be removed.

## Structs

* [​`InlineArray`](/mojo/stdlib/collections/inline_array/InlineArray): A fixed-size sequence of homogeneous elements where size is a constant expression.

---

## InlineArray

`struct InlineArray[ElementType: Copyable & Movable, size: Int, *, run_destructors: Bool = False]`

A fixed-size sequence of homogeneous elements where size is a constant expression.

InlineArray provides a fixed-size array implementation with compile-time
size checking. The array size is determined at compile time and cannot be
changed. Elements must implement the `Copyable` and `Movable` traits.

Examples:

```mojo
# Create array of 3 integers
var arr = InlineArray[Int, 3](1, 2, 3)

# Create array filled with value
var filled = InlineArray[Int, 5](fill=42)

# Access elements
print(arr[0])  # Prints 1
```

## Parameters

* ​ElementType (`Copyable & Movable`): The type of the elements in the array. Must implement
  `Copyable` and `Movable`.
* ​size (`Int`): The size of the array. Must be a positive integer constant.
* ​run\_destructors (`Bool`): Whether to run destructors on the elements. Defaults to
  `False` for backwards compatibility. Will default to `True` in the
  future.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `type`

`alias type = array, :trait ElementType>`

## Methods

### `__init__`

`__init__(out self)`

This constructor will always cause a compile time error if used. It is used to steer users away from uninitialized memory.

`__init__(out self, *, uninitialized: Bool)`

Create an InlineArray with uninitialized memory.

Examples:

```mojo
var uninitialized_array = InlineArray[Int, 10](uninitialized=True)
```

Notes:
This constructor is unsafe and should be used with caution. The
array elements will be uninitialized and accessing them before
initialization is undefined behavior.

**Args:**

* ​uninitialized (`Bool`): A boolean to indicate if the array should be
  initialized. Always set to `True` (it's not actually used inside
  the constructor).

`__init__(out self, *, owned unsafe_assume_initialized: InlineArray[UnsafeMaybeUninitialized[ElementType], size])`

Constructs an `InlineArray` from an `InlineArray` of `UnsafeMaybeUninitialized`.

Warning:
This is an unsafe constructor. Only use it if you are certain all
elements are properly initialized.

Notes:
This constructor assumes all elements in the input array are
initialized. Using uninitialized elements results in undefined
behavior, even for types that are valid for any bit pattern
(e.g. `Int` or `Float`).

**Args:**

* ​unsafe\_assume\_initialized (`InlineArray[UnsafeMaybeUninitialized[ElementType], size]`): The array of `UnsafeMaybeUninitialized`
  elements. All elements must be initialized.

`@implicit`
`__init__[batch_size: Int = 64](out self, fill: ElementType)`

Constructs an array where each element is initialized to the supplied value.

Examples:

```mojo
var filled = InlineArray[Int, 5](fill=42)  # [42, 42, 42, 42, 42]

# For large arrays, consider adjusting batch_size to balance
# compile time and runtime performance:
var large = InlineArray[Int, 10000].__init__[batch_size=32](fill=0)
```

Notes:

* Full unrolling with large arrays (>2k elements) can cause significant
  compiler slowdowns.
* Using batch\_size=64 balances AVX512 efficiency and instruction cache
  usage.
* For very large arrays, using smaller batch sizes (e.g., 32 or 16) can
  further improve compilation speed while still maintaining good
  runtime performance.

**Parameters:**

* ​batch\_size (`Int`): The number of elements to unroll for filling the array.
  Default is 64, which optimizes for AVX512 operations on modern
  CPUs. For large arrays (>2k elements), this batched approach
  significantly improves compile times compared to full unrolling
  while maintaining good runtime performance.

**Args:**

* ​fill (`ElementType`): The element value to fill each index with.

`@implicit`
`__init__(out self, owned *elems: ElementType, *, __list_literal__: Tuple[] = Tuple())`

Constructs an array from a variadic list of elements.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)  # [1, 2, 3]
```

**Args:**

* ​\*elems (`ElementType`): The elements to initialize the array with. Must match the
  array size.
* ​**list\_literal** (`Tuple[]`): Specifies that this constructor can be used for
  list literals.

`__init__(out self, *, owned storage: VariadicListMem[ElementType, origin, is_owned])`

Construct an array from a low-level internal representation.

**Args:**

* ​storage (`VariadicListMem[ElementType, origin, is_owned]`): The variadic list storage to construct from. Must match
  array size.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Copy constructs the array from another array.

Notes:
Creates a deep copy by copying each element individually.

**Args:**

* ​other (`Self`): The array to copy from.

### `__del__`

`__del__(owned self)`

Deallocates the array and destroys its elements.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
# arr's destructor is called automatically when it goes out of scope
```

Notes:
This destructor is called automatically when the array goes out of
scope. If the array's `run_destructors` parameter is `True`, it will
call the destructor on each element in the array before deallocating
the array's memory.

### `__getitem__`

`__getitem__[I: Indexer](ref self, idx: I) -> ref [self] ElementType`

Gets a reference to the element at the given index.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
print(arr[0])   # Prints 1 - first element
print(arr[1])   # Prints 2 - second element
print(arr[-1])  # Prints 3 - last element
print(arr[-2])  # Prints 2 - second to last element
```

Notes:
This method provides array-style indexing access to elements in the
InlineArray. It supports both positive indices starting from 0 and
negative indices counting backwards from the end of the array. The
index is bounds-checked at runtime.

**Parameters:**

* ​I (`Indexer`): The type parameter representing the index type, must implement
  Indexer trait.

**Args:**

* ​idx (`I`): The index to access. Can be positive (0 to len-1) or negative
  (-len to -1).

**Returns:**

A reference to the element at the specified index.

`__getitem__[I: Indexer, //, idx: I](ref self) -> ref [self] ElementType`

Gets a reference to the element at the given index with compile-time bounds checking.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
print(arr[0])   # Prints 1 - first element
print(arr[-1])  # Prints 3 - last element
```

Notes:
This overload provides array-style indexing with compile-time bounds
checking. The index must be a compile-time constant value. It
supports both positive indices starting from 0 and negative indices
counting backwards from the end of the array.

**Parameters:**

* ​I (`Indexer`): The type parameter representing the index type, must implement
  Indexer trait.
* ​idx (`I`): The compile-time constant index to access. Can be positive
  (0 to len-1) or negative (-len to -1).

**Returns:**

A reference to the element at the specified index.

### `__contains__`

`__contains__[T: EqualityComparable & Copyable & Movable, //](self: InlineArray[T, size], value: T) -> Bool`

Tests if a value is present in the array using the `in` operator.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
print(3 in arr)  # Prints True - value exists
print(4 in arr)  # Prints False - value not found
```

Notes:
This method enables using the `in` operator to check if a value
exists in the array. It performs a linear search comparing each
element for equality with the given value. The element type must
implement the `EqualityComparable`, `Copyable` and `Movable` traits
to support equality comparison.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The element type, must implement both `EqualityComparable` and
  `Copyable` and `Movable`.

**Args:**

* ​value (`T`): The value to search for.

**Returns:**

True if the value is found in any position in the array, False
otherwise.

### `copy`

`copy(self) -> Self`

Creates a deep copy of the array.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
var copy = arr.copy()  # Creates new array [1, 2, 3]
```

**Returns:**

A new array containing copies of all elements.

### `__len__`

`__len__(self) -> Int`

Returns the length of the array.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
print(len(arr))  # Prints 3
```

Notes:
The length is a compile-time constant value determined by the
size parameter used when creating the array.

**Returns:**

The size of the array as an Int.

### `unsafe_get`

`unsafe_get[I: Indexer](ref self, idx: I) -> ref [self] ElementType`

Gets a reference to an element without bounds checking.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
print(arr.unsafe_get(0))  # Prints 1
```

Warning:
This is an unsafe method. No bounds checking is performed.
Using an invalid index will cause undefined behavior.
Negative indices are not supported.

Notes:
This is an unsafe method that skips bounds checking for performance.
Users should prefer `__getitem__` instead for safety.

**Parameters:**

* ​I (`Indexer`): A type parameter representing the index type, must implement
  Indexer trait.

**Args:**

* ​idx (`I`): The index of the element to get. Must be non-negative and in
  bounds. Using an invalid index will cause undefined behavior.

**Returns:**

A reference to the element at the given index.

### `unsafe_ptr`

`unsafe_ptr(ref self) -> UnsafePointer[ElementType, mut=self_is_mut, origin=self_is_origin]`

Gets an unsafe pointer to the underlying array storage.

Examples:

```mojo
var arr = InlineArray[Int, 3](1, 2, 3)
var ptr = arr.unsafe_ptr()
print(ptr[0])  # Prints 1
```

Warning:
This is an unsafe method. The returned pointer:

* Becomes invalid if the array is moved
* Must not be used to access memory outside array bounds
* Must be refreshed after any operation that could move the array

Notes:
Returns a raw pointer to the array's memory that can be used for
direct memory access. The pointer inherits mutability from the array
reference.

**Returns:**

An `UnsafePointer` to the underlying array storage. The pointer's
mutability matches that of the array reference.

---

## Inner_matmul_default

`struct Inner_matmul_default`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`InnerMatmulKernel`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__inner_matmul__`

`__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)`

Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows, TileN, TileK) tile.

---

## Inner_matmul_i8mm

`struct Inner_matmul_i8mm`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`InnerMatmulKernel`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__inner_matmul__`

`__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)`

Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows2, TileN, TileK) tile.

---

## Inner_matmul_neon

`struct Inner_matmul_neon`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`InnerMatmulKernel`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__inner_matmul__`

`__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)`

Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows, TileN, TileK) tile.

---

## Inner_matmul_vnni

`struct Inner_matmul_vnni[saturated_vnni: Bool]`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`InnerMatmulKernel`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__inner_matmul__`

`__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)`

Utility function on the inner loop. Run the inner kernel on the whole (kernel\_rows, TileN, TileK) tile.

---

## inner_product

`inner_product(a: IntTuple[origin], b: IntTuple[origin]) -> Int`

Compute the inner product of two `IntTuple`s.

For flat tuples, this is the sum of element-wise products.
For nested tuples, the function recurses into corresponding nested elements.

Note:
If the input tuples have different lengths, `abort()` will be called.

**Args:**

* ​a (`IntTuple[origin]`): First `IntTuple`.
* ​b (`IntTuple[origin]`): Second `IntTuple`.

**Returns:**

The inner product as an `Int`.

---

## InnerKernelID

`@register_passable(trivial)`
`struct InnerKernelID`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `DEFAULT`

`alias DEFAULT = InnerKernelID(0)`

### `I8MM`

`alias I8MM = InnerKernelID(3)`

### `NEON`

`alias NEON = InnerKernelID(2)`

### `VNNI`

`alias VNNI = InnerKernelID(1)`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

---

## InnerMatmulKernel

## Implemented traits

`AnyType`,
`Copyable`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__inner_matmul__`

`__inner_matmul__[kernel_rows: Int, kernel_cols: Int, simd_size: Int](self: _Self, c: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], a: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], b_packed: NDBuffer[type, 3, origin, shape], global_offset: GemmShape, global_bound: GemmShape, tile_n_k: IndexList[2], skip_boundary_check: Bool)`

---

## input

`input(prompt: String = __init__[__mlir_type.!kgen.string]("")) -> String`

Reads a line of input from the user.

Reads a line from standard input, converts it to a string, and returns that string.
If the prompt argument is present, it is written to standard output without a trailing newline.

Examples:

```mojo
name = input("Enter your name: ")
print("Hello", name)
```

If the user enters "Mojo" it prints "Hello Mojo".

**Args:**

* ​prompt (`String`): An optional string to be printed before reading input.

**Returns:**

A string containing the line read from the user input.

---

## Install guide

import TutorialStack from '@site/src/components/TutorialStack';
import InstallModular from '@site/docs/_includes/install-modular.mdx';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

You can install all the Modular APIs and tools (including MAX and Mojo) as a
single package called `modular`, using `pip` or `magic` (or other Python and
Conda package managers).

The `modular` package is available as a nightly and a stable build. You can
also select the latest nightly or stable documentation, using a drop-down in
the website header. By default, we show the nightly version so you always see
the latest APIs and documentation.

If you just want to get started, instead see our [quickstart
guide](/max/get-started).

:::note

If you'll mostly be programming in Mojo, we recommend installing with `magic`,
because the `pip` package currently doesn't include the Mojo LSP or debugger.

:::

## Install

To get the latest performance improvements and new features, we recommend
installing our nightly build, which we release several times a week. If you
want a better tested but slightly older version, you can install a stable
build. (Each stable release is described in the [changelog](/max/changelog).)

The `modular` package installs MAX, Mojo, and other package dependencies.

:::note GitHub stable branch

When using a stable build, make sure you also checkout the `stable` branch when
you clone the [Modular repo](https://github.com/modular/modular) (because the
`main` branch includes the latest nightly code). For example:

```sh
git clone -b stable https://github.com/modular/modular.git
```

:::

## Uninstall

  
  You can uninstall `modular` from your virtual environment with the following
  command:
  ```sh
  pip uninstall modular
  ```

  To deactivate your virtual environment, run:
  ```sh
  deactivate
  ```
  
  
  You can uninstall `modular` from your virtual environment with the following
  command:
  ```sh
  uv pip uninstall modular
  ```

  To deactivate your virtual environment, run:
  ```sh
  deactivate
  ```
  
  
  If you installed with `magic`, just delete the project paths that you
  created with `magic init` (paths with a `pyproject.toml`, `mojoproject.toml`,
  or `pixi.toml` file).

  To remove the `magic` tool, delete the `magic` binary:

  ```sh
  rm ~/.modular/bin/magic
  ```
  

## What's included

  
    The `modular` Python wheel installs the following:

    - MAX tools and libraries
      - [`max` CLI](/max/max-cli)
      - [`max` Python library](/max/api/python/)
      - [`max` Mojo library](/max/api/mojo/)
    - Mojo tools and libraries
      - [`mojo` CLI](/mojo/cli)
      - [Mojo standard library](/mojo/lib)

    `pip` known issues:

    - The Mojo LSP and Mojo debugger aren't included. If you want to develop
    with Mojo, we currently recommend you install the `modular` conda package
    with [Magic](/magic) or [conda](/magic/conda).

  
    The `max` conda package installs the following:

    - MAX tools and libraries
      - [`max` Python library](/max/api/python/)
      - [`max` Mojo library](/max/api/mojo/)
      - [MAX Engine C API](/max/api/c/)
    - Mojo tools and libraries
      - [`mojo` CLI](/mojo/cli)
      - [Mojo standard library](/mojo/lib)
      - Mojo LSP
      - Mojo debugger

    The `max-pipelines` package installs the [`max` CLI](/max/max-cli).

    `magic` known issues:

    - You might encounter issues if you invoke `magic` within a `conda` virtual
    environment. It's best if you don't mix the two tools.

  
## Next steps

export const tutorials = [
  'magic',
  'deploy-llama-vision',
];

---

## Install MAX with pip

import TutorialStack from '@site/src/components/TutorialStack';
import InstallModularNoMagic from '@site/docs/_includes/install-modular-no-magic.mdx';

You can install everything you need to build and deploy MAX models using pip.
However, if you want to develop with Mojo, we recommend using [Magic](/magic)
or [conda](/magic/conda).

## Get started using pip

Here's how to install the Modular platform APIs and tools with pip, and then
deploy a GenAI model on a local endpoint:

1. Start a Python virtual environment and install MAX:

    
2. Start a local endpoint for Llama 3:

    ```sh
    max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF
    ```

    In addition to starting a local server, this downloads the model weights
    and compiles the model, which might take some time.

    The endpoint is ready when you see the URI printed in your terminal:

    ```output
    Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
    ```

3. Now open another terminal to send a request using `curl`:

    ```sh
    curl -N http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
        "stream": true,
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the World Series in 2020?"}
        ]
    }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
    ```

Now check out these tutorials for more about how to accelerate your GenAI
models with MAX:

export const maxTutorials = [
  'run-embeddings-with-max-serve',
  'deploy-llama-vision',
  'get-started-with-max-graph-in-python',
];

## What's included

The `modular` Python package installs the following:

- MAX tools and libraries
  - [`max` CLI](/max/max-cli)
  - [`max` Python library](/max/api/python/)
  - [`max` Mojo library](/max/api/mojo/)
- Mojo tools and libraries
  - [`mojo` CLI](/mojo/cli)
  - [Mojo standard library](/mojo/lib)

## Known issues

- The Mojo LSP and Mojo debugger aren't included. If you want to develop with
Mojo, we currently recommend you install the `max` conda package with
[Magic](/magic) or [conda](/magic/conda).

---

## Install MAX/Mojo with conda

import TutorialStack from '@site/src/components/TutorialStack';

Although we recommend using [Magic](/magic) to manage your virtual environments
and packages for MAX and Mojo, you can also add MAX/Mojo to a
[conda](https://docs.conda.io/projects/conda/en/latest/index.html) project.

:::note

The `max` package includes Mojo. There's no separate package for Mojo.

:::

## Get started with MAX

Here's how to install MAX using conda and then deploy a GenAI model on a local
endpoint:

1. Create a conda project that includes MAX:

    ```sh
    conda create -n max-project -c conda-forge -c https://conda.modular.com/max/ \
      python=3.11 max=* max-pipelines=* -y
    ```

2. Activate the environment:

    ```sh
    conda activate max-project
    ```

3. Start a local endpoint for Llama 3:

    ```sh
    max-pipelines serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF
    ```

    In addition to starting a local server, this downloads the model weights
    and compiles the model, which might take some time.

    The endpoint is ready when you see the URI printed in your terminal:

    ```output
    Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
    ```

4. Now open another terminal to send a request using `curl`:

    ```sh
    curl -N http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
        "stream": true,
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the World Series in 2020?"}
        ]
    }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'

Now check out these tutorials for more about how to accelerate your GenAI
models with MAX:

export const maxTutorials = [
  'run-embeddings-with-max-serve',
  'deploy-llama-vision',
  'get-started-with-max-graph-in-python',
];

## Get started with Mojo

Here's how to install Mojo using conda and run a code example:

1. Create a conda project that includes MAX/Mojo:

    ```sh
    conda create -n mojo-project -c conda-forge -c https://conda.modular.com/max/ \
      python=3.11 max=* -y
    ```

2. Activate the environment and you'll have access to `mojo`:

    ```sh
    conda activate mojo-project
    ```

    ```sh
    mojo --version
    ```

3. Try one of the Mojo code examples:

    ```sh
    git clone https://github.com/modular/modular.git
    ```

    ```sh
    cd max/examples/mojo
    ```

    ```sh
    mojo hello_interop.mojo
    ```

    ```output
    Hello Mojo 🔥!
    9
    6
    3
    Hello from Python!
    I can even print a numpy array:  [1 2 3]
    ```

Now continue exploring Mojo with these tutorials:

export const mojoTutorials = [
  'get-started',
  'gpu/intro-tutorial',
];

## What's included

The `max` conda package installs the following:

- MAX tools and libraries
  - [`max` Python library](/max/api/python/)
  - [`max` Mojo library](/max/api/mojo/)
  - [MAX Engine C API](/max/api/c/)
- Mojo tools and libraries
  - [`mojo` CLI](/mojo/cli)
  - [Mojo standard library](/mojo/lib)
  - Mojo LSP
  - Mojo debugger

The `max-pipelines` package installs the [`max` CLI](/max/max-cli).

## Known issues

- You might encounter issues if you invoke `magic` within a `conda` virtual
environment. It's best if you don't mix the two tools.

---

## Install Modular

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import MaxInstall from '@site/src/components/MaxInstall';
import CodeBlock from '@theme/CodeBlock';

export default function InstallModular({ folder = "modular" }) {
  return (
    
      
              Create a project folder:
              
                {`mkdir ${folder} && cd ${folder}`}
              
            
              Create and activate a virtual environment:
              
                {`python3 -m venv .venv/${folder} \\
  && source .venv/${folder}/bin/activate`}
              
            
              Install the modular Python package:
              
                
                    {`pip install modular \\
  --extra-index-url https://download.pytorch.org/whl/cpu \\
  --extra-index-url https://dl.modular.com/public/nightly/python/simple/`}
                  
                
                    {`pip install modular \\
  --extra-index-url https://modular.gateway.scarf.sh/simple/ \\
  --extra-index-url https://download.pytorch.org/whl/cpu`}
                  
                
              Install uv:
              
                {`curl -LsSf https://astral.sh/uv/install.sh | sh`}
              
              Then restart your terminal to make uv accessible.
            

              Create a project:
              
                {`uv init ${folder} && cd ${folder}`}
              
            
              Create and start a virtual environment:
              
                {`uv venv && source .venv/bin/activate`}
              
            
              Install the modular Python package:
              
                
                    {`uv pip install modular \\
  --extra-index-url https://download.pytorch.org/whl/cpu \\
  --extra-index-url https://dl.modular.com/public/nightly/python/simple/ \\
  --index-strategy unsafe-best-match`}
                  
                
                    {`uv pip install modular \\
  --extra-index-url https://modular.gateway.scarf.sh/simple/ \\
  --extra-index-url https://download.pytorch.org/whl/cpu \\
  --index-strategy unsafe-best-match`}
                  
                
              Install magic:
              
              Then run the source command that's printed in your terminal.
            

              Create a project:
              
                {`magic init ${folder} --format pyproject && cd ${folder}`}
              
            
              Install the max-pipelines conda package:
              
                
                    {`magic add max-pipelines`}
                  
                
                    {`magic add "max-pipelines==25.3"`}
                  
                
              Start the virtual environment:
              
                {`magic shell`}
              
            
  );
}

---

## Install Modular No Magic

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import CodeBlock from '@theme/CodeBlock';

export default function InstallModularNoMagic({ folder = "modular" }) {
  return (
    
      
              Create a project folder:
              
                {`mkdir ${folder} && cd ${folder}`}
              
            
              Create and activate a virtual environment:
              
                {`python3 -m venv .venv/${folder} \\
  && source .venv/${folder}/bin/activate`}
              
            
              Install the modular Python package:
              
                
                    {`pip install modular \\
  --extra-index-url https://download.pytorch.org/whl/cpu \\
  --extra-index-url https://dl.modular.com/public/nightly/python/simple/`}
                  
                
                    {`pip install modular \\
  --extra-index-url https://download.pytorch.org/whl/cpu`}
                  
                
              Install uv:
              
                {`curl -LsSf https://astral.sh/uv/install.sh | sh`}
              
              Then restart your terminal to make uv accessible.
            

              Create a project:
              
                {`uv init ${folder} && cd ${folder}`}
              
            
              Create and start a virtual environment:
              
                {`uv venv && source .venv/bin/activate`}
              
            
              Install the modular Python package:
              
                
                    {`uv pip install modular \\
  --extra-index-url https://download.pytorch.org/whl/cpu \\
  --extra-index-url https://dl.modular.com/public/nightly/python/simple/ \\
  --index-strategy unsafe-best-match`}
                  
                
                    {`uv pip install modular \\
  --extra-index-url https://download.pytorch.org/whl/cpu \\
  --index-strategy unsafe-best-match`}
                  
                
  );
}

---

## int

Implements the Int class.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`Int`](/mojo/stdlib/builtin/int/Int): This type represents an integer value.

## Traits

* [​`ImplicitlyIntable`](/mojo/stdlib/builtin/int/ImplicitlyIntable): The `ImplicitlyIntable` trait describes a type that can be converted to an Int implicitly.
* [​`Indexer`](/mojo/stdlib/builtin/int/Indexer): The `Indexer` trait is used for types that can index into a collection or pointer. The type returned is the underlying \_\_mlir\_type.index, enabling types like `UInt` to not have to be converted to an `Int` first. This type is implicitly convertable to an `Int`, so can be used anywhere an `Int` can e.g. for comparisons.
* [​`Intable`](/mojo/stdlib/builtin/int/Intable): The `Intable` trait describes a type that can be converted to an Int.
* [​`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising): The `IntableRaising` trait describes a type can be converted to an Int, but the conversion might raise an error.

## Functions

* [​`index`](/mojo/stdlib/builtin/int/index-function): Returns the value of `__index__` for the given value.

---

## Int

`@register_passable(trivial)`
`struct Int`

This type represents an integer value.

## Fields

* ​value (`index`): The underlying storage for the integer value.

## Implemented traits

`Absable`,
`AnyType`,
`Boolable`,
`CeilDivable`,
`Comparable`,
`ConvertibleFromPython`,
`Copyable`,
`Defaultable`,
`DevicePassable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`Hashable`,
`ImplicitlyBoolable`,
`Indexer`,
`Intable`,
`KeyElement`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`Powable`,
`PythonConvertible`,
`Representable`,
`Roundable`,
`Stringable`,
`TypeIdentifiable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Aliases

### `BITWIDTH`

`alias BITWIDTH = __init__[::Intable](bitwidthof[::DType,__mlir_type.!kgen.target]())`

The bit width of the integer type.

### `device_type`

`alias device_type = Int`

Int is remapped to the same type when passed to accelerator devices.

### `MAX`

`alias MAX = __init__[::Intable](SIMD(max_or_inf[::DType]()))`

Returns the maximum integer value.

### `MIN`

`alias MIN = __init__[::Intable](SIMD(min_or_neg_inf[::DType]()))`

Returns the minimum value of type.

### `TYPE_ID`

`alias TYPE_ID = "stdlib.Int"`

## Methods

### `__init__`

`__init__() -> Self`

Default constructor that produces zero.

`@implicit`
`__init__(value: IntLiteral[value]) -> Self`

Construct Int from the given IntLiteral value.

**Args:**

* ​value (`IntLiteral[value]`): The init value.

`@implicit`
`__init__(value: UInt) -> Self`

Construct Int from the given UInt value.

**Args:**

* ​value (`UInt`): The init value.

`__init__[T: Intable](value: T) -> Self`

Get the Int representation of the value.

**Parameters:**

* ​T (`Intable`): The Intable type.

**Args:**

* ​value (`T`): The object to get the integral representation of.

`__init__[T: IntableRaising](out self, value: T)`

Get the Int representation of the value.

**Parameters:**

* ​T (`IntableRaising`): The Intable type.

**Args:**

* ​value (`T`): The object to get the integral representation of.

**Raises:**

If the type does not have an integral representation.

`@implicit`
`__init__[I: ImplicitlyIntable](value: I) -> Self`

Construct Int from implicitly convertable type.

**Parameters:**

* ​I (`ImplicitlyIntable`): The type that is implicitly convertable to an `Int`.

**Args:**

* ​value (`I`): The init value.

`__init__(out self, value: StringSlice[origin], base: UInt = UInt(10))`

Parses and returns the given string as an integer in the given base.

If base is set to 0, the string is parsed as an Integer literal, with the
following considerations:

* '0b' or '0B' prefix indicates binary (base 2)
* '0o' or '0O' prefix indicates octal (base 8)
* '0x' or '0X' prefix indicates hexadecimal (base 16)
* Without a prefix, it's treated as decimal (base 10)

Examples:

> > > Int("32")
> > > 32
> > > Int("FF", 16)
> > > 255
> > > Int("0xFF", 0)
> > > 255
> > > Int("0b1010", 0)
> > > 10

Notes:
This follows [Python's integer literals](https://docs.python.org/3/reference/lexical_analysis.html#integers).

**Args:**

* ​value (`StringSlice[origin]`): A string to be parsed as an integer in the given base.
* ​base (`UInt`): Base used for conversion, value must be between 2 and 36, or 0.

**Raises:**

If the given string cannot be parsed as an integer value or if an
incorrect base is provided.

### `__bool__`

`__bool__(self) -> Bool`

Convert this Int to Bool.

**Returns:**

False Bool value if the value is equal to 0 and True otherwise.

### `__neg__`

`__neg__(self) -> Self`

Return -self.

**Returns:**

The -self value.

### `__pos__`

`__pos__(self) -> Self`

Return +self.

**Returns:**

The +self value.

### `__invert__`

`__invert__(self) -> Self`

Return \~self.

**Returns:**

The \~self value.

### `__lt__`

`__lt__(self, rhs: Self) -> Bool`

Compare this Int to the RHS using LT comparison.

**Args:**

* ​rhs (`Self`): The other Int to compare against.

**Returns:**

True if this Int is less-than the RHS Int and False otherwise.

### `__le__`

`__le__(self, rhs: Self) -> Bool`

Compare this Int to the RHS using LE comparison.

**Args:**

* ​rhs (`Self`): The other Int to compare against.

**Returns:**

True if this Int is less-or-equal than the RHS Int and False
otherwise.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compare this Int to the RHS using EQ comparison.

**Args:**

* ​rhs (`Self`): The other Int to compare against.

**Returns:**

True if this Int is equal to the RHS Int and False otherwise.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Compare this Int to the RHS using NE comparison.

**Args:**

* ​rhs (`Self`): The other Int to compare against.

**Returns:**

True if this Int is non-equal to the RHS Int and False otherwise.

### `__gt__`

`__gt__(self, rhs: Self) -> Bool`

Compare this Int to the RHS using GT comparison.

**Args:**

* ​rhs (`Self`): The other Int to compare against.

**Returns:**

True if this Int is greater than the RHS Int and False otherwise.

### `__ge__`

`__ge__(self, rhs: Self) -> Bool`

Compare this Int to the RHS using GE comparison.

**Args:**

* ​rhs (`Self`): The other Int to compare against.

**Returns:**

True if this Int is greater-or-equal than the RHS Int and False
otherwise.

### `__add__`

`__add__(self, rhs: Self) -> Self`

Return `self + rhs`.

**Args:**

* ​rhs (`Self`): The value to add.

**Returns:**

`self + rhs` value.

### `__sub__`

`__sub__(self, rhs: Self) -> Self`

Return `self - rhs`.

**Args:**

* ​rhs (`Self`): The value to subtract.

**Returns:**

`self - rhs` value.

### `__mul__`

`__mul__(self, rhs: Self) -> Self`

Return `self * rhs`.

**Args:**

* ​rhs (`Self`): The value to multiply with.

**Returns:**

`self * rhs` value.

### `__truediv__`

`__truediv__(self, rhs: Self) -> SIMD[float64, 1]`

Return the floating point division of `self` and `rhs`.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

`Float64(self)/Float64(rhs)` value.

### `__floordiv__`

`__floordiv__(self, rhs: Self) -> Self`

Return the division of `self` and `rhs` rounded down to the nearest integer.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

`floor(self/rhs)` value.

### `__mod__`

`__mod__(self, rhs: Self) -> Self`

Return the remainder of self divided by rhs.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

The remainder of dividing self by rhs.

### `__pow__`

`__pow__(self, exp: Self) -> Self`

Return the value raised to the power of the given exponent.

Computes the power of an integer using the Russian Peasant Method.

**Args:**

* ​exp (`Self`): The exponent value.

**Returns:**

The value of `self` raised to the power of `exp`.

### `__lshift__`

`__lshift__(self, rhs: Self) -> Self`

Return `self rhs (`Self`): The value to shift with.

**Returns:**

`self 

### `__rshift__`

`__rshift__(self, rhs: Self) -> Self`

Return `self >> rhs`.

**Args:**

* ​rhs (`Self`): The value to shift with.

**Returns:**

`self >> rhs`.

### `__and__`

`__and__(self, rhs: Self) -> Self`

Return `self & rhs`.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self & rhs`.

### `__or__`

`__or__(self, rhs: Self) -> Self`

Return `self | rhs`.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self | rhs`.

### `__xor__`

`__xor__(self, rhs: Self) -> Self`

Return `self ^ rhs`.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self ^ rhs`.

### `__radd__`

`__radd__(self, value: Self) -> Self`

Return `value + self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value + self`.

### `__rsub__`

`__rsub__(self, value: Self) -> Self`

Return `value - self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value - self`.

### `__rmul__`

`__rmul__(self, value: Self) -> Self`

Return `value * self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value * self`.

### `__rfloordiv__`

`__rfloordiv__(self, value: Self) -> Self`

Return `value // self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value // self`.

### `__rmod__`

`__rmod__(self, value: Self) -> Self`

Return `value % self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value % self`.

### `__rpow__`

`__rpow__(self, value: Self) -> Self`

Return `pow(value,self)`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`pow(value,self)`.

### `__rlshift__`

`__rlshift__(self, value: Self) -> Self`

Return `value value (`Self`): The other value.

**Returns:**

`value 

### `__rrshift__`

`__rrshift__(self, value: Self) -> Self`

Return `value >> self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value >> self`.

### `__rand__`

`__rand__(self, value: Self) -> Self`

Return `value & self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value & self`.

### `__ror__`

`__ror__(self, value: Self) -> Self`

Return `value | self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value | self`.

### `__rxor__`

`__rxor__(self, value: Self) -> Self`

Return `value ^ self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value ^ self`.

### `__iadd__`

`__iadd__(mut self, rhs: Self)`

Compute `self + rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__isub__`

`__isub__(mut self, rhs: Self)`

Compute `self - rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__imul__`

`__imul__(mut self, rhs: Self)`

Compute self\*rhs and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__itruediv__`

`__itruediv__(mut self, rhs: Self)`

Compute `self / rhs`, convert to int, and save the result in self.

Since `floor(self / rhs)` is equivalent to `self // rhs`, this yields
the same as `__ifloordiv__`.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ifloordiv__`

`__ifloordiv__(mut self, rhs: Self)`

Compute `self // rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__imod__`

`__imod__(mut self, rhs: Self)`

Compute `self % rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ipow__`

`__ipow__(mut self, rhs: Self)`

Compute `pow(self, rhs)` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ilshift__`

`__ilshift__(mut self, rhs: Self)`

Compute `self rhs (`Self`): The RHS value.

### `__irshift__`

`__irshift__(mut self, rhs: Self)`

Compute `self >> rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__iand__`

`__iand__(mut self, rhs: Self)`

Compute `self & rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ixor__`

`__ixor__(mut self, rhs: Self)`

Compute `self ^ rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ior__`

`__ior__(mut self, rhs: Self)`

Compute self|rhs and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `get_type_name`

`static get_type_name() -> String`

Gets this type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls.

**Returns:**

This type's name.

### `get_device_type_name`

`static get_device_type_name() -> String`

Gets device\_type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls.

**Returns:**

This type's name.

### `__divmod__`

`__divmod__(self, rhs: Self) -> Tuple[Int, Int]`

Computes both the quotient and remainder using integer division.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

The quotient and remainder as a tuple `(self // rhs, self % rhs)`.

### `__as_bool__`

`__as_bool__(self) -> Bool`

Convert this Int to Bool.

**Returns:**

False Bool value if the value is equal to 0 and True otherwise.

### `__index__`

`__index__(self) -> index`

Convert to index.

**Returns:**

The corresponding \_\_mlir\_type.index value.

### `__int__`

`__int__(self) -> Self`

Gets the integral value (this is an identity function for Int).

**Returns:**

The value as an integer.

### `__abs__`

`__abs__(self) -> Self`

Return the absolute value of the Int value.

**Returns:**

The absolute value.

### `__ceil__`

`__ceil__(self) -> Self`

Return the ceiling of the Int value, which is itself.

**Returns:**

The Int value itself.

### `__floor__`

`__floor__(self) -> Self`

Return the floor of the Int value, which is itself.

**Returns:**

The Int value itself.

### `__round__`

`__round__(self) -> Self`

Return the rounded value of the Int value, which is itself.

**Returns:**

The Int value itself.

`__round__(self, ndigits: Self) -> Self`

Return the rounded value of the Int value, which is itself.

**Args:**

* ​ndigits (`Self`): The number of digits to round to.

**Returns:**

The Int value itself if ndigits >= 0 else the rounded value.

### `__trunc__`

`__trunc__(self) -> Self`

Return the truncated Int value, which is itself.

**Returns:**

The Int value itself.

### `__ceildiv__`

`__ceildiv__(self, denominator: Self) -> Self`

Return the rounded-up result of dividing self by denominator.

**Args:**

* ​denominator (`Self`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

### `is_power_of_two`

`is_power_of_two(self) -> Bool`

Check if the integer is a (non-zero) power of two.

**Returns:**

True if the integer is a power of two, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this integer to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `write_padded`

`write_padded[W: Writer](self, mut writer: W, width: Self)`

Write the int right-aligned to a set padding.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.
* ​width (`Self`): The amount to pad to the left.

### `__str__`

`__str__(self) -> String`

Get the integer as a string.

**Returns:**

A string representation.

### `__repr__`

`__repr__(self) -> String`

Get the integer as a string. Returns the same `String` as `__str__`.

**Returns:**

A string representation.

### `__hash__`

`__hash__(self) -> UInt`

Hash the int using builtin hash.

**Returns:**

A 64-bit hash value. This value is *not* suitable for cryptographic
uses. Its intended usage is for data structures. See the `hash`
builtin documentation for more details.

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with this int value.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

### `to_python_object`

`to_python_object(self) -> PythonObject`

Convert this value to a PythonObject.

**Returns:**

A PythonObject representing the value.

---

## int_literal

Implements the IntLiteral class.

## Structs

* [​`IntLiteral`](/mojo/stdlib/builtin/int_literal/IntLiteral): This type represents a static integer literal value with infinite precision.  This type is a compile-time construct which stores its value as a parameter.  It is typically materialized into other types (like `Int`) for use at runtime.  This compile-time representation allows for arbitrary precision constants that would overflow on Int and other fixed precision integer types.

---

## int_tuple

Hierarchical integer tuple data structures for high-performance tensor operations.

This module provides a flexible, memory-efficient implementation of nested integer tuples
optimized for tensor shape, stride, and index operations in high-performance computing.
The core data structures support both flat and hierarchical representations with
efficient memory sharing and zero-copy views.

Key components:

* `IntArray`: Low-level register-passable array with direct memory management
* `IntTuple`: Hierarchical nested tuple with efficient memory layout and operations
* Utility functions for tensor shape manipulation, coordinate transformations, and layout operations

Performance features:

* Register-passable data structures for optimal compiler optimizations
* Zero-copy views for efficient memory sharing
* Specialized memory layout for nested structures
* Optimized algorithms for common tensor operations

Common operations:

* Shape manipulation: `flatten`, `to_nest`, `apply`, `product`, `sum`
* Coordinate transformations: `idx2crd`, `crd2idx`
* Layout operations: `compact_order`, `prefix_product`
* Structural comparisons: `congruent`, `compatible`, `weakly_congruent`

Example usage:

```mojo
from layout import IntTuple
from layout.int_tuple import flatten, compact_order, size

# Create nested tuples
var shape = IntTuple(2, IntTuple(3, 4), 5)  # Represents shape (2, (3, 4), 5)

# Flatten a nested tuple
var flat = flatten(shape)  # Results in (2, 3, 4, 5)

# Create compact strides for a given shape and order
var order = IntTuple(1, IntTuple(2, 3), 4)
var strides = compact_order(shape, order)  # Results in (1, (2, 6), 24)

# Calculate total size (product of all elements)
var total_size = size(shape)  # Results in 120
```

## Aliases

### `INT_TUPLE_VALIDATION`

`alias INT_TUPLE_VALIDATION = False`

### `IntList`

`alias IntList = List[Int, True]`

A type alias for a List of integers with ownership.

This alias defines a List that contains Int values and has ownership of its data.
It's used throughout the module for storing and manipulating collections of integers,
particularly for operations like permutations and indices.

### `UNKNOWN_VALUE`

`alias UNKNOWN_VALUE = -1`

Special value indicating an unknown or unspecified dimension.

This constant is used throughout the `IntTuple` system to represent dimensions
that are not known at compile time or have not been specified.

## Structs

* [​`IntArray`](./IntArray): A memory-efficient, register-passable array of integers.
* [​`IntTuple`](./IntTuple): A hierarchical, nested tuple of integers with efficient memory management.

## Functions

* [​`abs`](./abs): Compute the absolute value of each element in an `IntTuple`.
* [​`apply`](./apply): Apply a function to each integer value in an `IntTuple`.
* [​`apply_predicate`](./apply_predicate): Apply a predicate function recursively to two `IntTuple`s.
* [​`apply_zip`](./apply_zip): Apply a function to pairs of elements from two `IntTuple`s.
* [​`compact_order`](./compact_order): Create a compact stride based on shape and order.
* [​`compatible`](./compatible): Test if two shapes are compatible for tensor operations.
* [​`congruent`](./congruent): Test if two `IntTuple`s have the same hierarchical structure.
* [​`crd2idx`](./crd2idx): Map a logical coordinate to a linear index.
* [​`depth`](./depth): Calculates the maximum nesting depth of an `IntTuple`.
* [​`fill_like`](./fill_like): Creates an `IntTuple` with the same structure as the source but filled with a specified value.
* [​`flatten`](./flatten): Flatten a nested `IntTuple` into a single-level `IntTuple`.
* [​`idx2crd`](./idx2crd): Converts a linear index to a coordinate tuple within a given shape.
* [​`idx2crd2`](./idx2crd2): Convert a linear index to coordinates.
* [​`inner_product`](./inner_product): Compute the inner product of two `IntTuple`s.
* [​`is_flat`](./is_flat): Check if an `IntTuple` is flat.
* [​`is_int`](./is_int): Check if an `IntTuple` represents a single integer value.
* [​`is_tuple`](./is_tuple): Check if an `IntTuple` represents a nested tuple.
* [​`mul`](./mul): Multiply each element in an `IntTuple` by a scalar value.
* [​`prefix_product`](./prefix_product): Compute the exclusive prefix product of an `IntTuple`.
* [​`product`](./product): Calculate the product of all values in an `IntTuple`.
* [​`product_each`](./product_each): Compute the product of elements in each sub-tuple of an `IntTuple`.
* [​`propagate_unknown`](./propagate_unknown): Propagates unknown dimensions from the target `IntTuple` to the source `IntTuple`.
* [​`reduce`](./reduce): Apply a reduction function to an `IntTuple` with an initial value.
* [​`reverse`](./reverse): Reverses the order of elements in an `IntTuple`, recursively.
* [​`shallow_apply`](./shallow_apply): Apply a function to each top-level element of an `IntTuple`.
* [​`shape_div`](./shape_div): Performs division operation between shape tuples.
* [​`signum`](./signum): Calculate the sign of an integer.
* [​`size`](./size): Calculate the total size (product of all elements) of an `IntTuple`.
* [​`sorted`](./sorted): Sort an IntTuple using the provided comparison function.
* [​`sum`](./sum): Calculate the sum of all values in an `IntTuple`.
* [​`to_nest`](./to_nest): Nests a flat `IntTuple` according to the structure of a nested `IntTuple`.
* [​`to_unknown`](./to_unknown): Create an `IntTuple` with the same structure but filled with `UNKNOWN_VALUE`.
* [​`tuple_max`](./tuple_max): Calculate the maximum value in an `IntTuple`.
* [​`tuple_min`](./tuple_min): Compute the element-wise minimum of two `IntTuple`s.
* [​`weakly_compatible`](./weakly_compatible): Test if shape A is weakly compatible with shape B.
* [​`weakly_congruent`](./weakly_congruent): Test if two IntTuples have similar hierarchical structures.
* [​`zip`](./zip): Create a zip iterator from an array of `IntTuple` pointers.

---

## Intable

The `Intable` trait describes a type that can be converted to an Int.

Any type that conforms to `Intable` or
[`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising) can construct an
`Int`.

This trait requires the type to implement the `__int__()` method. For
example:

```mojo
struct Foo(Intable):
    var i: Int

    fn __int__(self) -> Int:
        return self.i
```

Now you can construct an `Int`:

```mojo
foo = Foo(42)
assert_equal(Int(foo), 42)
```

**Note:** If the `__int__()` method can raise an error, use the
[`IntableRaising`](/mojo/stdlib/builtin/int/intableraising) trait
instead.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `__int__`

`__int__(self: _Self) -> Int`

Get the integral representation of the value.

**Returns:**

The integral representation of the value.

---

## IntableRaising

The `IntableRaising` trait describes a type can be converted to an Int, but the conversion might raise an error.

Any type that conforms to [`Intable`](/mojo/stdlib/builtin/int/Intable)
or `IntableRaising` can construct an `Int`.

This trait requires the type to implement the `__int__()` method, which can
raise an error. For example:

```mojo
struct Foo(IntableRaising):
    var i: Int

    fn __int__(self) raises -> Int:
        return self.i
```

Now you can construct an `Int`:

```mojo
foo = Foo(42)
assert_equal(Int(foo), 42)
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__int__`

`__int__(self: _Self) -> Int`

Get the integral representation of the value.

**Returns:**

The integral representation of the type.

**Raises:**

If the type does not have an integral representation.

---

## IntArray

`@register_passable`
`struct IntArray`

A memory-efficient, register-passable array of integers.

`IntArray` provides a low-level implementation of a dynamically-sized integer array
with direct memory management. It supports both owned and non-owned (view) modes
for efficient memory sharing without copying.

This struct serves as the underlying storage mechanism for `IntTuple` and related
data structures, optimized for high-performance tensor operations.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(size: Int = 0) -> Self`

Initialize a new owned `IntArray` with the specified size.

**Args:**

* ​size (`Int`): Number of integers to allocate space for. Defaults to 0.

`__init__(*, non_owned: Self, offset: Int = 0) -> Self`

Create a non-owned view into another `IntArray`.

Creates a view starting at the specified offset in the source array.
The resulting array doesn't own the memory and won't free it when destroyed.

**Args:**

* ​non\_owned (`Self`): The source array to create a view into.
* ​offset (`Int`): Starting position in the source array. Defaults to 0.

### `__copyinit__`

`__copyinit__(existing: Self) -> Self`

Initialize by copying an existing `IntArray`.

For owned arrays, this performs a deep copy of the data.
For non-owned arrays, this creates another view of the same data (zero-copy operation).

**Args:**

* ​existing (`Self`): The source array to copy from.

### `__del__`

`__del__(owned self)`

Destroy the `IntArray` and free its memory if owned.

Only frees memory for owned arrays (positive \_size) to prevent
double-free errors with views.

### `__getitem__`

`__getitem__(self, idx: Int) -> Int`

Access an element at the specified index.

Note:
Bounds checking is only performed when `INT_TUPLE_VALIDATION` is enabled.

**Args:**

* ​idx (`Int`): Zero-based index of the element to access.

**Returns:**

The integer value at the specified index.

### `__setitem__`

`__setitem__(mut self, idx: Int, value: Int)`

Set the value at the specified index.

Note:
Bounds checking is only performed when `INT_TUPLE_VALIDATION` is enabled.

**Args:**

* ​idx (`Int`): Zero-based index of the element to modify.
* ​value (`Int`): The integer value to store at the specified index.

### `owning`

`owning(self) -> Bool`

Check if this `IntArray` owns its memory.

**Returns:**

True if this array owns its memory (positive \_size),
False if it's a view (negative \_size).

### `size`

`size(self) -> Int`

Get the number of elements in the array.

**Returns:**

The number of elements in the array, regardless of ownership status.

### `copy_from`

`copy_from(mut self, offset: Int, source: Self, size: Int)`

Copy elements from another `IntArray`.

**Args:**

* ​offset (`Int`): Destination offset in this array.
* ​source (`Self`): Source array to copy from.
* ​size (`Int`): Number of elements to copy.

`copy_from(mut self, dst_offset: Int, source: Self, src_offset: Int, size: Int)`

Copy elements from another IntArray with source offset.

**Args:**

* ​dst\_offset (`Int`): Destination offset in this array.
* ​source (`Self`): Source array to copy from.
* ​src\_offset (`Int`): Source offset in the source array.
* ​size (`Int`): Number of elements to copy.

---

## intel_amx_intrinsics

## Aliases

### `void`

`alias void = invalid`

## Structs

* [​`__tile`](./__tile): An AMX tile representation
* [​`tileconfig`](./tileconfig):

## Functions

* [​`init_intel_amx`](./init_intel_amx):

---

## interfaces

General interface for Attention.

## `AttentionImpl` {#max.nn.attention.interfaces.AttentionImpl}

> *class* max.nn.attention.interfaces.AttentionImpl(n\_heads, kv\_params, wqkv, wo, scale)

A generalized attention interface, that will be used upstream by a general Transformer.
We would expect a separate subclass, articulating each variation of Attention:

* AttentionWithRope
* AttentionWithAlibi
* VanillaAttentionWithCausalMask
* …

There are a series of shared attributes, however, more may be needed for each individual variant.
For example, we may introduce an OptimizedRotaryEmbedding class for the AttentionWithRope class:

```python
@dataclass
class AttentionWithRope(AttentionImpl):
    rope: OptimizedRotaryEmbedding
    ...
```

We expect the `__call__` abstractmethod to remain relatively consistent, however the `**kwargs`
argument is exposed, allowing you to leverage additional arguments for each particular variant.
For example, we may introduce an VanillaAttentionWithCausalMask class, which includes an attention
mask:

```python
@dataclass
class VanillaAttentionWithCausalMask(AttentionImpl):
    ...

    def __call__(
        self,
        x: TensorValueLike,
        kv_collection: ContinuousBatchingKVCacheCollection,
        valid_lengths: TensorValueLike,
        **kwargs,
    ) -> tuple[TensorValue, ContinuousBatchingKVCacheCollection]: ...

        if "attn_mask" not in kwargs:
            raise ValueError("attn_mask not provided to VanillaAttentionWithCausalMask")

        # Which we can then use the attention mask downstream like so:
        op(
            attn_mask = kwargs["attn_mask"]
        )
```

**Parameters:**

* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **wqkv** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) )
* **wo** ([`LinearV1`](../linear.md#max.nn.linear.LinearV1) )
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) )

### `kv_params` {#max.nn.attention.interfaces.AttentionImpl.kv_params}

> kv\_params\*: [KVCacheParams](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams)\*

KV Cache Params, including the number of kv heads, the head dim, and data type.

### `n_heads` {#max.nn.attention.interfaces.AttentionImpl.n_heads}

> n\_heads\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The number of attention heads.

### `scale` {#max.nn.attention.interfaces.AttentionImpl.scale}

> scale\*: [float](https://docs.python.org/3/library/functions.html#float)\*

The scale factor for the attention.

### `wo` {#max.nn.attention.interfaces.AttentionImpl.wo}

> wo\*: [LinearV1](../linear.md#max.nn.linear.LinearV1)\*

A linear layer for the output projection.

### `wqkv` {#max.nn.attention.interfaces.AttentionImpl.wqkv}

> wqkv\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\*

The concatenation of q, k, and v weight vectors.

## `AttentionImplQKV` {#max.nn.attention.interfaces.AttentionImplQKV}

> *class* max.nn.attention.interfaces.AttentionImplQKV(n\_heads, kv\_params, wq, wk, wv, wo, scale)

A generalized attention interface, that will be used upstream by a general Transformer.
We would expect a separate subclass, articulating each variation of Attention:

* AttentionWithRope
* AttentionWithAlibi
* VanillaAttentionWithCausalMask
* …

There are a series of shared attributes, however, more may be needed for each individual variant.
For example, we may introduce an OptimizedRotaryEmbedding class for the AttentionWithRope class:

```python
@dataclass
class AttentionWithRope(AttentionImpl):
    rope: OptimizedRotaryEmbedding
    ...
```

We expect the `__call__` abstractmethod to remain relatively consistent, however the `**kwargs`
argument is exposed, allowing you to leverage additional arguments for each particular variant.
For example, we may introduce an VanillaAttentionWithCausalMask class, which includes an attention
mask:

```python
@dataclass
class VanillaAttentionWithCausalMask(AttentionImpl):
    ...

    def __call__(
        self,
        x: TensorValueLike,
        kv_collection: ContinuousBatchingKVCacheCollection,
        valid_lengths: TensorValueLike,
        **kwargs,
    ) -> tuple[TensorValue, ContinuousBatchingKVCacheCollection]: ...

        if "attn_mask" not in kwargs:
            raise ValueError("attn_mask not provided to VanillaAttentionWithCausalMask")

        # Which we can then use the attention mask downstream like so:
        op(
            attn_mask = kwargs["attn_mask"]
        )
```

**Parameters:**

* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **wq** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **wk** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **wv** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **wo** ([`LinearV1`](../linear.md#max.nn.linear.LinearV1) )
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) )

### `kv_params` {#max.nn.attention.interfaces.AttentionImplQKV.kv_params}

> kv\_params\*: [KVCacheParams](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams)\*

KV Cache Params, including the number of kv heads, the head dim, and data type.

### `n_heads` {#max.nn.attention.interfaces.AttentionImplQKV.n_heads}

> n\_heads\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The number of attention heads.

### `scale` {#max.nn.attention.interfaces.AttentionImplQKV.scale}

> scale\*: [float](https://docs.python.org/3/library/functions.html#float)\*

The scale factor for the attention.

### `wk` {#max.nn.attention.interfaces.AttentionImplQKV.wk}

> wk\*: Value\[TensorType] | [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../../graph/type.md#max.graph.type.Shape) | [Dim](../../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

The k weight vector.

### `wo` {#max.nn.attention.interfaces.AttentionImplQKV.wo}

> wo\*: [LinearV1](../linear.md#max.nn.linear.LinearV1)\*

A linear layer for the output projection.

### `wq` {#max.nn.attention.interfaces.AttentionImplQKV.wq}

> wq\*: Value\[TensorType] | [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../../graph/type.md#max.graph.type.Shape) | [Dim](../../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

The q weight vector.

### `wv` {#max.nn.attention.interfaces.AttentionImplQKV.wv}

> wv\*: Value\[TensorType] | [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../../graph/type.md#max.graph.type.Shape) | [Dim](../../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

The v weight vector.

## `DistributedAttentionImpl` {#max.nn.attention.interfaces.DistributedAttentionImpl}

> *class* max.nn.attention.interfaces.DistributedAttentionImpl

A generalized Distributed attention interface.

---

## interpolate_point_1d

`interpolate_point_1d[coordinate_transformation_mode: CoordinateTransformationMode, antialias: Bool, rank: Int, type: DType, interpolation_mode: InterpolationMode](interpolator: Interpolator[interpolation_mode], dim: Int, out_coords: IndexList[rank], scale: SIMD[float32, 1], input: NDBuffer[type, rank, origin], output: NDBuffer[type, rank, origin])`

---

## InterpolationMode

`struct InterpolationMode`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `Linear`

`alias Linear = InterpolationMode(0)`

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

---

## Interpolator

`@register_passable(trivial)`
`struct Interpolator[mode: InterpolationMode]`

## Fields

* ​cubic\_coeff (`SIMD[float32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(cubic_coeff: SIMD[float32, 1]) -> Self`

`__init__() -> Self`

### `filter_length`

`static filter_length() -> Int`

### `filter`

`filter(self, x: SIMD[float32, 1]) -> SIMD[float32, 1]`

---

## interval

A self-balancing interval tree is a specialized binary search tree designed to efficiently store and query intervals.

It maintains intervals sorted by their low endpoints and augments each node with a
`max_high` attribute, representing the maximum high endpoint in its subtree. This
`max_high` value enables efficient overlap searching by pruning the search space.
Self-balancing mechanisms, such as Red-Black or AVL trees, ensure logarithmic time
complexity for operations.

Key Features:

* Stores intervals (low, high).
* Nodes ordered by `low` endpoints.
* `max_high` attribute at each node for efficient overlap search.
* Self-balancing (e.g., using Red-Black tree logic) for O(log n) operations.

Operations:

* Insertion: O(log n) - Adds a new interval, maintaining balance and updating
  `max_high`.
* Overlap Search: O(log n) - Finds intervals overlapping a query interval using
  `max_high` for pruning.
* Deletion: O(log n) - Removes an interval, maintaining balance and updating
  `max_high`.

Space Complexity: O(n), where n is the number of intervals.

Use Cases:

* Calendar scheduling
* Computational geometry
* Genomics
* Database indexing
* Resource allocation

In essence, this data structure provides a fast and efficient way to manage and
query interval data, particularly for finding overlaps.

## Structs

* [​`Interval`](/mojo/stdlib/collections/interval/Interval): A half-open interval \[start, end) that represents a range of values.
* [​`IntervalTree`](/mojo/stdlib/collections/interval/IntervalTree): An interval tree data structure for efficient range queries.

## Traits

* [​`IntervalElement`](/mojo/stdlib/collections/interval/IntervalElement): The trait denotes a trait composition of the `Copyable`, `Movable`, `Writable`, `Intable`, and `Comparable` traits. Which is also subtractable.

---

## Interval

`struct Interval[T: IntervalElement]`

A half-open interval \[start, end) that represents a range of values.

The interval includes the start value but excludes the end value.

## Parameters

* ​T (`IntervalElement`): The type of the interval bounds.

## Fields

* ​start (`T`): The inclusive start of the interval.
* ​end (`T`): The exclusive end of the interval.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(out self, start: T, end: T)`

Initialize an interval with start and end values.

**Args:**

* ​start (`T`): The starting value of the interval.
* ​end (`T`): The ending value of the interval. Must be greater than or
  equal to start.

`__init__(out self, interval: Tuple[T, T], /)`

Initialize an interval with a tuple of start and end values.

**Args:**

* ​interval (`Tuple[T, T]`): A tuple containing the start and end values.

### `__copyinit__`

`__copyinit__(out self, existing: Self, /)`

Create a new instance of the interval by copying the values from an existing one.

**Args:**

* ​existing (`Self`): The interval to copy values from.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self, /)`

Create a new instance of the interval by moving the values from an existing one.

**Args:**

* ​existing (`Self`): The interval to move values from.

### `__bool__`

`__bool__(self) -> Bool`

Returns whether this interval is empty.

**Returns:**

True if the interval is not empty (start 

### `__lt__`

`__lt__(self, other: Self) -> Bool`

Returns whether this interval is less than another interval.

**Args:**

* ​other (`Self`): The interval to compare with.

**Returns:**

True if this interval's start is less than the other interval's start.

### `__le__`

`__le__(self, other: Self) -> Bool`

Returns whether this interval is less than or equal to another interval.

**Args:**

* ​other (`Self`): The interval to compare with.

**Returns:**

True if this interval's start is less than or equal to the other interval's start.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Returns whether this interval equals another interval.

**Args:**

* ​other (`Self`): The interval to compare with.

**Returns:**

True if both intervals have the same start and end values.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Returns whether this interval is not equal to another interval.

**Args:**

* ​other (`Self`): The interval to compare with.

**Returns:**

True if the intervals are not equal, False if they are equal.

### `__gt__`

`__gt__(self, other: Self) -> Bool`

Returns whether this interval is greater than another interval.

**Args:**

* ​other (`Self`): The interval to compare with.

**Returns:**

True if this interval's end is greater than the other interval's end.

### `__ge__`

`__ge__(self, other: Self) -> Bool`

Returns whether this interval is greater than or equal to another interval.

**Args:**

* ​other (`Self`): The interval to compare with.

**Returns:**

True if this interval's end is greater than or equal to the other interval's end.

### `__contains__`

`__contains__(self, other: T) -> Bool`

Returns whether a value is contained within this interval.

**Args:**

* ​other (`T`): The value to check.

**Returns:**

True if the value is within the interval bounds, False otherwise.

`__contains__(self, other: Self) -> Bool`

Returns whether another interval is fully contained within this interval.

**Args:**

* ​other (`Self`): The interval to check.

**Returns:**

True if the other interval is fully contained within this interval,
False otherwise.

### `overlaps`

`overlaps(self, other: Self) -> Bool`

Returns whether this interval overlaps with another interval.

**Args:**

* ​other (`Self`): The interval to check for overlap with.

**Returns:**

True if the intervals overlap, False otherwise.

### `union`

`union(self, other: Self) -> Self`

Returns the union of this interval and another interval.

**Args:**

* ​other (`Self`): The interval to union with.

**Returns:**

The union of this interval and the other interval.

### `intersection`

`intersection(self, other: Self) -> Self`

Returns the intersection of this interval and another interval.

**Args:**

* ​other (`Self`): The interval to intersect with.

**Returns:**

The intersection of this interval and the other interval.

### `__len__`

`__len__(self) -> Int`

Returns the length of this interval.

**Returns:**

The difference between end and start values as an integer.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes this interval to a writer in the format '(start, end)'.

**Parameters:**

* ​W (`Writer`): The writer type that implements the Writer trait.

**Args:**

* ​writer (`W`): The writer to write the interval to.

### `__str__`

`__str__(self) -> String`

Returns a string representation of this interval.

**Returns:**

A string in the format '(start, end)' representing this interval.

### `__repr__`

`__repr__(self) -> String`

Returns a string representation of this interval suitable for debugging.

**Returns:**

A string in the format '(start, end)' representing this interval.

---

## IntervalElement

The trait denotes a trait composition of the `Copyable`, `Movable`, `Writable`, `Intable`, and `Comparable` traits. Which is also subtractable.

## Implemented traits

`AnyType`,
`Comparable`,
`Copyable`,
`EqualityComparable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`Intable`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `__lt__`

`__lt__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is less than `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is less than `rhs`.

### `__le__`

`__le__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is less than or equal to `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is less than or equal to `rhs`.

### `__eq__`

`__eq__(self: _Self, other: _Self) -> Bool`

Define whether two instances of the object are equal to each other.

**Args:**

* ​other (`_Self`): Another instance of the same type.

**Returns:**

True if the instances are equal according to the type's definition
of equality, False otherwise.

### `__ne__`

`__ne__(self: _Self, other: _Self) -> Bool`

Define whether two instances of the object are not equal to each other.

**Args:**

* ​other (`_Self`): Another instance of the same type.

**Returns:**

True if the instances are not equal according to the type's definition
of equality, False otherwise.

### `__gt__`

`__gt__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is greater than `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is greater than `rhs`.

### `__ge__`

`__ge__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is greater than or equal to `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is greater than or equal to `rhs`.

### `__sub__`

`__sub__(self: _Self, rhs: _Self) -> _Self`

Subtracts rhs from self, must be implemented in concrete types.

**Args:**

* ​rhs (`_Self`): The value to subtract from self.

**Returns:**

The result of subtracting rhs from self.

### `__int__`

`__int__(self: _Self) -> Int`

Get the integral representation of the value.

**Returns:**

The integral representation of the value.

### `write_to`

`write_to[W: Writer](self: _Self, mut writer: W)`

Formats the string representation of this type to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The type conforming to `Writable`.

---

## IntervalTree

`struct IntervalTree[T: IntervalElement, U: Copyable & Movable & Stringable & Comparable]`

An interval tree data structure for efficient range queries.

## Parameters

* ​T (`IntervalElement`): The type of the interval bounds, must support subtraction, integer
  conversion, string conversion, comparison and collection operations.
* ​U (`Copyable & Movable & Stringable & Comparable`): The type of the associated data, must support string conversion
  and collection operations.

## Implemented traits

`AnyType`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(out self)`

Initializes an empty IntervalTree.

### `insert`

`insert(mut self, interval: Tuple[T, T], data: U)`

Insert a new interval into the tree using a tuple representation.

**Args:**

* ​interval (`Tuple[T, T]`): A tuple containing the start and end values of the interval.
* ​data (`U`): The data value to associate with this interval.

`insert(mut self, interval: Interval[T], data: U)`

Insert a new interval into the tree.

This method inserts a new interval and its associated data into the interval tree.
It maintains the binary search tree property based on interval start times and
updates the tree structure to preserve red-black tree properties.

**Args:**

* ​interval (`Interval[T]`): The interval to insert into the tree.
* ​data (`U`): The data value to associate with this interval.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the interval tree.

**Returns:**

A string representation of the interval tree.

### `__repr__`

`__repr__(self) -> String`

Returns a string representation of the interval tree suitable for debugging.

**Returns:**

A string representation of the interval tree.

### `write_to`

`write_to[w: Writer](self, mut writer: w)`

Writes the interval tree to a writer.

**Parameters:**

* ​w (`Writer`): The writer type that implements the Writer trait.

**Args:**

* ​writer (`w`): The writer to write the interval tree to.

### `depth`

`depth(self) -> Int`

Returns the depth of the interval tree.

**Returns:**

The depth of the interval tree.

### `transplant`

`transplant(mut self, mut u: UnsafePointer[_IntervalNode[T, U]], mut v: UnsafePointer[_IntervalNode[T, U]])`

Transplants the subtree rooted at node u with the subtree rooted at node v.

**Args:**

* ​u (`UnsafePointer[_IntervalNode[T, U]]`): The node to transplant.
* ​v (`UnsafePointer[_IntervalNode[T, U]]`): The node to transplant to.

### `search`

`search(self, interval: Tuple[T, T]) -> List[U]`

Searches for intervals overlapping with the given tuple.

**Args:**

* ​interval (`Tuple[T, T]`): The interval tuple (start, end).

**Returns:**

A list of data associated with overlapping intervals.

`search(self, interval: Interval[T]) -> List[U]`

Searches for intervals overlapping with the given interval.

**Args:**

* ​interval (`Interval[T]`): The interval to search.

**Returns:**

A list of data associated with overlapping intervals.

---

## IntLiteral

`@register_passable(trivial)`
`struct IntLiteral[value: !pop.int_literal]`

This type represents a static integer literal value with infinite precision.  This type is a compile-time construct which stores its value as a parameter.  It is typically materialized into other types (like `Int`) for use at runtime.  This compile-time representation allows for arbitrary precision constants that would overflow on Int and other fixed precision integer types.

## Parameters

* ​value (`!pop.int_literal`): The underlying integer value.

## Implemented traits

`AnyType`,
`Boolable`,
`Ceilable`,
`Copyable`,
`Floorable`,
`ImplicitlyBoolable`,
`ImplicitlyIntable`,
`Indexer`,
`Intable`,
`Movable`,
`Stringable`,
`Truncable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Constructor for any value.

### `__bool__`

`__bool__(self) -> Bool`

Convert this IntLiteral to Bool.

**Returns:**

False Bool value if the value is equal to 0 and True otherwise.

### `__neg__`

`__neg__(self) -> IntLiteral[(0 - value)]`

Return -self.

**Returns:**

The -self value.

### `__pos__`

`__pos__(self) -> Self`

Return +self.

**Returns:**

The +self value.

### `__invert__`

`__invert__(self) -> IntLiteral[(value ^ -1)]`

Return \~self.

**Returns:**

The \~self value.

### `__lt__`

`__lt__(self, rhs: IntLiteral[value]) -> Bool`

Compare this IntLiteral to the RHS using LT comparison.

**Args:**

* ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against.

**Returns:**

True if this IntLiteral is less-than the RHS IntLiteral and False otherwise.

### `__le__`

`__le__(self, rhs: IntLiteral[value]) -> Bool`

Compare this IntLiteral to the RHS using LE comparison.

**Args:**

* ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against.

**Returns:**

True if this IntLiteral is less-or-equal than the RHS IntLiteral and False
otherwise.

### `__eq__`

`__eq__(self, rhs: IntLiteral[value]) -> Bool`

Compare this IntLiteral to the RHS using EQ comparison.

**Args:**

* ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against.

**Returns:**

True if this IntLiteral is equal to the RHS IntLiteral and False otherwise.

### `__ne__`

`__ne__(self, rhs: IntLiteral[value]) -> Bool`

Compare this IntLiteral to the RHS using NE comparison.

**Args:**

* ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against.

**Returns:**

True if this IntLiteral is non-equal to the RHS IntLiteral and False otherwise.

### `__gt__`

`__gt__(self, rhs: IntLiteral[value]) -> Bool`

Compare this IntLiteral to the RHS using GT comparison.

**Args:**

* ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against.

**Returns:**

True if this IntLiteral is greater-than the RHS IntLiteral and False otherwise.

### `__ge__`

`__ge__(self, rhs: IntLiteral[value]) -> Bool`

Compare this IntLiteral to the RHS using GE comparison.

**Args:**

* ​rhs (`IntLiteral[value]`): The other IntLiteral to compare against.

**Returns:**

True if this IntLiteral is greater-or-equal than the RHS IntLiteral and False
otherwise.

### `__add__`

`__add__(self, rhs: IntLiteral[value]) -> IntLiteral[(value + value)]`

Return `self + rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The value to add.

**Returns:**

`self + rhs` value.

### `__sub__`

`__sub__(self, rhs: IntLiteral[value]) -> IntLiteral[(value - value)]`

Return `self - rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The value to subtract.

**Returns:**

`self - rhs` value.

### `__mul__`

`__mul__(self, rhs: IntLiteral[value]) -> IntLiteral[(value * value)]`

Return `self * rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The value to multiply with.

**Returns:**

`self * rhs` value.

### `__floordiv__`

`__floordiv__(self, rhs: IntLiteral[value]) -> IntLiteral[(value // value)]`

Return `self // rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The value to divide with.

**Returns:**

`self // rhs` value.

### `__mod__`

`__mod__(self, rhs: IntLiteral[value]) -> IntLiteral[(value % value)]`

Return the remainder of self divided by rhs.

**Args:**

* ​rhs (`IntLiteral[value]`): The value to divide on.

**Returns:**

The remainder of dividing self by rhs.

### `__lshift__`

`__lshift__(self, rhs: IntLiteral[value]) -> IntLiteral[(value 

Return `self rhs (`IntLiteral[value]`): The value to shift with.

**Returns:**

`self 

### `__rshift__`

`__rshift__(self, rhs: IntLiteral[value]) -> IntLiteral[(value >> value)]`

Return `self >> rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The value to shift with.

**Returns:**

`self >> rhs`.

### `__and__`

`__and__(self, rhs: IntLiteral[value]) -> IntLiteral[(value & value)]`

Return `self & rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The RHS value.

**Returns:**

`self & rhs`.

### `__or__`

`__or__(self, rhs: IntLiteral[value]) -> IntLiteral[(value | value)]`

Return `self | rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The RHS value.

**Returns:**

`self | rhs`.

### `__xor__`

`__xor__(self, rhs: IntLiteral[value]) -> IntLiteral[(value ^ value)]`

Return `self ^ rhs`.

**Args:**

* ​rhs (`IntLiteral[value]`): The RHS value.

**Returns:**

`self ^ rhs`.

### `__as_bool__`

`__as_bool__(self) -> Bool`

Convert this IntLiteral to Bool.

**Returns:**

False Bool value if the value is equal to 0 and True otherwise.

### `__int__`

`__int__(self) -> Int`

Convert from IntLiteral to Int.

**Returns:**

The value as an integer of platform-specific width.

### `__as_int__`

`__as_int__(self) -> Int`

Implicitly convert to an Int.

**Returns:**

An integral value that represents this object.

### `__uint__`

`__uint__(self) -> UInt`

Convert from IntLiteral to UInt.

**Returns:**

The value as an unsigned integer of platform-specific width.

### `__ceil__`

`__ceil__(self) -> Self`

Return the ceiling of the IntLiteral value, which is itself.

**Returns:**

The IntLiteral value itself.

### `__floor__`

`__floor__(self) -> Self`

Return the floor of the IntLiteral value, which is itself.

**Returns:**

The IntLiteral value itself.

### `__trunc__`

`__trunc__(self) -> Self`

Return the truncated of the IntLiteral value, which is itself.

**Returns:**

The IntLiteral value itself.

### `__str__`

`__str__(self) -> String`

Convert from IntLiteral to String.

**Returns:**

The value as a string.

### `__ceildiv__`

`__ceildiv__(self, denominator: IntLiteral[value]) -> IntLiteral[(0 - (value // (0 - value)))]`

Return the rounded-up result of dividing self by denominator.

**Args:**

* ​denominator (`IntLiteral[value]`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

### `__index__`

`__index__(self) -> index`

Convert from IntLiteral to index.

**Returns:**

The corresponding \_\_mlir\_type.index value, interpreting as signed.

---

## intrinsics

Provides low-level GPU intrinsic operations and memory access primitives.

Implements hardware-specific intrinsics that map directly to GPU assembly
instructions, focusing on NVIDIA GPU architectures. Includes:

* Global memory load/store operations with cache control
* Warp-level primitives and synchronization
* Memory fence and barrier operations
* Atomic operations and memory ordering primitives

These low-level primitives should be used carefully as they correspond
directly to hardware instructions and require understanding of the
underlying GPU architecture.

## Structs

* [​`Scope`](/mojo/stdlib/gpu/intrinsics/Scope): Represents memory synchronization scope levels for GPU memory operations.

## Functions

* [​`buffer_load`](/mojo/stdlib/gpu/intrinsics/buffer_load): Loads data from global memory into a SIMD register.
* [​`buffer_load_store_lds`](/mojo/stdlib/gpu/intrinsics/buffer_load_store_lds): Loads four bytes from global memory ands writes them to shared memory.
* [​`buffer_store`](/mojo/stdlib/gpu/intrinsics/buffer_store): Stores a register variable to global memory.
* [​`byte_permute`](/mojo/stdlib/gpu/intrinsics/byte_permute): Permutes bytes from two 32-bit integers based on a control mask.
* [​`ldg`](/mojo/stdlib/gpu/intrinsics/ldg): Load data from global memory through the non-coherent cache.
* [​`load_acquire`](/mojo/stdlib/gpu/intrinsics/load_acquire): Performs an atomic load operation with acquire memory ordering semantics.
* [​`load_volatile`](/mojo/stdlib/gpu/intrinsics/load_volatile): Performs a volatile load operation that cannot be optimized away.
* [​`lop`](/mojo/stdlib/gpu/intrinsics/lop): Performs an arbitrary logical operation on 3 inputs using a lookup table.
* [​`make_buffer_resource`](/mojo/stdlib/gpu/intrinsics/make_buffer_resource): Creates a 128-bit buffer resource descriptor for AMD GPU buffer operations.
* [​`mulhi`](/mojo/stdlib/gpu/intrinsics/mulhi): Calculates the most significant 32 bits of the product of two 16-bit unsigned integers.
* [​`mulwide`](/mojo/stdlib/gpu/intrinsics/mulwide): Performs a wide multiplication of two 32-bit unsigned integers.
* [​`store_release`](/mojo/stdlib/gpu/intrinsics/store_release): Performs an atomic store with release memory ordering semantics.
* [​`store_volatile`](/mojo/stdlib/gpu/intrinsics/store_volatile): Performs a volatile store operation that cannot be optimized away.
* [​`threadfence`](/mojo/stdlib/gpu/intrinsics/threadfence): Enforces ordering of memory operations across threads.
* [​`warpgroup_reg_alloc`](/mojo/stdlib/gpu/intrinsics/warpgroup_reg_alloc): Allocates additional registers for the executing warp group.
* [​`warpgroup_reg_dealloc`](/mojo/stdlib/gpu/intrinsics/warpgroup_reg_dealloc): Deallocates additional registers for the executing warp group.

---

## intrinsics

Defines intrinsics.

You can import these APIs from the `sys` package. For example:

```mojo
from sys import PrefetchLocality
```

## Aliases

### `block_dim`

`alias block_dim = _BlockDim()`

### `block_id_in_cluster`

`alias block_id_in_cluster = _Cluster_BlockIdx()`

### `block_idx`

`alias block_idx = _BlockIdx()`

### `cluster_dim`

`alias cluster_dim = _ClusterDim()`

### `cluster_idx`

`alias cluster_idx = _ClusterIdx()`

### `global_idx`

`alias global_idx = _GridIdx()`

### `grid_dim`

`alias grid_dim = _GridDim()`

### `thread_idx`

`alias thread_idx = _ThreadIdx()`

## Structs

* [​`PrefetchCache`](/mojo/stdlib/sys/intrinsics/PrefetchCache): Prefetch cache type.
* [​`PrefetchLocality`](/mojo/stdlib/sys/intrinsics/PrefetchLocality): The prefetch locality.
* [​`PrefetchOptions`](/mojo/stdlib/sys/intrinsics/PrefetchOptions): Collection of configuration parameters for a prefetch intrinsic call.
* [​`PrefetchRW`](/mojo/stdlib/sys/intrinsics/PrefetchRW): Prefetch read or write.

## Functions

* [​`assume`](/mojo/stdlib/sys/intrinsics/assume): Signals to the optimizer that the condition is always true. This allows the optimizer to optimize the code.
* [​`ballot`](/mojo/stdlib/sys/intrinsics/ballot): Returns a bitfield(Int32 or Int64) containing the result of its Bool argument in all active lanes, and zero in all inactive lanes. For example, ballot(True) returns EXEC mask.
* [​`compressed_store`](/mojo/stdlib/sys/intrinsics/compressed_store): Compresses the lanes of `value`, skipping `mask` lanes, and stores at `addr`.
* [​`expect`](/mojo/stdlib/sys/intrinsics/expect): Provides information about expected (the most probable) value of `val`, which can be used by optimizers.
* [​`gather`](/mojo/stdlib/sys/intrinsics/gather): Reads scalar values from a SIMD vector, and gathers them into one vector.
* [​`implicitarg_ptr`](/mojo/stdlib/sys/intrinsics/implicitarg_ptr): Get a pointer to AMD's implicit arguments table.
* [​`lane_id`](/mojo/stdlib/sys/intrinsics/lane_id): Returns the lane ID of the current thread.
* [​`likely`](/mojo/stdlib/sys/intrinsics/likely): Provides information that the most probable value of `val` is going to be `True`. This information can be used by optimizers.
* [​`llvm_intrinsic`](/mojo/stdlib/sys/intrinsics/llvm_intrinsic): Calls an LLVM intrinsic with the name `intrin` and return type `type`.
* [​`masked_load`](/mojo/stdlib/sys/intrinsics/masked_load): Loads data from memory and return it, replacing masked lanes with values from the passthrough vector.
* [​`masked_store`](/mojo/stdlib/sys/intrinsics/masked_store): Stores a value at a memory location, skipping masked lanes.
* [​`prefetch`](/mojo/stdlib/sys/intrinsics/prefetch): Prefetches an instruction or data into cache before it is used.
* [​`readfirstlane`](/mojo/stdlib/sys/intrinsics/readfirstlane): Get the value in the lowest active lane of the input operand.
* [​`scatter`](/mojo/stdlib/sys/intrinsics/scatter): Takes scalar values from a SIMD vector and `scatters` them into a vector of pointers.
* [​`sendmsg`](/mojo/stdlib/sys/intrinsics/sendmsg): Send a message to fixed function hardware. Refer to the specific ISA manual for the ops and messages.
* [​`strided_load`](/mojo/stdlib/sys/intrinsics/strided_load): Loads values from addr according to a specific stride.
* [​`strided_store`](/mojo/stdlib/sys/intrinsics/strided_store): Loads values from addr according to a specific stride.
* [​`unlikely`](/mojo/stdlib/sys/intrinsics/unlikely): Provides information that the most probable value of `val` is going to be `False`. This information can be used by optimizers.

---

## Intro to custom ops

Custom operations (custom ops) extend [MAX Graph's
Python](/max/model-formats#max-graph) inference APIs with custom
[Mojo](/mojo/manual) kernels. Whether you need to optimize performance of
functions, implement custom algorithms, or create hardware-specific versions of
existing operators, custom ops provide the flexibility you need.

The [custom ops](/max/api/python/graph/ops#custom) API provides complete control
over MAX Graph while handling kernel integration and optimization pipelines
automatically.

Try it now with our [custom ops
examples](https://github.com/modular/modular/tree/main/examples/custom_ops) on
GitHub or follow the [Build custom ops for GPUs](/max/tutorials/build-custom-ops)
tutorial and [let us know what you think](https://www.modular.com/community).

## How it works

A custom op consists of two main components that work together to integrate your
custom implementation into the MAX execution pipeline:

1. A custom function implementation written in Mojo that defines your computation
2. A registration process that connects your function to the graph execution system

Under the hood, custom ops utilize high-level abstractions that handle memory
management, device placement, and optimization. The graph compiler integrates
your custom op implementation into the execution flow.

For more information:

- Follow the [Build custom ops for GPUs tutorial](/max/tutorials/build-custom-ops)
- Learn more about [GPU programming with Mojo](/mojo/manual/gpu/basics)
- Explore the [Custom ops GitHub examples](https://github.com/modular/modular/tree/main/examples/custom_ops)
- Reference the [MAX Graph custom ops API](/max/api/python/graph/ops#custom)

---

## Intro to pointers

A pointer is an indirect reference to one or more values stored in memory. The
pointer is a value that holds an address to memory, and provides APIs to store
and retrieve values to that memory. The value pointed to by a pointer is also
known as a _pointee_.

The Mojo standard library includes several types of pointers, which provide
different sets of features. All of these pointer types are _generic_—they can
point to any type of value, and the value type is specified as a parameter. For
example, the following code creates an `OwnedPointer` that points to an `Int`
value:

```mojo
var ptr: OwnedPointer[Int]
ptr = OwnedPointer(100)
```

The `ptr` variable has a value of type `OwnedPointer[Int]`. The pointer *points
to* a value of type `Int`, as shown in Figure 1.

![](../images/owned-pointer-diagram.png#light)
![](../images/owned-pointer-diagram-dark.png#dark)

Figure 1. Pointer and pointee

Accessing the memory—to retrieve or update a value—is called
_dereferencing_ the pointer. You can dereference a pointer by following the
variable name with an empty pair of square brackets:

```mojo
# Update an initialized value
ptr[] += 10
# Access an initialized value
print(ptr[])
```

## Pointer terminology

Before we jump into the pointer types, here are a few terms you'll run across. Some
of them may already be familiar to you.

- **Safe pointers**: are designed to prevent memory errors. Unless you use one
  of the APIs that are specially designated as unsafe, you can use these
  pointers without worrying about memory issues like double-free or
  use-after-free.

- **Nullable pointers**: can point to an invalid memory location (typically 0,
or a “null pointer”). Safe pointers aren't nullable.

- **Smart pointers**: own their pointees, which means that the value they point
  to may be deallocated when the pointer itself is destroyed. Non-owning
  pointers may point to values owned elsewhere, or may require some manual
  management of the value lifecycle.

- **Memory allocation**: some pointer types can allocate memory to store their
  pointees, while other pointers can only point to pre-existing values. Memory
  allocation can either be implicit (that is, performed automatically when
  initializing a pointer with a value) or explicit.

- **Uninitialized memory**: refers to memory locations that haven't been
  initialized with a value, which may therefore contain random data.
  Newly-allocated memory is uninitialized. The safe pointer types don't allow
  users to access memory that's uninitialized. Unsafe pointers can allocate a
  block of uninitialized memory locations and then initialize them one at a time.
  Being able to access uninitialized memory is unsafe by definition.

- **Copyable types**: can be copied implicitly (for example, by assigning a
  value to a variable). Also called *implicitly copyable types*.

  ```mojo
  copied_ptr = ptr
  ```

  *Explicitly copyable* types require the user to request a copy, using a
    constructor with a keyword argument:

  ```mojo
  copied_owned_ptr = OwnedPointer(other=owned_ptr)
  ```

## Pointer types

The Mojo standard library includes several pointer types with different
characteristics:

- [`Pointer`](/mojo/stdlib/memory/pointer/Pointer) is a safe pointer that points
  to a single value that it doesn't own.

- [`OwnedPointer`](/mojo/stdlib/memory/owned_pointer/OwnedPointer) is a smart
  pointer that points to a single value, and maintains exclusive ownership of
  that value.

- [`ArcPointer`](/mojo/stdlib/memory/arc/ArcPointer) is a reference-counted
  smart pointer that points to an owned value with ownership potentially shared
  with other instances of `ArcPointer`.

- [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) points to
  one or more consecutive memory locations, and can refer to uninitialized
  memory.

Table 1 summarizes the different types of pointers:

|  | `Pointer` | `OwnedPointer` | `ArcPointer` | `UnsafePointer` |
| --- | --- | --- | --- | --- |
| Safe | Yes | Yes | Yes | No |
| Allocates memory | No | Implicitly 1 | Implicitly 1 | Explicitly |
| Owns pointee(s) | No | Yes | Yes | No 2 |
| Copyable | Yes | No 3 | Yes | Yes |
| Nullable | No | No | No | Yes |
| Can point to uninitialized memory | No | No | No | Yes |
| Can point to multiple values (array-like access) | No | No | No | Yes |

Table 1. Pointer types

1 `OwnedPointer` and `ArcPointer` implicitly allocate memory when you
initialize the pointer with a value.

2 `UnsafePointer` provides unsafe methods for initializing and
destroying instances of the stored type. The user is responsible for managing
the lifecycle of stored values.

3 `OwnedPointer` is explicitly copyable, but explicitly copying an
`OwnedPointer` copies the *stored value* into a new `OwnedPointer`.

The following sections provide more details on each pointer type.

## `Pointer`

The [`Pointer`](/mojo/stdlib/memory/pointer/Pointer) type is a safe pointer that
points to a initialized value that it doesn't own. Some example use cases for a
`Pointer` include:

- Storing a reference to a related type. For example, a list's iterator object
might hold a `Pointer` back to the original list.

- Passing the memory location for a single value to external code via
`external_call()`.

- Where you need an API to return a long-lived reference to a value. (Currently
the iterators for standard library collection types like `List` return
pointers.)

You can construct a `Pointer` to an existing value by calling the constructor
with the `to` keyword argument:

```python
ptr = Pointer(to=some_value)
```

You can also create a `Pointer` by copying an existing `Pointer`.

A `Pointer` carries an [`origin`](/mojo/manual/values/lifetimes) for the stored
value, so Mojo can track the lifetime of the referenced value.

## `OwnedPointer`

The [`OwnedPointer`](/mojo/stdlib/memory/owned_pointer/OwnedPointer) type is a
smart pointer designed for cases where there is single ownership of the
underlying data. An `OwnedPointer` points to a single item, which is passed in
when you initialize the `OwnedPointer`. The `OwnedPointer` allocates memory and
moves or copies the value into the reserved memory.

```python
o_ptr = OwnedPointer(some_big_struct)
```

An owned pointer can hold almost any type of item, but the stored item must be
either `Movable`, `Copyable`, or `ExplicitlyCopyable`.

Since an `OwnedPointer` is designed to enforce single ownership, the pointer
itself can be moved, but not copied.

Note: Currently, you can't create an `Optional[OwnedPointer[T]]` because the
`Optional` type only works with types that are both movable and copyable. This
restricts some use cases that would otherwise be a natural fit
for`OwnedPointer`, including self-referential data structures like linked lists
and trees. (Until this use case is supported for `OwnedPointer`, it's possible
to use`ArcPointer` where you need a smart pointer that can be `Optional`.)

## `ArcPointer`

An [`ArcPointer`](/mojo/stdlib/memory/arc/ArcPointer) is a reference-counted
smart pointer, ideal for shared resources where the last owner for a given value
may not be clear. Like an `OwnedPointer`, it points to a single value, and it
allocates memory when you initialize the `ArcPointer` with a value:

```python
attributesDict: Dict[String, String] = {}
attributes = ArcPointer(attributesDict)
```

Unlike an `OwnedPointer`, an `ArcPointer` can be freely copied. All instances
of a given `ArcPointer` share a reference count, which is incremented whenever
the `ArcPointer` is copied and decremented whenever an instance is destroyed.
When the reference count reaches zero, the stored value is destroyed and the
allocated memory is freed.

You can use `ArcPointer` to implement safe reference-semantic types. For
example, in the following code snippet `SharedDict` uses an `ArcPointer` to
store a dictionary. Copying an instance of `SharedDict` only copies the
`ArcPointer`, not the dictionary, which is shared between all of the copies.

```python
from memory import ArcPointer

struct SharedDict:
    var attributes: ArcPointer[Dict[String, String]]

    fn __init__(out self):
        attributesDict: Dict[String, String] = {}
        self.attributes = ArcPointer(attributesDict)

    fn __copyinit__(out self, other: Self):
        self.attributes = other.attributes

    def __setitem__(mut self, key: String, value: String):
        self.attributes[][key] = value

    def __getitem__(self, key: String) -> String:
        return self.attributes[].get(key, default="")

def main():
    thing1 = SharedDict()
    thing2 = thing1
    thing1["Flip"] = "Flop"
    print(thing2["Flip"])
```

Note: The reference count is stored using an
[`Atomic`](/mojo/stdlib/os/atomic/Atomic)
value to ensure that updates to the reference count are thread-safe. However,
Mojo doesn't currently enforce exclusive access across thread boundaries, so
it's possible to form race conditions.

## UnsafePointer

[`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) is a
low-level pointer that can access a block of contiguous memory locations, which
might be uninitialized. It's analogous to a raw pointer in the C and C++
programming languages. `UnsafePointer` provides unsafe methods for initializing
and destroying stored values, as well as for accessing the values once they're
initialized.

As the name suggests, `UnsafePointer` doesn't provide any memory safety
guarantees, so you should reserve it for cases when none of the other pointer
types will do the job. Here are some use cases where you might want to use an
`UnsafePointer`:

- Building a high-performance array-like structure, such as `List` or `Tensor`.
  A single `UnsafePointer` can access many values, and gives you a lot of
  control over how you allocate, use, and deallocate memory. Being able to
  access uninitialized memory means that you can preallocate a block of memory,
  and initialize values incrementally as they are added to the collection.

- Interacting with external libraries including C++ and Python. You can
  use`UnsafePointer` to pass a buffer full of data to or from an external
  library.

For more information, see [Unsafe
pointers](/mojo/manual/pointers/unsafe-pointers).

---

## Intro to value lifecycle

So far, we've explained how Mojo allows you to build high-performance code that
is memory safe *without* manually managing memory, using Mojo's [ownership
model](/mojo/manual/values/ownership). However, Mojo is designed for
[systems programming](https://en.wikipedia.org/wiki/Systems_programming), which
often requires manual memory management for custom data types. So, Mojo lets
you do that as you see fit. To be clear, Mojo has no reference counter and no
garbage collector.

Mojo also has no built-in data types with special privileges. All data types
in the standard library (such as [`Bool`](/mojo/stdlib/builtin/bool/Bool),
[`Int`](/mojo/stdlib/builtin/int/Int), and
[`String`](/mojo/stdlib/collections/string/string/String)) are implemented as
[structs](/mojo/manual/structs).

What's great about the Mojo language is that it provides you these low-level
tools for systems programming, but within a framework that helps you build
things that are safe and easy to use from higher-level programs. That is, you
can get under the hood and write all the "unsafe" code you want, but as long as
you do so in accordance with Mojo's [value
semantics](/mojo/manual/values/value-semantics), the programmer instantiating
your type/object doesn't need to think about memory management at all, and the
behavior will be safe and predictable, thanks to [value
ownership](/mojo/manual/values/ownership).

In summary, it's the responsibility of the type author to manage the memory and
resources for each value type, by implementing specific lifecycle methods, such
as the constructor, copy constructor, move constructor, and destructor, as
necessary. Mojo doesn't create any constructors by default, although it does
add a trivial, no-op destructor for types that don't define their own.

In the following pages, we'll explain exactly how to define these lifecycle
methods in accordance with value semantics so your types play nicely with value
ownership.

## Lifecycles and lifetimes

First, let's clarify some terminology:

* The "lifecycle" of a value is defined by various [dunder
  methods](/mojo/manual/structs#special-methods) in a struct.
  Each lifecycle event is handled by a different method,
  such as the constructor (`__init__()`), the destructor (`__del__()`), the copy
  constructor (`__copyinit__()`), and the move constructor (`__moveinit__()`).
  All values that are declared with the same type have the same lifecycle.

* The "lifetime" of a variable is defined by the span of time during
  program execution in which the variable is considered valid. The life of a
  variable begins when its value is initialized (via `__init__()`,
  `__copyinit__()` or `__moveinit__()`) and ends when the value is destroyed
  (`__del__()`), or consumed in some other way (for example, as part of a
  `__moveinit__()` call).

No two values have the exact same lifetime, because every value is created and
destroyed at a different point in time (even if the difference is imperceptible).

:::note Origin type

The concept of lifetimes is related to the `origin` type, a Mojo primitive
used to track ownership. For most Mojo programming, you won't need to work with
`origin` values directly. For information, see [Lifetimes, origins and
references](/mojo/manual/values/lifetimes).

:::

The life of a value in Mojo begins when a variable is initialized and continues
up until the value is last used, at which point Mojo destroys it. Mojo destroys
every value/object as soon as it's no longer used, using an “as soon as
possible” (ASAP) destruction policy that runs after every sub-expression. The
Mojo compiler takes care of releasing resources after last use when needed.

As you might imagine, keeping track of a value's life can be difficult if a
value is shared across functions many times during the life of a program.
However, Mojo makes this predictable partly through its [value
semantics](/mojo/manual/values/value-semantics) and [value
ownership](/mojo/manual/values/ownership) (both prerequisite readings for
the following sections). The final piece of the puzzle for lifetime management
is the value lifecycle: every value (defined in a struct) needs to implement
key lifecycle methods that define how a value is created and destroyed.

---

## Intro to value ownership

A program is nothing without data, and all modern programming languages store
data in one of two places: the call stack and the heap (also sometimes in CPU
registers, but we won't get into that here). However, each language reads and
writes data a bit differently—sometimes very differently. So in the following
sections, we'll explain how Mojo manages memory in your programs and how this
affects the way you write Mojo code.

:::note

For an alternate introduction to ownership in Mojo, check out our two-part blog
post:
[What ownership is really about: a mental model approach](https://www.modular.com/blog/what-ownership-is-really-about-a-mental-model-approach), and [Deep dive into
ownership in Mojo](https://www.modular.com/blog/deep-dive-into-ownership-in-mojo).

:::

## Stack and heap overview

In general, all modern programming languages divide a running program's memory
into four segments:

* Text. The compiled program.
* Data. Global data, either initialized or uninitialized.
* Stack. Local data, automatically managed during the program's runtime.
* Heap. Dynamically-allocated data, managed by the programmer.

The text and data segments are statically sized, but the stack and heap change
size as the program runs.

The *stack* stores data local to the current function. When a function is
called, the program allocates a block of memory—a *stack frame*—that is exactly
the size required to store the function's data, including any *fixed-size*
local variables. When another function is called, a new stack frame is pushed
onto the top of the stack. When a function is done, its stack frame is popped
off the stack.

Notice that we said only "*fixed-size* local values" are stored in the stack.
Dynamically-sized values that can change in size at runtime are instead
stored in the heap, which is a much larger region of memory that allows for
dynamic memory allocation. Technically, a local variable for such a value
is still stored in the call stack, but its value is a fixed-size pointer to the
real value on the heap. Consider a Mojo string: it can be any length, and
its length can change at runtime. So the Mojo `String` struct includes some statically-sized fields, plus a pointer to a dynamically-allocated buffer
holding the actual string data.

Another important difference between the heap and the stack is that the stack is
managed automatically—the code to push and pop stack frames is added by the
compiler. Heap memory, on the other hand, is managed by the programmer
explicitly allocating and deallocating memory. You may do this indirectly—by
using standard library types like `List` and `String`—or directly, using the
[`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) API.

Values that need to outlive the lifetime of a function (such as
an array that's passed between functions and should not be copied) are stored
in the heap, because heap memory is accessible from anywhere in the call stack,
even after the function that created it is removed from the stack. This sort of
situation—in which a heap-allocated value is used by multiple functions—is where
most memory errors occur, and it's where memory management strategies vary the
most between programming languages.

## Memory management strategies

Because memory is limited, it's important that programs remove unused data from
the heap ("free" the memory) as quickly as possible. Figuring out when to free
that memory is pretty complicated.

Some programming languages try to hide the complexities of memory management
from you by utilizing a "garbage collector" process that tracks all memory
usage and deallocates unused heap memory periodically (also known as automatic
memory management). A significant benefit of this method is that it relieves
developers from the burden of manual memory management, generally avoiding more
errors and making developers more productive. However, it incurs a performance
cost because the garbage collector interrupts the program's execution, and it
might not reclaim memory very quickly.

Other languages require that you manually free data that's allocated on the
heap. When done properly, this makes programs execute quickly, because there's
no processing time consumed by a garbage collector. However, the challenge with
this approach is that programmers make mistakes, especially when multiple parts
of the program need access to the same memory—it becomes difficult to know
which part of the program "owns" the data and must deallocate it. Programmers
might accidentally deallocate data before the program is done with it (causing
"use-after-free" errors), or they might deallocate it twice ("double free"
errors), or they might never deallocate it ("leaked memory" errors). Mistakes
like these and others can have catastrophic results for the program, and these
bugs are often hard to track down, making it especially important that they
don't occur in the first place.

Mojo uses a third approach called "ownership" that relies on a collection of
rules that programmers must follow when passing values. The rules ensure there
is only one "owner" for a given value at a time. When a value's lifetime ends,
Mojo calls its destructor, which is responsible for deallocating any heap memory
that needs to be deallocated.

In this way, Mojo helps ensure memory is freed, but it does so in a way that's
deterministic and safe from errors such as use-after-free, double-free and
memory leaks. Plus, it does so with a very low performance overhead.

Mojo's value ownership model provides an excellent balance of programming
productivity and strong memory safety. It only requires that you learn some new
syntax and a few rules about how to share access to memory within your program.

But before we explain the rules and syntax for Mojo's value ownership model,
you first need to understand [value
semantics](/mojo/manual/values/value-semantics).

---

## Introduction to layouts

Mojo’s [`layout` package](/mojo/kernels/layout/) provides a number of APIs for
working with dense multidimensional arrays, which simplify writing algorithms
for handling linear algebra.

This package includes the following main types:

- The [`Layout`](/mojo/kernels/layout/layout/Layout) struct describes an
  arrangement of data in memory. A *layout* is a function that maps a set of
  logical coordinates (like (*x*, *y*) in a two-dimensional array) to a linear
  index value. Layouts can be hierarchical (for example, representing a 2D
  matrix that’s further subdivided into tiles).
- [`LayoutTensor`](/mojo/kernels/layout/layout_tensor/LayoutTensor) is a flexible
  tensor type that combines a `Layout` and a pointer to data.
- The [`IntTuple`](/mojo/kernels/layout/int_tuple/IntTuple) struct is a
  hierarchical tuple type, where each element of the tuple can either be an
  integral value or a nested `IntTuple`. The `IntTuple` type is used extensively
  for defining and indexing layouts and layout tensors.

:::tip Example code

You can find most of the code examples on this page in the [public GitHub repo](https://github.com/modular/modular/tree/main/examples/mojo/layouts).

Some of the concepts presented here can be a little hard to grasp from static
examples, so we recommend downloading the example code and experimenting.

:::

## What’s a Layout?

A layout is a function that maps a set of logical coordinates to a single linear
index value.

For example, a layout could describe a 2x4 row-major matrix, or a 6x6
column-major matrix.

```mojo
from layout import Layout, print_layout

var l2x4row_major = Layout.row_major(2, 4)
var l6x6col_major = Layout.col_major(6, 6)
```

Layouts are made up of two tuples: shape and stride, where shape describes the
logical coordinate space and the stride determines the mapping to the linear
index value. A layout can be written as (*shape*:*stride*). For
example, a contiguous vector of length 4 can be represented as (4:1):

![](../images/layout/1d-layout-with-strides.png#light)
![](../images/layout/1d-layout-with-strides-dark.png#dark)

Figure 1. 1D layout (4:1)

A 3x4 row-major layout can be represented as ((3, 4):(4, 1)). That is, the
*shape* is 3x4 and the *strides* are 4 and 1. You can break this down into two
sub-layouts or *modes*: a row mode and a column mode: 3 rows with a stride of 4
(3:4, the first numbers from each tuple) and 4 columns with a stride of 1 (4:1,
the second numbers from each tuple).

The [`print_layout()`](/mojo/kernels/layout/layout/print_layout) function
generates an ASCII diagram of any 2D layout, showing the coordinates on the
outside and the corresponding index values in the grid.

```mojo
var l3x4row_major = Layout.row_major(3, 4)
print_layout(l3x4row_major)
```

Output:

```plaintext
((3, 4):(4, 1))
       0    1    2    3
    +----+----+----+----+
 0  |  0 |  1 |  2 |  3 |
    +----+----+----+----+
 1  |  4 |  5 |  6 |  7 |
    +----+----+----+----+
 2  |  8 |  9 | 10 | 11 |
    +----+----+----+----+
```

The coordinate to index mapping is performed by calculating the dot product of
the logical coordinates and the corresponding strides. For example, given the
coordinates (*i, j*) and the layout shown above, the index value is $i*4 + j*1$.
So coordinate (1, 1) maps to 5, as shown in the diagram.

The following example shows how to use a `Layout` to convert between coordinates
and index values.

```mojo
var coords = IntTuple(1, 1)
var idx = l3x4row_major(coords)
print("index at coordinates (1, 1): ", idx)
print("coordinates at index 7:", l3x4row_major.idx2crd(7))
```

Output:

```plaintext
index at coordinates (1, 1):  5
coordinates at index 7: (1, 3)
```

As this example shows, the layout is a function that takes a set of integer
coordinates and returns a single integer (the linear index). The `Layout` struct
also provides an [`idx2crd()`](/mojo/kernels/layout/layout/Layout#idx2crd) method
that transforms a linear index into a set of logical coordinates.

:::note Printing layouts

You can use `print_layout()` to print a diagram of any 2D layout. You can pass
*any* layout to the built-in `print()` function to print a string representation
of the layout in the form of a (*shape*:*stride*) pair.

:::

### IntTuple: representing hierarchical shapes and strides

A layout’s shape and stride are represented using the
[`IntTuple`](/mojo/kernels/layout/int_tuple/IntTuple) type. Each element of an
`IntTuple` is either an integer value or a nested `IntTuple`. You can create
nested `IntTuples` using the `IntTuple` constructor:

```mojo
var shape1 = IntTuple(4, IntTuple(2, 2))
```

A layout’s shape and stride tuples must be *congruent*—that is, they need to
have the same hierarchical structure: the tuples must have the same number of
elements, and any elements that are nested tuples must also have the same number
of elements.

The [`int_tuple`](/mojo/kernels/layout/int_tuple/) package provides a number of
functions for working with `IntTuple`. For example, it provides a
[`congruent()`](/mojo/kernels/layout/int_tuple/congruent) function for testing
the congruency of two tuples.

### Modes

A layout has one or more *modes*, where a mode is a shape:stride pair. For
example, the 1D vector layout (8:1) has a single mode: 8 elements with a stride
of 1:

![](../images/layout/1d-layout.png#light)
![](../images/layout/1d-layout-dark.png#dark)

Figure 2. 1D layout

The 2D row-major matrix layout ((2, 4):(4, 1)) has two modes, 2:4 (the first
numbers from each tuple) and 4:1 (the second numbers from each tuple). Taking
them right to left, the second mode describes 4 columns with a stride of one.
The first mode specifies that there are two of these groups with a stride of 4:

![](../images/layout/2d-layout-with-strides.png#light)
![](../images/layout/2d-layout-with-strides-dark.png#dark)

Figure 3. 2D layout with strides

In a column-major layout, the row number varies the fastest, so a column-major
2x4 matrix has the layout ((2, 4):(1, 2)) and looks like this:

![](../images/layout/2d-col-major-layout-with-strides.png#light)
![](../images/layout/2d-col-major-layout-with-strides-dark.png#dark)

Figure 4. 2D column-major layout with strides

A layout’s *rank* is the number of modes in its shape. A rank-1 (or 1D) layout
describes a vector. A rank-2 layout describes a 2D matrix, and so on.

A layout’s *size* is defined as the product of all of the modes in the layout’s
shape. To put it another way, it’s the number of elements that the layout
addresses: that is, the *domain* of the layout function.

Modes can also be nested to represent more complicated strides along a
dimension. For example, the layout (8:1) represents a 1D vector of 8 elements.

![](../images/layout/1d-layout.png#light)
![](../images/layout/1d-layout-dark.png#dark)

Figure 5. 1D vector layout

The layout (((4, 2):(1, 4))) is *also* a 1D vector of 8 elements. The extra set
of parentheses indicates a nested or hierarchical mode. Instead of being
represented by a single mode like 8:1, this layout’s single dimension is
represented by the multi-mode (4, 2):(1, 4):

![](../images/layout/1d-multi-modal-layout.png#light)
![](../images/layout/1d-multi-modal-layout-dark.png#dark)

Figure 6. 1D layout with nested modes

Note that in the nested modes, there’s no notion of row and column. You can
think of the first mode as the “inner” mode (defining a group) and the next mode
as an “outer” mode (defining a repeat of the group) as shown above.

A set of nested modes (a *multi-mode*) counts as a single mode when considering
the parent layout’s rank. For example, the layouts (8:1) and (((4, 2):(1, 4)))
are both rank-1 layouts.

This gets more interesting when we move to two dimensions. Consider the
following 2D layouts:

![](../images/layout/multi-modal-layout.png#light)
![](../images/layout/multi-modal-layout-dark.png#dark)

Figure 7. Two 2D layouts

Layouts A and B are both 2D matrix layouts with the same overall 2D shape, but
with the elements in a different order. Layout B is *tiled*, so instead of being
in row-major or column-major order, four consecutive indices are grouped into
each 2x2 tile. This is sometimes called *tile-major order*.

We can break this tiled layout into two modes, one for the rows and one for the columns:

- Layout B has a row mode of (2, 2):(1, 4). We can further break this into two
  sub-modes: the inner mode, 2:1, defines a group of two rows with a stride of
  one. The outer mode, 2:4, specifies that the group occurs twice with a stride
  of 4.
- The column has the mode (2, 2):(2, 8). Once again we can break this into two
  sub-modes: (2:2) defines a group of two columns with a stride of two, and the
  group occurs twice with a stride of 8 (2:8).

 If all of those modes are swimming before your eyes, take a moment to study the
 figure and trace out the strides yourself.

### Coordinates

Coordinates for layouts can be written in the same format as the shape tuple.
For example, coordinates for layout B above can be written ((*i, j*), (*k, l*)).
However, this layout can also be addressed as a logical 2D matrix, just like
layout A. So ((0, 1), (0, 1)) and (2, 2) are both valid coordinates that map to
the same index.

In fact, this is true for any layout: the layout can be addressed with 1D or 2D
coordinates as well as its “natural” coordinates. When mapping coordinates, the
dimensions are traversed in *colexicographical* order (that is, a generalized
column-major order, where the leftmost coordinate varies fastest). Table 1 shows
how different 1D and 2D coordinates map to the “natural” coordinates of the ((2,
2), (2, 2)) shape shown above:

| 1D | 2D | Natural |
| ----- | :---- | :---- |
| 0 | (0, 0) | ((0, 0), (0, 0)) |
| 1 | (1, 0) | ((1, 0), (0, 0)) |
| 2 | (2, 0) | ((0, 1), (0, 0)) |
| 3 | (3, 0) | ((1, 1), (0, 0)) |
| 4 | (0, 1) | ((0, 0), (1, 0)) |
| 5 | (1, 1) | ((1, 0), (1, 0)) |
| 6 | (2, 1) | ((0, 1), (1, 0)) |
| 7 | (3, 1) | ((1, 1), (1, 0)) |
| 8 | (0, 2) | ((0, 0), (0, 1)) |
| ... | ... | ... |
| 15 | (3, 3) | ((1, 1), (1, 1)) |

Table 1. Mapping between 1D, 2D, and natural coordinates

## Making layouts

There are multiple ways to create layouts. The
[`row_major()`](/mojo/kernels/layout/layout/Layout/#row_major) and
[`col_major()`](/mojo/kernels/layout/layout/Layout/#col_major) static methods are
probably the simplest ways to create a layout. The `row_major()` method creates
a generalized row-major layout: that is, the rightmost coordinate varies the
fastest. The `col_major()` method creates a generalized column-major layout,
where the leftmost coordinate varies the fastest.

```mojo
print(Layout.row_major(4, 4, 4))
print(Layout.col_major(4, 4, 4))
```

Output:

```plaintext
((4, 4, 4):(16, 4, 1))
((4, 4, 4):(1, 4, 16))
```

If you know the shape and strides in advance, you can construct an arbitrarily
complex layout using the `Layout` constructor. For example:

```mojo
var tiled_layout = Layout(
    IntTuple(IntTuple(3, 2), IntTuple(2, 5)), # shape
    IntTuple(IntTuple(1, 6), IntTuple(3, 12)) # strides
)
print_layout(tiled_layout)
```

Output:

```plaintext

(((3, 2), (2, 5)):((1, 6), (3, 12)))
       0    1    2    3    4    5    6    7    8    9
    +----+----+----+----+----+----+----+----+----+----+
 0  |  0 |  3 | 12 | 15 | 24 | 27 | 36 | 39 | 48 | 51 |
    +----+----+----+----+----+----+----+----+----+----+
 1  |  1 |  4 | 13 | 16 | 25 | 28 | 37 | 40 | 49 | 52 |
    +----+----+----+----+----+----+----+----+----+----+
 2  |  2 |  5 | 14 | 17 | 26 | 29 | 38 | 41 | 50 | 53 |
    +----+----+----+----+----+----+----+----+----+----+
 3  |  6 |  9 | 18 | 21 | 30 | 33 | 42 | 45 | 54 | 57 |
    +----+----+----+----+----+----+----+----+----+----+
 4  |  7 | 10 | 19 | 22 | 31 | 34 | 43 | 46 | 55 | 58 |
    +----+----+----+----+----+----+----+----+----+----+
 5  |  8 | 11 | 20 | 23 | 32 | 35 | 44 | 47 | 56 | 59 |
    +----+----+----+----+----+----+----+----+----+----+
```

The result is a 6x10 tile-major layout. The layout is indexed vertically in 2
groups of 3 rows (3, 2) : (1, 6) ( and horizontally in 5 groups of 2 columns (2,
5):(3, 12). Alternatively, you can think of this as a layout consisting of 3x2
column-major tiles ((3, 2):(1, 3)) that are arranged into two rows of 5, ((2,
5):(6, 12)).

The `Layout` constructor works fine if you know the shape and strides in
advance, but calculating the strides for a complicated layout isn’t always
intuitive.

An easier way to generate this layout is the
[`tile_to_shape()`](/mojo/kernels/layout/layout/tile_to_shape) function.
This takes a layout (representing the tile) and a final shape to tile to:

```mojo
var tts = tile_to_shape(Layout.col_major(3, 2), IntTuple(6, 10))
print_layout(tts)
```

Output:

```plaintext
(((3, 2), (2, 5)):((1, 6), (3, 12)))
       0    1    2    3    4    5    6    7    8    9
    +----+----+----+----+----+----+----+----+----+----+
 0  |  0 |  3 | 12 | 15 | 24 | 27 | 36 | 39 | 48 | 51 |
    +----+----+----+----+----+----+----+----+----+----+
 1  |  1 |  4 | 13 | 16 | 25 | 28 | 37 | 40 | 49 | 52 |
    +----+----+----+----+----+----+----+----+----+----+
 2  |  2 |  5 | 14 | 17 | 26 | 29 | 38 | 41 | 50 | 53 |
    +----+----+----+----+----+----+----+----+----+----+
 3  |  6 |  9 | 18 | 21 | 30 | 33 | 42 | 45 | 54 | 57 |
    +----+----+----+----+----+----+----+----+----+----+
 4  |  7 | 10 | 19 | 22 | 31 | 34 | 43 | 46 | 55 | 58 |
    +----+----+----+----+----+----+----+----+----+----+
 5  |  8 | 11 | 20 | 23 | 32 | 35 | 44 | 47 | 56 | 59 |
    +----+----+----+----+----+----+----+----+----+----+
```

A variation on `tile_to_shape()` is the
[`blocked_product()`](/mojo/kernels/layout/layout/blocked_product) function. The
main difference is that where `tile_to_shape()` takes an output *shape*,
`blocked_product()` takes a *tiler* layout: essentially, every element in the
tiler layout is replaced by a tile. The following example generates the same
tiled layout using `blocked_product()`. It also prints out the two input
layouts.

```mojo
# Define 2x3 tile
var tile = Layout.col_major(3, 2)
# Define a 2x5 tiler
var tiler = Layout.col_major(2, 5)
var blocked = blocked_product(tile, tiler)

print("Tile:")
print_layout(tile)
print("\nTiler:")
print_layout(tiler)
print("\nTiled layout:")
print(blocked)
```

Output:

```plaintext
Tile:
((3, 2):(1, 3))
      0   1
    +---+---+
 0  | 0 | 3 |
    +---+---+
 1  | 1 | 4 |
    +---+---+
 2  | 2 | 5 |
    +---+---+

Tiler:
((2, 5):(1, 2))
       0    1    2    3    4
    +----+----+----+----+----+
 0  |  0 |  2 |  4 |  6 |  8 |
    +----+----+----+----+----+
 1  |  1 |  3 |  5 |  7 |  9 |
    +----+----+----+----+----+

Tiled layout:
(((3, 2), (2, 5)):((1, 6), (3, 12)))

```

As you can see, `blocked_product()` combines two simple layouts to generate a
more complex one.

Finally, if you know the *shape* you want and the *order* in which you want to
iterate through the dimensions, you can use the
[`make_ordered_layout()`](/mojo/kernels/layout/layout/make_ordered_layout)
function. For example, the following example is yet one more way to generate the
previous tiled layout:

```mojo
var ordered = make_ordered_layout(
    IntTuple(IntTuple(3, 2), IntTuple(2, 5)), # shape
    IntTuple(IntTuple(0, 2), IntTuple(1, 3))  # order
)
print(ordered)
```

Output:

```plaintext
(((3, 2), (2, 5)):((1, 6), (3, 12)))
```

The generated layout's strides follow the same ordering as `order`—that is, the
dimension with the smallest corresponding `order` value has the smallest stride
value, and so on. The strides are computed such that the layout is dense—that
is, the logical multidimensional array is contiguous.

## Non-contiguous layouts

All of the examples so far have been dense layouts, where all of the elements
are contiguous in memory. However, layouts can also describe sparse logical
arrays. For example, a (4:2) layout is a sparse 1D array:

![](../images/layout/1d-sparse-layout.png#light)
![](../images/layout/1d-sparse-layout-dark.png#dark)

Figure 8. 1D sparse layout (4:2)

A layout’s *cosize* is the size of the layout’s codomain, which you can think of
as the size of the smallest contiguous array that can contain all of the
layout’s elements. The cosize is the largest linear index value generated by the
layout plus 1. So in the example in Figure 9, the layout has a size of 4, but a
cosize of 7.

---

## IntTuple

`struct IntTuple[origin: ImmutableOrigin = {}]`

A hierarchical, nested tuple of integers with efficient memory management.

IntTuple provides a flexible data structure for representing multi-dimensional
shapes, indices, and other nested integer collections. It supports both flat
and hierarchical representations with efficient memory sharing.

This structure is fundamental for tensor operations, layout specifications,
and dimension handling in high-performance computing contexts.

## Parameters

* ​origin (`ImmutableOrigin`): Origin tracking for memory safety. Defaults to the current origin.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Intable`,
`Movable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `MinimumValue`

`alias MinimumValue = -65534`

Minimum allowed value for integers in an `IntTuple`.

This constant defines the lower bound for integer values that can be stored
directly in an `IntTuple`. Values below this threshold are reserved for internal
use to represent structural information like sub-tuple offsets.

## Methods

### `__init__`

`__init__(out self)`

Initialize an empty IntTuple.

Creates an `IntTuple` with zero elements, which can be used as a starting
point for building tuples incrementally with `append` or `extend`.

Performance:

* Minimal allocation (just a single element for length).
* Structure validation only performed when `INT_TUPLE_VALIDATION` is enabled.

`__init__(out self, *, num_elems: Int)`

Initialize an `IntTuple` with a specified number of uninitialized elements.

Creates an `IntTuple` with space for the specified number of elements,
but does not initialize the elements themselves.

Note:
Structure validation only performed when `INT_TUPLE_VALIDATION` is enabled.

**Args:**

* ​num\_elems (`Int`): The number of elements to allocate space for.

`@implicit`
`__init__(out self, *elements: Int)`

Initialize an `IntTuple` with a variadic list of integers.

Creates an `IntTuple` containing the provided integer values.
This constructor is implicit, allowing direct conversion from integer lists.

**Args:**

* ​\*elements (`Int`): Variable number of integer values to store in the tuple.

`__init__(out self, elements: VariadicList[Int])`

Initialize an `IntTuple` with a list of integers.

Creates an `IntTuple` containing the provided integer values.
This constructor is implicit, allowing direct conversion from integer lists.

Notes:

* Pre-allocates exact memory needed for efficiency.
* Validates that all values are above `MinimumValue`. If any value is
  less than `MinimumValue`, aborts with an error message.
* Structure validation only performed when `INT_TUPLE_VALIDATION` is
  enabled.

**Args:**

* ​elements (`VariadicList[Int]`): List of integer values to store in the tuple.

`@implicit`
`__init__(out self, value: Int)`

Initialize an `IntTuple` with a single integer value.

Creates an `IntTuple` containing a single integer element.

**Args:**

* ​value (`Int`): The integer value to store in the tuple.

`__init__(out self, *elements: IntTuple[origin], *, __list_literal__: Tuple[] = Tuple())`

Initialize an `IntTuple` with nested IntTuples.

Creates a hierarchical `IntTuple` containing the provided `IntTuple` elements,
preserving their nested structure.

**Args:**

* ​\*elements (`IntTuple[origin]`): Variable number of `IntTuple` values to store in the tuple.
* ​**list\_literal** (`Tuple[]`): Specifies that this constructor can be used for
  list literals.

`__init__(out self, *, non_owned: IntArray)`

Initialize an `IntTuple` with a non-owned `IntArray`.

Creates an `IntTuple` that uses the provided `IntArray` as its storage
without taking ownership. This allows creating views into existing
`IntTuple` data without copying.

**Args:**

* ​non\_owned (`IntArray`): The `IntArray` to use as storage without taking ownership.

`__init__(out self, existing: Self, rng: _StridedRange)`

Initialize an `IntTuple` as a slice of an existing `IntTuple`.

Creates a new `IntTuple` containing only the elements from the existing
`IntTuple` that are specified by the range.

Notes:

* Preserves nested structure of elements in the slice.
* Structure validation only performed when `INT_TUPLE_VALIDATION` is enabled.

**Args:**

* ​existing (`Self`): The source `IntTuple` to slice from.
* ​rng (`_StridedRange`): The range of indices to include in the new `IntTuple`.

`__init__(out self, dimlist: DimList)`

Initialize an `IntTuple` from a DimList.

Creates an `IntTuple` containing the dimensions from a DimList, handling
both defined and undefined dimensions appropriately.

Notes:

* Converts undefined dimensions to `UNKNOWN_VALUE`.
* Validates that all values are above `MinimumValue`. If any value is
  less than `MinimumValue`, aborts with an error message.

**Args:**

* ​dimlist (`DimList`): The DimList containing dimension information.

`@implicit`
`__init__(out self, zipper: _zip[origin, 2])`

Initialize an `IntTuple` from a zip iterator.

Creates an `IntTuple` by appending each element from the zip iterator.
This constructor is implicit, allowing direct conversion from zip iterators.

Note:
This implementation is not optimized and may be improved in future versions.

**Args:**

* ​zipper (`_zip[origin, 2]`): A zip iterator containing pairs of elements to append.

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Initialize by copying an existing `IntTuple`.

Creates a deep copy of the provided `IntTuple`, copying all its data
into newly allocated memory.

Note:
There is a Mojo bug where this method unnecessarily propagates
the origin of self to the new copy.

**Args:**

* ​existing (`Self`): The `IntTuple` to copy from.

### `__getitem__`

`__getitem__(self, _idx: Int) -> IntTuple[self]`

Retrieves an element at the specified index from the `IntTuple`.

Supports negative indexing (e.g., `-1` for the last element).

Notes:
If index validation is enabled and the index is out of bounds,
aborts with an error message.

**Args:**

* ​\_idx (`Int`): The index of the element to retrieve.

**Returns:**

An `IntTuple` containing either a single value or a sub-tuple.

`__getitem__(self, span: Slice) -> Self`

Retrieves a slice of elements from the `IntTuple`.

Creates a new `IntTuple` containing the elements specified by the slice.

**Args:**

* ​span (`Slice`): A slice object specifying the range of elements to retrieve.

**Returns:**

A new `IntTuple` containing the specified elements.

### `__lt__`

`__lt__(self, rhs: IntTuple[origin]) -> Bool`

Compare two `IntTuple`s lexicographically.

This function performs element-wise comparison of two `IntTuple`s and determines
if the first is lexicographically less than the second. It compares corresponding
elements until it finds a pair where the elements differ.

Example:

```mojo
from layout.int_tuple import IntTuple

var tuple1 = IntTuple(1, 2, 3)
var tuple2 = IntTuple(1, 2, 4)

var result = tuple1 rhs (`IntTuple[origin]`): The other `IntTuple` to compare.

**Returns:**

True if `self` is lexicographically less than `rhs`, False otherwise.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Equality operator for `IntTuple`.

**Args:**

* ​other (`Self`): The `IntTuple` to compare with.

**Returns:**

True if the `IntTuple`s are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Inequality operator for `IntTuple`.

**Args:**

* ​other (`Self`): The `IntTuple` to compare with.

**Returns:**

True if the `IntTuple`s are not equal, False otherwise.

### `elements_size`

`static elements_size[origin: ImmutableOrigin](elements: VariadicListMem[IntTuple[origin], origin, is_owned]) -> Int`

Calculate the total storage size needed for a list of IntTuples.

Computes the sum of sizes for all elements, accounting for both direct
integer values and nested sub-tuples.

**Parameters:**

* ​origin (`ImmutableOrigin`): Origin of the elements in the `IntTuple`.

**Args:**

* ​elements (`VariadicListMem[IntTuple[origin], origin, is_owned]`): List of `IntTuple` elements to measure.

**Returns:**

The total storage size required for all elements.

`static elements_size[origin: ImmutableOrigin, n: Int](elements: InlineArray[Pointer[IntTuple, origin], n], idx: Int) -> Int`

Calculate the total storage size needed for IntTuples at a specific index.

Computes the sum of sizes for all elements at the given index in an array
of `IntTuple` pointers.

**Parameters:**

* ​origin (`ImmutableOrigin`): Origin tracking for memory safety.
* ​n (`Int`): Size of the inline array.

**Args:**

* ​elements (`InlineArray[Pointer[IntTuple, origin], n]`): Array of pointers to `IntTuple`s.
* ​idx (`Int`): Index to access in each `IntTuple`.

**Returns:**

The total storage size required for all elements at the specified index.

### `owned_copy`

`owned_copy(self) -> IntTuple`

Create a deep copy of this `IntTuple` with its own memory ownership.

This method creates a completely independent copy of the `IntTuple` with
newly allocated memory. Unlike `__copyinit__`, this method can be called
on an existing instance to create a separate copy.

Example:

```mojo
from layout import IntTuple

var original = IntTuple(1, 2, 3)
var copy = original.owned_copy()
# Modifying copy will not affect original
```

.

**Returns:**

A new `IntTuple` containing the same data as this one but with
independent memory ownership.

### `replace_entry`

`replace_entry(self, idx: Int, value: IntTuple[origin]) -> IntTuple`

Replace an entry in the tuple with another `IntTuple`.

Creates a new `IntTuple` with the element at the specified index replaced
by the provided `IntTuple`.

Note:
If the index is out of bounds and `INT_TUPLE_VALIDATION` is enabled,
aborts with an error message.

**Args:**

* ​idx (`Int`): The index of the element to replace.
* ​value (`IntTuple[origin]`): The `IntTuple` to insert at the specified index.

**Returns:**

A new `IntTuple` with the replacement applied.

`replace_entry(mut self, idx: Int, *, int_value: Int)`

Replace an integer value at the specified index in-place.

Directly modifies the tuple by replacing the integer value at the given index.
This is more efficient than creating a new tuple when only a single value
needs to be changed.

Note:
If the index is out of bounds and `INT_TUPLE_VALIDATION` is enabled,
aborts with an error message.

**Args:**

* ​idx (`Int`): The index of the element to replace.
* ​int\_value (`Int`): The integer value to insert at the specified index.

### `count_values`

`count_values(self) -> Int`

Count the total number of integer values in this tuple hierarchy.

Recursively traverses the nested tuple structure and counts all integer values.
This is useful for determining the size needed for flattened representations.

Note:
For a flat tuple, this will return the same value as `len(self)`.
For nested tuples, it counts all leaf integer values.

**Returns:**

The total count of integer values in this tuple and all nested tuples.

### `flatten`

`flatten(self) -> IntTuple`

Flatten a nested `IntTuple` into a single-level `IntTuple`.

This function converts a hierarchical `IntTuple` structure into a flat
sequence of integer values, preserving the order of elements.

**Returns:**

A new `IntTuple` containing all integer values in a flat structure.

### `all_known`

`all_known(self) -> Bool`

Check if all values in this tuple hierarchy are known (not `UNKNOWN_VALUE`).

Recursively traverses the nested tuple structure and checks if any value
is equal to `UNKNOWN_VALUE`.

**Returns:**

True if all values in this tuple and nested tuples are known,
False if any value is `UNKNOWN_VALUE`.

### `append`

`append(mut self, *elements: IntTuple[origin])`

Append one or more `IntTuple` elements to this tuple.

This method modifies the tuple in-place by adding the provided elements
to the end of the tuple. It handles both value tuples and nested tuples.

Notes:

* This operation requires reallocating the underlying `IntArray` storage to accommodate
  the new elements, which may impact performance for large tuples.
* Aborts if called on a non-owning (sub-tuple) instance.

**Args:**

* ​\*elements (`IntTuple[origin]`): Variable number of `IntTuple` objects to append to this tuple.

### `extend`

`extend(mut self, tuple: IntTuple[origin])`

Extends this tuple by appending all elements from another tuple.

This method modifies the tuple in-place by adding all elements from the provided
tuple to the end of this tuple. It efficiently handles both value elements and
nested tuples.

Notes:

* This operation requires reallocating the underlying `IntArray` storage
  to accommodate the new elements, which may impact performance for large tuples.
* Aborts if called on a non-owning (sub-tuple) instance.
* If the input tuple is empty, this method returns without making any changes.

**Args:**

* ​tuple (`IntTuple[origin]`): The `IntTuple` whose elements will be appended to this tuple.

### `size`

`size(self) -> Int`

Returns the total size of the `IntTuple` in memory.

For owning tuples, returns the size of the underlying `IntArray`.
For non-owning tuples, calculates the size recursively.

**Returns:**

The total size in memory units.

### `tuple_size`

`static tuple_size(data: IntArray) -> Int`

Recursively calculates the size of a tuple represented by an `IntArray`.

This method traverses the tuple structure, accounting for both direct values
and nested sub-tuples to compute the total memory footprint.

**Args:**

* ​data (`IntArray`): `IntArray` containing the tuple data.

**Returns:**

The total size of the tuple in memory units.

### `validate_structure`

`validate_structure(self)`

Validates the internal structure of the `IntTuple`.

Ensures that the actual size of the underlying data matches the computed size
based on the tuple's structure. This helps detect memory corruption or
implementation errors.

Aborts execution with an error message if validation fails.

### `__len__`

`__len__(self) -> Int`

Returns the number of elements in the `IntTuple`.

This is the logical length of the tuple, not its memory size.

**Returns:**

The number of elements in the tuple.

### `__iter__`

`__iter__(self) -> _IntTupleIter[self, origin]`

Returns an iterator over the elements of the `IntTuple`.

This enables iteration through the tuple using for-loops.

**Returns:**

An iterator object for this `IntTuple`.

### `is_value`

`is_value(self) -> Bool`

Determines if this `IntTuple` represents a single value rather than a tuple.

**Returns:**

True if this `IntTuple` contains exactly one element that is a value,
False otherwise.

`is_value(self, i: Int) -> Bool`

Determines if the element at the specified index is a value rather than a tuple.

Notes:
If index validation is enabled and the index is out of bounds,
aborts with an error message.

**Args:**

* ​i (`Int`): The index of the element to check.

**Returns:**

True if the element at index i is a value, False if it's a tuple.

### `is_tuple`

`is_tuple(self) -> Bool`

Determines if this `IntTuple` represents a tuple rather than a single value.

**Returns:**

True if this `IntTuple` is a tuple (not a single value), False otherwise.

`is_tuple(self, i: Int) -> Bool`

Determines if the element at the specified index is a tuple rather than a value.

Notes:
This is the complement of is\_value(i).

**Args:**

* ​i (`Int`): The index of the element to check.

**Returns:**

True if the element at index i is a tuple, False if it's a value.

### `value`

`value(self) -> Int`

Retrieves the value of this `IntTuple` if it represents a single value.

This method should only be called if `is_value()` returns True.

**Returns:**

The integer value stored in this `IntTuple`.

`value(self, i: Int) -> Int`

Retrieves the value of the element at the specified index.

This method should only be called if `is_value(i)` returns True.

Notes:
If the element is not a value, the behavior is undefined.

**Args:**

* ​i (`Int`): The index of the element to retrieve.

**Returns:**

The integer value stored at the specified index.

### `tuple`

`tuple(ref self) -> ref [self] Self`

Returns a reference to this `IntTuple` as a tuple.

Notes:
This method is used to access the current `IntTuple` as a tuple
without creating a copy of the data.

**Returns:**

A reference to this `IntTuple` to avoid unnecessary copying.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes a string representation of this `IntTuple` to the provided writer.

Notes:
For single values, writes just the value.
For tuples, writes a comma-separated list of elements enclosed in parentheses.

**Parameters:**

* ​W (`Writer`): A type that conforms to the Writer trait.

**Args:**

* ​writer (`W`): The writer to output the string representation to.

### `__str__`

`__str__(self) -> String`

Returns a string representation of this `IntTuple`.

**Returns:**

A string representation of the `IntTuple`, using the `write_to` method.

### `is_equal`

`static is_equal(a: IntTuple[origin], b: IntTuple[origin]) -> Bool`

Compares two `IntTuple`s for equality.

Notes:
Handles nested tuples and special cases where a single-element tuple
is equivalent to its contained value.

**Args:**

* ​a (`IntTuple[origin]`): The first `IntTuple` to compare.
* ​b (`IntTuple[origin]`): The second `IntTuple` to compare.

**Returns:**

True if the `IntTuple`s are equal in structure and values, False otherwise.

### `__repr__`

`__repr__(self) -> String`

Returns a string representation of this `IntTuple` for debugging.

**Returns:**

A string representation of the `IntTuple`, same as `__str__`.

### `__int__`

`__int__(self) -> Int`

Converts this `IntTuple` to an integer.

This method should only be called if `is_value()` returns True.

Notes:
If the `IntTuple` is not a single value, the behavior is undefined.

**Returns:**

The integer value stored in this `IntTuple`.

---

## io

Provides utilities for working with input/output.

These are Mojo built-ins, so you don't need to import them.

## Functions

* [​`input`](/mojo/stdlib/builtin/io/input): Reads a line of input from the user.
* [​`print`](/mojo/stdlib/builtin/io/print): Prints elements to the text stream. Each element is separated by `sep` and followed by `end`.

---

## IO

`@register_passable(trivial)`
`struct IO`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `FusedInput`

`alias FusedInput = IO(2)`

### `FusedOutput`

`alias FusedOutput = IO(3)`

### `Input`

`alias Input = IO(1)`

### `Output`

`alias Output = IO(0)`

### `Unknown`

`alias Unknown = IO(-1)`

## Methods

### `__init__`

`__init__(value: Int) -> Self`

### `__eq__`

`__eq__(self, other: Self) -> Bool`

---

## io_spec

## Aliases

### `FusedInput`

`alias FusedInput = IOSpec()`

### `FusedOutput`

`alias FusedOutput = IOSpec()`

### `Input`

`alias Input = IOSpec()`

### `IOUnknown`

`alias IOUnknown = IOSpec()`

### `MutableInput`

`alias MutableInput = IOSpec()`

### `Output`

`alias Output = IOSpec()`

## Structs

* [​`IO`](/max/api/mojo/tensor/io_spec/IO):
* [​`IOSpec`](/max/api/mojo/tensor/io_spec/IOSpec): Parameter used to encode whether a particular tensor argument to a DPS kernel is an output, input, or mutable input.

---

## IOSpec

`@register_passable(trivial)`
`struct IOSpec[mut: Bool, input: IO]`

Parameter used to encode whether a particular tensor argument to a DPS kernel is an output, input, or mutable input.

```mojo
Input == IOSpec[False, IO.Input]()
Output == IOSpec[True, IO.Output]()
MutableInput == IOSpec[True, IO.Input]()
FusedInput == IOSpec[False, IO.FusedInput]()
FusedOutput == IOSpec[True, IO.FusedOutput]()
```

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

---

## iota

`iota[dtype: DType, width: Int](offset: SIMD[dtype, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[dtype, width]`

Creates a SIMD vector containing an increasing sequence, starting from offset.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​offset (`SIMD[dtype, 1]`): The value to start the sequence at. Default is zero.

**Returns:**

An increasing sequence of values, starting from offset.

`iota[dtype: DType, //](buff: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], len: Int, offset: Int = 0)`

Fill the buffer with numbers ranging from offset to offset + len - 1, spaced by 1.

The function doesn't return anything, the buffer is updated inplace.

**Parameters:**

* ​dtype (`DType`): DType of the underlying data.

**Args:**

* ​buff (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The buffer to fill.
* ​len (`Int`): The length of the buffer to fill.
* ​offset (`Int`): The value to fill at index 0.

`iota[dtype: DType, //](mut v: List[SIMD[dtype, 1], hint_trivial_type], offset: Int = 0)`

Fill a list with consecutive numbers starting from the specified offset.

**Parameters:**

* ​dtype (`DType`): DType of the underlying data.

**Args:**

* ​v (`List[SIMD[dtype, 1], hint_trivial_type]`): The list to fill with numbers.
* ​offset (`Int`): The starting value to fill at index 0.

`iota(mut v: List[Int, hint_trivial_type], offset: Int = 0)`

Fill a list with consecutive numbers starting from the specified offset.

**Args:**

* ​v (`List[Int, hint_trivial_type]`): The list to fill with numbers.
* ​offset (`Int`): The starting value to fill at index 0.

---

## irfft

Inverse real FFT kernel using cuFFT.

## Functions

* [​`irfft`](./irfft): Compute the inverse real FFT of the input tensor.

---

## irfft

`irfft[input_rank: Int, input_type: DType, output_type: DType](input: NDBuffer[input_type, input_rank, origin], output: NDBuffer[output_type, input_rank, origin], n: Int, ctx: DeviceContext)`

Compute the inverse real FFT of the input tensor.

Currently, only applies it to the last dimension.

**Args:**

* ​input (`NDBuffer[input_type, input_rank, origin]`): Complex input tensor (NDBuffer).
* ​output (`NDBuffer[output_type, input_rank, origin]`): Real output tensor (NDBuffer).
* ​n (`Int`): Output signal size (if ctx (`DeviceContext`): Device context.

---

## is_32bit

`is_32bit[target: target = _current_target()]() -> Bool`

Returns True if the maximum integral value is 32 bit.

**Parameters:**

* ​target (`target`): The target architecture.

**Returns:**

True if the maximum integral value is 32 bit, False otherwise.

---

## is_64bit

`is_64bit[target: target = _current_target()]() -> Bool`

Returns True if the maximum integral value is 64 bit.

**Parameters:**

* ​target (`target`): The target architecture.

**Returns:**

True if the maximum integral value is 64 bit, False otherwise.

---

## is_absolute

`is_absolute[PathLike: PathLike, //](path: PathLike) -> Bool`

Return True if `path` is an absolute path name. On Unix, that means it begins with a slash.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to check.

**Returns:**

Return `True` if path is an absolute path name.

---

## is_amd_gpu

`is_amd_gpu() -> Bool`

Returns True if the target triple of the compiler is `amdgcn-amd-amdhsa` False otherwise.

**Returns:**

True if the triple target is amdgpu and False otherwise.

---

## is_apple_m1

`is_apple_m1() -> Bool`

Returns True if the host system is an Apple M1 with AMX support, otherwise returns False.

**Returns:**

True if the host system is an Apple M1 with AMX support and False
otherwise.

---

## is_apple_m2

`is_apple_m2() -> Bool`

Returns True if the host system is an Apple M2 with AMX support, otherwise returns False.

**Returns:**

True if the host system is an Apple M2 with AMX support and False
otherwise.

---

## is_apple_m3

`is_apple_m3() -> Bool`

Returns True if the host system is an Apple M3 with AMX support, otherwise returns False.

**Returns:**

True if the host system is an Apple M3 with AMX support and False
otherwise.

---

## is_apple_m4

`is_apple_m4() -> Bool`

Returns True if the host system is an Apple M4 with AMX support, otherwise returns False.

**Returns:**

True if the host system is an Apple M4 with AMX support and False
otherwise.

---

## is_apple_silicon

`is_apple_silicon() -> Bool`

Returns True if the host system is an Apple Silicon with AMX support, otherwise returns False.

**Returns:**

True if the host system is an Apple Silicon with AMX support and False
otherwise.

---

## is_big_endian

`is_big_endian[target: target = _current_target()]() -> Bool`

Returns True if the host endianness is big and False otherwise.

**Parameters:**

* ​target (`target`): The target architecture.

**Returns:**

True if the host target is big endian and False otherwise.

---

## is_compile_time

`is_compile_time() -> Bool`

Returns true if the current code is executed at compile time, false otherwise.

**Returns:**

A boolean value indicating whether the code is being compiled.

---

## is_contiguous_dim

`is_contiguous_dim(layout: Layout, dim: Int) -> Bool`

Checks if a flat layout is contiguous in a specific dimension.

This function checks if a flat layout is contiguous in a specified
dimension, considering both positive strides and zero strides with a single
element. The latter case is necessary for coalesced layouts.

**Args:**

* ​layout (`Layout`): The layout to check.
* ​dim (`Int`): The dimension to check.

**Returns:**

True if the layout is contiguous in the specified dimension,
False otherwise.

---

## is_cpu

`is_cpu[: Bool, : Origin[$0], //, target: StringSlice[$1]]() -> Bool`

Checks if the target is a CPU (compile-time version).

**Parameters:**

* ​target (`StringSlice[$1]`): Target string to check.

**Returns:**

True if the target is a CPU, False otherwise.

`is_cpu(target: StringSlice[origin]) -> Bool`

Checks if the target is a CPU (runtime version).

**Args:**

* ​target (`StringSlice[origin]`): Target string to check.

**Returns:**

True if the target is a CPU, False otherwise.

---

## is_defined

`is_defined[name: StringSlice[StaticConstantOrigin]]() -> Bool`

Return true if the named value is defined.

**Parameters:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name to test.

**Returns:**

True if the name is defined.

---

## is_flat

`is_flat(t: IntTuple[origin]) -> Bool`

Check if an `IntTuple` is flat.

This function checks if the `IntTuple` is flat, meaning it has no nested
elements.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to check.

**Returns:**

True if the `IntTuple` is flat, False otherwise.

---

## is_gpu

`is_gpu[: Bool, : Origin[$0], //, target: StringSlice[$1]]() -> Bool`

Checks if the target is a GPU (compile-time version).

**Parameters:**

* ​target (`StringSlice[$1]`): Target string to check.

**Returns:**

True if the target is a GPU, False otherwise.

`is_gpu(target: StringSlice[origin]) -> Bool`

Checks if the target is a GPU (runtime version).

**Args:**

* ​target (`StringSlice[origin]`): Target string to check.

**Returns:**

True if the target is a GPU, False otherwise.

---

## is_gpu

`is_gpu() -> Bool`

Returns True if the target triple is GPU and  False otherwise.

**Returns:**

True if the triple target is GPU and False otherwise.

---

## is_int

`is_int(t: IntTuple[origin]) -> Bool`

Check if an `IntTuple` represents a single integer value.

This function determines whether the given `IntTuple` contains a single integer value
rather than a nested tuple structure.

Example:

```mojo
from layout.int_tuple import is_int, IntTuple

var single_value = IntTuple(5)
var nested_tuple = IntTuple(1, 2, 3)

var result1 = is_int(single_value)  # Returns True
var result2 = is_int(nested_tuple)  # Returns False
```

.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to check.

**Returns:**

True if the `IntTuple` contains a single integer value,
False if it's a nested tuple.

---

## is_int

`is_int[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> Bool`

Determines if a `RuntimeTuple` represents a scalar integer value.

This function checks if the `RuntimeTuple` holds a single scalar value
rather than a tuple structure with multiple elements.

**Parameters:**

* ​t (`IntTuple[$0]`): The IntTuple type parameter of the RuntimeTuple.

**Args:**

* ​tuple (`RuntimeTuple[t, element_type=element_type]`): The `RuntimeTuple` to check.

**Returns:**

True if the `RuntimeTuple` represents a scalar integer, False otherwise.

---

## is_little_endian

`is_little_endian[target: target = _current_target()]() -> Bool`

Returns True if the host endianness is little and False otherwise.

**Parameters:**

* ​target (`target`): The target architecture.

**Returns:**

True if the host target is little endian and False otherwise.

---

## is_neoverse_n1

`is_neoverse_n1() -> Bool`

Returns True if the host system is a Neoverse N1 system, otherwise returns False.

**Returns:**

True if the host system is a Neoverse N1 system and False otherwise.

---

## is_nvidia_gpu

`is_nvidia_gpu() -> Bool`

Returns True if the target triple of the compiler is `nvptx64-nvidia-cuda` False otherwise.

**Returns:**

True if the triple target is cuda and False otherwise.

`is_nvidia_gpu[subarch: StringSlice[StaticConstantOrigin]]() -> Bool`

Returns True if the target triple of the compiler is `nvptx64-nvidia-cuda` and we are compiling for the specified sub-architecture and False otherwise.

**Parameters:**

* ​subarch (`StringSlice[StaticConstantOrigin]`): The subarchitecture (e.g. sm\_80).

**Returns:**

True if the triple target is cuda and False otherwise.

---

## is_profiling_disabled

`is_profiling_disabled[type: TraceCategory, level: TraceLevel]() -> Bool`

Returns False if the profiling is enabled for that specific type and level and True otherwise.

**Parameters:**

* ​type (`TraceCategory`): The trace category to check.
* ​level (`TraceLevel`): The trace level to check.

**Returns:**

True if profiling is disabled for the specified type and level.

---

## is_profiling_enabled

`is_profiling_enabled[type: TraceCategory, level: TraceLevel]() -> Bool`

Returns True if the profiling is enabled for that specific type and level and False otherwise.

**Parameters:**

* ​type (`TraceCategory`): The trace category to check.
* ​level (`TraceLevel`): The trace level to check.

**Returns:**

True if profiling is enabled for the specified type and level.

---

## is_row_major

`is_row_major[rank: Int](layout: Layout) -> Bool`

Checks if a layout has row-major ordering for the specified rank.

A row-major layout has strides that decrease from left to right, with the
rightmost dimension having a stride of 1.

**Parameters:**

* ​rank (`Int`): The expected rank of the layout.

**Args:**

* ​layout (`Layout`): The layout to check.

**Returns:**

True if the layout has row-major ordering for the specified rank,
False otherwise.

---

## is_triple

`is_triple[: string, //, name: StringLiteral[$0], target: target = _current_target()]() -> Bool`

Returns True if the target triple of the compiler matches the input and False otherwise.

**Parameters:**

* ​name (`StringLiteral[$0]`): The name of the triple value.
* ​target (`target`): The triple value to be checked against.

**Returns:**

True if the triple matches and False otherwise.

---

## is_tuple

`is_tuple(t: IntTuple[origin]) -> Bool`

Check if an `IntTuple` represents a nested tuple.

This function determines whether the given `IntTuple` contains nested elements
rather than a single integer value. It is the complement of the `is_int` function.

Example:

```mojo
from layout.int_tuple import is_tuple, IntTuple

var single_value = IntTuple(5)
var nested_tuple = IntTuple(1, 2, 3)

var result1 = is_tuple(single_value)  # Returns False
var result2 = is_tuple(nested_tuple)  # Returns True
```

.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to check.

**Returns:**

True if the `IntTuple` contains nested elements,
False if it's a single integer value.

---

## is_tuple

`is_tuple[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> Bool`

Determines if a `RuntimeTuple` represents a tuple rather than a scalar value.

This function checks the structure of the underlying IntTuple to determine
if it represents a tuple with multiple elements or a single scalar value.

**Parameters:**

* ​t (`IntTuple[$0]`): The IntTuple type parameter of the RuntimeTuple.

**Args:**

* ​tuple (`RuntimeTuple[t, element_type=element_type]`): The `RuntimeTuple` to check.

**Returns:**

True if the `RuntimeTuple` represents a tuple, False if it represents a scalar.

---

## is_valid_target

`is_valid_target[: Bool, : Origin[$0], //, target: StringSlice[$1]]() -> Bool`

Checks if the target is valid (compile-time version).

**Parameters:**

* ​target (`StringSlice[$1]`): Target string to check.

**Returns:**

True if the target is valid (CPU or GPU), False otherwise.

`is_valid_target(target: StringSlice[origin]) -> Bool`

Checks if the target is valid (runtime version).

**Args:**

* ​target (`StringSlice[origin]`): Target string to check.

**Returns:**

True if the target is valid (CPU or GPU), False otherwise.

---

## is_x86

`is_x86() -> Bool`

Returns True if the host system architecture is X86 and False otherwise.

**Deprecated:**

Use `CompilationTarget.is_x86()` instead.

**Returns:**

True if the host system architecture is X86 and False otherwise.

---

## isclose

`isclose[dtype: DType, width: Int](a: SIMD[dtype, width], b: SIMD[dtype, width], *, atol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0E-8), rtol: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1.0000000000000001E-5), equal_nan: Bool = False) -> SIMD[bool, width]`

Checks if the two input values are numerically within a tolerance.

When the type is integral, then equality is checked. When the type is
floating point, then this checks if the two input values are numerically the
close using the $abs(a - b) dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​a (`SIMD[dtype, width]`): The first value to compare.
* ​b (`SIMD[dtype, width]`): The second value to compare.
* ​atol (`SIMD[float64, 1]`): The absolute tolerance.
* ​rtol (`SIMD[float64, 1]`): The relative tolerance.
* ​equal\_nan (`Bool`): Whether to treat nans as equal.

**Returns:**

A boolean vector where a and b are equal within the specified tolerance.

---

## isdir

`isdir[PathLike: PathLike, //](path: PathLike) -> Bool`

Return True if path is an existing directory. This follows symbolic links, so both islink() and isdir() can be true for the same path.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

True if the path is a directory or a link to a directory and
False otherwise.

---

## isfile

`isfile[PathLike: PathLike, //](path: PathLike) -> Bool`

Test whether a path is a regular file.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

Returns True if the path is a regular file.

---

## isfinite

`isfinite[dtype: DType, simd_width: Int](val: SIMD[dtype, simd_width]) -> SIMD[bool, simd_width]`

Checks if the value is not infinite.

This is always True for non-FP data types.

**Parameters:**

* ​dtype (`DType`): The value dtype.
* ​simd\_width (`Int`): The width of the SIMD vector.

**Args:**

* ​val (`SIMD[dtype, simd_width]`): The value to check.

**Returns:**

True if val is finite and False otherwise.

---

## isinf

`isinf[dtype: DType, simd_width: Int](val: SIMD[dtype, simd_width]) -> SIMD[bool, simd_width]`

Checks if the value is infinite.

This is always False for non-FP data types.

**Parameters:**

* ​dtype (`DType`): The value dtype.
* ​simd\_width (`Int`): The width of the SIMD vector.

**Args:**

* ​val (`SIMD[dtype, simd_width]`): The value to check.

**Returns:**

True if val is infinite and False otherwise.

---

## islink

`islink[PathLike: PathLike, //](path: PathLike) -> Bool`

Return True if path refers to an existing directory entry that is a symbolic link.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

True if the path is a link to a directory and False otherwise.

---

## isnan

`isnan[dtype: DType, simd_width: Int](val: SIMD[dtype, simd_width]) -> SIMD[bool, simd_width]`

Checks if the value is Not a Number (NaN).

**Parameters:**

* ​dtype (`DType`): The value dtype.
* ​simd\_width (`Int`): The width of the SIMD vector.

**Args:**

* ​val (`SIMD[dtype, simd_width]`): The value to check.

**Returns:**

True if val is NaN and False otherwise.

---

## isqrt

`isqrt[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs elementwise reciprocal square root on a SIMD vector.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector to perform reciprocal square root on.

**Returns:**

The elementwise reciprocal square root of x.

---

## j0

`j0[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the Bessel function of the first kind of order 0 for each input value.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input vector.

**Returns:**

A vector containing the computed value for each value in the input.

---

## j1

`j1[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the Bessel function of the first kind of order 1 for each input value.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input vector.

**Returns:**

A vector containing the computed value for each value in the input.

---

## join

`join(owned path: String, *paths: String) -> String`

Join two or more pathname components, inserting '/' as needed. If any component is an absolute path, all previous path components will be discarded.  An empty last part will result in a path that ends with a separator.

**Args:**

* ​path (`String`): The path to join.
* ​\*paths (`String`): The paths to join.

**Returns:**

The joined path.

---

## k_matmul_ragged_paged

`k_matmul_ragged_paged[type: DType, num_heads: Int, head_dim: Int, page_size: Int, //, target: StringSlice[StaticConstantOrigin]](hidden_state: NDBuffer[type, 2, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[type, 2, origin, shape, strides], kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], layer_idx: SIMD[uint32, 1], ctx: DeviceContextPtr)`

Performs a matmul, writing the output into a mutable PagedKVCacheCollection object.

**Args:**

* ​hidden\_state (`NDBuffer[type, 2, origin, shape, strides]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size).
* ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,)
  denoting the start of each sequence along the seq\_len dimension.
* ​weight (`NDBuffer[type, 2, origin, shape, strides]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​kv\_collection (`PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size]`): The historical KVCache for keys and values. The KVCache for
  this layer is retrieved via layer\_idx.
* ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache
  for the given layer from kv\_collection.
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## keep

`keep(val: Bool)`

Provides a hint to the compiler to not optimize the variable use away.

This is useful in benchmarking to avoid the compiler not deleting the
code to be benchmarked because the variable is not used in a side-effecting
manner.

**Args:**

* ​val (`Bool`): The value to not optimize away.

`keep(val: Int)`

Provides a hint to the compiler to not optimize the variable use away.

This is useful in benchmarking to avoid the compiler not deleting the
code to be benchmarked because the variable is not used in a side-effecting
manner.

**Args:**

* ​val (`Int`): The value to not optimize away.

`keep[type: DType, simd_width: Int](val: SIMD[type, simd_width])`

Provides a hint to the compiler to not optimize the variable use away.

This is useful in benchmarking to avoid the compiler not deleting the
code to be benchmarked because the variable is not used in a side-effecting
manner.

**Parameters:**

* ​type (`DType`): The `dtype` of the input and output SIMD vector.
* ​simd\_width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​val (`SIMD[type, simd_width]`): The value to not optimize away.

`keep[type: AnyType](val: UnsafePointer[type])`

Provides a hint to the compiler to not optimize the variable use away.

This is useful in benchmarking to avoid the compiler not deleting the
code to be benchmarked because the variable is not used in a side-effecting
manner.

**Parameters:**

* ​type (`AnyType`): The type of the input.

**Args:**

* ​val (`UnsafePointer[type]`): The value to not optimize away.

`keep[type: AnyTrivialRegType](mut val: type)`

Provides a hint to the compiler to not optimize the variable use away.

This is useful in benchmarking to avoid the compiler not deleting the
code to be benchmarked because the variable is not used in a side-effecting
manner.

**Parameters:**

* ​type (`AnyTrivialRegType`): The type of the input.

**Args:**

* ​val (`type`): The value to not optimize away.

---

## Kernel

A kernel is a function that runs on a GPU, executing computations in parallel
across a large number of [threads](thread.mdx). Kernels are a fundamental
part of general-purpose GPU (GPGPU) programming and are designed to process
large datasets efficiently by performing the same operation simultaneously on
multiple data elements.

---

## KernelConfig

`struct KernelConfig`

Static configuration of the matmul inner kernel.

## Fields

* ​kernel\_rows (`Int`):
* ​kernel\_cols (`Int`):
* ​simd\_size (`Int`):
* ​packed\_shape (`DimList`):

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, *, kernel_rows: Int, kernel_cols: Int, simd_size: Int, packed_shape: DimList)`

---

## KernelLibrary

## `KernelLibrary` {#max.graph.KernelLibrary}

> *class* max.graph.KernelLibrary(context, paths=\[])

**Parameters:**

* **context** (`mlir.Context` )
* **paths** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `Path` `]` )

### `add_path()` {#max.graph.KernelLibrary.add_path}

> add\_path(path)

**Parameters:**

**path** ([`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) )

### `library_paths()` {#max.graph.KernelLibrary.library_paths}

> library\_paths()

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Path*](https://docs.python.org/3/library/pathlib.html#pathlib.Path)]

### `load_paths()` {#max.graph.KernelLibrary.load_paths}

> load\_paths(context, custom\_extensions)

Load the custom operations from provided library paths.

Performs additional “smart” library loading logic for custom operation
libraries in additional formats. The loading logic supports the
following formats:

* Compiled Mojo binary packages with .mojopkg extension
* Mojo source directory with custom operations

The loaded libraries are added to the current kernel library.

**Parameters:**

* **context** (`Context` ) – The MLIR context for loading MLIR operations
* **custom\_extensions** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`Path`](https://docs.python.org/3/library/pathlib.html#pathlib.Path) `]` ) – File paths to the custom operation libraries

### `verify_custom_op()` {#max.graph.KernelLibrary.verify_custom_op}

> verify\_custom\_op(custom\_op)

**Parameters:**

**custom\_op** (`Operation` )

---

## kernels

Helper functions for wrapping custom kv cache/attention related ops.

## `AttentionMaskVariant` {#max.nn.kernels.AttentionMaskVariant}

> *class* max.nn.kernels.AttentionMaskVariant(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None)

### `CAUSAL_MASK` {#max.nn.kernels.AttentionMaskVariant.CAUSAL_MASK}

> CAUSAL\_MASK *= 'causal'*

### `CHUNKED_CAUSAL_MASK` {#max.nn.kernels.AttentionMaskVariant.CHUNKED_CAUSAL_MASK}

> CHUNKED\_CAUSAL\_MASK *= 'chunked\_causal'*

### `NULL_MASK` {#max.nn.kernels.AttentionMaskVariant.NULL_MASK}

> NULL\_MASK *= 'null'*

### `SLIDING_WINDOW_CAUSAL_MASK` {#max.nn.kernels.AttentionMaskVariant.SLIDING_WINDOW_CAUSAL_MASK}

> SLIDING\_WINDOW\_CAUSAL\_MASK *= 'sliding\_window\_causal'*

### `TENSOR_MASK` {#max.nn.kernels.AttentionMaskVariant.TENSOR_MASK}

> TENSOR\_MASK *= 'tensor\_mask'*

## `MHAMaskConfig` {#max.nn.kernels.MHAMaskConfig}

> *class* max.nn.kernels.MHAMaskConfig(attention\_mask\_variant: 'AttentionMaskVariant', positional\_encoding\_variant: 'PositionalEncodingVariant')

**Parameters:**

* **attention\_mask\_variant** ([`AttentionMaskVariant`](#max.nn.kernels.AttentionMaskVariant) )
* **positional\_encoding\_variant** ([`PositionalEncodingVariant`](#max.nn.kernels.PositionalEncodingVariant) )

### `attention_mask_variant` {#max.nn.kernels.MHAMaskConfig.attention_mask_variant}

> attention\_mask\_variant\*: [AttentionMaskVariant](#max.nn.kernels.AttentionMaskVariant)\*

### `positional_encoding_variant` {#max.nn.kernels.MHAMaskConfig.positional_encoding_variant}

> positional\_encoding\_variant\*: [PositionalEncodingVariant](#max.nn.kernels.PositionalEncodingVariant)\*

## `MHAMaskVariant` {#max.nn.kernels.MHAMaskVariant}

> *class* max.nn.kernels.MHAMaskVariant(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None)

### `CAUSAL_ALIBI_MASK` {#max.nn.kernels.MHAMaskVariant.CAUSAL_ALIBI_MASK}

> CAUSAL\_ALIBI\_MASK *= '1'*

### `CAUSAL_MASK` {#max.nn.kernels.MHAMaskVariant.CAUSAL_MASK}

> CAUSAL\_MASK *= '0'*

### `CHUNKED_CAUSAL_MASK` {#max.nn.kernels.MHAMaskVariant.CHUNKED_CAUSAL_MASK}

> CHUNKED\_CAUSAL\_MASK *= '3'*

### `NULL_MASK` {#max.nn.kernels.MHAMaskVariant.NULL_MASK}

> NULL\_MASK *= '2'*

### `SLIDING_WINDOW_CAUSAL_MASK` {#max.nn.kernels.MHAMaskVariant.SLIDING_WINDOW_CAUSAL_MASK}

> SLIDING\_WINDOW\_CAUSAL\_MASK *= '4'*

## `PositionalEncodingVariant` {#max.nn.kernels.PositionalEncodingVariant}

> *class* max.nn.kernels.PositionalEncodingVariant(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None)

### `ALIBI_POS` {#max.nn.kernels.PositionalEncodingVariant.ALIBI_POS}

> ALIBI\_POS *= 'alibi\_pos'*

### `NO_POS` {#max.nn.kernels.PositionalEncodingVariant.NO_POS}

> NO\_POS *= 'no\_pos'*

## `causal_flash_attention_gpu()` {#max.nn.kernels.causal_flash_attention_gpu}

> max.nn.kernels.causal\_flash\_attention\_gpu(q, k, v, scale)

Computes causal flash attention using GPU-optimized kernel.
:param q: Query tensor of shape \[batch, seq\_len, num\_heads, head\_dim]
:param k: Key tensor of shape \[batch, seq\_len, num\_heads, head\_dim]
:param v: Value tensor of shape \[batch, seq\_len, num\_heads, head\_dim]
:param scale: Scaling factor for attention scores

**Parameters:**

* **q** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **k** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **v** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `cross_attention_ragged()` {#max.nn.kernels.cross_attention_ragged}

> max.nn.kernels.cross\_attention\_ragged(kv\_params, input, input\_row\_offsets, kv\_collection, layer\_idx, mask\_variant, kv\_input\_row\_offsets, q\_max\_seq\_len, scale, local\_window\_size=-1)

Computes cross attention provided the !mo.opaque KV Cache.

Notably, this materializes the attention mask (dependent on MHAMaskVariant)
within the kernel.
input and input\_row\_offsets are used together to implement the ragged
tensor.
input\_row\_offsets indicates where each batch starts and ends in input

attention, kv\_input\_row\_offsets represents the KV sequence length.

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection)  `|`  `PagedKVCacheCollection` )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **mask\_variant** ([`MHAMaskVariant`](#max.nn.kernels.MHAMaskVariant) )
* **kv\_input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **q\_max\_seq\_len** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **local\_window\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `dynamic_scaled_matmul()` {#max.nn.kernels.dynamic_scaled_matmul}

> max.nn.kernels.dynamic\_scaled\_matmul(a, b, a\_scales, b\_scales, out\_type=bfloat16)

Perform a matmul of two tensors with scaling factors. Currently only
supports channel-wise scaling for weights and per-token scaling for inputs.

**Parameters:**

* **a** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The first tensor to multiply.
* **b** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The second tensor to multiply, must be transposed.
* **a\_scales** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The scaling factors for the first tensor.
* **b\_scales** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The scaling factors for the second tensor.
* **out\_type** ([`DType`](../dtype.md#max.dtype.DType) )

**Returns:**

The result of the matmul operation.

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `flare_mla_decode_ragged()` {#max.nn.kernels.flare_mla_decode_ragged}

> max.nn.kernels.flare\_mla\_decode\_ragged(kv\_params, input, input\_row\_offsets, kv\_collection, layer\_idx, mask\_variant, scale, qk\_rope\_dim=64)

Computes flash (self) attention provided the !mo.opaque KV Cache.

Notably, this materializes the attention mask (dependent on MHAMaskVariant)
within the kernel.
input and input\_row\_offsets are used together to implement the ragged
tensor.
input\_row\_offsets indicates where each batch starts and ends in input

Note that this is self attention and the KV sequence length is
assumed to be equal to the Q sequence length.
For KV sequence length != Q sequence length, use cross\_attention\_ragged.

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **kv\_collection** (`PagedKVCacheCollection` )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **mask\_variant** ([`MHAMaskVariant`](#max.nn.kernels.MHAMaskVariant) )
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **qk\_rope\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `flare_mla_decompress_k_cache()` {#max.nn.kernels.flare_mla_decompress_k_cache}

> max.nn.kernels.flare\_mla\_decompress\_k\_cache(kv\_params, buffer\_row\_offsets\_1d, cache\_offsets\_1d, buffer\_length, weight, kv\_collection, layer\_idx, buffer\_size)

This kernel decompresses the key cache by up-projecting latent representations
into the KV space using a weight matrix.

The process involves:
: 1. Copying buffer\_length latent vectors from the key cache into a contiguous
buffer (k\_latent)
2\. Computing k = k\_latent @ weight.T to obtain the decompressed keys

**Returns:**

A tensor of shape \[buffer\_size, weight.shape\[0]] containing the decompressed
keys. Note that only the first buffer\_length tokens are valid.

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **buffer\_row\_offsets\_1d** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **cache\_offsets\_1d** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **buffer\_length** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **kv\_collection** (`PagedKVCacheCollection` )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **buffer\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `flare_mla_prefill_plan()` {#max.nn.kernels.flare_mla_prefill_plan}

> max.nn.kernels.flare\_mla\_prefill\_plan(kv\_params, input\_row\_offsets, kv\_collection, layer\_idx, buffer\_size, max\_chunks=16)

This kernel plans how to process a batch of sequences with
varying lengths using a fixed-size buffer.

Each sequence in the batch has some existing cached tokens and new input
tokens. The kernel divides the total tokens into chunks of buffer\_size.

For each chunk (iteration), it calculates:
: 1. Buffer offsets for each sequence in each chunk
2\. Cache offsets for each sequence in each chunk
3\. Total buffer lengths for each processing iteration

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **kv\_collection** (`PagedKVCacheCollection` )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **buffer\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **max\_chunks** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue), [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue), [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)]

## `flare_mla_prefill_ragged()` {#max.nn.kernels.flare_mla_prefill_ragged}

> max.nn.kernels.flare\_mla\_prefill\_ragged(kv\_params, input, k, v, input\_row\_offsets, buffer\_row\_offsets, cache\_offsets, kv\_collection, layer\_idx, mask\_variant, scale, qk\_rope\_dim=64, prev\_output=None, prev\_softmax\_info=None)

Performs MLA prefill. In the MLA prefill, we need to decompress
the KV tensors, as we store the latent representations in the KV cache.
We will decompress the KV tensors into a fixed size buffer to avoid
out-of-memory errors. In case the total cache length is greater than
the buffer size, we will process the attention calculation in chunks.

This MLA prefill kernel will return the output tensor for this iteration
and the softmax info tensor for this iteration. Such tensors will be used
by the next iteration of the MLA prefill kernel to continue the attention
calculation.

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) – KVCacheParams
* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – Input tensor
* **k** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – Key tensor
* **v** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – Value tensor
* **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – Indicates where each batch starts and ends in input
* **buffer\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – Indicates where each batch starts and ends in the buffer
* **cache\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – Indicates where each batch starts and ends in the KV cache
* **kv\_collection** (`PagedKVCacheCollection` ) – KV collection
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – Layer index tensor
* **mask\_variant** ([`MHAMaskVariant`](#max.nn.kernels.MHAMaskVariant) ) – Mask variant
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) ) – Scale
* **qk\_rope\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – QK rope dimension
* **prev\_output** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  `None` ) – Optional. Previous output tensor
* **prev\_softmax\_info** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  `None` ) – Optional. Previous softmax info tensor

**Returns:**

* The first tensor is the output tensor for this iteration
* The second tensor is the softmax info tensor for this iteration

**Return type:**

A tuple of two tensors

## `flash_attention()` {#max.nn.kernels.flash_attention}

> max.nn.kernels.flash\_attention(kv\_params, input, kv\_collection, layer\_idx, attention\_mask, valid\_lengths, scale)

Computes flash attention provided the mo.opaque KV Cache.

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **attention\_mask** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **valid\_lengths** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `flash_attention_ragged()` {#max.nn.kernels.flash_attention_ragged}

> max.nn.kernels.flash\_attention\_ragged(kv\_params, input, input\_row\_offsets, kv\_collection, layer\_idx, mask\_variant, scale, local\_window\_size=-1)

Computes flash (self) attention provided the !mo.opaque KV Cache.

Notably, this materializes the attention mask (dependent on MHAMaskVariant)
within the kernel.
input and input\_row\_offsets are used together to implement the ragged
tensor.
input\_row\_offsets indicates where each batch starts and ends in input

Note that this is self attention and the KV sequence length is
assumed to be equal to the Q sequence length.
For KV sequence length != Q sequence length, use cross\_attention\_ragged.

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection)  `|`  `PagedKVCacheCollection` )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **mask\_variant** ([`MHAMaskVariant`](#max.nn.kernels.MHAMaskVariant) )
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **local\_window\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `flash_attention_with_causal_mask()` {#max.nn.kernels.flash_attention_with_causal_mask}

> max.nn.kernels.flash\_attention\_with\_causal\_mask(kv\_params, input, kv\_collection, layer\_idx, valid\_lengths, scale)

Computes flash attention provided the mo.opaque KV Cache.
Notably, materializes the causal mask within the kernel.

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **valid\_lengths** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `fused_qk_ragged_rope()` {#max.nn.kernels.fused_qk_ragged_rope}

> max.nn.kernels.fused\_qk\_ragged\_rope(kv\_params, input, input\_row\_offsets, kv\_collection, freqs\_cis, layer\_idx, interleaved=True)

Computes fused query-key attention with rotary positional encodings and ragged inputs.

**Parameters:**

* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – \[batch\_size \* seq\_len, n\_heads, head\_dim]
* **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **freqs\_cis** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – tensor of shape (max\_seq\_len \* 2, head\_dim)
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **interleaved** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )
* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection)  `|`  `PagedKVCacheCollection` )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

input and input\_row\_offsets are used together to implement the ragged tensor.
input\_row\_offsets indicates where each batch starts and ends in input

## `fused_qk_rope()` {#max.nn.kernels.fused_qk_rope}

> max.nn.kernels.fused\_qk\_rope(kv\_params, input, kv\_collection, freqs\_cis\_2d, layer\_idx, interleaved=True)

Computes fused query-key attention with rotary positional encodings.

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) )
* **freqs\_cis\_2d** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **interleaved** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `fused_qkv_matmul()` {#max.nn.kernels.fused_qkv_matmul}

> max.nn.kernels.fused\_qkv\_matmul(kv\_params, input, wqkv, kv\_collection, layer\_idx, n\_heads)

Computes fused query, key and value projections.

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **wqkv** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection) )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `fused_qkv_ragged_matmul()` {#max.nn.kernels.fused_qkv_ragged_matmul}

> max.nn.kernels.fused\_qkv\_ragged\_matmul(kv\_params, input, input\_row\_offsets, wqkv, kv\_collection, layer\_idx, n\_heads, bias=None)

Computes fused query, key, and value projections with ragged input.

input and input\_row\_offsets are used together to implement the ragged
tensor.
input\_row\_offsets indicates where each batch starts and ends in input

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – on input shapes/dtypes that are invalid for the kernel.

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **wqkv** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection)  `|`  `PagedKVCacheCollection` )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **bias** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  `None` )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `fused_qkv_ragged_matmul_quantized()` {#max.nn.kernels.fused_qkv_ragged_matmul_quantized}

> max.nn.kernels.fused\_qkv\_ragged\_matmul\_quantized(kv\_params, input, input\_row\_offsets, wqkv, kv\_collection, layer\_idx, n\_heads, quantization\_config, perm\_idx=None, bias=None)

Computes fused query, key, and value projections with ragged input and
quantized weight matrices. A quantization\_config must be provided.

input and input\_row\_offsets are used together to implement the ragged
tensor.
input\_row\_offsets indicates where each batch starts and ends in input

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – on input shapes/dtypes that are invalid for the kernel.

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **wqkv** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection)  `|`  `PagedKVCacheCollection` )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **quantization\_config** ([`QuantizationConfig`](../graph/quantization.md#max.graph.quantization.QuantizationConfig) )
* **perm\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  `None` )
* **bias** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  `None` )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `fused_qkv_ragged_matmul_scaled_float8()` {#max.nn.kernels.fused_qkv_ragged_matmul_scaled_float8}

> max.nn.kernels.fused\_qkv\_ragged\_matmul\_scaled\_float8(kv\_params, input, input\_row\_offsets, wqkv, kv\_collection, layer\_idx, n\_heads, input\_scale, weight\_scale, bias=None)

Computes fused query, key, and value projections with ragged input.

input and input\_row\_offsets are used together to implement the ragged
tensor.
input\_row\_offsets indicates where each batch starts and ends in input

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – on input shapes/dtypes that are invalid for the kernel.

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **wqkv** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **kv\_collection** (`PagedKVCacheCollection` )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **input\_scale** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **weight\_scale** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **bias** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  `None` )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `grouped_matmul_ragged()` {#max.nn.kernels.grouped_matmul_ragged}

> max.nn.kernels.grouped\_matmul\_ragged(hidden\_states, weight, expert\_start\_indices, expert\_ids, expert\_usage\_stats\_host)

Grouped matmul used in MoE layer.

hidden\_states and expert\_start\_indices are used together to implement
the ragged tensor. expert\_start\_indices indicates where each group starts
and ends in hidden\_states

expert\_ids is the id of the expert for each group in hidden\_states

expert\_usage\_stats\_host is the maximum number of tokens assigned to any
expert, and the number of active experts.

**Parameters:**

* **hidden\_states** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **expert\_start\_indices** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **expert\_ids** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **expert\_usage\_stats\_host** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `kv_cache_get_max_seq_len()` {#max.nn.kernels.kv_cache_get_max_seq_len}

> max.nn.kernels.kv\_cache\_get\_max\_seq\_len(kv\_collection)

This kernel returns the maximum sequence length.

**Parameters:**

**kv\_collection** (`PagedKVCacheCollection` )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `matmul_k_cache_ragged()` {#max.nn.kernels.matmul_k_cache_ragged}

> max.nn.kernels.matmul\_k\_cache\_ragged(kv\_params, hidden\_states, input\_row\_offsets, weight, kv\_collection, layer\_idx)

Computes key projections with ragged input.

hidden\_states and input\_row\_offsets are used together to
implement the ragged tensor.
input\_row\_offsets indicates where each batch starts and ends in input

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **hidden\_states** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **kv\_collection** (`PagedKVCacheCollection` )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )

**Return type:**

None

## `matmul_kv_cache_ragged()` {#max.nn.kernels.matmul_kv_cache_ragged}

> max.nn.kernels.matmul\_kv\_cache\_ragged(kv\_params, hidden\_states, input\_row\_offsets, weight, kv\_collection, layer\_idx)

Computes key and value projections with ragged input.

hidden\_states and input\_row\_offsets are used together to
implement the ragged tensor.
input\_row\_offsets indicates where each batch starts and ends in input

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **hidden\_states** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **kv\_collection** (`PagedKVCacheCollection` )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )

**Return type:**

None

## `matmul_static_scaled_float8()` {#max.nn.kernels.matmul_static_scaled_float8}

> max.nn.kernels.matmul\_static\_scaled\_float8(input, weight, input\_scale, weight\_scale)

**Parameters:**

* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **input\_scale** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **weight\_scale** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `merge_ragged_tensors()` {#max.nn.kernels.merge_ragged_tensors}

> max.nn.kernels.merge\_ragged\_tensors(a, a\_row\_offsets, b, b\_row\_offsets)

Merges two ragged tensors into a single ragged tensor.

Both ragged tensors must have the same batch size (same number of row
offsets). This function interleaves the rows from each tensor based on
their row offsets.

**Parameters:**

* **a** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The first ragged tensor of shape \[total\_a\_rows, …].
* **a\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The row offsets of the first ragged tensor,indicating
  where each batch starts and ends in a.
* **b** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The second ragged tensor of shape \[total\_b\_rows, …].
* **b\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The row offsets of the second ragged tensor, indicating
  where each batch starts and ends in b.

**Returns:**

* The merged ragged tensor with shape
  \[total\_a\_rows + total\_b\_rows, …].
* The merged row offsets with the same shape as input row offsets.

**Return type:**

A tuple of two tensors

## Example

a = [1, 2, 3, 4, 5, 6]
a\_row\_offsets = [0, 2, 6]
b = [7, 8, 9, 10]
b\_row\_offsets = [0, 3, 4]

merged\_tensor, merged\_row\_offsets = merge\_ragged\_tensors(
: a, a\_row\_offsets, b, b\_row\_offsets)

merged\_tensor = [1, 2, 7, 8, 9, 3, 4, 5, 6, 10]
merged\_row\_offsets = [0, 5, 10]

## `moe_create_indices()` {#max.nn.kernels.moe_create_indices}

> max.nn.kernels.moe\_create\_indices(topk\_ids, num\_local\_experts)

Creates indices for the MoE layer.

**Parameters:**

* **topk\_ids** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The expert assignments for each token from the router.
* **num\_local\_experts** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of experts on this device.

**Returns:**

* token\_expert\_order: The reordered token indices, grouped by assigned expert.
* expert\_start\_indices: The starting index for each expert’s token group in
  the reordered sequence.
* restore\_token\_order: The indices to restore original token ordering after
  expert computation.
* expert\_ids: ids of active experts selected for tokens
* expert\_usage\_stats: The maximum number of tokens assigned to any expert,
  and the number of active experts.

**Return type:**

A tuple of four tensors

## `null_mask_flash_attention_gpu()` {#max.nn.kernels.null_mask_flash_attention_gpu}

> max.nn.kernels.null\_mask\_flash\_attention\_gpu(q, k, v, scale)

Computes flash attention using GPU-optimized kernel.
:param q: Query tensor of shape \[batch, seq\_len, num\_heads, head\_dim]
:param k: Key tensor of shape \[batch, seq\_len, num\_heads, head\_dim]
:param v: Value tensor of shape \[batch, seq\_len, num\_heads, head\_dim]
:param scale: Scaling factor for attention scores

**Parameters:**

* **q** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **k** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **v** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `quantize_dynamic_scaled_float8()` {#max.nn.kernels.quantize_dynamic_scaled_float8}

> max.nn.kernels.quantize\_dynamic\_scaled\_float8(input, scale\_ub=1200.0, group\_size\_or\_per\_token=-1, out\_type=float8\_e4m3fn, scales\_type=bfloat16)

Dynamically quantize the input tensor to fp8.

**Parameters:**

* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) ) – The input tensor to quantize.
* **scale\_ub** ([`float`](https://docs.python.org/3/library/functions.html#float) ) – The upper bound of the scale factor.
* **group\_size\_or\_per\_token** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The group size for quantization. When set to -1,
  the quantization is column-wise.
* **out\_type** ([`DType`](../dtype.md#max.dtype.DType) ) – The type of the output tensor.
* **scales\_type** ([`DType`](../dtype.md#max.dtype.DType) ) – The type of the scales tensor.

**Returns:**

The quantized tensor and the scales.

**Return type:**

[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue), [*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)]

## `quantize_static_scaled_float8()` {#max.nn.kernels.quantize_static_scaled_float8}

> max.nn.kernels.quantize\_static\_scaled\_float8(x, scale, scale\_is\_inverted=True)

**Parameters:**

* **x** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **scale** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **scale\_is\_inverted** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `rms_norm_key_cache()` {#max.nn.kernels.rms_norm_key_cache}

> max.nn.kernels.rms\_norm\_key\_cache(kv\_params, kv\_collection, gamma, epsilon, layer\_idx, total\_seq\_len, input\_row\_offsets, weight\_offset, rms\_norm\_cols=None)

Computes RMSNorm on the \_new\_ entries in the KVCache.

This function applies RMSNorm to either all dimensions or a subset of
dimensions in each head of the key cache. The size of the gamma tensor
determines how many dimensions will be normalized. If gamma’s size doesn’t
match head\_dim, rms\_norm\_cols must be explicitly specified to confirm the
intention to normalize only a subset of dimensions.

Currently, the KVCacheT class itself isn’t aware of the new cache entries
until cache length increment, which happens after model forward.
So use input\_row\_offsets to do this bookkeeping.

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection)  `|`  `PagedKVCacheCollection` )
* **gamma** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **epsilon** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **total\_seq\_len** ([`Dim`](../graph/type.md#max.graph.type.Dim) )
* **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **weight\_offset** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) )
* **rms\_norm\_cols** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )

**Return type:**

None

## `swish_glu()` {#max.nn.kernels.swish_glu}

> max.nn.kernels.swish\_glu(a, b0, b1)

Computes swish(.t()) \* (.t())

**Parameters:**

* **a** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **b0** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **b1** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

## `unfused_qkv_ragged_matmul_gguf_quantized()` {#max.nn.kernels.unfused_qkv_ragged_matmul_gguf_quantized}

> max.nn.kernels.unfused\_qkv\_ragged\_matmul\_gguf\_quantized(kv\_params, input, input\_row\_offsets, n\_heads, q\_weight, k\_weight, v\_weight, quantization\_encoding\_q, quantization\_encoding\_k, quantization\_encoding\_v, kv\_collection, layer\_idx)

Computes fused query, key, and value projections with ragged input and
quantized weight matrices. A quantization\_config must be provided.

input and input\_row\_offsets are used together to implement the ragged
tensor.
input\_row\_offsets indicates where each batch starts and ends in input

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – on input shapes/dtypes that are invalid for the kernel.

**Parameters:**

* **kv\_params** ([`KVCacheParams`](kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **input** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **input\_row\_offsets** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **q\_weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **k\_weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **v\_weight** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )
* **quantization\_encoding\_q** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) )
* **quantization\_encoding\_k** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) )
* **quantization\_encoding\_v** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) )
* **kv\_collection** ([`ContinuousBatchingKVCacheCollection`](kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheCollection)  `|`  `PagedKVCacheCollection` )
* **layer\_idx** ([`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue) )

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

---

## KeyElement

A trait composition for types which implement all requirements of dictionary keys. Dict keys must minimally be Copyable, Movable, Hashable, and EqualityComparable for a hash map. Until we have references they must also be copyable.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Hashable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `__eq__`

`__eq__(self: _Self, other: _Self) -> Bool`

Define whether two instances of the object are equal to each other.

**Args:**

* ​other (`_Self`): Another instance of the same type.

**Returns:**

True if the instances are equal according to the type's definition
of equality, False otherwise.

### `__ne__`

`__ne__(self: _Self, other: _Self) -> Bool`

Define whether two instances of the object are not equal to each other.

**Args:**

* ​other (`_Self`): Another instance of the same type.

**Returns:**

True if the instances are not equal according to the type's definition
of equality, False otherwise.

### `__hash__`

`__hash__(self: _Self) -> UInt`

Return a 64-bit hash of the type's data.

**Returns:**

A 64-bit integer hash of this instance's data.

---

## KV cache

KV (key-value) cache is a memory structure used in
[transformer](transformer.mdx) models to store key-value tensors output from
[self-attention](self-attention.mdx) layers. The KV cache speeds up inference
for transformer models such as large language models (LLMs) by avoiding the
need to recompute the self-attention scores for all previous tokens in a
sequence.

For example, suppose an LLM is trying to complete the sentence, "The quick
brown fox..." After the model predicts "jumps" and then begins to predict the
next token, the model must know the attention score for every token in the
sequence so far (including the one it just predicted). That is, for each step
in the [autoregression](autoregression.mdx) cycle, it must process the entire
sequence thus far:

1. "The quick brown fox..."
2. "The quick brown fox jumps..."
3. "The quick brown fox jumps over..."

And so on.

By storing the already-calculated attention scores for previous tokens in KV
cache, the model simply reads the KV cache at each step, instead of recomputing
those scores all over again. Once the model predicts the next token and
calculates its self-attention, it adds it to the KV cache.

As the sequence length grows during inference (as more words are generated),
the KV cache becomes the dominant factor in a model's memory usage. The
sequence length is always limited by the model's total context window length,
which varies across models and can usually be configured.

---

## kv_cache

## Modules

* [`cache_params`](/max/api/python/nn/kv_cache/cache_params)
* [`continuous_batching_cache`](/max/api/python/nn/kv_cache/continuous_batching_cache)
* [`hf`](/max/api/python/nn/kv_cache/hf)
* [`manager`](/max/api/python/nn/kv_cache/manager)

---

## kv_cache

Contains implementations for several types of key-value caches.

[KV caches](/glossary/ai/kv-cache) are used in transformer models to store
key-value tensors output from self-attention layers.

These APIs are used in the higher-level functions in the
[`nn`](/mojo/kernels/nn) package.

## Modules

* [​`types`](./types/): This module contains the types for the key-value cache APIs.

---

## kv_cache

## Aliases

### `embed_fn_type`

`alias embed_fn_type = fn[DType, Int](IndexList[4], SIMD[$0, $1]) capturing -> SIMD[$0, $1]`

## Functions

* [​`generic_flash_attention_kv_cache_padded`](./generic_flash_attention_kv_cache_padded):
* [​`generic_flash_attention_kv_cache_padded_materialized_mask`](./generic_flash_attention_kv_cache_padded_materialized_mask):
* [​`generic_fused_qk_rope_bshd_continuous_batch`](./generic_fused_qk_rope_bshd_continuous_batch): Performs a fused RoPE projection for Q and K projections.
* [​`generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch`](./generic_fused_qkv_matmul_kv_cache_bshd_continuous_batch): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.
* [​`generic_get_continuous_cache`](./generic_get_continuous_cache):
* [​`generic_get_paged_cache`](./generic_get_paged_cache):
* [​`managed_tensor_slice_to_ndbuffer`](./managed_tensor_slice_to_ndbuffer):
* [​`print_kv_cache_cont_batch_generic_cpu`](./print_kv_cache_cont_batch_generic_cpu):
* [​`print_kv_cache_cont_batch_generic_gpu`](./print_kv_cache_cont_batch_generic_gpu):
* [​`print_kv_cache_paged_generic_cpu`](./print_kv_cache_paged_generic_cpu):
* [​`print_kv_cache_paged_generic_gpu`](./print_kv_cache_paged_generic_gpu):
* [​`rms_norm_kv_cache_ragged_continuous_batching`](./rms_norm_kv_cache_ragged_continuous_batching): Performs RMSNorm in place on new entries in the key cache.
* [​`rms_norm_kv_cache_ragged_paged`](./rms_norm_kv_cache_ragged_paged): Performs RMSNorm in place on new entries in the key cache.

---

## kv_cache_ragged

## Functions

* [​`generic_cross_attention_kv_cache`](./generic_cross_attention_kv_cache):
* [​`generic_flare_mla_decode_kv_cache_ragged`](./generic_flare_mla_decode_kv_cache_ragged):
* [​`generic_flare_mla_decompress_k_cache_ragged_paged`](./generic_flare_mla_decompress_k_cache_ragged_paged):
* [​`generic_flare_mla_prefill_kv_cache_ragged`](./generic_flare_mla_prefill_kv_cache_ragged):
* [​`generic_flare_mla_prefill_ragged_paged_plan`](./generic_flare_mla_prefill_ragged_paged_plan):
* [​`generic_flash_attention_kv_cache_ragged`](./generic_flash_attention_kv_cache_ragged):
* [​`generic_fused_qk_rope_bshd_continuous_batch_ragged`](./generic_fused_qk_rope_bshd_continuous_batch_ragged):
* [​`generic_fused_qk_rope_bshd_paged_ragged`](./generic_fused_qk_rope_bshd_paged_ragged): Performs a fused RoPE projection for Q and K projections.
* [​`generic_fused_qkv_matmul_kv_cache_cont_batch_ragged`](./generic_fused_qkv_matmul_kv_cache_cont_batch_ragged): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.
* [​`generic_fused_qkv_matmul_kv_cache_paged_ragged`](./generic_fused_qkv_matmul_kv_cache_paged_ragged): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.
* [​`generic_fused_qkv_matmul_kv_cache_paged_ragged_bias`](./generic_fused_qkv_matmul_kv_cache_paged_ragged_bias): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.
* [​`generic_fused_qkv_matmul_kv_cache_paged_ragged_scale`](./generic_fused_qkv_matmul_kv_cache_paged_ragged_scale): Performs a fused QKV matmul. Q outputs are written to the output argument while K and V outputs are written in-place into k\_cache and v\_cache.
* [​`k_matmul_ragged_paged`](./k_matmul_ragged_paged): Performs a matmul, writing the output into a mutable PagedKVCacheCollection object.
* [​`kv_matmul_ragged_paged`](./kv_matmul_ragged_paged): Performs a matmul, writing the output into a mutable ContinuousBatchingKVCacheCollection object.
* [​`unfused_qkv_matmul_ragged_paged_gguf_quantized`](./unfused_qkv_matmul_ragged_paged_gguf_quantized): Performs a quantized matmul, writing the output into a mutable PagedKVCacheCollection object.
* [​`valid_length_managed_tensor_slice_to_ndbuffer`](./valid_length_managed_tensor_slice_to_ndbuffer):

---

## kv_matmul_ragged_paged

`kv_matmul_ragged_paged[type: DType, num_heads: Int, head_dim: Int, page_size: Int, //, target: StringSlice[StaticConstantOrigin]](hidden_state: NDBuffer[type, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], weight: NDBuffer[type, 2, origin, shape], kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], layer_idx: SIMD[uint32, 1], ctx: DeviceContextPtr)`

Performs a matmul, writing the output into a mutable ContinuousBatchingKVCacheCollection object.

**Args:**

* ​hidden\_state (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size).
* ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,)
  denoting the start of each sequence along the seq\_len dimension.
* ​weight (`NDBuffer[type, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​kv\_collection (`PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size]`): The historical KVCache for keys and values. The KVCache for
  this layer is retrieved via layer\_idx.
* ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache
  for the given layer from kv\_collection.
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## KVCacheMHAOperand

`@register_passable(trivial)`
`struct KVCacheMHAOperand[cache_t: KVCacheT]`

An implementation for `mo.opaque` KVCacheT arguments to MHA kernels.

We can eventually remove this trait and just add it as a sub-trait in the
KVCacheT type, but we need to solve some cyclic dependencies first.

## Fields

* ​cache (`cache_t`):

## Implemented traits

`AnyType`,
`Copyable`,
`MHAOperand`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `type`

`alias type = get_vtable_entry(:trait cache_t, "type")`

## Methods

### `__init__`

`__init__(cache: cache_t) -> Self`

### `block_paged_ptr`

`block_paged_ptr[tile_size: Int](self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[get_vtable_entry(:trait cache_t, "type"), 1]]`

### `cache_length`

`cache_length(self, batch_idx: Int) -> Int`

### `max_context_length`

`max_context_length(self) -> SIMD[uint32, 1]`

---

## KVCacheStaticParams

`@register_passable(trivial)`
`struct KVCacheStaticParams`

## Fields

* ​num\_heads (`UInt`):
* ​head\_size (`UInt`):

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

---

## KVCacheT

Trait for different KVCache types and implementations.

Represents a single (key or value) cache.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `kv_params`

`alias kv_params`

### `type`

`alias type`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `cache_lengths_nd`

`cache_lengths_nd(self: _Self) -> NDBuffer[uint32, 1, MutableAnyOrigin]`

Returns the cache lengths as a NDBuffer.

### `cache_length`

`cache_length(self: _Self, batch_idx: Int) -> Int`

Returns the length of the cache for a given batch index.

### `load`

`load[width: Int](self: _Self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int) -> SIMD[get_vtable_entry(:trait _Self, "type"), width]`

Loads an element from the given index.

### `store`

`store(self: _Self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int, val: SIMD[get_vtable_entry(:trait _Self, "type"), size])`

Stores an element at the given index.

### `empty_cache`

`empty_cache(self: _Self) -> Bool`

Returns true if the cache\_lengths for all requests is 0, false otherwise.

### `max_prompt_length`

`max_prompt_length(self: _Self) -> SIMD[uint32, 1]`

Returns the maximum sequence length across all batches of the current request.

### `max_context_length`

`max_context_length(self: _Self) -> SIMD[uint32, 1]`

Returns the maximum cache length used across all batches of the current request.

### `block_paged_ptr`

`block_paged_ptr[tile_size: Int](self: _Self, batch_idx: Int, start_tok_idx: Int, head_idx: Int, head_dim_idx: Int = 0) -> UnsafePointer[SIMD[get_vtable_entry(:trait _Self, "type"), 1]]`

Returns a LayoutTensor pointing to the KVCache block at the given index.

Paged KVCache implementations must have a block\_size which is a multiple of the
and greater than the layout's first dimension.

### `max_tile_size`

`static max_tile_size() -> Int`

Returns the maximum tile size for the KVCache.

---

## KVCollectionT

Trait for a pair of caches (keys and values).

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `CacheType`

`alias CacheType`

### `kv_params`

`alias kv_params`

### `name_str`

`alias name_str`

### `type`

`alias type`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `get_key_cache`

`get_key_cache(self: _Self, layer_idx: Int) -> get_vtable_entry(:trait _Self, "CacheType")`

### `get_value_cache`

`get_value_cache(self: _Self, layer_idx: Int) -> get_vtable_entry(:trait _Self, "CacheType")`

### `cache_length`

`cache_length(self: _Self, bs_idx: Int) -> Int`

---

## lane_group_max

`lane_group_max[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Reduces a SIMD value to its maximum within a lane group using warp-level operations.

This function performs a parallel reduction across a group of lanes to find the maximum value.
The reduction is done using warp shuffle operations for efficient communication between lanes.
The result is stored in all participating lanes.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.
* ​num\_lanes (`Int`): The number of threads participating in the reduction.
* ​stride (`Int`): The stride between lanes participating in the reduction.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the maximum.

**Returns:**

A SIMD value where all participating lanes contain the maximum value found across the lane group.
Non-participating lanes (lane\_id >= num\_lanes) retain their original values.

---

## lane_group_max_and_broadcast

`lane_group_max_and_broadcast[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Reduces and broadcasts the maximum value within a lane group using warp-level operations.

This function performs a parallel reduction to find the maximum value and broadcasts it to all lanes.
The reduction and broadcast are done using warp shuffle operations in a butterfly pattern for
efficient all-to-all communication between lanes.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.
* ​num\_lanes (`Int`): The number of threads participating in the reduction.
* ​stride (`Int`): The stride between lanes participating in the reduction.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce and broadcast. Each lane contributes its value.

**Returns:**

A SIMD value where all participating lanes contain the maximum value found across the lane group.
Non-participating lanes (lane\_id >= num\_lanes) retain their original values.

---

## lane_group_min

`lane_group_min[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Reduces a SIMD value to its minimum within a lane group using warp-level operations.

This function performs a parallel reduction across a group of lanes to find the minimum value.
The reduction is done using warp shuffle operations for efficient communication between lanes.
The result is stored in all participating lanes.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.
* ​num\_lanes (`Int`): The number of threads participating in the reduction.
* ​stride (`Int`): The stride between lanes participating in the reduction.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the minimum.

**Returns:**

A SIMD value where all participating lanes contain the minimum value found across the lane group.
Non-participating lanes (lane\_id >= num\_lanes) retain their original values.

---

## lane_group_reduce

`lane_group_reduce[val_type: DType, simd_width: Int, //, shuffle: fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1], func: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1], num_lanes: Int, *, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Performs a generic warp-level reduction operation using shuffle operations.

This function implements a parallel reduction across threads in a warp using a butterfly
pattern. It allows customizing both the shuffle operation and reduction function.

Example:

```mojo
    from gpu.warp import lane_group_reduce, shuffle_down

    # Compute sum across 16 threads using shuffle down
    @parameter
    fn add[type: DType, width: Int](x: SIMD[type, width], y: SIMD[type, width]) -> SIMD[type, width]:
        return x + y
    var val = SIMD[DType.float32, 16](42.0)
    var result = lane_group_reduce[shuffle_down, add, num_lanes=16](val)
```

.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.
* ​shuffle (`fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1]`): A function that performs the warp shuffle operation. Takes a SIMD value and
  offset and returns the shuffled result.
* ​func (`fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]`): A binary function that combines two SIMD values during reduction. This defines
  the reduction operation (e.g. add, max, min).
* ​num\_lanes (`Int`): The number of lanes in a group. The reduction is done within each group. Must be a power of 2.
* ​stride (`Int`): The stride between lanes participating in the reduction.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value.

**Returns:**

A SIMD value containing the reduction result.

---

## lane_group_sum

`lane_group_sum[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Computes the sum of values across a group of lanes using warp-level operations.

This function performs a parallel reduction across a group of lanes to compute their sum.
The reduction is done using warp shuffle operations for efficient communication between lanes.
The result is stored in all participating lanes.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.
* ​num\_lanes (`Int`): The number of threads participating in the reduction.
* ​stride (`Int`): The stride between lanes participating in the reduction.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to the sum.

**Returns:**

A SIMD value where all participating lanes contain the sum found across the lane group.
Non-participating lanes (lane\_id >= num\_lanes) retain their original values.

---

## lane_group_sum_and_broadcast

`lane_group_sum_and_broadcast[val_type: DType, simd_width: Int, //, num_lanes: Int, stride: Int = 1](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Computes the sum across a lane group and broadcasts the result to all lanes.

This function performs a parallel reduction using a butterfly pattern to compute the sum,
then broadcasts the result to all participating lanes. The butterfly pattern ensures
efficient communication between lanes through warp shuffle operations.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.
* ​num\_lanes (`Int`): The number of threads participating in the reduction.
* ​stride (`Int`): The stride between lanes participating in the reduction.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to the sum.

**Returns:**

A SIMD value where all participating lanes contain the sum found across the lane group.
Non-participating lanes (lane\_id >= num\_lanes) retain their original values.

---

## lane_id

`lane_id() -> UInt`

Returns the lane ID of the current thread within its warp.

The lane ID is a unique identifier for each thread within a warp, ranging from 0 to
WARP\_SIZE-1. This ID is commonly used for warp-level programming and thread
synchronization within a warp.

**Returns:**

The lane ID (0 to WARP\_SIZE-1) of the current thread.

---

## lane_id

`lane_id() -> UInt`

Returns the lane ID of the current thread.

**Returns:**

The lane ID of the current thread.

---

## launch_attribute

GPU Launch Attributes for Kernel Execution Control

This module provides structures for configuring GPU kernel execution through launch attributes.
It implements a Mojo interface to CUDA's launch attribute system, allowing fine-grained control
over kernel execution characteristics such as memory access policies, synchronization behavior,
cluster dimensions, and resource allocation.

The main components include:

* `LaunchAttributeID`: Identifies different types of launch attributes
* `LaunchAttributeValue`: Stores the value for a specific attribute type
* `LaunchAttribute`: Combines an ID and value to form a complete attribute
* `AccessPolicyWindow`: Configures memory access patterns and caching behavior
* `AccessProperty`: Defines cache persistence properties for memory access

These structures enable optimizing GPU kernel performance by controlling execution parameters
at a granular level, similar to CUDA's native launch attribute system.

## Structs

* [​`AccessPolicyWindow`](/mojo/stdlib/gpu/host/launch_attribute/AccessPolicyWindow): Specifies an access policy for a window of memory.
* [​`AccessProperty`](/mojo/stdlib/gpu/host/launch_attribute/AccessProperty): Specifies performance hint with AccessPolicyWindow for hit\_prop and miss\_prop fields.
* [​`LaunchAttribute`](/mojo/stdlib/gpu/host/launch_attribute/LaunchAttribute): Represents a complete launch attribute with ID and value.
* [​`LaunchAttributeID`](/mojo/stdlib/gpu/host/launch_attribute/LaunchAttributeID): Identifies the type of launch attribute for GPU kernel execution.
* [​`LaunchAttributeValue`](/mojo/stdlib/gpu/host/launch_attribute/LaunchAttributeValue): Represents a value for a CUDA launch attribute.

---

## launch_dependent_grids

`launch_dependent_grids()`

Launches dependent grids that were previously configured to depend on the current grid.

This function triggers the execution of dependent grids that have been configured
with a dependency on the current grid. It maps directly to the CUDA grid
dependency control instruction for launching dependent grids.

Note:

* Only supported on NVIDIA SM90+ (Hopper architecture and newer) GPUs.
* Must be called by all threads in a thread block to avoid undefined behavior.
* Typically used in multi-grid pipeline scenarios where one grid's completion
  should trigger the execution of other grids.

---

## LaunchAttribute

`@register_passable(trivial)`
`struct LaunchAttribute`

Represents a complete launch attribute with ID and value.

This struct combines a `LaunchAttributeID` and `LaunchAttributeValue` to form
a complete attribute that can be passed to GPU kernel launches. It provides
a way to specify various execution parameters that control kernel behavior.

## Fields

* ​id (`LaunchAttributeID`): The identifier specifying the type of this launch attribute.
* ​\_\_pad (`StaticTuple[SIMD[uint8, 1], ((sizeof[::AnyType,__mlir_type.!kgen.target]() * -1) + 8)]`): Padding to ensure proper alignment of the structure.
* ​value (`LaunchAttributeValue`): The value associated with this launch attribute.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Initializes a new LaunchAttribute with IGNORE ID and zeroed value.

`__init__(id: LaunchAttributeID, value: LaunchAttributeValue) -> Self`

Initializes a `LaunchAttribute` with a specific ID and value.

**Args:**

* ​id (`LaunchAttributeID`): The `LaunchAttributeID` to set.
* ​value (`LaunchAttributeValue`): The `LaunchAttributeValue` to set.

`@implicit`
`__init__(policy: AccessPolicyWindow) -> Self`

Initializes a `LaunchAttribute` from an `AccessPolicyWindow`.

Creates a launch attribute with `ACCESS_POLICY_WINDOW` ID and the provided policy.

**Args:**

* ​policy (`AccessPolicyWindow`): The `AccessPolicyWindow` to use for this attribute.

### `from_cluster_dim`

`static from_cluster_dim(dim: Dim) -> Self`

Creates a `LaunchAttribute` for cluster dimensions.

Creates a launch attribute with `CLUSTER_DIMENSION` ID and the provided dimensions.

**Args:**

* ​dim (`Dim`): The dimensions to use for this attribute.

**Returns:**

A new `LaunchAttribute` configured with the specified cluster dimensions.

---

## LaunchAttributeID

`@register_passable(trivial)`
`struct LaunchAttributeID`

Identifies the type of launch attribute for GPU kernel execution.

This struct represents the various types of launch attributes that can be specified
when launching GPU kernels or configuring streams and graph nodes. Each attribute
controls different aspects of kernel execution behavior such as memory access policies,
synchronization, scheduling, and resource allocation.

The attributes are compatible with CUDA's launch attribute system and provide
fine-grained control over kernel execution characteristics.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `ACCESS_POLICY_WINDOW`

`alias ACCESS_POLICY_WINDOW = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](1))`

Valid for streams, graph nodes, launches.

### `CLUSTER_DIMENSION`

`alias CLUSTER_DIMENSION = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](4))`

Valid for graph nodes, launches.

### `CLUSTER_SCHEDULING_POLICY_PREFERENCE`

`alias CLUSTER_SCHEDULING_POLICY_PREFERENCE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](5))`

Valid for graph nodes, launches.

### `COOPERATIVE`

`alias COOPERATIVE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](2))`

Valid for graph nodes, launches.

### `DEVICE_UPDATABLE_KERNEL_NODE`

`alias DEVICE_UPDATABLE_KERNEL_NODE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](13))`

Valid for graph nodes, launches. This attribute is graphs-only, and passing it to a launch in a non-capturing stream will result in an error. CUlaunchAttributeValue::deviceUpdatableKernelNode::deviceUpdatable can only be set to 0 or 1. Setting the field to 1 indicates that the corresponding kernel node should be device-updatable. On success, a handle will be returned via CUlaunchAttributeValue::deviceUpdatableKernelNode::devNode which can be passed to the various device-side update functions to update the node's kernel parameters from within another kernel. For more information on the types of device updates that can be made, as well as the relevant limitations thereof, see cudaGraphKernelNodeUpdatesApply. Nodes which are device-updatable have additional restrictions compared to regular kernel nodes. Firstly, device-updatable nodes cannot be removed from their graph via cuGraphDestroyNode. Additionally, once opted-in to this functionality, a node cannot opt out, and any attempt to set the deviceUpdatable attribute to 0 will result in an error. Device-updatable kernel nodes also cannot have their attributes copied to/from another kernel node via cuGraphKernelNodeCopyAttributes. Graphs containing one or more device-updatable nodes also do not allow multiple instantiation, and neither the graph nor its instantiated version can be passed to cuGraphExecUpdate. If a graph contains device-updatable nodes and updates those nodes from the device from within the graph, the graph must be uploaded with cuGraphUpload before it is launched. For such a graph, if host-side executable graph updates are made to the device-updatable nodes, the graph must be uploaded before it is launched again.

### `IGNORE`

`alias IGNORE = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](0))`

Ignored entry, for convenient composition.

### `LAUNCH_COMPLETION_EVENT`

`alias LAUNCH_COMPLETION_EVENT = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](12))`

Valid for launches. Set CUlaunchAttributeValue::launchCompletionEvent to record the event. Nominally, the event is triggered once all blocks of the kernel have begun execution. Currently this is a best effort. If a kernel B has a launch completion dependency on a kernel A, B may wait until A is complete. Alternatively, blocks of B may begin before all blocks of A have begun, for example if B can claim execution resources unavailable to A (e.g. they run on different GPUs) or if B is a higher priority than A. Exercise caution if such an ordering inversion could lead to deadlock. A launch completion event is nominally similar to a programmatic event with triggerAtBlockStart set except that it is not visible to cudaGridDependencySynchronize() and can be used with compute capability less than 9.0. The event supplied must not be an interprocess or interop event. The event must disable timing (i.e. must be created with the CU\_EVENT\_DISABLE\_TIMING flag set).

### `MEM_SYNC_DOMAIN`

`alias MEM_SYNC_DOMAIN = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](10))`

Valid for streams, graph nodes, launches.

### `MEM_SYNC_DOMAIN_MAP`

`alias MEM_SYNC_DOMAIN_MAP = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](9))`

Valid for streams, graph nodes, launches.

### `PREFERRED_SHARED_MEMORY_CARVEOUT`

`alias PREFERRED_SHARED_MEMORY_CARVEOUT = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](14))`

Valid for launches. On devices where the L1 cache and shared memory use the same hardware resources, setting CUlaunchAttributeValue::sharedMemCarveout to a percentage between 0-100 signals the CUDA driver to set the shared memory carveout preference, in percent of the total shared memory for that kernel launch. This attribute takes precedence over CU\_FUNC\_ATTRIBUTE\_PREFERRED\_SHARED\_MEMORY\_CARVEOUT. This is only a hint, and the CUDA driver can choose a different configuration if required for the launch.

### `PRIORITY`

`alias PRIORITY = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](8))`

Valid for streams, graph nodes, launches.

### `PROGRAMMATIC_EVENT`

`alias PROGRAMMATIC_EVENT = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](7))`

Valid for launches. Set CUlaunchAttributeValue::programmaticEvent to record the event. Event recorded through this launch attribute is guaranteed to only trigger after all block in the associated kernel trigger the event. A block can trigger the event through PTX launchdep.release or CUDA builtin function cudaTriggerProgrammaticLaunchCompletion(). A trigger can also be inserted at the beginning of each block's execution if triggerAtBlockStart is set to non-0. The dependent launches can choose to wait on the dependency using the programmatic sync (cudaGridDependencySynchronize() or equivalent PTX instructions). Note that dependents (including the CPU thread calling cuEventSynchronize()) are not guaranteed to observe the release precisely when it is released. For example, cuEventSynchronize() may only observe the event trigger long after the associated kernel has completed. This recording type is primarily meant for establishing programmatic dependency between device tasks. Note also this type of dependency allows, but does not guarantee, concurrent execution of tasks. The event supplied must not be an interprocess or interop event. The event must disable timing (i.e. must be created with the CU\_EVENT\_DISABLE\_TIMING flag set).

### `PROGRAMMATIC_STREAM_SERIALIZATION`

`alias PROGRAMMATIC_STREAM_SERIALIZATION = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](6))`

Valid for launches. Setting CUlaunchAttributeValue:: programmaticStreamSerializationAllowed to non-0 signals that the kernel will use programmatic means to resolve its stream dependency, so that the CUDA runtime should opportunistically allow the grid's execution to overlap with the previous kernel in the stream, if that kernel requests the overlap. The dependent launches can choose to wait on the dependency using the programmatic sync.

### `SYNCHRONIZATION_POLICY`

`alias SYNCHRONIZATION_POLICY = LaunchAttributeID(__init__[__mlir_type.!pop.int_literal](3))`

Valid for streams.

## Methods

### `__init__`

`__init__(*, other: Self) -> Self`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two `LaunchAttribute` instances are equal.

Compares the underlying integer values of the attributes.

**Args:**

* ​other (`Self`): The other `LaunchAttribute` instance to compare with.

**Returns:**

True if the attributes are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if two `LaunchAttribute` instances are not equal.

**Args:**

* ​other (`Self`): The other `LaunchAttribute` instance to compare with.

**Returns:**

True if the attributes are not equal, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Checks if two `LaunchAttribute` instances have the same value.

This is an identity comparison that delegates to equality comparison.

**Args:**

* ​other (`Self`): The other \`LaunchAttribute instance to compare with.

**Returns:**

True if the attributes have the same value, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Checks if two `LaunchAttribute` instances have different values.

**Args:**

* ​other (`Self`): The other `LaunchAttribute` instance to compare with.

**Returns:**

True if the attributes have different values, False otherwise.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the `LaunchAttribute`.

**Returns:**

A string representation of the attribute.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the string representation of the attribute to a writer.

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer interface.

**Args:**

* ​writer (`W`): The writer to write to.

---

## LaunchAttributeValue

`@register_passable(trivial)`
`struct LaunchAttributeValue`

Represents a value for a CUDA launch attribute.

This struct emulates a C union to store different types of launch attribute values.
It provides a fixed-size storage that can be initialized with different attribute types
such as AccessPolicyWindow or dimension specifications.

Note:
This implementation uses a fixed-size byte array to emulate the union behavior
defined in the CUDA Driver API's CUlaunchAttributeValue.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Initializes a new `LaunchAttributeValue` with zeroed storage.

`@implicit`
`__init__(policy: AccessPolicyWindow) -> Self`

Initializes a `LaunchAttributeValue` from an `AccessPolicyWindow`.

**Args:**

* ​policy (`AccessPolicyWindow`): The `AccessPolicyWindow` to store in this attribute value.

`@implicit`
`__init__(dim: Dim) -> Self`

Initializes a LaunchAttributeValue from a Dim (dimension) object.

**Args:**

* ​dim (`Dim`): The dimension specification to store in this attribute value.

`@implicit`
`__init__(value: Bool) -> Self`

Initializes a LaunchAttributeValue from a boolean object..

**Args:**

* ​value (`Bool`): The boolean value to store in this attribute value.

---

## layer

## `Layer` {#max.nn.layer.Layer}

> *class* max.nn.layer.Layer

#### Deprecated

Deprecated since version 25.2..

Base class for neural network components.
Use [`Module`](#max.nn.layer.Module) instead.

Provides functionality for adding hooks to the call function of
each layer to support testing, debugging or profiling.

## `LayerList` {#max.nn.layer.LayerList}

> *class* max.nn.layer.LayerList(layers)

Stores a list of layers.

Can be used as a regular python list.

**Parameters:**

**layers** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Layer`](#max.nn.layer.Layer) `]` )

### `append()` {#max.nn.layer.LayerList.append}

> append(layer)

**Parameters:**

**layer** ([`Layer`](#max.nn.layer.Layer) )

### `extend()` {#max.nn.layer.LayerList.extend}

> extend(layer)

**Parameters:**

**layer** ([`Layer`](#max.nn.layer.Layer) )

### `insert()` {#max.nn.layer.LayerList.insert}

> insert(i, layer)

**Parameters:**

**layer** ([`Layer`](#max.nn.layer.Layer) )

### `sublayers` {#max.nn.layer.LayerList.sublayers}

> *property* sublayers\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [Module](#max.nn.layer.Module)]\*

## `Module` {#max.nn.layer.Module}

> *class* max.nn.layer.Module

Base class for model components with weight management.

Provides functionality to create custom layers and construct networks with automatic weight tracking.

The following example uses the [`Module`](#max.nn.layer.Module) class to create custom layers and build a neural network:

```python
from max import nn
from max.dtype import DType
from max.graph import Weight, ops, DeviceRef

class Linear(nn.Module):
    def __init__(self, in_dims, out_dims):
        super().__init__()
        self.weight = Weight("weight", DType.float32, (in_dim, out_dim), DeviceRef.CPU())

    def __call__(self, x):
        return x @ self.weight.T

class MLP(nn.Module):
    def __init__(self):
        self.up = Linear(5, 10)
        self.gate = Linear(5, 10)
        self.down = Linear(10, 5)

    def __call__(self, x):
        return self.down(ops.silu(self.gate(x)) + self.up(x))

model = MLP()
print(model.state_dict())  # {"up.weight": Tensor([5, 10]), ...}
```

Constructing a graph without [`Module`](#max.nn.layer.Module) can result in name collisions
with the weights (in this example, there would be three weights with the
name Weight). With [`Module`](#max.nn.layer.Module), you can use [`state_dict()`](#max.nn.layer.Module.state_dict) or
[`load_state_dict()`](#max.nn.layer.Module.load_state_dict) to initialize or set the weights values, and finalize
the weight names to be unique within the model.

### `layer_weights` {#max.nn.layer.Module.layer_weights}

> *property* layer\_weights\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [Weight](../graph/Weight.md#max.graph.Weight)]\*

### `load_state_dict()` {#max.nn.layer.Module.load_state_dict}

> load\_state\_dict(state\_dict, \*, override\_quantization\_encoding=False, weight\_alignment=None, strict=True)

Sets the values of all weights in this model.

**Parameters:**

* **state\_dict** ([`Mapping`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Mapping) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  [`DLPackArray`](../driver.md#max.driver.DLPackArray)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `WeightData` `]` ) – A map from weight name to a numpy array or
  [`max.driver.Tensor`](../driver.md#max.driver.Tensor).
* **override\_quantization\_encoding** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether to override the weight
  quantization based on the loaded value.
* **weight\_alignment** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` ) – If specified, overrides the alignment for each
  weight in the Module. If left as None, each value in
  state\_dict must be aligned by the default dtype alignment.
* **strict** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – If True, raises an error if any keys in state\_dict were
  not used by the Module.

**Raises:**

* [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If any weight in the model is not present in the state dict.
* [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If strict is True and state\_dict contains keys
  not used by the Module.

**Return type:**

None

### `raw_state_dict()` {#max.nn.layer.Module.raw_state_dict}

> raw\_state\_dict()

Returns all weights objects in the model.
Unlike [`state_dict`](#max.nn.layer.Module.state_dict), this returns [`max.graph.Weight`](../graph/Weight.md#max.graph.Weight) objects instead of
the assigned values. Some parameters inside the `Weight` can be
configured before a graph is built. Do not change these attributes after
building a graph:

* [`align`](../graph/Weight.md#max.graph.Weight.align)
* [`dtype`](../graph/Weight.md#max.graph.Weight.dtype)
* [`quantization_encoding`](../graph/Weight.md#max.graph.Weight.quantization_encoding)
* [`shape`](../graph/Weight.md#max.graph.Weight.shape)

**Returns:**

Map from weight name to the [`max.graph.Weight`](../graph/Weight.md#max.graph.Weight) object.

**Return type:**

[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [*Weight*](../graph/Weight.md#max.graph.Weight)]

### `set_shared_weight()` {#max.nn.layer.Module.set_shared_weight}

> set\_shared\_weight(name, weight)

**Parameters:**

* **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **weight** ([`Weight`](../graph/Weight.md#max.graph.Weight) )

### `state_dict()` {#max.nn.layer.Module.state_dict}

> state\_dict(auto\_initialize=True)

Returns values of all weights in the model.

The values returned are the same as the values set in [`load_state_dict`](#max.nn.layer.Module.load_state_dict).
If [`load_state_dict`](#max.nn.layer.Module.load_state_dict) has not been called and none of the weights have
values, then they are initialized to zero.

**Parameters:**

**auto\_initialize** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Determines whether to initialize weights to zero if
the weight value has not been loaded. If this is False, a
ValueError is raised if an uninitialized weight is found.

**Returns:**

Map from weight name to the weight value (can be numpy array or
[`max.driver.Tensor`](../driver.md#max.driver.Tensor)).

**Return type:**

[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [*DLPackArray*](../driver.md#max.driver.DLPackArray) | [*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)]

### `sublayers` {#max.nn.layer.Module.sublayers}

> *property* sublayers\*: [dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [Module](#max.nn.layer.Module)]\*

## `add_layer_hook()` {#max.nn.layer.add_layer_hook}

> max.nn.layer.add\_layer\_hook(fn)

Adds a hook to call a function after each layer’s `__call__`.

The function will be passed four inputs:

* layer
* input\_args
* input\_kwargs
* outputs

The function can either return None or new
outputs that will replace the layer returned outputs.

Note that input and outputs contain graph Values, which show limited
information (like [`shape`](../graph/TensorValue.md#max.graph.TensorValue.shape) and [`dtype`](../graph/TensorValue.md#max.graph.TensorValue.dtype)). You can still see the computed values
if you include the Value in the `graph.ops.output` op, or call `graph.ops.print`.

Example of printing debug inputs:

```python
def print_info(layer, args, kwargs, outputs):
    print("Layer:", type(layer).__name__)
    print("Input args:", args)
    print("Input kwargs:", kwargs)
    print("Outputs:", outputs)
    return outputs

add_layer_hook(print_info)
```

**Parameters:**

**fn** ([`Callable`](https://docs.python.org/3/library/typing.html#typing.Callable) `[` `[` [`Layer`](#max.nn.layer.Layer) `,`  [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `,`  `...` `]` `,`  [`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `]` `,`  [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `]` `,`  [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `]` )

**Return type:**

None

## `clear_hooks()` {#max.nn.layer.clear_hooks}

> max.nn.layer.clear\_hooks()

Remove all hooks.

## `recursive_named_layers()` {#max.nn.layer.recursive_named_layers}

> max.nn.layer.recursive\_named\_layers(parent, prefix='')

Recursively walks through the layers and generates names.

**Parameters:**

* **parent** ([`Module`](#max.nn.layer.Module) )
* **prefix** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )

**Return type:**

[*Iterable*](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable)\[[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [*Module*](#max.nn.layer.Module)]]

---

## layer_norm

Layer Normalization layer.

## `LayerNorm` {#max.nn.norm.layer_norm.LayerNorm}

> *class* max.nn.norm.layer\_norm.LayerNorm(dims, device, dtype, eps=1e-05, use\_bias=True)

Layer normalization block.

**Parameters:**

* **dims** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **device** (`DeviceRef` )
* **dtype** ([`DType`](../../dtype.md#max.dtype.DType) )
* **eps** ([`float`](https://docs.python.org/3/library/functions.html#float) )

## `LayerNormV1` {#max.nn.norm.layer_norm.LayerNormV1}

> *class* max.nn.norm.layer\_norm.LayerNormV1(weight, bias=None, eps=1e-06)

Layer normalization block.

Deprecated: Use LayerNorm instead.

**Parameters:**

* **weight** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) )
* **bias** ([`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue)  `|`  `None` )
* **eps** ([`float`](https://docs.python.org/3/library/functions.html#float) )

### `bias` {#max.nn.norm.layer_norm.LayerNormV1.bias}

> bias\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

### `eps` {#max.nn.norm.layer_norm.LayerNormV1.eps}

> eps\*: [float](https://docs.python.org/3/library/functions.html#float)\* *= 1e-06*

### `weight` {#max.nn.norm.layer_norm.LayerNormV1.weight}

> weight\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\*

---

## layer_norm

`layer_norm[type: DType, rank: Int, input_0_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], input_1_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], /, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](shape: IndexList[rank], gamma_shape: IndexList[1], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], output: NDBuffer[type, rank, origin, shape, strides], ctx: DeviceContextPtr)`

---

## layer_norm_cpu

`layer_norm_cpu[type: DType, //, input_fn: fn[Int](Int, Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](out_buf: NDBuffer[type, 2, origin, shape], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1])`

Computes layernorm(elementwise\_fn(x)) across the last dimension of x, where layernorm is defined as $(x-mean(x))/(sqrt(var(x)+eps)*gamma_fn + beta$.

Currently performs 3 passes over the input data. This can be reduced to 2 by
fusing the add, mean, and variance loops using Welford's algorithm.

**Parameters:**

* ​type (`DType`): The x and out buffers' elements dtype.
* ​input\_fn (`fn[Int](Int, Int) capturing -> SIMD[type, $0]`): Function called to generate an input value.
* ​gamma\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): Function called to generate a gamma value.

**Args:**

* ​out\_buf (`NDBuffer[type, 2, origin, shape]`): The output buffer.
* ​beta (`NDBuffer[type, 1, origin]`): The beta value to use in the layernorm calculation.
* ​epsilon (`SIMD[type, 1]`): The eps value to use in the layernorm calculation.

`layer_norm_cpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](shape: IndexList[rank, element_type=element_type], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], output: NDBuffer[type, rank, origin, shape, strides])`

---

## layer_norm_gpu

`layer_norm_gpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](shape: IndexList[rank, element_type=element_type], beta: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], output: NDBuffer[type, rank, origin, shape, strides], *, ctx: DeviceContext)`

---

## layer_norm_gpu_block

`layer_norm_gpu_block[type: DType, //, simd_width: UInt, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](output: NDBuffer[type, 2, MutableAnyOrigin], beta: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1])`

---

## layer_norm_gpu_warp_tiling

`layer_norm_gpu_warp_tiling[type: DType, //, simd_width: UInt, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], gamma_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](output: NDBuffer[type, 2, MutableAnyOrigin], beta: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1])`

---

## layer_norm_reshape

`layer_norm_reshape[type: DType, rank: Int, //, output_rank: Int](shape: IndexList[rank, element_type=element_type], buf: NDBuffer[type, rank, origin, shape, strides]) -> NDBuffer[type, output_rank, origin]`

---

## layer_norm_shape

`layer_norm_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], gamma: NDBuffer[type, 1, origin, __init__[::Intable](1)], beta: NDBuffer[type, 1, origin, __init__[::Intable](1)], epsilon: SIMD[type, 1]) -> IndexList[rank]`

Compute the output shape of a `layer_norm` operation.

**Parameters:**

* ​type (`DType`): Type of the input tensors.
* ​rank (`Int`): Rank of the input tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input (`NDBuffer[type, rank, origin]`): The input tensor.
* ​gamma (`NDBuffer[type, 1, origin, __init__[::Intable](1)]`): The tensor for gamma coefficient.
* ​beta (`NDBuffer[type, 1, origin, __init__[::Intable](1)]`): The tensor for beta coefficient.
* ​epsilon (`SIMD[type, 1]`): The tensor for epsilon coefficient.

**Returns:**

The output shape.

---

## layout

Provides layout and layout tensor types, which abstract memory layout for multidimensional data.

* The [`Layout`](/mojo/kernels/layout/layout/Layout) type represents a mapping
  between a set of logical coordinates and a linear index. It can be used, for
  example, to map logical tensor coordinates to a memory address, or to map GPU
  threads to tiles of data.

* The [`LayoutTensor`](/mojo/kernels/layout/layout_tensor/LayoutTensor) type is a
  high-performance tensor with explicit memory layout via a `Layout`.

## Modules

* [​`element`](./element/): Provides element-based access to memory using layout-driven vectorization.
* [​`int_tuple`](./int_tuple/): Hierarchical integer tuple data structures for high-performance tensor operations.
* [​`layout`](./layout/): Provides a high-performance tensor layout system for memory mapping and indexing.
* [​`layout_tensor`](./layout_tensor/): Provides the `LayoutTensor` type for representing multidimensional data.
* [​`math`](./math/): Implements math methods that work on layout tensors.
* [​`runtime_layout`](./runtime_layout/): Provides the `RuntimeLayout` type and functions for working with it. You can use `RuntimeLayout` to define a layout where the dimensions are not known at compile time.
* [​`runtime_tuple`](./runtime_tuple/): Provides the `RuntimeTuple` data structure and related utility functions for handling tuple-like data with both compile-time and runtime elements. `RuntimeTuple` is designed for high-performance tensor operations, supporting efficient manipulation of multi-dimensional data structures like shapes, indices, and coordinates.
* [​`swizzle`](./swizzle/): Defines swizzle layouts for optimizing memory access patterns.
* [​`tensor_builder`](./tensor_builder/): Tensor Builder Module
* [​`tensor_core`](./tensor_core/): Tensor Core Module for High-Performance Matrix Operations
* [​`tensor_core_async`](./tensor_core_async/): Tensor Core Async Module
* [​`tma_async`](./tma_async/): Tensor Memory Accelerator (TMA) Asynchronous Operations Module

---

## layout

Provides a high-performance tensor layout system for memory mapping and indexing.

The layout module implements a comprehensive system for describing memory layouts
of multi-dimensional tensors, enabling efficient mapping between logical tensor
coordinates and physical memory locations. This is a critical component for
high-performance tensor operations in machine learning and scientific computing.
These low-level primitives require careful use to avoid errors.
Understanding the relationship between tensor shapes, strides, and
memory layout is essential for effective use.

Key components:

* `LayoutTrait`: Core trait defining the interface for all layout types
* `Layout`: Primary struct implementing memory layout with shape and stride information
* Layout algebra: Functions for composing, dividing, and transforming layouts
* Tiling operations: Functions for hierarchical decomposition of layouts

Performance features:

* Zero-cost abstractions for mapping between logical and physical indices
* Support for both compile-time and runtime-determined shapes
* Efficient memory access patterns through layout transformations
* Hierarchical tiling for cache-friendly memory access

Common use cases:

* Defining memory layouts for tensors with different storage formats (row-major, column-major)
* Implementing efficient tensor operations with optimal memory access patterns
* Supporting hardware-specific memory layouts for accelerators
* Enabling zero-copy tensor views and reshaping operations

Example:

```mojo
from layout import Layout, IntTuple
from layout.layout import blocked_product

# Create a 3x4 row-major layout
var layout = Layout.row_major(3, 4)

# Access the memory location for logical coordinates (1, 2)
var memory_idx = layout([1, 2])

# Create a tiled layout for blocked matrix multiplication
var tiled = blocked_product(layout, Layout([2, 2]))
```

## Aliases

### `LayoutList`

`alias LayoutList = List[Layout]`

## Structs

* [​`Layout`](./Layout): Represents a memory layout for multi-dimensional data.

## Traits

* [​`LayoutTrait`](./LayoutTrait): Defines the interface for mapping between logical coordinates and memory indices.

## Functions

* [​`apply_tiler`](./apply_tiler): Applies a layout transformation function to each element of a layout with a tiler.
* [​`blocked_product`](./blocked_product): Creates a blocked layout by combining two layouts.
* [​`coalesce`](./coalesce): Simplifies a layout by combining dimensions with contiguous strides.
* [​`complement`](./complement): Computes the complement layout for a given layout.
* [​`composition`](./composition): Composes two layouts to create a new layout.
* [​`cosize`](./cosize): Returns the size of the memory region spanned by the layout.
* [​`downcast`](./downcast): Splits elements in a layout to create a finer layout without changing the total number of elements so that the alignment is preserved.
* [​`expand_modes_alike`](./expand_modes_alike): Aligns two shape-stride pairs to have the same hierarchical structure.
* [​`expand_strides`](./expand_strides): Expands a scalar stride into a stride tuple matching a shape tuple.
* [​`format_layout`](./format_layout): Formats a 2D layout as a table and writes it to the specified writer.
* [​`hierarchical_unzip`](./hierarchical_unzip): Hierarchically unzips a layout according to a list of layouts.
* [​`is_contiguous_dim`](./is_contiguous_dim): Checks if a flat layout is contiguous in a specific dimension.
* [​`is_row_major`](./is_row_major): Checks if a layout has row-major ordering for the specified rank.
* [​`logical_divide`](./logical_divide): Divides a layout into blocks according to another layout.
* [​`logical_product`](./logical_product): Creates a product of two layouts.
* [​`make_layout`](./make_layout): Creates a composite layout by concatenating multiple layouts.
* [​`make_ordered_layout`](./make_ordered_layout): Creates a layout with strides ordered according to a specified traversal order.
* [​`MakeLayoutList`](./MakeLayoutList): Creates a list containing two layouts.
* [​`MakeTileLayoutList`](./MakeTileLayoutList): Creates a list of layouts for tiling operations.
* [​`print_layout`](./print_layout): Prints a 2D layout to the standard output.
* [​`right_inverse`](./right_inverse): Creates a right inverse of a layout.
* [​`size`](./size): Returns the total number of elements in the layout's domain.
* [​`sublayout`](./sublayout): Creates a sublayout by selecting specific dimensions from a layout.
* [​`tile_to_shape`](./tile_to_shape): Creates a layout by tiling a base layout to match a target shape.
* [​`upcast`](./upcast): Fuses consecutive elements in a layout to create a coarser layout.
* [​`zip_modes`](./zip_modes): Combines corresponding modes from two layouts.
* [​`zipped_divide`](./zipped_divide): Divides a layout into blocks according to another layout.

---

## Layout

`struct Layout`

Represents a memory layout for multi-dimensional data.

The Layout struct is the primary implementation of the LayoutTrait,
providing a concrete representation of memory layouts using shape and
stride information. It maps between logical coordinates and linear
memory indices, enabling efficient access to multi-dimensional data.

A Layout consists of:

* shape: Defines the dimensions of the logical coordinate space
* stride: Defines the step sizes in memory for each dimension

The Layout struct supports various operations including:

* Creation of row-major and column-major layouts
* Conversion between coordinates and indices
* Composition with other layouts
* Iteration over sub-layouts

Layouts can be hierarchical, with nested shapes and strides, allowing
for complex memory access patterns like blocked or tiled layouts.

## Fields

* ​shape (`IntTuple`): The dimensions of the layout.
  This field defines the size of each dimension in the logical coordinate space.
  For example, a shape of (3, 4) represents a 3×4 grid of elements.
* ​stride (`IntTuple`): The memory step sizes for each dimension.
  This field defines how many elements to skip in memory when moving one unit
  in each dimension. For example, in a row-major 3×4 layout, the strides might
  be (4, 1), meaning moving one unit in the first dimension requires skipping
  4 elements in memory, while moving one unit in the second dimension requires
  skipping 1 element.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`LayoutTrait`,
`Movable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `has_shape`

`alias has_shape = True`

Indicates whether the layout has a valid shape.

## Methods

### `__init__`

`__init__(out self)`

Initializes an empty layout with no dimensions.

Creates a layout with empty shape and stride tuples, which can be
populated later using append operations.

`@implicit`
`__init__(out self, shape: IntTuple[origin])`

Initializes a layout with the given shape and column-major strides.

Creates a layout with the specified shape and automatically calculates
column-major strides (where the first dimension varies fastest in memory).

**Args:**

* ​shape (`IntTuple[origin]`): The dimensions of the layout.

`__init__(out self, shape: IntTuple[origin], stride: IntTuple[origin])`

Initializes a layout with the given shape and stride.

Creates a layout with explicitly specified shape and stride values.
If an empty stride is provided, column-major strides are calculated.

**Args:**

* ​shape (`IntTuple[origin]`): The dimensions of the layout.
* ​stride (`IntTuple[origin]`): The memory step size for each dimension, or empty for column-major.

`__init__(out self, *, other: Self)`

Explicitly constructs a deep copy of the provided layout.

**Args:**

* ​other (`Self`): The layout to copy.

### `__getitem__`

`__getitem__(self, index: Int) -> Self`

Returns a sub-layout for the specified dimension.

**Args:**

* ​index (`Int`): The dimension index to extract.

**Returns:**

A Layout containing the shape and stride for the specified dimension.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if this layout is equal to another layout.

Two layouts are considered equal if they have identical shape and stride tuples.

**Args:**

* ​other (`Self`): The layout to compare with.

**Returns:**

True if the layouts are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if this layout is not equal to another layout.

**Args:**

* ​other (`Self`): The layout to compare with.

**Returns:**

True if the layouts are not equal, False otherwise.

### `idx2crd`

`idx2crd(self, idx: IntTuple[origin]) -> IntTuple`

Converts a linear index to logical coordinates.

This is the inverse operation of the **call** method, mapping from
a memory index back to the corresponding logical coordinates.

**Args:**

* ​idx (`IntTuple[origin]`): The linear index to convert.

**Returns:**

The logical coordinates corresponding to the given index.

### `col_major`

`static col_major(*dims: Int) -> Self`

Creates a column-major layout with the specified dimensions.

In a column-major layout, the first dimension varies fastest in memory,
which is the default layout in languages like Fortran and MATLAB.

Example:

```mojo
from layout import Layout

# Create a 3x4 column-major layout
var layout = Layout.col_major(3, 4)
# Result: Layout with shape (3,4) and stride (1,3)
```

.

**Args:**

* ​\*dims (`Int`): Variable number of dimension sizes.

**Returns:**

A column-major Layout with the specified dimensions

`static col_major(shape: IntTuple[origin]) -> Self`

Creates a column-major layout with the specified shape.

In a column-major layout, the first dimension varies fastest in memory,
which is the default layout in languages like Fortran and MATLAB.

Example:

```mojo
from layout import Layout
from layout.int_tuple import IntTuple

# Create a 3x4 column-major layout
var layout = Layout.col_major(IntTuple(3, 4))
# Result: Layout with shape (3,4) and stride (1,3)
```

.

**Args:**

* ​shape (`IntTuple[origin]`): An IntTuple specifying the dimensions.

**Returns:**

A column-major Layout with the specified shape

### `row_major`

`static row_major(*dims: Int) -> Self`

Creates a row-major layout with the specified dimensions.

In a row-major layout, the last dimension varies fastest in memory,
which is the default layout in languages like C, C++, and Python.

Example:

```mojo
from layout import Layout

# Create a 3x4 row-major layout
var layout = Layout.row_major(3, 4)
# Result: Layout with shape (3,4) and stride (4,1)
```

.

**Args:**

* ​\*dims (`Int`): Variable number of dimension sizes.

**Returns:**

A row-major Layout with the specified dimensions

`static row_major[rank: Int](dims: DimList) -> Self`

Creates a row-major layout from a DimList with compile-time rank.

This method creates a row-major layout where the last dimension varies fastest in memory.
It handles both known and unknown dimensions at compile time, properly calculating
strides for each dimension. If any dimension is unknown, subsequent strides will
also be marked as unknown.

Example:

```mojo
from layout import Layout
from layout.layout import DimList

# Create a row-major layout with compile-time rank
var dims = DimList(3, 4)
var layout = Layout.row_major[2](dims)
# Result: Layout with shape (3,4) and stride (4,1)
```

.

**Parameters:**

* ​rank (`Int`): The compile-time rank (number of dimensions) of the layout.

**Args:**

* ​dims (`DimList`): A DimList containing the dimensions of the layout.

**Returns:**

A row-major Layout with the specified dimensions and computed strides.

`static row_major(shape: IntTuple[origin]) -> Self`

Creates a row-major layout from an IntTuple of dimensions.

In a row-major layout, the last dimension varies fastest in memory.
This method computes the appropriate strides for a row-major layout
given the input shape.

Example:

```mojo
from layout import Layout
from layout.int_tuple import IntTuple

# Create a row-major layout from a shape tuple
var shape = IntTuple(3, 4)
var layout = Layout.row_major(shape)
# Result: Layout with shape (3,4) and stride (4,1)
```

.

**Args:**

* ​shape (`IntTuple[origin]`): An IntTuple containing the dimensions of the layout.

**Returns:**

A row-major Layout with the specified shape and computed strides.

### `make_shape_unknown`

`make_shape_unknown[axis: Int = -1](self) -> Self`

Creates a new Layout with unknown shape dimensions.

This method creates a copy of the current Layout but marks either all dimensions
or a specific dimension as unknown, while preserving the original strides.
This is useful for tiling tensors with runtime sizes where the tile's shape
is unknown but the memory layout (strides) remains constant.

Example:

```mojo
from layout import Layout
from layout.int_tuple import IntTuple

# Mark all dimensions as unknown
var layout = Layout(IntTuple(2, 3))
var unknown = layout.make_shape_unknown()
# Result: Layout with shape (?, ?) and original strides

# Mark only first dimension as unknown
var partial = layout.make_shape_unknown[0]()
# Result: Layout with shape (?, 3) and original strides
```

.

**Parameters:**

* ​axis (`Int`): The dimension to mark as unknown. If UNKNOWN\_VALUE (default),
  all dimensions are marked as unknown.

**Returns:**

A new Layout with the specified dimension(s) marked as unknown and
original strides preserved.

### `copy`

`copy(self) -> Self`

Explicitly constructs a copy of this layout.

Creates a deep copy of the layout, including its shape and stride tuples.

**Returns:**

A new Layout instance with identical shape and stride values.

### `__str__`

`__str__(self) -> String`

Converts the layout to a string representation.

**Returns:**

A string representation of the layout in the format "(shape:stride)".

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the layout to the specified writer.

Formats the layout as "(shape:stride)" and writes it to the provided writer.

**Parameters:**

* ​W (`Writer`): Type parameter representing a Writer implementation.

**Args:**

* ​writer (`W`): The writer to output the layout representation to.

### `__len__`

`__len__(self) -> Int`

Returns the number of dimensions in the layout.

**Returns:**

The number of elements in the shape tuple.

### `__iter__`

`__iter__(self) -> _LayoutIter[self]`

Returns an iterator over the layout's dimensions.

Each iteration yields a Layout containing the shape and stride for one dimension.

**Returns:**

An iterator over the layout's dimensions.

### `size`

`size(self) -> Int`

Returns the total number of elements in the layout's domain.

Calculates the product of all dimensions in the shape.

**Returns:**

The total number of elements in the layout.

### `cosize`

`cosize(self) -> Int`

Returns the size of the memory region spanned by the layout.

Calculates the maximum memory index plus one, representing the total
memory footprint required by the layout.

**Returns:**

The size of the memory region required by the layout.

### `rank`

`rank(self) -> Int`

Returns the number of dimensions in the layout.

This is equivalent to **len** and returns the number of elements in the
shape tuple.

**Returns:**

The number of dimensions in the layout.

### `__call__`

`__call__(self, idx: IntTuple[origin]) -> Int`

Maps logical coordinates to a linear memory index.

This is the core functionality of a layout, converting multi-dimensional
coordinates to a linear memory location.

**Args:**

* ​idx (`IntTuple[origin]`): The logical coordinates to map.

**Returns:**

The linear memory index corresponding to the given coordinates.

### `append`

`append(mut self, item: Self)`

Appends another layout to this layout.

This method adds the shape and stride from the provided layout to this layout,
effectively increasing its dimensionality.

**Args:**

* ​item (`Self`): The layout to append to this layout.

### `all_dims_known`

`all_dims_known(self) -> Bool`

Checks if all dimensions in the layout have known values.

A dimension is considered unknown if its shape or stride is set to the
special `UNKNOWN_VALUE` constant.

**Returns:**

True if all dimensions have known shape and stride values, False otherwise.

### `known_shape`

`known_shape(self) -> Bool`

Checks if all shape dimensions in the layout have known values.

A dimension is considered unknown if its shape is set to the special
`UNKNOWN_VALUE` constant. This method only checks shapes, not strides.

**Returns:**

True if all shape dimensions have known values, False otherwise.

---

## layout_tensor

Provides the `LayoutTensor` type for representing multidimensional data.

## Aliases

### `binary_op_type`

`alias binary_op_type = fn[DType, Int](lhs: SIMD[$0, $1], rhs: SIMD[$0, $1]) -> SIMD[$0, $1]`

Type alias for binary operations on SIMD vectors.

This type represents a function that takes two SIMD vectors of the same type and
width and returns a SIMD vector of the same type and width.

Args:
type: The data type of the SIMD vector elements.
width: The width of the SIMD vector.
lhs: Left-hand side SIMD vector operand.
rhs: Right-hand side SIMD vector operand.

Returns:
A SIMD vector containing the result of the binary operation.

## Structs

* [​`LayoutTensor`](./LayoutTensor): A high-performance tensor with explicit memory layout and hardware-optimized access patterns.
* [​`LayoutTensorIter`](./LayoutTensorIter): Iterator for traversing a memory buffer with a specific layout.
* [​`ThreadScope`](./ThreadScope): Represents the scope of thread operations in GPU programming.

## Functions

* [​`copy`](./copy): Synchronously copy data from local memory (registers) to SRAM (shared memory).
* [​`copy_dram_to_local`](./copy_dram_to_local): Efficiently copy data from global memory (DRAM) to registers for AMD GPUs.
* [​`copy_dram_to_sram`](./copy_dram_to_sram): Synchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context.
* [​`copy_dram_to_sram_async`](./copy_dram_to_sram_async): Asynchronously copy data from DRAM (global memory) to SRAM (shared memory) in a GPU context.
* [​`copy_local_to_dram`](./copy_local_to_dram): Efficiently copy data from registers (LOCAL) to global memory (DRAM).
* [​`copy_local_to_local`](./copy_local_to_local): Synchronously copy data between local memory (register) tensors with type conversion.
* [​`copy_sram_to_dram`](./copy_sram_to_dram): Synchronously copy data from SRAM (shared memory) to DRAM (global memory).
* [​`copy_sram_to_local`](./copy_sram_to_local): Synchronously copy data from SRAM (shared memory) to local memory.
* [​`cp_async_k_major`](./cp_async_k_major): Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with K-major layout.
* [​`cp_async_mn_major`](./cp_async_mn_major): Asynchronously copy data from DRAM to SRAM using TMA (Tensor Memory Accelerator) with MN-major layout.
* [​`stack_allocation_like`](./stack_allocation_like): Create a stack-allocated tensor with the same layout as an existing tensor.

---

## LayoutTensor

`@register_passable(trivial)`
`struct LayoutTensor[mut: Bool, //, dtype: DType, layout: Layout, origin: Origin[mut], /, *, address_space: AddressSpace = AddressSpace(0), element_layout: Layout = __init__[::Origin[::Bool(IntTuple(1), IntTuple(1)), layout_int_type: DType = _get_layout_type(layout, address_space), linear_idx_type: DType = _get_index_type(layout, address_space), masked: Bool = False, alignment: Int = alignof[::DType,__mlir_type.!kgen.target]()]`

A high-performance tensor with explicit memory layout and hardware-optimized access patterns.

`LayoutTensor` provides a powerful abstraction for multi-dimensional data
with precise control over memory organization. It supports various memory
layouts (row-major, column-major, tiled), hardware-specific optimizations,
and efficient parallel access patterns.

Example:

```mojo
from layout import Layout, LayoutTensor

var storage = InlineArray[Scalar[DType.float32], 5 * 4](uninitialized = True)
var tensor_5x4 = LayoutTensor[DType.float32, Layout.row_major(5,4)](storage)
```

## Parameters

* ​mut (`Bool`): The inferred mutability of the underlying pointer.
* ​dtype (`DType`): The data type of the underlying pointer.
* ​layout (`Layout`): The memory layout of the tensor.
* ​origin (`Origin[mut]`): The origin of the underlying pointer.
* ​address\_space (`AddressSpace`): The address space of the underlying pointer.
* ​element\_layout (`Layout`): The memory layout of each element in the tensor.
* ​layout\_int\_type (`DType`): The integer type of each dimension of runtime layout.
* ​linear\_idx\_type (`DType`): The integer type of the index pointing to memory
  locations.
* ​masked (`Bool`): If true the tensor is masked and runtime layouts determine the
  shape.
* ​alignment (`Int`): Alignment of the data pointer.

## Fields

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the underlying memory buffer containing the tensor data.
  This pointer respects the specified address space, alignment, mutability,
  and origin tracking for memory safety and performance optimization.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=layout_int_type, linear_idx_type=linear_idx_type]`): Runtime representation of the tensor's memory layout.
  Handles both compile-time and runtime-determined dimensions, enabling
  efficient mapping between logical tensor coordinates and physical memory
  locations.
* ​runtime\_element\_layout (`RuntimeLayout[element_layout, element_type=int32, linear_idx_type=linear_idx_type]`): Runtime representation of each element's internal layout.
  Used when elements themselves have structure, such as in blocked or tiled
  layouts.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `element_size`

`alias element_size = element_layout.size()`

The number of scalar values in each element of the tensor.

### `element_type`

`alias element_type = SIMD[dtype, element_layout.size()]`

The SIMD vector type used for vectorized operations on tensor elements.

### `rank`

`alias rank = layout.rank()`

The number of dimensions in the tensor's layout.

## Methods

### `__init__`

`@implicit`
`__init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]) -> Self`

Create a `LayoutTensor` with a `Span`.

**Constraints:**

Layout must be fully static.

**Args:**

* ​span (`Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]`): The `Span` pointing to the underlying data.

`__init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self`

Create a `LayoutTensor` with a `Span` and a runtime layout for the tensor. The runtime layout element type will be casted to the layout tensor layout integer type.

**Constraints:**

* Element layout must be fully static.

**Args:**

* ​span (`Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]`): The `Span` pointing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the LayoutTensor.

`__init__(span: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self`

Create a `LayoutTensor` with a `Span`, a runtime layout of the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type.

**Constraints:**

* Runtime layout and `LayoutTensor` must have the same bitwidth and
  index type.

**Args:**

* ​span (`Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment]`): The `Span` pointing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`.
* ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element.

`@implicit`
`__init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self`

Create a `LayoutTensor` with an `UnsafePointer`.

**Constraints:**

Layout must be fully static.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The `UnsafePointer` pointing to the underlying data.

`__init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self`

Create a `LayoutTensor` with an `UnsafePointer` and a runtime layout for the tensor. The runtime layout element type will be casted to the layout tensor layout integer type.

**Constraints:**

Element layout must be fully static.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The UnsafePointer pointing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the LayoutTensor.

`__init__(ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> Self`

Create a `LayoutTensor` with an `UnsafePointer`, a runtime layout for the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The `UnsafePointer` pointing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`.
* ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element.

`@implicit`
`__init__(ref [origin] device_buffer: DeviceBuffer[dtype]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Create a `LayoutTensor` from a `DeviceBuffer`. The layout must have statically known dimensions.

```mojo
from gpu.host import DeviceContext, DeviceBuffer
from layout import Layout, LayoutTensor

alias dtype = DType.float32

var ctx = DeviceContext()
var dev_buf = ctx.enqueue_create_buffer[dtype](8)

alias layout = Layout.row_major(4, 4)
var tensor = LayoutTensor[dtype, layout](dev_buf)
```

**Constraints:**

* Layout must be fully static.

**Args:**

* ​device\_buffer (`DeviceBuffer[dtype]`): Contains the underlying data to point to.

`@implicit`
`__init__(ref [origin] host_buffer: HostBuffer[dtype]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Create a `LayoutTensor` from a `HostBuffer`. The layout must have statically known dimensions.

```mojo
from gpu.host import DeviceContext, DeviceBuffer
from layout import Layout, LayoutTensor

alias dtype = DType.float32

var ctx = DeviceContext()
var dev_buf = ctx.enqueue_create_buffer[dtype](8)

alias layout = Layout.row_major(4, 4)
var tensor = LayoutTensor[dtype, layout](dev_buf)
```

**Constraints:**

* Layout must be fully static.

**Args:**

* ​host\_buffer (`HostBuffer[dtype]`): Contains the underlying data to point to.

`__init__(ref [origin] device_buffer: DeviceBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Create a `LayoutTensor` from a `DeviceBuffer` and a runtime layout. The runtime layout element type will be casted to the layout tensor layout integer type.

**Constraints:**

* Element layout must be fully static.

**Args:**

* ​device\_buffer (`DeviceBuffer[dtype]`): The `DeviceBuffer` containing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the LayoutTensor.

`__init__(ref [origin] host_buffer: HostBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Create a `LayoutTensor` from a `HostBuffer` and a runtime layout. The runtime layout element type will be casted to the layout tensor layout integer type.

**Constraints:**

* Element layout must be fully static.

**Args:**

* ​host\_buffer (`HostBuffer[dtype]`): The `HostBuffer` containing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`.

`__init__(ref [origin] device_buffer: DeviceBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Create a `LayoutTensor` from a `DeviceBuffer`, a runtime layout for the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type.

**Args:**

* ​device\_buffer (`DeviceBuffer[dtype]`): The `DeviceBuffer` containing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`.
* ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element.

`__init__(ref [origin] host_buffer: HostBuffer[dtype], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], element_runtime_layout: RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]) -> LayoutTensor[dtype, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Create a `LayoutTensor` from a `HostBuffer`, a runtime layout for the tensor, and the runtime layout of each element. The runtime layout element type will be casted to the layout tensor layout integer type.

**Args:**

* ​host\_buffer (`HostBuffer[dtype]`): The `HostBuffer` containing to the underlying data.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of the `LayoutTensor`.
* ​element\_runtime\_layout (`RuntimeLayout[element_layout, element_type=element_type, linear_idx_type=linear_idx_type]`): The runtime layout of each element.

### `__getitem__`

`__getitem__(self, *dims: Int) -> SIMD[dtype, element_layout.size()]`

Retrieves a single element from the tensor at the specified indices.

This method provides array-like indexing for the tensor. The number of
indices provided must match the rank of the tensor, otherwise an error
will occur at runtime.

**Args:**

* ​\*dims (`Int`): The indices specifying the element's position in each
  dimension. For example, in a 3D tensor, you would use (i, j, k).

**Returns:**

The element at the specified position with the tensor's data type.

`__getitem__(self, crd: RuntimeTuple[S, element_type=element_type]) -> SIMD[dtype, element_layout.size()]`

Retrieves a single element from the tensor at the specified indices.

This method provides array-like indexing for the tensor. The number of
indices provided must match the rank of the tensor, otherwise an error
will occur at runtime.

**Args:**

* ​crd (`RuntimeTuple[S, element_type=element_type]`): The coordinate specifying the element's position in each dimension. For example, in a 3D tensor, you would use (i, j, k).

**Returns:**

The element at the specified position with the tensor's data type.

### `__setitem__`

`__setitem__(self, d0: Int, val: SIMD[dtype, element_layout.size()])`

Sets a single element in a rank-1 tensor at the specified index.

This method provides array-like element assignment for rank-1 tensors.

Notes:

* No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.

**Args:**

* ​d0 (`Int`): The index along the first dimension.
* ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position.

`__setitem__(self, d0: Int, d1: Int, val: SIMD[dtype, element_layout.size()])`

Sets a single element in a rank-2 tensor at the specified indices.

This method provides array-like element assignment for rank-2 tensors.

Performance:

* Direct memory access with minimal overhead.
* Memory access pattern follows the tensor's stride configuration.

Notes:

* No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.

**Args:**

* ​d0 (`Int`): The index along the first dimension.
* ​d1 (`Int`): The index along the second dimension.
* ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position.

`__setitem__(self, d0: Int, d1: Int, d2: Int, val: SIMD[dtype, element_layout.size()])`

Sets a single element in a rank-3 tensor at the specified indices.

This method provides array-like element assignment for rank-3 tensors.

Performance:

* Direct memory access with minimal overhead.
* Memory access pattern follows the tensor's stride configuration.

Notes:

* No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.

**Args:**

* ​d0 (`Int`): The index along the first dimension.
* ​d1 (`Int`): The index along the second dimension.
* ​d2 (`Int`): The index along the third dimension.
* ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position.

`__setitem__(self, d0: Int, d1: Int, d2: Int, d3: Int, val: SIMD[dtype, element_layout.size()])`

Sets a single element in a rank-4 tensor at the specified indices.

This method provides array-like element assignment for rank-4 tensors.

Performance:

* Direct memory access with minimal overhead.
* Memory access pattern follows the tensor's stride configuration.

Notes:

* No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.

**Args:**

* ​d0 (`Int`): The index along the first dimension.
* ​d1 (`Int`): The index along the second dimension.
* ​d2 (`Int`): The index along the third dimension.
* ​d3 (`Int`): The index along the fourth dimension.
* ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position.

`__setitem__(self, d0: Int, d1: Int, d2: Int, d3: Int, d4: Int, val: SIMD[dtype, element_layout.size()])`

Sets a single element in a rank-5 tensor at the specified indices.

This method provides array-like element assignment for rank-5 tensors.

Performance:

* Direct memory access with minimal overhead.
* Memory access pattern follows the tensor's stride configuration.

Notes:

* No bounds checking is performed. Accessing out-of-bounds indices
  will result in undefined behavior.

**Args:**

* ​d0 (`Int`): The index along the first dimension.
* ​d1 (`Int`): The index along the second dimension.
* ​d2 (`Int`): The index along the third dimension.
* ​d3 (`Int`): The index along the fourth dimension.
* ​d4 (`Int`): The index along the fifth dimension.
* ​val (`SIMD[dtype, element_layout.size()]`): The value to write to the tensor at the specified position.

### `__add__`

`__add__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Add a scalar value to each element of the tensor.

Performs an elementwise addition operation, adding the scalar value to
each element in the tensor. This operation creates a new tensor with the
results.

Performance:

* This operation creates a copy of the tensor before performing the
  addition.
* For in-place addition, use the `__iadd__` method instead (`+=`
  operator).

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to add to each element.

**Returns:**

A new tensor containing the results of the addition operation.

`__add__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Add another tensor to this tensor elementwise.

Performs an elementwise addition between this tensor and another tensor.
This operation creates a new tensor with the results.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Performance:

* This operation creates a copy of the tensor before performing the
  addition.
* For in-place addition, use the `__iadd__` method instead (`+=`
  operator).

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to add to this tensor.

**Returns:**

A new tensor containing the results of the addition operation.

### `__sub__`

`__sub__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Subtract a scalar value from each element of the tensor.

Performs an elementwise subtraction operation, subtracting the scalar
value from each element in the tensor. This operation creates a new
tensor with the results.

Performance:

* This operation creates a copy of the tensor before performing the
  subtraction.
* For in-place subtraction, use the `__isub__` method instead (`-=`
  operator).

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to subtract from each element.

**Returns:**

A new tensor containing the results of the subtraction operation.

`__sub__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Subtract another tensor from this tensor elementwise.

Performs an elementwise subtraction between this tensor and another
tensor. This operation creates a new tensor with the results.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Performance:

* This operation creates a copy of the tensor before performing the
  subtraction.
* For in-place subtraction, use the `__isub__` method instead (`-=`
  operator).

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to subtract from this tensor.

**Returns:**

A new tensor containing the results of the subtraction operation.

### `__mul__`

`__mul__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Multiply each element of the tensor by a scalar value.

Performs an elementwise multiplication operation, multiplying each
element in the tensor by the scalar value. This operation creates a new
tensor with the results.

Performance:

* This operation creates a copy of the tensor before performing the
  multiplication.
* For in-place multiplication, use the `__imul__` method instead
  (`*=` operator).

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to multiply with each element.

**Returns:**

A new tensor containing the results of the multiplication operation.

`__mul__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Multiply this tensor with another tensor elementwise.

Performs an elementwise multiplication (Hadamard product) between this tensor
and another tensor. This operation creates a new tensor with the results.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Note: This is NOT a matrix multiplication operation. For matrix
multiplication, use the appropriate matmul function instead.

Performance:

* This operation creates a copy of the tensor before performing the
  multiplication.
* For in-place multiplication, use the `__imul__` method instead
  (`*=` operator).

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to multiply with this tensor.

**Returns:**

A new tensor containing the results of the elementwise
multiplication.

### `__truediv__`

`__truediv__(self, other: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Divide each element of the tensor by a scalar value.

Performs an elementwise division operation, dividing each element in the
tensor by the scalar value. This operation creates a new tensor with the
results.

Performance:

* This operation creates a copy of the tensor before performing the
  division.
* For in-place division, use the `__itruediv__` method instead
  (`/=` operator).

Notes:

* Division by zero will result in undefined behavior or errors
  depending on the dtype.
* For integer dtypes, this performs integer division.

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to divide each element by.

**Returns:**

A new tensor containing the results of the division operation.

`__truediv__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Divide this tensor by another tensor elementwise.

Performs an elementwise division between this tensor and another tensor.
This operation creates a new tensor with the results.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Performance:

* This operation creates a copy of the tensor before performing the
  division.
* For in-place division, use the `__itruediv__` method instead
  (`/=` operator).

Notes:

* Division by zero will result in undefined behavior or errors depending on the dtype.
* For integer dtypes, this performs integer division.

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to divide this tensor by.

**Returns:**

A new tensor containing the results of the division operation.

### `__iadd__`

`__iadd__(self, other: SIMD[dtype, 1])`

Add a scalar value to each element of the tensor in-place.

Performs an elementwise addition operation, adding the scalar value to
each element in the tensor. This operation modifies the tensor in-place.

Performance:

* This operation modifies the tensor directly without creating a copy.

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to add to each element.

`__iadd__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Add another tensor to this tensor elementwise in-place.

Performs an elementwise addition between this tensor and another tensor.
This operation modifies the tensor in-place.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Performance:

* This operation modifies the tensor directly without creating a
  copy.

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to add to this tensor.

### `__isub__`

`__isub__(self, other: SIMD[dtype, 1])`

Subtract a scalar value from each element of the tensor in-place.

Performs an elementwise subtraction operation, subtracting the scalar
value from each element in the tensor. This operation modifies the
tensor in-place.

Performance:

* This operation modifies the tensor directly without creating a copy.

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to subtract from each element.

`__isub__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Subtract another tensor from this tensor elementwise in-place.

Performs an elementwise subtraction between this tensor and another
tensor. This operation modifies the tensor in-place.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Performance:

* This operation modifies the tensor directly without creating a copy.

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to subtract from this tensor.

### `__imul__`

`__imul__(self, other: SIMD[dtype, 1])`

Multiply each element of the tensor by a scalar value in-place.

Performs an elementwise multiplication operation, multiplying each
element in the tensor by the scalar value. This operation modifies the
tensor in-place.

Performance:

* This operation modifies the tensor directly without creating a copy.

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to multiply with each element.

`__imul__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Multiply this tensor with another tensor elementwise in-place.

Performs an elementwise multiplication (Hadamard product) between this
tensor and another tensor. This operation modifies the tensor in-place.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Note: This is NOT a matrix multiplication operation. For matrix
multiplication, use the appropriate matmul function instead.

Performance:

* This operation modifies the tensor directly without creating a copy.

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to multiply with this tensor.

### `__itruediv__`

`__itruediv__(self, other: SIMD[dtype, 1])`

Divide each element of the tensor by a scalar value in-place.

Performs an elementwise division operation, dividing each element in the
tensor by the scalar value. This operation modifies the tensor in-place.

Performance:

* This operation modifies the tensor directly without creating a copy.

Notes:

* Division by zero will result in undefined behavior or errors depending on the dtype.
* For integer dtypes, this performs integer division.

**Args:**

* ​other (`SIMD[dtype, 1]`): The scalar value to divide each element by.

`__itruediv__[other_layout: Layout](self, other: LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Divide this tensor by another tensor elementwise in-place.

Performs an elementwise division between this tensor and another tensor.
This operation modifies the tensor in-place.

Limited broadcasting is supported:

* For tensors of the same rank, shapes must match exactly.
* For rank-1 to rank-2 broadcasting, the rank-1 tensor's dimension must
  match the corresponding dimension of the rank-2 tensor.

Performance:

* This operation modifies the tensor directly without creating a copy.

Notes:

* Division by zero will result in undefined behavior or errors depending on the dtype.
* For integer dtypes, this performs integer division.

**Parameters:**

* ​other\_layout (`Layout`): The layout of the other tensor.

**Args:**

* ​other (`LayoutTensor[dtype, other_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The tensor to divide this tensor by.

### `copy`

`copy(self) -> Self`

Explicitly copy the other `LayoutTensor`.

**Returns:**

A copy of the value.

### `bitcast`

`bitcast[new_type: DType, /, address_space: AddressSpace = address_space, element_layout: Layout = element_layout](self) -> LayoutTensor[new_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`

Bitcast the underlying pointer to a new data type.

**Parameters:**

* ​new\_type (`DType`): The new data type it is casting to.
* ​address\_space (`AddressSpace`): The address space of the returned `LayoutTensor`.
* ​element\_layout (`Layout`): The element layout of the returned `LayoutTensor`.

**Returns:**

A new `LayoutTensor` with the same memory location but with the
specified data type, address space, and element layout.

### `origin_cast`

`origin_cast[mut: Bool = mut, origin: Origin[mut] = origin](self) -> LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Changes the origin or mutability of a pointer.

**Parameters:**

* ​mut (`Bool`): Whether the origin is mutable.
* ​origin (`Origin[mut]`): Origin of the destination pointer.

**Returns:**

A new `LayoutTensor` object with the same type and the same address,
as the original `LayoutTensor`, and the new specified mutability and
origin.

### `address_space_cast`

`address_space_cast[address_space: AddressSpace = address_space](self) -> LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Changes the origin or mutability of a pointer.

**Parameters:**

* ​address\_space (`AddressSpace`): The new address space.

**Returns:**

A new `LayoutTensor` object with the same type and origin
as the original `LayoutTensor`, and the new specified address\_space.

### `get_immutable`

`get_immutable(self) -> LayoutTensor[dtype, layout, (muttoimm origin._mlir_origin), address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Return an immutable version of this tensor.

**Returns:**

A `LayoutTensor` covering the same elements, but without mutability.

### `__exp__`

`__exp__(self) -> Self`

Computes element-wise exponential function.

Returns a new tensor containing the
[element-wise exponential](/mojo/stdlib/math/math/exp/) of the input tensor.

**Returns:**

A new tensor containing the element-wise exponential.

### `load`

`load[width: Int](self, m: Int, n: Int) -> SIMD[dtype, width]`

Load a SIMD vector from the tensor at the specified 2D coordinates.

Performs a vectorized load operation from the tensor's memory,
retrieving `width` consecutive elements starting at position (m, n).
This method enables efficient SIMD operations on tensor data.

Performance:

* Uses unaligned memory access which may be slower on some
  architectures.
* For aligned access, use `aligned_load` instead when data alignment is
  guaranteed.
* The load operation is optimized based on the tensor's memory layout.

Notes:

* No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.
* The elements are loaded according to the tensor's stride configuration.

**Parameters:**

* ​width (`Int`): The number of elements to load into the SIMD vector. Should match
  the target hardware's vector width for optimal performance.

**Args:**

* ​m (`Int`): The row index (first dimension).
* ​n (`Int`): The column index (second dimension).

**Returns:**

A SIMD vector containing 'width' consecutive elements from the tensor.

### `prefetch`

`prefetch(self, m: Int, n: Int)`

Prefetch tensor data at the specified 2D coordinates into cache.

Issues a software prefetch hint to the processor to load the data at
position (m, n) into the cache hierarchy. This can improve performance
by reducing memory latency for subsequent accesses to the same location.

Performance:

* Prefetching is a performance hint and does not guarantee data will be
  cached.
* Most effective when issued sufficiently ahead of the actual data
  access.
* Uses high locality prefetch to the data cache, optimized for data that
  will be accessed multiple times.
* Can reduce memory access latency by 50-90% when used correctly.

Notes:

* Excessive prefetching can pollute the cache and degrade performance.
* Most beneficial for predictable access patterns that would otherwise
  cause cache misses.
* No operation is performed on the prefetched data.

**Args:**

* ​m (`Int`): The row index (first dimension).
* ​n (`Int`): The column index (second dimension).

### `aligned_load`

`aligned_load[width: Int](self, m: Int, n: Int) -> SIMD[dtype, width]`

Load a SIMD vector with alignment guarantees from the tensor.

Performs an aligned vectorized load operation from the tensor's memory,
retrieving `width` consecutive elements starting at position (m, n). The
alignment is automatically calculated based on the SIMD width and dtype.

Performance:

* Uses aligned memory access which is faster than unaligned access on
  most architectures.
* The alignment is automatically calculated based on the SIMD width and
  dtype.
* Can be up to 2x faster than unaligned loads on architectures that
  require alignment.

Notes:

* The caller must ensure that the memory at (m, n) is properly aligned.
  Misaligned access with this method may cause hardware exceptions on
  some architectures.
* No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.

**Parameters:**

* ​width (`Int`): The number of elements to load into the SIMD vector. Should
  match the target hardware's vector width for optimal performance.

**Args:**

* ​m (`Int`): The row index (first dimension).
* ​n (`Int`): The column index (second dimension).

**Returns:**

A SIMD vector containing 'width' consecutive elements from the tensor.

### `store`

`store[width: Int](self, m: Int, n: Int, val: SIMD[dtype, width])`

Store a SIMD vector to the tensor at the specified 2D coordinates.

Performs a vectorized store operation to the tensor's memory, writing
'width' consecutive elements starting at position (m, n). This method
enables efficient SIMD operations on tensor data.

Performance:

* Uses unaligned memory access which may be slower on some
  architectures.
* For aligned access, use aligned\_store instead when data alignment is
  guaranteed.
* The store operation is optimized based on the tensor's memory layout.

Notes:

* No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.
* The elements are stored according to the tensor's stride configuration.
* This operation modifies the tensor's data in-place.

**Parameters:**

* ​width (`Int`): The number of elements in the SIMD vector to store. Should
  match the target hardware's vector width for optimal performance.

**Args:**

* ​m (`Int`): The row index (first dimension) where the store operation begins.
* ​n (`Int`): The column index (second dimension) where the store operation
  begins.
* ​val (`SIMD[dtype, width]`): The SIMD vector containing the values to store in the tensor.

### `aligned_store`

`aligned_store[width: Int](self, m: Int, n: Int, val: SIMD[dtype, width])`

Store a SIMD vector with alignment guarantees to the tensor.

Performs an aligned vectorized store operation to the tensor's memory,
writing `width` consecutive elements starting at position (m, n). The
alignment is automatically calculated based on the SIMD width and dtype.

Performance:

* Uses aligned memory access which is faster than unaligned access on
  most architectures.
* The alignment is automatically calculated based on the SIMD width and
  dtype.
* Can be up to 2x faster than unaligned stores on architectures that
  require alignment.
* Particularly important for streaming stores that bypass the cache.

Notes:

* The caller must ensure that the memory at (m, n) is properly aligned.
  Misaligned access with this method may cause hardware exceptions on
  some architectures.
* No bounds checking is performed. Accessing out-of-bounds indices will
  result in undefined behavior.
* This operation modifies the tensor's data in-place.

**Parameters:**

* ​width (`Int`): The number of elements in the SIMD vector to store. Should
  match the target hardware's vector width for optimal performance.

**Args:**

* ​m (`Int`): The row index (first dimension) where the store operation begins.
* ​n (`Int`): The column index (second dimension) where the store operation
  begins.
* ​val (`SIMD[dtype, width]`): The SIMD vector containing the values to store in the tensor.

### `size`

`size(self) -> Int`

Get the total number of elements that the tensor can contain.

**Returns:**

The total number of elements that can be stores in the tensor.

### `stack_allocation`

`static stack_allocation[*, alignment: Int = alignment]() -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Allocates stack memory for a `LayoutTensor` with a fully static layout.

Creates a new `LayoutTensor` instance with memory allocated on the stack
rather than the heap. This provides deterministic memory management and
potentially better performance for tensors with known sizes at compile
time.

Performance:

* Stack allocation is typically faster than heap allocation.
* Proper alignment can significantly improve memory access performance,
  especially for vectorized operations.
* No dynamic memory management overhead (no malloc/free calls).

Notes:

* Only works with tensors that have fully static layouts known at
  compile time.
* Stack memory is limited, so this should only be used for reasonably
  sized tensors.
* The allocated memory is automatically freed when the function returns.

**Constraints:**

* The layout must be fully static (all dimensions known at compile
  time).
* The alignment must be a multiple of the tensor's minimum required
  alignment.

**Parameters:**

* ​alignment (`Int`): Memory alignment value for the allocation in bytes. Must
  be a multiple of the tensor's minimum required alignment.
  Default is the tensor's natural alignment based on its data type
  and layout.

**Returns:**

A new `LayoutTensor` instance with memory allocated on the stack.

### `shape`

`static shape[idx: Int]() -> Int`

Returns the size of the tensor along the specified dimension.

Provides static access to the tensor's shape information. This method
returns the size of a specific dimension without requiring an instance
of the tensor, as the shape is part of the tensor's static type
information.

Performance:

* This is a compile-time operation with no runtime cost when used
  with static dimensions.

Notes:

* This is a static method that operates on the tensor's type information,
  not on a specific tensor instance.

**Parameters:**

* ​idx (`Int`): The dimension index to query (0-based).
  For example, in a 3D tensor with shape \[10, 20, 30]:
  * `shape[0]()` returns 10 (first dimension).
  * `shape[1]()` returns 20 (second dimension).
  * `shape[2]()` returns 30 (third dimension).

**Returns:**

The size of the tensor along the specified dimension as an integer.

### `stride`

`static stride[idx: Int]() -> Int`

Returns the memory stride of the tensor along the specified dimension.

Provides static access to the tensor's stride information. The stride
represents the number of elements to skip in memory to move one position
along a particular dimension. This method returns the stride without
requiring an instance of the tensor, as the stride is part of the
tensor's static type information.

Performance:

* This is a compile-time operation with no runtime cost when used
  with static dimensions.
* Understanding stride patterns is crucial for optimizing memory access
  patterns in performance-critical code.

Notes:

* Strides depend on the memory layout (row-major, column-major, or
  custom).
* For non-contiguous tensors (e.g., tensor slices), strides may not
  follow a simple pattern.

**Parameters:**

* ​idx (`Int`): The dimension index to query (0-based).
  For example, in a 2D tensor with shape \[10, 20] and row-major
  layout:
  * `stride[0]()` might return 20 (moving one row requires
    skipping 20 elements).
  * `stride[1]()` might return 1 (moving one column requires
    skipping 1 element).

**Returns:**

The memory stride of the tensor along the specified dimension as an
integer.

### `dim`

`dim[idx: Int](self) -> Int`

Returns the runtime dimension size of the tensor along the specified axis.

Unlike the static `shape` method, this instance method provides access
to the tensor's actual dimension sizes at runtime, which is necessary
for tensors with dynamic shapes or when working with tensor slices.

Performance:

* This is a run-time operation that accesses the tensor's runtime
  layout information.
* For static dimensions known at compile time, prefer the static
  `shape` method when possible for better performance.

Notes:

* This method works with both static and dynamic dimensions.
* For tensors with masked or partial views, this returns the actual
  size of the view, not the original tensor.

**Constraints:**

* Only works with tensors that have depth-1 layouts (no nested
  shapes).

**Parameters:**

* ​idx (`Int`): The dimension index to query (0-based).
  For example, in a 3D tensor with shape `[10, 20, 30]`:
  * `dim(0)` returns 10 (first dimension).
  * `dim(1)` returns 20 (second dimension).
  * `dim(2)` returns 30 (third dimension).

**Returns:**

The size of the tensor along the specified dimension as an integer.

### `coalesce`

`coalesce(self) -> LayoutTensor[dtype, coalesce(layout, False), origin, address_space=address_space, element_layout=element_layout]`

Creates a tensor with a coalesced memory layout from this tensor.

Coalescing a tensor's layout means reorganizing its memory
representation to be as contiguous as possible, which can improve memory
access patterns and performance. This operation does not move or copy
data; it only changes how the same memory is interpreted.

Performance:

* Coalesced layouts typically provide better cache utilization and
  memory access patterns.
* This operation is zero-cost at runtime as it only changes the
  layout information, not the actual data.
* Particularly beneficial before operations that perform sequential
  memory access or vectorized operations.

Notes:

* The coalesced tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
* The shape of the tensor remains the same, only the stride information
  is optimized.
* For already optimally coalesced tensors, this operation has no effect.

**Returns:**

A tensor with the same data but with a coalesced memory layout.
The returned tensor has type `LayoutTensor` with the same dtype but
with a coalesced layout.

### `tile_type`

`static tile_type[*tile_sizes: Int](*tile_coords: Int) -> LayoutTensor[dtype, _compute_tile_layout[*::Int]()[0], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _tile_is_masked[::Layout,*::Int]()]`

Returns a the type of a tile view of the tensor with specified dimensions and coordinates.

**Parameters:**

* ​\*tile\_sizes (`Int`): The dimensions of each tile along each axis of the
  tensor.

**Args:**

* ​\*tile\_coords (`Int`): The coordinates of the specific tile to extract.

**Returns:**

The type of a view into the original tensor representing the
specified tile.

### `tile`

`tile[*tile_sizes: Int](self, *tile_coords: Int) -> LayoutTensor[dtype, _compute_tile_layout[*::Int]()[0], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _tile_is_masked[::Layout,*::Int]()]`

Extract a tile (sub-tensor) from this tensor with specified dimensions and position.

Tiling is a fundamental operation for high-performance tensor
computations that divides a tensor into smaller blocks for better cache
locality and parallelism. This method extracts a specific tile at the
given coordinates without copying data.

Example:

For a 4×4 tensor with values:

```
[1 2 3 4]
[2 3 4 5]
[5 4 3 2]
[1 1 1 1]
```

`tile[2, 2](1, 0)` will extract the tile:

```
[5 4]
[1 1]
```

Performance:

* Creates a view without copying data, making it very efficient.
* Optimized for both static and dynamic layouts with different code paths.
* Properly handles edge cases where tiles may be partially outside the tensor.
* Maintains stride information for efficient memory access within the tile.

Notes:

* The resulting tile is a view into the original tensor, so modifications
  to the tile will affect the original tensor.
* For tiles at the edges of the tensor, the actual dimensions may be smaller
  than the requested tile\_sizes if masking is enabled.
* The implementation automatically selects between static and dynamic tiling
  based on the tensor's layout properties.

**Parameters:**

* ​\*tile\_sizes (`Int`): The dimensions of each tile along each axis of the
  tensor. For example, in a 2D tensor, `tile[32, 32]` creates
  32×32 tiles.

**Args:**

* ​\*tile\_coords (`Int`): The coordinates of the specific tile to extract. For
  example, `tile[32, 32](1, 2)` extracts the tile at position
  (1, 2) in the grid of 32×32 tiles.

**Returns:**

A view into the original tensor representing the specified tile.

### `tiled_iterator`

`tiled_iterator[*tile_sizes: Int, *, axis: Int = 0](self, *tile_coords: Int) -> LayoutTensorIter[dtype, _compute_tile_layout[*::Int]()[0], origin, address_space=address_space, axis=OptionalReg[Int]({:_stdlib::_builtin::_int::_Int axis, 0}), layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _tile_is_masked[::Layout,*::Int]()]`

Create an iterator that traverses tiles along a specified axis.

This method creates an iterator that allows efficient traversal of tiles
within a tensor. The iterator starts at the specified tile coordinates
and can move along the specified axis, providing access to consecutive
tiles.

Performance:

* Provides efficient sequential access to tiles with good cache
  locality.
* Optimized for both static and dynamic layouts with different code
  paths.
* Maintains stride information for efficient memory access within each
  tile.
* Properly handles edge cases where tiles may be partially outside the
  tensor.

Notes:

* The iterator provides views into the original tensor, so modifications
  through the iterator will affect the original tensor.
* For tiles at the edges of the tensor, the actual dimensions may be smaller
  than the requested tile\_sizes if masking is enabled.
* The iterator is not circular by default, meaning it will not wrap around
  when reaching the end of the tensor along the iteration axis.
* The implementation automatically selects between static and dynamic tiling
  based on the tensor's layout properties.

Example:

```mojo
var iter = tensor.tiled_iterator[16, 16, axis=0](0, 0)
for i in range(num_tiles_along_axis):
    var tile = iter.get()
    # Process tile
    iter.next()
```

**Parameters:**

* ​\*tile\_sizes (`Int`): The dimensions of each tile along each axis of the
  tensor. For example, in a 2D tensor, `tiled_iterator[32, 32]`
  creates an iterator over 32×32 tiles.
* ​axis (`Int`): The axis along which the iterator will traverse. Default is 0
  (first dimension). For example, with axis=0, the iterator will
  move vertically through tiles.

**Args:**

* ​\*tile\_coords (`Int`): The starting coordinates of the tile where iteration
  begins.

**Returns:**

A `LayoutTensorIter` that can be used to traverse tiles along the
specified axis.

### `split`

`split[count: Int, axis: Int = 0](self) -> StaticTuple[LayoutTensor[dtype, _compute_tile_layout[::Int,::Int]()[0], origin, address_space=address_space, element_layout=element_layout, alignment=alignment], count]`

Split the `LayoutTensor` along a axis and return a `StaticTuple` of `LayoutTensor`.

**Parameters:**

* ​count (`Int`): Number of portion to split.
* ​axis (`Int`): The axis where the split is applied to.

**Returns:**

A `StaticTuple` containing `count` `LayoutTensors`, each
representing an equal-sized partition of the original tensor along
the specified axis. Each partition has the same data type and memory
characteristics as the original tensor, but with a reduced size
along the split axis.

`split[axis: Int = 0, alignment: Int = 1](self, count: Int, idx: Int) -> LayoutTensor[dtype, layout.make_shape_unknown[::Int](), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Retrieve a specific partition of the tensor after splitting along a specified axis.

This method divides the tensor into 'count' partitions along the
specified axis and returns the partition at index 'idx'. The
partitioning is done with alignment considerations to optimize memory
access patterns.

Unlike the overloaded split method that returns all partitions, this
method returns only a single partition, making it more memory-efficient
for cases where only one partition is needed at a time.

Notes:

* The shape along the split axis becomes unknown at compile time.
* Only works with dimensions that have statically known sizes.
* The last partition may be smaller than others if the dimension size
  is not evenly divisible by `count`.
* Partition sizes are aligned up to the specified alignment value,
  which can improve performance for vectorized operations.

Performance:

* Uses aligned partitioning to improve memory access patterns.
* Avoids creating all partitions in memory, reducing memory usage.
* Maintains the original tensor's stride information for efficient
  element access within the partition.

**Constraints:**

* The dimension being split must have a statically known size.
* Cannot split dimensions with unknown or dynamic sizes.

**Parameters:**

* ​axis (`Int`): The axis along which to split the tensor. Defaults to 0 (first
  dimension).
* ​alignment (`Int`): Memory alignment value for the partition size. Defaults
  to 1.

**Args:**

* ​count (`Int`): The number of partitions to divide the tensor into.
* ​idx (`Int`): The index of the partition to return (0-based).

**Returns:**

A `LayoutTensor` representing the requested partition.

### `distribute`

`distribute[threads_layout: Layout, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), submode_axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1})](self, thread_id: UInt) -> LayoutTensor[dtype, _compute_distribute_layout[::Layout,::Layout,::OptionalReg[::Int]]()[1], origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked if masked else _distribute_is_masked[::Layout,::Layout,::OptionalReg[::Int]]() if is_nvidia_gpu() else False]`

Distribute tensor workload across multiple threads in a structured pattern.

This method partitions a tensor across multiple threads for parallel
processing, assigning each thread a specific portion of the tensor. The
distribution pattern is determined by the threads\_layout parameter,
which defines the logical arrangement of threads.

Example:

For a 4×4 tensor distributed across 4 threads in a 2×2 grid:

* Thread 0 might get the top-left quadrant
* Thread 1 might get the top-right quadrant
* Thread 2 might get the bottom-left quadrant
* Thread 3 might get the bottom-right quadrant

If axis=0 is specified with the same setup:

* Thread 0 and Thread 2 would get the same data (left half)
* Thread 1 and Thread 3 would get the same data (right half)

Performance:

* Creates a view without copying data, making it very efficient for
  parallel processing.
* The swizzle parameter can significantly improve cache locality and
  memory access patterns.
* Optimized for both static and dynamic layouts with different code
  paths.

Notes:

* The resulting tensor is a view into the original tensor, so
  modifications will affect the original tensor.
* For optimal performance, the `threads_layout` should match the
  hardware's thread organization (e.g., warp/wavefront size and shape).
* When using swizzling, carefully consider the memory access patterns to
  avoid cache thrashing or bank conflicts.
* This function is particularly useful for GPU programming where threads
  are organized in structured grids.

**Constraints:**

* For dynamic layouts, the shape must be known at runtime and the
  threads\_layout must be fully static.

**Parameters:**

* ​threads\_layout (`Layout`): Defines the logical arrangement of threads (e.g.,
  2×2 grid of 4 threads). This layout determines how the tensor is
  partitioned.
* ​axis (`OptionalReg[Int]`): Optional. If specified, restricts distribution to only this
  axis. For example, with axis=0 in a 2D thread layout, threads
  that differ only in their second coordinate will receive the
  same data.
* ​swizzle (`OptionalReg[Swizzle]`): Optional. A function that remaps the distribution pattern
  to improve memory access patterns or cache locality.
* ​submode\_axis (`OptionalReg[Int]`): Optional. Specifies an axis for specialized
  distribution modes.

**Args:**

* ​thread\_id (`UInt`): The ID of the current thread (0-based).

**Returns:**

A view into the original tensor representing the portion assigned to
this thread.

### `vectorize_type`

`static vectorize_type[*vector_shape: Int]() -> LayoutTensor[dtype, coalesce(_compute_tile_layout[*::Int]()[1], True), origin, address_space=address_space, element_layout=_divide_tiles[*::Int]()[0], layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`

Returns the type of a vectorized view of the tensor with specified vector dimensions.

**Parameters:**

* ​\*vector\_shape (`Int`): The dimensions of each vector unit along each axis of
  the tensor.

**Returns:**

The type of a view into the original tensor with a vectorized layout.

### `vectorize`

`vectorize[*vector_shape: Int](self) -> LayoutTensor[dtype, coalesce(_compute_tile_layout[*::Int]()[1], True), origin, address_space=address_space, element_layout=_divide_tiles[*::Int]()[0], layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`

Reshape a tensor into a vectorized form for efficient SIMD operations.

This method transforms the tensor's logical layout to enable efficient
vectorized processing, treating blocks of elements as vector units. The
transformation is particularly useful for SIMD (Single Instruction
Multiple Data) operations and hardware acceleration.

Example:

For a 16×16 tensor, `vectorize[4, 4]` will produce a 4×4 tensor
where each element represents a 4×4 block from the original tensor.

Performance:

* Creates a view without copying data, making it very efficient.
* Enables hardware-accelerated vector operations on blocks of data.
* Improves cache locality by grouping related elements together.
* Particularly beneficial for operations that can leverage SIMD
  instructions.

Notes:

* The tensor dimensions must be divisible by the corresponding vector
  dimensions.
* For dimensions with unknown size, the corresponding vector dimension
  must be 1.
* The resulting tensor has the same data but a different logical
  organization.
* Modifications to the vectorized tensor affect the original tensor.
* This transformation is particularly useful for GPU and vector
  processor optimizations.

**Constraints:**

* Each tensor dimension must be divisible by the corresponding
  vector dimension.
* Vector dimensions must be smaller than or equal to the
  corresponding tensor dimensions.
* For dimensions with unknown size, the vector dimension must be 1.

**Parameters:**

* ​\*vector\_shape (`Int`): The dimensions of each vector unit along each axis of
  the tensor. or example, in a 2D tensor, `vectorize[4, 4]` treats
  4×4 blocks as vector units.

**Returns:**

A view of the tensor with a vectorized layout, where each element in
the resulting tensor represents a vector of elements from the
original tensor.

### `slice`

`slice[d0_slice: Slice, d1_slice: Slice](self) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, d1_slice), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Extract a slice from a rank-2 tensor using slice objects.

This method creates a view into a subset of the tensor defined by the
slice specifications for each dimension. The slice is a continuous
region of the tensor with no gaps (step size must be 1).

Example:

For a 4×4 tensor with values:

```
[1 2 3 4]
[5 6 7 8]
[9 10 11 12]
[13 14 15 16]
```

```mojo
slice[Slice(1, 3), Slice(0, 2)]
```

will extract:

```
[5 6]
[9 10]
```

Performance:

* Creates a view without copying data, making it very efficient.
* Maintains the original tensor's stride information for efficient
  memory access.
* Zero-cost abstraction at runtime when used with compile-time constant
  slices.

Notes:

* The slice is a view into the original tensor, so modifications to the
  slice will affect the original tensor.
* Only supports rank-2 tensors. For higher-rank tensors, use the
  overloaded version with slice indices.
* The step size must be 1 (no gaps allowed in the slice).
* Slice bounds are not checked at runtime; accessing out-of-bounds
  indices will result in undefined behavior.

**Constraints:**

* Only works with rank-2 tensors.

**Parameters:**

* ​d0\_slice (`Slice`): Slice specification for the first dimension (rows).
  Defines the start and end indices for the slice along this
  dimension.
* ​d1\_slice (`Slice`): Slice specification for the second dimension (columns).
  Defines the start and end indices for the slice along this
  dimension.

**Returns:**

A view into the original tensor representing the specified slice.

`slice[d0_slice: Slice, d1_slice: Slice, slice_indices: IndexList[2], __offset_dims: Int = (layout.rank() + -2)](self, offsets: IndexList[__offset_dims]) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, d1_slice, slice_indices.__getitem__[::Indexer](0), slice_indices.__getitem__[::Indexer](1)), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Extract a 2D slice from a higher-rank tensor at specific indices.

This method creates a view into a 2D subset of a higher-rank tensor:

Selecting two dimensions to slice using the slice\_indices parameter.
Applying slice specifications to those dimensions.
Using fixed offsets for all other dimensions.

Example:

Given a 3×4×5 tensor, the following example extracts a 2×2 slice from
dimensions 0 and 2, with dimension 1 fixed at index 1.

```mojo
slice = t.slice[Slice(1, 3), Slice(0, 2), IndexList[2](0, 2)](1)
```

Performance:

* Creates a view without copying data, making it very efficient.
* Maintains the original tensor's stride information for efficient
  memory access.
* Zero-cost abstraction at runtime when used with compile-time constant
  slices.

Notes:

* The slice is a view into the original tensor, so modifications to the
  slice will affect the original tensor.
* The slice indices must be ordered (e.g., \[0, 2] is valid, \[2, 0] is
  not).
* The step size must be 1 (no gaps allowed in the slice).
* Slice bounds are not checked at runtime; accessing out-of-bounds
  indices will result in undefined behavior.

**Constraints:**

* Slice step size must be 1 (no gaps).
* Slice indices must be ordered (ascending).
* Tensor rank must be at least 2.

**Parameters:**

* ​d0\_slice (`Slice`): Slice specification for the first selected dimension.
* ​d1\_slice (`Slice`): Slice specification for the second selected dimension.
* ​slice\_indices (`IndexList[2]`): Indices of the two dimensions to slice (must be
  ordered).
* ​\_\_offset\_dims (`Int`): Internal parameter representing number of fixed
  dimensions.

**Args:**

* ​offsets (`IndexList[__offset_dims]`): Fixed index values for all dimensions not being sliced.

**Returns:**

A 2D view into the original tensor representing the specified slice.

### `slice_1d`

`slice_1d[d0_slice: Slice, slice_indices: IndexList[1], __offset_dims: Int = (layout.rank() + -1)](self, offsets: IndexList[__offset_dims]) -> LayoutTensor[dtype, _compute_slice_layout(d0_slice, slice_indices.__getitem__[::Indexer](0)), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Extract a 1D slice from a higher-rank tensor at a specific index.

This method creates a view into a 1D subset of a higher-rank tensor by:

1. Selecting one dimension to slice using the slice\_indices parameter
2. Applying a slice specification to that dimension
3. Using fixed offsets for all other dimensions

Example:

For a 3×4×5 tensor, the following example extracts a 1D slice from
dimension 0, with dimensions 1 and 2 fixed at indices 1 and 2:

```mojo
slice_1d[Slice(1, 3), IndexList[1](0)](1, 2)`
```

Performance:

* Creates a view without copying data, making it very efficient.
* Maintains the original tensor's stride information for efficient
  memory access.
* Zero-cost abstraction at runtime when used with compile-time constant
  slices.

Notes:

* The slice is a view into the original tensor, so modifications
  to the slice will affect the original tensor.
* The step size must be 1 (no gaps allowed in the slice).
* Slice bounds are not checked at runtime; accessing out-of-bounds
  indices will result in undefined behavior.
* This function exists as a workaround for compiler limitations with
  overloading.

**Constraints:**

* Slice step size must be 1 (no gaps).
* Tensor rank must be at least 1.

**Parameters:**

* ​d0\_slice (`Slice`): Slice specification for the selected dimension.
* ​slice\_indices (`IndexList[1]`): Index of the dimension to slice.
* ​\_\_offset\_dims (`Int`): Internal parameter representing number of fixed
  dimensions.

**Args:**

* ​offsets (`IndexList[__offset_dims]`): Fixed index values for all dimensions not being sliced.

**Returns:**

A 1D view into the original tensor representing the specified slice.

### `transpose`

`transpose[M: Int = shape[::Int](), N: Int = shape[::Int]()](self) -> LayoutTensor[dtype, composition(layout, __init__[::Origin[::Bool(__init__[::Origin[::Bool(IntTuple(N), IntTuple(M), Tuple()), __init__[::Origin[::Bool(IntTuple(M), IntTuple(1), Tuple()))), origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Create a transposed view of a rank-2 tensor.

This method creates a view of the tensor with its dimensions swapped, effectively
converting rows to columns and columns to rows. The transposition is performed
without copying data, by adjusting the tensor's layout information.

Example:

For a 2×3 tensor with values:

```
[1 2 3]
[4 5 6]
```

`transpose()` will produce a 3×2 tensor:

```
[1 4]
[2 5]
[3 6]
```

Performance:

* Creates a view without copying data, making it very efficient.
* The operation is zero-cost at runtime as it only changes the layout
  information.
* Memory access patterns may be less efficient in the transposed view
  due to non-contiguous memory access, especially for row-major
  storage.

Notes:

* The transposed tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
* Only works with rank-2 tensors.
* For optimal performance when repeatedly accessing the transposed data,
  consider creating a physical copy with the transposed layout.

**Constraints:**

* Only works with rank-2 tensors.

**Parameters:**

* ​M (`Int`): The size of the first dimension (rows) of the original tensor.
  Defaults to the static shape value of the first dimension.
* ​N (`Int`): The size of the second dimension (columns) of the original tensor.
  Defaults to the static shape value of the second dimension.

**Returns:**

A view of the tensor with dimensions transposed (rows become columns and vice versa).

### `reshape`

`reshape[dst_layout: Layout](self) -> LayoutTensor[dtype, dst_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`

Create a view of the tensor with a different shape.

This method creates a view of the tensor with a new shape, without changing
the underlying data. The total number of elements must remain the same.

Example:

For a 2×6 tensor, `reshape[Layout((3, 4))]()` produces a 3×4 tensor
with the same elements in row-major order.

Performance:

* Creates a view without copying data, making it very efficient.
* The operation is zero-cost at runtime as it only changes the layout
  information.
* Memory access patterns may change, potentially affecting performance
  depending on the original and target layouts.

Notes:

* The reshaped tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
* The total number of elements must remain the same after reshaping.
* The reshape operation assumes a row-major (C-style) memory layout.
* For tensors with complex strides or non-contiguous memory, reshaping
  may not produce the expected results.
* Masked tensors cannot be reshaped.

**Constraints:**

* Cannot reshape masked tensors.
* The total number of elements must be the same in both layouts.

**Parameters:**

* ​dst\_layout (`Layout`): The target layout for the reshaped tensor. Must have the same
  total number of elements as the original tensor.

**Returns:**

A view of the tensor with the new shape specified by dst\_layout.

### `composition`

`composition[rhs_layout: Layout, dst_layout: Layout = composition(layout, rhs_layout)](self) -> LayoutTensor[dtype, dst_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Create a view of the tensor with a composed layout.

This method creates a view of the tensor with a new layout that is the
composition of the original layout with another layout. Layout
composition allows for complex transformations of the tensor's logical
structure without copying data.

Example:

For a 4×4 tensor with a standard row-major layout, composing with a
layout that represents a 2×2 tiling would result in a tensor that
logically views the data as 2×2 blocks.

Performance:

* Creates a view without copying data, making it very efficient.
* The operation is zero-cost at runtime as it only changes the layout information.
* Can be used to optimize memory access patterns for specific algorithms.

Notes:

* The composed tensor shares the same memory as the original tensor,
  so modifications to one will affect the other.
* Layout composition is a powerful tool for expressing complex data
  transformations like tiling, transposition, and reshaping in a
  unified framework.
* Understanding the mathematical properties of layout composition is
  important for correctly using this function.

**Constraints:**

* The layouts must be compatible for composition.
* The total number of elements must remain the same after
  composition.

**Parameters:**

* ​rhs\_layout (`Layout`): The layout to compose with the tensor's current layout.
* ​dst\_layout (`Layout`): The resulting layout after composition. Defaults to the
  composition of the tensor's layout with rhs\_layout.

**Returns:**

A view of the tensor with the composed layout.

### `distance`

`distance(self, addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space]) -> SIMD[linear_idx_type, 1]`

Calculate the element-wise distance between this tensor's pointer and another pointer.

This method computes the number of elements (not bytes) between the
tensor's pointer and the provided address. This is useful for
determining offsets within a larger memory allocation or for pointer
arithmetic operations.

Example:

If `tensor.ptr` points to an element at index 100 in a buffer, and
`addr` points to element at index 50, then `distance(addr)` returns 50.

Performance:

* This is a lightweight operation that only involves pointer arithmetic.
* The operation is optimized based on the address space, using smaller
  integer types for shared memory to improve efficiency.

Notes:

* The distance is calculated in elements, not bytes.
* The result can be positive or negative depending on the relative positions
  of the pointers.
* This function is particularly useful for GPU programming where understanding
  memory offsets is critical for performance.
* Care should be taken when using this with pointers from different allocations,
  as the result would be meaningless.

**Args:**

* ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space]`): The target pointer to calculate the distance to.

**Returns:**

The number of elements between this tensor's pointer and the
provided address. The result is of type `_uint_dtype`.

`distance[_layout: Layout, _uint_dtype: DType = _get_unsigned_type(_layout, address_space)](self, src: LayoutTensor[dtype, _layout, origin, address_space=address_space]) -> SIMD[_uint_dtype, 1]`

Calculate the element-wise distance between this tensor and another tensor.

This method computes the number of elements (not bytes) between this
tensor's pointer and another tensor's pointer. This is useful for
determining the relative positions of tensors within a larger memory
allocation.

Example:

If tensor1 points to element at index 100 in a buffer, and tensor2 points
to element at index 50, then `tensor1.distance(tensor2)` would return 50.

Performance:

* This is a lightweight operation that only involves pointer arithmetic.
* The operation is optimized based on the address space and layout,
  using appropriate integer types for efficiency.

Notes:

* The distance is calculated in elements, not bytes.
* The result can be positive or negative depending on the relative
  positions of the tensors.
* This function is particularly useful for GPU programming where
  understanding memory offsets is critical for performance.
* Both tensors must be in the same address space for the result to be
  meaningful.
* This overload is more type-safe than the pointer-based version as it
  ensures the tensors have compatible data types and address spaces.

**Parameters:**

* ​\_layout (`Layout`): The layout of the source tensor.
* ​\_uint\_dtype (`DType`): The unsigned integer type to use for the result.
  Automatically determined based on the layout and address space.

**Args:**

* ​src (`LayoutTensor[dtype, _layout, origin, address_space=address_space]`): The source tensor to calculate the distance to.

**Returns:**

The number of elements between this tensor's pointer and the source
tensor's pointer. The result is of type \_uint\_dtype.

### `copy_from`

`copy_from(self, other: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Copy data from another tensor to this tensor.

This method performs an element-by-element copy from the source tensor
to this tensor, respecting the layouts of both tensors. The copy
operation handles different memory layouts correctly, ensuring that
elements are copied to their proper positions regardless of how the data
is arranged in memory.

* Both tensors must have statically known shapes.
* The total number of elements must be the same in both tensors.
* The element sizes must match between the tensors.

Example:

```mojo
from layout import LayoutTensor, Layout

var src = LayoutTensor[DType.float32, Layout((2, 3))]()
var dst = LayoutTensor[DType.float32, Layout((3, 2))]()
dst.copy_from(src)  # Copies all elements from src to dst
```

Performance:

* Performs element-by-element copying, which may be less efficient than
  vectorized or bulk memory operations.
* The copy respects the memory layout of both tensors, which may involve
  non-contiguous memory access patterns.
* For optimal performance with large tensors, consider using specialized
  copy functions that can leverage hardware acceleration.

Notes:

* Both tensors must have statically known shapes.
* The total number of elements must be the same in both tensors.
* The element sizes must match between the tensors.
* This function handles different memory layouts correctly, making it suitable
  for copying between tensors with different shapes or strides.
* The copy is performed element by element, not as a bulk memory copy.

**Args:**

* ​other (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor to copy data from. Must have the same total
  number of elements as this tensor.

### `copy_from_async`

`copy_from_async[is_masked: Bool = False, swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1}), fill: Fill = Fill(0), eviction_policy: CacheEviction = CacheEviction(0)](self, src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], src_idx_bound: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0), base_offset: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0))`

Asynchronously copy data from another tensor to this tensor using GPU hardware.

This method performs an asynchronous copy from the source tensor to this
tensor using GPU hardware acceleration. It's specifically designed for
copying data from global memory to shared memory in GPU kernels,
leveraging hardware-specific asynchronous copy mechanisms for improved
performance.

Example:

```mojo
from layout import LayoutTensor, Layout, AddressSpace
var global_data = LayoutTensor[DType.float32, Layout((128, 128)),
                                address_space=AddressSpace.GLOBAL]()
var shared_data = LayoutTensor[DType.float32, Layout((32, 32)),
                                address_space=AddressSpace.SHARED]()
shared_data.copy_from_async(global_data)
```

Performance:

* Uses hardware-accelerated asynchronous copy mechanisms for optimal
  performance.
* Particularly efficient for copying data from global memory to shared
  memory in GPU kernels.
* Supports vectorized copies for 4, 8, or 16-byte elements for better
  throughput.
* Can bypass L1 cache with appropriate eviction policies for specific
  access patterns.
* Swizzling can improve memory access patterns and reduce bank
  conflicts.

Notes:

* For vectorized copies, both tensors must have contiguous element
  layouts.
* Asynchronous copies allow computation to overlap with memory
  transfers.
* A synchronization barrier is required before using the copied data.

**Constraints:**

* Destination must be in shared memory.
* Source and destination data types must match.
* Element size must be 4, 8, or 16 bytes.
* Destination tensor must have a static layout.

**Parameters:**

* ​is\_masked (`Bool`): Whether to perform a masked copy, where elements outside
  the `src_idx_bound` are not copied or filled with zeros.
* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzling function to rearrange the destination
  indices, which can improve memory access patterns.
* ​fill (`Fill`): Fill policy for elements that are not copied (only used with
  masked copies).
* ​eviction\_policy (`CacheEviction`): Cache eviction policy for the source data.

**Args:**

* ​src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor to copy data from.
* ​src\_idx\_bound (`SIMD[linear_idx_type, 1]`): For masked copies, the upper bound index for valid
  source elements.
* ​base\_offset (`SIMD[linear_idx_type, 1]`): Base offset for swizzling calculations.

### `fill`

`fill(self: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], val: SIMD[dtype, 1]) -> LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Fill the entire tensor with a single value.

This method sets all elements of the tensor to the specified value. It
works with both statically and dynamically shaped tensors, filling all
elements regardless of the tensor's layout.

Example:

```mojo
from layout import LayoutTensor, Layout
var tensor = LayoutTensor[DType.float32, Layout((3, 4))]()
tensor.fill(0.0)  # Sets all elements to 0.0
```

Performance:

* For statically known layouts, the fill operation is unrolled at
  compile time.
* For dynamic layouts, a runtime loop is used.
* No vectorization is applied, so performance may be suboptimal for
  large tensors.
* Consider using hardware-specific fill operations for better
  performance with large tensors.

Notes:

* The tensor must be mutable (`mut=True`).
* The fill operation respects the tensor's layout, filling all elements
  regardless of how they are arranged in memory.
* This method can be used with tensors of any rank and shape.
* For tensors with `element_layout`, all elements within each logical
  element are filled with the same value.

**Args:**

* ​val (`SIMD[dtype, 1]`): The value to fill the tensor with. Must be of the same data
  type as the tensor.

**Returns:**

The tensor itself (self), allowing for method chaining.

### `__str__`

`__str__(self) -> String`

Convert the tensor to a string representation.

This method converts the tensor to a human-readable string
representation by writing its contents to a string. It delegates to the
`write_to` method which formats the tensor appropriately based on its
rank and shape.

**Returns:**

A string representation of the tensor.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Format and write the tensor's contents to a writer.

This method formats the tensor's contents and writes them to the
provided writer. For 2D tensors, it formats the output in a 2D grid. For
tensors of other ranks, it prints all values in column-major coordinate
order.

Example:

```mojo
from layout import LayoutTensor, Layout
var tensor = LayoutTensor[DType.float32, Layout((2, 3))]()
tensor.fill(1.0)
print(tensor)  # Internally calls `write_to` with a StringWriter
```

Output for a 2×3 tensor:

```
[[1.0, 1.0, 1.0],
    [1.0, 1.0, 1.0]]
```

Notes:

* For 2D tensors, the output is formatted as a 2D grid with rows and
  columns.
* For tensors of other ranks, values are printed in column-major
  coordinate order.
* Empty tensors (size 0) produce no output.
* This method is used by the `__str__` method to convert the tensor to a
  string.
* The formatting is designed for human readability rather than parsing.
* For large tensors, the output may be truncated to avoid excessive
  output.

**Parameters:**

* ​W (`Writer`): The writer type that will receive the formatted output.

**Args:**

* ​writer (`W`): The writer instance to write the formatted output to.

---

## LayoutTensorBuild

`@register_passable(trivial)`
`struct LayoutTensorBuild[dtype: DType, *, __layout: Layout = __init__[::Origin[::Bool(IntTuple(1)), __layout_init: Bool = False, __address_space: AddressSpace = AddressSpace(0), __layout_int_type: DType = _get_layout_type(__layout, __address_space), __index_type: DType = _get_index_type(__layout, __address_space), __circular: Bool = False]`

Tensor layout builder providing a fluent interface for constructing tensors with various layouts.

## Parameters

* ​dtype (`DType`): Data type of tensor elements.
* ​\_\_layout (`Layout`): The tensor's memory layout.
* ​\_\_layout\_init (`Bool`): Whether the layout has been initialized.
* ​\_\_address\_space (`AddressSpace`): Memory space (generic, shared, local).
* ​\_\_layout\_int\_type (`DType`): Layout index type.
* ​\_\_index\_type (`DType`): Type used for indexing.
* ​\_\_circular (`Bool`): Whether tensor has circular indexing semantics.

## Fields

* ​runtime\_layout (`RuntimeLayout[__layout, element_type=__layout_int_type, linear_idx_type=__index_type]`): Runtime representation of the tensor's layout.
  This field stores the layout information that can be manipulated at runtime,
  particularly important for tensors with dynamic dimensions. It encapsulates:
  * The static layout template from `__layout` parameter
  * The bit width for index calculations
  * The appropriate index type based on address space

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Initializes a new `LayoutTensorBuild` instance with default values.

### `row_major`

`row_major[*shapes: Int](self) -> LayoutTensorBuild[dtype, __layout=row_major[::Origin[::Bool(_to_int_tuple[::VariadicList[::Int]]()), __layout_init=True]`

Creates a row-major layout using compile-time dimensions.

**Parameters:**

* ​\*shapes (`Int`): Variadic parameter specifying the dimensions of the tensor.
  Each value represents the size of a dimension.

**Returns:**

`LayoutTensorBuild` - A new builder with row-major layout.

`row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim), __layout_init=True]`

Creates a row-major 2D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with row-major layout.

`row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim, dim), __layout_init=True]`

Creates a row-major 3D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.
* ​shape2 (`ValueOrUnknown[dim]`): Third dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with row-major layout.

`row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim, dim, dim), __layout_init=True]`

Creates a row-major 4D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.
* ​shape2 (`ValueOrUnknown[dim]`): Third dimension size.
* ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with row-major layout.

`row_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim], shape4: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=row_major(dim, dim, dim, dim, dim), __layout_init=True]`

Creates a row-major 5D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.
* ​shape2 (`ValueOrUnknown[dim]`): Third dimension size.
* ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size.
* ​shape4 (`ValueOrUnknown[dim]`): Fifth dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with row-major layout.

### `col_major`

`col_major[*shapes: Int](self) -> LayoutTensorBuild[dtype, __layout=col_major[::Origin[::Bool(_to_int_tuple[::VariadicList[::Int]]()), __layout_init=True]`

Creates a column-major layout using compile-time dimensions.

**Parameters:**

* ​\*shapes (`Int`): Variadic parameter specifying the dimensions of the tensor.
  Each value represents the size of a dimension.

**Returns:**

`LayoutTensorBuild` - A new builder with column-major layout.

`col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim), __layout_init=True]`

Creates a column-major 2D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with column-major layout.

`col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim, dim), __layout_init=True]`

Creates a column-major 3D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.
* ​shape2 (`ValueOrUnknown[dim]`): Third dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with column-major layout.

`col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim, dim, dim), __layout_init=True]`

Creates a column-major 4D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.
* ​shape2 (`ValueOrUnknown[dim]`): Third dimension size.
* ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with column-major layout.

`col_major(self, shape0: ValueOrUnknown[dim], shape1: ValueOrUnknown[dim], shape2: ValueOrUnknown[dim], shape3: ValueOrUnknown[dim], shape4: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=col_major(dim, dim, dim, dim, dim), __layout_init=True]`

Creates a column-major 5D layout using runtime dimensions.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): First dimension size.
* ​shape1 (`ValueOrUnknown[dim]`): Second dimension size.
* ​shape2 (`ValueOrUnknown[dim]`): Third dimension size.
* ​shape3 (`ValueOrUnknown[dim]`): Fourth dimension size.
* ​shape4 (`ValueOrUnknown[dim]`): Fifth dimension size.

**Returns:**

`LayoutTensorBuild` - A new builder with column-major layout.

### `layout`

`layout[shape0: Int](self) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(IntTuple(shape0)), __layout_init=True]`

Creates a 1D layout with a compile-time dimension.

**Parameters:**

* ​shape0 (`Int`): Size of the single dimension.

**Returns:**

`LayoutTensorBuild` - A new builder with the specified layout.

`layout[rank: Int, shape: IndexList[rank], stride: IndexList[rank]](self) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(_to_int_tuple[::Int](shape), _to_int_tuple[::Int](stride)), __layout_init=True]`

Creates a custom layout with compile-time dimensions and strides.

**Parameters:**

* ​rank (`Int`): Number of dimensions.
* ​shape (`IndexList[rank]`): List of dimension sizes.
* ​stride (`IndexList[rank]`): List of strides for each dimension.

**Returns:**

`LayoutTensorBuild` - A new builder with the specified custom layout.

`layout[rank: Int](self, shape: IndexList[rank], stride: IndexList[rank]) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(_to_int_tuple[::Int](-1), _to_int_tuple[::Int](-1)), __layout_init=True]`

Creates a custom layout with runtime dimensions and strides.

**Parameters:**

* ​rank (`Int`): Number of dimensions.

**Args:**

* ​shape (`IndexList[rank]`): List of dimension sizes.
* ​stride (`IndexList[rank]`): List of strides for each dimension.

**Returns:**

`LayoutTensorBuild` - A new builder with the specified custom layout.

`layout(self, shape0: ValueOrUnknown[dim]) -> LayoutTensorBuild[dtype, __layout=__init__[::Origin[::Bool(IntTuple(dim)), __layout_init=True]`

Creates a 1D layout with a runtime dimension.

**Args:**

* ​shape0 (`ValueOrUnknown[dim]`): Size of the single dimension.

**Returns:**

`LayoutTensorBuild` - A new builder with the specified layout.

### `shared`

`shared(self) -> LayoutTensorBuild[dtype, __layout=__layout, __layout_init=__layout_init, __address_space=AddressSpace(3)]`

Places the tensor in GPU shared memory.

**Returns:**

`LayoutTensorBuild` - A new builder with shared memory address space.

### `local`

`local(self) -> LayoutTensorBuild[dtype, __layout=__layout, __layout_init=__layout_init, __address_space=AddressSpace(5)]`

Places the tensor in GPU local memory.

**Returns:**

`LayoutTensorBuild` - A new builder with local memory address space.

### `alloc`

`alloc(self) -> LayoutTensor[dtype, __layout, MutableAnyOrigin, address_space=__address_space]`

Allocates a new tensor using the current layout.

Note:
Fails to compile if layout is not set, dimensions are not known, or tensor is circular.

**Returns:**

`LayoutTensor` - A newly allocated tensor with the specified layout

### `view`

`view[address_space: AddressSpace](self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space]) -> LayoutTensor[dtype, __layout, MutableAnyOrigin, address_space=address_space, layout_int_type=__layout_int_type, linear_idx_type=__index_type]`

Creates a tensor view over existing memory.

Note:
Fails to compile if layout is not set, address spaces don't match, or tensor is circular.

**Parameters:**

* ​address\_space (`AddressSpace`): Memory address space for the tensor (generic, shared, local).

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space]`): Pointer to memory region to create the view over.

**Returns:**

`LayoutTensor` - A tensor view over the specified memory region with the current layout.

### `circular`

`circular(self) -> LayoutTensorBuild[dtype, __layout=__layout, __layout_init=__layout_init, __address_space=__address_space, __circular=True]`

Enables circular indexing for the tensor.

**Returns:**

`LayoutTensorBuild` - A new builder with circular indexing enabled.

### `iter`

`iter(self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=__address_space], bound: Int) -> LayoutTensorIter[dtype, __layout, MutableAnyOrigin, address_space=__address_space, circular=__circular, layout_int_type=__layout_int_type, linear_idx_type=__index_type]`

Creates an iterator over tensor elements.

Note:
Fails to compile if layout is not set or dimensions are not known.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=__address_space]`): Pointer to memory region.
* ​bound (`Int`): Upper bound for iteration.

**Returns:**

`LayoutTensorIter` - An iterator over tensor elements.

---

## LayoutTensorIter

`@register_passable(trivial)`
`struct LayoutTensorIter[mut: Bool, //, type: DType, layout: Layout, origin: Origin[mut], /, *, address_space: AddressSpace = AddressSpace(0), alignment: Int = alignof[::DType,__mlir_type.!kgen.target]() if is_nvidia_gpu() else 1, circular: Bool = False, axis: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), layout_int_type: DType = _get_index_type(address_space), linear_idx_type: DType = _get_index_type(address_space), masked: Bool = False]`

Iterator for traversing a memory buffer with a specific layout.

`LayoutTensorIter` provides a way to iterate through memory according to a
specific layout pattern, constructing layout tensors at each position. This
enables efficient traversal of multi-dimensional data structures with custom
memory layouts.

Notes:

The returned layout tensor is NOT vectorized. Users should explicitly vectorize
if needed for performance-critical operations.

## Parameters

* ​mut (`Bool`): Whether the iterator allows mutation of the underlying data.
* ​type (`DType`): The data type of the tensor elements.
* ​layout (`Layout`): The memory layout pattern to follow during iteration.
* ​origin (`Origin[mut]`): Origin tracking for memory safety.
* ​address\_space (`AddressSpace`): The memory address space (`GLOBAL`, `SHARED`, etc.).
* ​alignment (`Int`): Memory alignment requirement for the data.
* ​circular (`Bool`): Whether iteration wraps around at boundaries.
* ​axis (`OptionalReg[Int]`): Optional axis for dimension-specific operations.
* ​layout\_int\_type (`DType`): Integer type used for layout indices.
* ​linear\_idx\_type (`DType`): Integer type used for indexing into memory.
* ​masked (`Bool`): Whether to apply bounds masking during iteration.

## Fields

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory region being iterated, with appropriate type and memory attributes.
* ​offset (`SIMD[linear_idx_type, 1]`): Current offset from the base pointer, representing the iterator's position in memory.
* ​stride (`SIMD[linear_idx_type, 1]`): Step size between consecutive elements or blocks in memory during iteration.
* ​bound (`SIMD[linear_idx_type, 1]`): Upper bound of the memory region, limiting the iteration range.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=layout_int_type, linear_idx_type=linear_idx_type]`): Runtime representation of the layout pattern used for mapping logical indices to memory locations.
* ​dimension\_bound (`SIMD[layout_int_type, 1]`): Boundary value for the current dimension when iterating along a specific axis.
* ​idx (`SIMD[linear_idx_type, 1]`): Current logical index position within the iteration sequence.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `layout_uint_type`

`alias layout_uint_type = SIMD[layout_int_type, 1]`

The unsigned integer type used for layout, based on layout and address space.

### `linear_uint_type`

`alias linear_uint_type = SIMD[linear_idx_type, 1]`

The unsigned integer type used for indexing into memory.

## Methods

### `__init__`

`__init__() -> Self`

Initialize an empty iterator.

Creates a default iterator with zero values, typically used as a
placeholder or default value.

`__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], bound: SIMD[linear_idx_type, 1], stride: SIMD[linear_idx_type, 1] = SIMD(layout.size()), offset: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> Self`

Initialize an iterator with a pointer and basic parameters.

Creates an iterator for a memory region with the specified bounds and
stride.

**Constraints:**

The layout must have all dimensions known at compile time.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the beginning of the memory region.
* ​bound (`SIMD[linear_idx_type, 1]`): Upper bound of the memory region.
* ​stride (`SIMD[linear_idx_type, 1]`): Step size between consecutive elements (defaults to layout
  size).
* ​offset (`SIMD[linear_idx_type, 1]`): Initial offset from the base pointer.

`__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], bound: SIMD[linear_idx_type, 1], runtime_layout: RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type], stride: SIMD[linear_idx_type, 1] = SIMD(layout.size() if layout.all_dims_known() else -1), offset: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0), dimension_bound: SIMD[layout_int_type, 1] = __init__[__mlir_type.!pop.int_literal](0), idx: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> Self`

Initialize an iterator with a runtime layout.

Creates an iterator with a runtime-determined layout, allowing for more
flexible memory traversal patterns.

**Constraints:**

The runtime layout must have the same bitwidth as specified for the
iterator. Circular iteration is not supported when an axis is
defined.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the beginning of the memory region.
* ​bound (`SIMD[linear_idx_type, 1]`): Upper bound of the memory region.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`): Layout determined at runtime.
* ​stride (`SIMD[linear_idx_type, 1]`): Step size between consecutive elements.
* ​offset (`SIMD[linear_idx_type, 1]`): Initial offset from the base pointer.
* ​dimension\_bound (`SIMD[layout_int_type, 1]`): Bound for the specified dimension when using masked
  iteration.
* ​idx (`SIMD[linear_idx_type, 1]`): Initial index position.

### `__getitem__`

`__getitem__(self) -> LayoutTensor[type, layout, origin, address_space=address_space, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Get the layout tensor at the current iterator position.

Operator overload that returns a layout tensor representing the data
at the current position of the iterator.

**Returns:**

A layout tensor at the current iterator position.

### `__iadd__`

`__iadd__[T: Intable](mut self, rhs: T)`

Increment the iterator by an integer value.

Advances the iterator by the specified number of positions.

Notes:

This function is unsafe. It omits bound checking for performance
reasons. Caller must ensure the index doesn't go out-of-bounds.

**Parameters:**

* ​T (`Intable`): A type that can be converted to an integer.

**Args:**

* ​rhs (`T`): The number of positions to advance.

`__iadd__(mut self, rhs: SIMD[linear_idx_type, 1])`

Increment the iterator by a uint value.

Advances the iterator by the specified number of positions.

Notes:

This function is unsafe. It omits bound checking for performance
reasons. Caller must ensure the index doesn't go out-of-bounds.

**Args:**

* ​rhs (`SIMD[linear_idx_type, 1]`): The number of positions to advance.

### `get`

`get(self) -> LayoutTensor[type, layout, origin, address_space=address_space, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Get the layout tensor at the current iterator position.

Returns a layout tensor representing the data at the current position
of the iterator.

**Returns:**

A tensor view at the current iterator position with the
same type, layout, and memory characteristics as specified by the
output parameter.

### `next`

`next[T: Intable](self, rhs: T) -> Self`

Return an iterator pointing to a position ahead by rhs steps.

Creates a new iterator that points rhs positions ahead of the current
one.

**Parameters:**

* ​T (`Intable`): An integer-convertible type for the step size.

**Args:**

* ​rhs (`T`): The number of positions to advance.

**Returns:**

A new iterator pointing to the advanced position.

`next(self, rhs: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](1)) -> Self`

Return an iterator pointing to a position ahead by rhs steps.

Creates a new iterator that points rhs positions ahead of the current
one.

**Args:**

* ​rhs (`SIMD[linear_idx_type, 1]`): The number of positions to advance (defaults to 1).

**Returns:**

A new iterator pointing to the advanced position.

### `next_unsafe`

`next_unsafe(self, rhs: SIMD[linear_idx_type, 1] = __init__[__mlir_type.!pop.int_literal](1)) -> Self`

Return an iterator pointing to a position ahead by rhs steps (unsafe version).

Creates a new iterator that points rhs positions ahead of the current
one. This is an unsafe version that omits certain checks for
performance.

**Constraints:**

Cannot be used with masked iterators.
User must ensure rhs rhs (`SIMD[linear_idx_type, 1]`): The number of positions to advance (defaults to 1).

**Returns:**

A new iterator pointing to the advanced position.

### `reshape`

`reshape[dst_layout: Layout](self) -> LayoutTensorIter[type, dst_layout, origin, address_space=address_space, alignment=alignment, circular=circular, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`

Reshape the iterator to a new layout.

This method creates a new iterator with a different layout while
preserving the underlying data. The new layout must have the same total
size as the original.

**Constraints:**

* The destination layout must have the same total size as the original.
* Both layouts must be contiguous.
* Both layouts must have compile-time known dimensions.

**Parameters:**

* ​dst\_layout (`Layout`): The target layout to reshape to.

**Returns:**

A new iterator with the specified layout.

### `bitcast`

`bitcast[new_type: DType, *, address_space: AddressSpace = address_space, alignment: Int = alignment](self) -> LayoutTensorIter[new_type, layout, origin, address_space=address_space, alignment=alignment, circular=circular, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked]`

Reinterpret the iterator's underlying pointer as a different data type.

This method performs a bitcast operation, allowing you to view the same
memory location as a different data type without copying or converting
the data.

**Parameters:**

* ​new\_type (`DType`): The target data type to cast to.
* ​address\_space (`AddressSpace`): The memory address space for the new
  iterator (defaults to current).
* ​alignment (`Int`): Memory alignment requirement for the new
  iterator (defaults to current).

**Returns:**

A new LayoutTensorIter with the same layout but different data type.

---

## LayoutTrait

Defines the interface for mapping between logical coordinates and memory indices.

The `LayoutTrait` provides a common interface for all layout types, including
basic layouts, swizzles, and composed layouts. It enables mapping from
multi-dimensional logical coordinates to linear memory indices, which is
essential for tensor operations.

Implementations of this trait must provide methods for:

1. Mapping coordinates to indices via the `__call__` method
2. Calculating the total size of the layout's domain
3. Calculating the size of the layout's codomain (memory footprint)
4. Indicating whether the layout has a valid shape

This trait serves as the foundation for the layout system, allowing
different layout implementations to be used interchangeably in algorithms.

## Implemented traits

`AnyType`,
`Copyable`,
`UnknownDestructibility`

## Aliases

### `has_shape`

`alias has_shape`

Indicates whether the layout has a valid shape.

Layouts and ComposedLayouts with at least one Layout have valid shapes
and can be used in layout algebra. Swizzles don't have shapes and
should be excluded from layout algebra.

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__call__`

`__call__(self: _Self, index: IntTuple[origin]) -> Int`

Maps a logical coordinate to a linear memory index.

**Args:**

* ​index (`IntTuple[origin]`): An IntTuple representing the logical coordinates to map.

**Returns:**

The linear memory index corresponding to the given coordinates.

### `size`

`size(self: _Self) -> Int`

Returns the total number of elements in the layout's domain.

For a layout with shape (m, n), this returns m \* n, representing
the total number of valid coordinates in the layout.

**Returns:**

The total number of elements in the layout.

### `cosize`

`cosize(self: _Self) -> Int`

Returns the size of the memory region spanned by the layout.

For a layout with shape `(m, n)` and stride `(r, s)`, this returns
`(m-1)*r + (n-1)*s + 1`, representing the memory footprint.

**Returns:**

The size of the memory region required by the layout.

---

## lcm

`lcm(m: Int, n: Int, /) -> Int`

Computes the least common multiple of two integers.

**Args:**

* ​m (`Int`): The first integer.
* ​n (`Int`): The second integer.

**Returns:**

The least common multiple of the two integers.

`lcm(s: Span[Int, origin], /) -> Int`

Computes the least common multiple of a span of integers.

**Args:**

* ​s (`Span[Int, origin]`): A span of integers.

**Returns:**

The least common multiple of the span.

`lcm(l: List[Int, hint_trivial_type], /) -> Int`

Computes the least common multiple of a list of integers.

**Args:**

* ​l (`List[Int, hint_trivial_type]`): A list of integers.

**Returns:**

The least common multiple of the list.

`lcm(*values: Int) -> Int`

Computes the least common multiple of a variadic list of integers.

**Args:**

* ​\*values (`Int`): A variadic list of integers.

**Returns:**

The least common multiple of the list.

---

## ld_matrix

`ld_matrix[type: DType, //, simd_width: Int, *, transpose: Bool = False](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[type, simd_width]`

Loads a matrix from shared memory into registers in a format suitable for tensor core operations.

This function performs a warp-synchronized load from shared memory to registers, formatting the data
to be directly usable by tensor core Matrix Multiply-Accumulate (MMA) instructions.

Note:

* All threads in a warp must execute this operation together.
* For transposed loads, only half precision (float16) is supported.
* The register width is fixed at 4 bytes (32 bits).
* Supported configurations:
  * x1: One 32-bit register per thread.
  * x2: Two 32-bit registers per thread.
  * x4: Four 32-bit registers per thread.

Example:

```mojo
from gpu.mma import ld_matrix

# Load 8x8 matrix of float16 values
var data = ld_matrix[DType.float16, 8](ptr)

# Load transposed matrix
var transposed = ld_matrix[DType.float16, 8, transpose=True](ptr)
```

.

**Parameters:**

* ​type (`DType`): The data type of the matrix elements (e.g. float16, float32).
* ​simd\_width (`Int`): The width of the SIMD vector to load.
* ​transpose (`Bool`): Whether to transpose the matrix during load (only supported for half precision).

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to shared memory containing the source matrix data.

**Returns:**

SIMD vector containing the loaded matrix data, properly formatted for MMA operations.

---

## ldexp

`ldexp[dtype: DType, width: Int, //](x: SIMD[dtype, width], exp: SIMD[int32, width]) -> SIMD[dtype, width]`

Computes elementwise ldexp function.

The ldexp function multiplies a floating point value x by the number 2
raised to the exp power. I.e. $ldexp(x,exp)$ calculate the value of $x *
2^{exp}$ and is used within the $erf$ function.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector of floating point values.
* ​exp (`SIMD[int32, width]`): SIMD vector containing the exponents.

**Returns:**

Vector containing elementwise result of ldexp on x and exp.

---

## ldg

`ldg[type: DType, //, width: Int = 1, *, alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]()](x: UnsafePointer[SIMD[type, 1]]) -> SIMD[type, width]`

Load data from global memory through the non-coherent cache.

This function provides a hardware-accelerated global memory load operation
that uses the GPU's non-coherent cache (equivalent to CUDA's `__ldg` instruction).
It optimizes for read-only data access patterns.

Note:

* Uses invariant loads which indicate the memory won't change during kernel execution.
* Particularly beneficial for read-only texture-like access patterns.
* May improve performance on memory-bound kernels.

**Parameters:**

* ​type (`DType`): The data type to load (must be numeric).
* ​width (`Int`): The SIMD vector width for vectorized loads.
* ​alignment (`Int`): Memory alignment in bytes. Defaults to natural alignment
  of the SIMD vector type.

**Args:**

* ​x (`UnsafePointer[SIMD[type, 1]]`): Pointer to global memory location to load from.

**Returns:**

SIMD vector containing the loaded data.

---

## ldx

`ldx(gpr: Int)`

---

## ldy

`ldy(gpr: Int)`

---

## ldz

`ldz(gpr: Int)`

---

## ldzi

`ldzi(gpr: Int)`

---

## len

Provides the `len()` function and its associated traits.

These are Mojo built-ins, so you don't need to import them.

## Traits

* [​`Sized`](/mojo/stdlib/builtin/len/Sized): The `Sized` trait describes a type that has an integer length (such as a string or array).
* [​`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising): The `SizedRaising` trait describes a type that has an integer length, which might raise an error if the length can't be determined.
* [​`UIntSized`](/mojo/stdlib/builtin/len/UIntSized): The `Sized` trait describes a type that has an integer length (such as a string or array).

## Functions

* [​`len`](/mojo/stdlib/builtin/len/len): Get the length of a value.

---

## len

`len[T: Sized](value: T) -> Int`

Get the length of a value.

**Parameters:**

* ​T (`Sized`): The Sized type.

**Args:**

* ​value (`T`): The object to get the length of.

**Returns:**

The length of the object.

`len[T: SizedRaising](value: T) -> Int`

Get the length of a value.

**Parameters:**

* ​T (`SizedRaising`): The Sized type.

**Args:**

* ​value (`T`): The object to get the length of.

**Returns:**

The length of the object.

**Raises:**

If the length cannot be computed.

---

## LessThanComparable

A type which can be less than compared with other instances of itself.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__lt__`

`__lt__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is less than `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is less than `rhs`.

---

## LessThanOrEqualComparable

A type which can be less than or equal to compared with other instances of itself.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__le__`

`__le__(self: _Self, rhs: _Self) -> Bool`

Define whether `self` is less than or equal to `rhs`.

**Args:**

* ​rhs (`_Self`): The right hand side of the comparison.

**Returns:**

True if `self` is less than or equal to `rhs`.

---

## Level

`struct Level`

Represents logging severity levels.

Defines the available logging levels in ascending order of severity.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `CRITICAL`

`alias CRITICAL = Level(50)`

A serious error indicating that the program itself may be unable to continue running.

### `DEBUG`

`alias DEBUG = Level(10)`

Detailed information, typically of interest only when diagnosing problems.

### `ERROR`

`alias ERROR = Level(40)`

Due to a more serious problem, the software has not been able to perform some function.

### `INFO`

`alias INFO = Level(20)`

Confirmation that things are working as expected.

### `NOTSET`

`alias NOTSET = Level(0)`

Lowest level, used when no level is set.

### `WARNING`

`alias WARNING = Level(30)`

Indication that something unexpected happened, or may happen in the near future.

## Methods

### `__lt__`

`__lt__(self, other: Self) -> Bool`

Returns True if this level is less than the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if this level is less than the other level, False otherwise.

### `__le__`

`__le__(self, other: Self) -> Bool`

Returns True if this level is less than or equal to the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if this level is less than or equal to the other level, False otherwise.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Returns True if this level equals the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if the levels are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Returns True if this level does not equal the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if the levels are not equal, False otherwise.

### `__gt__`

`__gt__(self, other: Self) -> Bool`

Returns True if this level is greater than the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if this level is greater than the other level, False otherwise.

### `__ge__`

`__ge__(self, other: Self) -> Bool`

Returns True if this level is greater than or equal to the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if this level is greater than or equal to the other level, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Returns True if this level is identical to the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if this level is identical to the other level, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Returns True if this level is not identical to the other level.

**Args:**

* ​other (`Self`): The level to compare with.

**Returns:**

Bool: True if this level is not identical to the other level, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the string representation of this level to a writer.

**Parameters:**

* ​W (`Writer`): The writer type that implements the Writer trait.

**Args:**

* ​writer (`W`): The writer to write to.

### `__str__`

`__str__(self) -> String`

Returns the string representation of this level.

**Returns:**

String: A human-readable string representation of the level (e.g., "DEBUG", "INFO").

### `__repr__`

`__repr__(self) -> String`

Returns the detailed string representation of this level.

**Returns:**

String: A string representation including the type name and level value (e.g., "Level.DEBUG").

---

## lexists

`lexists[PathLike: PathLike, //](path: PathLike) -> Bool`

Return True if path exists or is a broken symlink.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

Returns True if the path exists or is a broken symbolic link.

---

## lgamma

`lgamma[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `lgamma` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `lgamma` of the input.

---

## Life of a value

The life of a value in Mojo begins when a variable is initialized and continues
up until the value is last used, at which point Mojo destroys it. This page
describes how every value in Mojo is created, copied, and moved. (The next
page describes [how values are
destroyed](/mojo/manual/lifecycle/death).)

All data types in Mojo—including basic types in the standard library such as
[`Bool`](/mojo/stdlib/builtin/bool/Bool), [`Int`](/mojo/stdlib/builtin/int/Int),
and [`String`](/mojo/stdlib/collections/string/string/String), up to complex
types such as [`SIMD`](/mojo/stdlib/builtin/simd/SIMD)—are defined as a
[struct](/mojo/manual/structs). This means the creation and destruction of any
piece of data follows the same lifecycle rules, and you can define your own data
types that work exactly the same way.

Mojo structs don't get any default lifecycle methods, such as a
constructor, copy constructor, or move constructor. That means you can create
a struct without a constructor, but then you can't instantiate it, and it
would be useful only as a sort of namespace for static methods. For example:

```mojo
struct NoInstances:
    var state: Int

    @staticmethod
    fn print_hello():
        print("Hello world!")
```

Without a constructor, this cannot be instantiated, so it has no lifecycle. The
`state` field is also useless because it cannot be initialized (Mojo structs do
not support default field values—you must initialize them in a constructor).

So the only thing you can do is call the static method:

```mojo
NoInstances.print_hello()
```

```output
Hello world!
```

## Constructor

To create an instance of a Mojo type, it needs the `__init__()` constructor
method. The main responsibility of the constructor is to initialize all fields.
For example:

```mojo
struct MyPet:
    var name: String
    var age: Int

    fn __init__(out self, name: String, age: Int):
        self.name = name
        self.age = age
```

Now we can create an instance:

```mojo
var mine = MyPet("Loki", 4)
```

An instance of `MyPet` can also be
[read](/mojo/manual/values/ownership#read-arguments-read)
and destroyed, but it currently can't be copied or moved.

We believe this is a good default starting point, because there are no built-in
lifecycle events and no surprise behaviors. You—the type author—must
explicitly decide whether and how the type can be copied or moved, by
implementing the copy and move constructors.

:::note

Mojo does not require a destructor to destroy an object. As long as
all fields in the struct are destructible (every type in the standard library
is destructible, except for
[pointers](/mojo/stdlib/memory/unsafe)), then Mojo knows how to destroy
the type when its lifetime ends. We'll discuss that more in [Death of a
value](/mojo/manual/lifecycle/death).

:::

### Overloading the constructor

Like any other function/method, you can
[overload](/mojo/manual/functions#overloaded-functions) the
`__init__()` constructor to initialize the object with different arguments. For
example, you might want a default constructor that sets some default values and
takes no arguments, and then additional constructors that accept more arguments.

Just be aware that, in order to modify any fields, each constructor must
declare the `self` argument with the [`out`
convention](/mojo/manual/values/ownership#mutable-arguments-mut). If you
want to call one constructor from another, you simply call upon that
constructor as you would externally (you don't need to pass `self`).

For example, here's how you can delegate work from an overloaded constructor:

```mojo
struct MyPet:
    var name: String
    var age: Int

    fn __init__(out self):
        self.name = ""
        self.age = 0

    fn __init__(out self, name: String):
        self = MyPet()
        self.name = name
```

### Field initialization

Notice in the previous example that, by the end of each constructor, all fields
must be initialized. That's the only requirement in the constructor.

In fact, the `__init__()` constructor is smart enough to treat the `self`
object as fully initialized even before the constructor is finished, as long
as all fields are initialized. For example, this constructor can pass around
`self` as soon as all fields are initialized:

```mojo
fn use(arg: MyPet):
    pass

struct MyPet:
    var name: String
    var age: Int

    fn __init__(out self, name: String, age: Int, cond: Bool):
        self.name = name
        if cond:
            self.age = age
            use(self)  # Safe to use immediately!

        self.age = age
        use(self)  # Safe to use immediately!
```

### Constructors and implicit conversion

Mojo supports implicit conversion from one type to another. Implicit conversion
can happen when one of the following occurs:

- You assign a value of one type to a variable with a different type.
- You pass a value of one type to a function that requires a different type.
- You return a value of one type from a function that specifies a different
  return type.

In all cases, implicit conversion is supported when the target type
defines a constructor that meets the following criteria:

- Is declared with the `@implicit` decorator.
- Has a single required, non-keyword argument of the source type.

For example:

```mojo
var a = Source()
var b: Target = a
```

Mojo implicitly converts the `Source` value in `a` to a `Target` value if
`Target` defines a matching constructor like this:

```mojo
struct Target:

    @implicit
    fn __init__(out self, s: Source): ...
```

With implicit conversion, the assignment above is essentially identical to:

```mojo
var b = Target(a)
```

In general, types should only support implicit conversions when the conversion
lossless, and ideally inexpensive. For example, converting an integer to a
floating-point number is usually lossless (except for very large positive and
negative integers, where the conversion may be approximate), but converting a
floating-point number to an integer is very likely to lose information. So
Mojo supports implicit conversion from `Int` to `Float64`, but not the reverse.

The constructor used for implicit conversion can take optional arguments, so
the following constructor would also support implicit conversion from `Source`
to `Target`:

```mojo
struct Target:

    @implicit
    fn __init__(out self, s: Source, reverse: Bool = False): ...
```

Implicit conversion can fail if Mojo can't unambiguously match the conversion to
a constructor. For example, if the target type has two overloaded constructors
that take different types, and each of those types supports an implicit
conversion from the source type, the compiler has two equally-valid paths to
convert the values:

```mojo
struct A:
    @implicit
    fn __init__(out self, s: Source): ...

struct B:
    @implicit
    fn __init__(out self, s: Source): ...

struct OverloadedTarget:
    @implicit
    fn __init__(out self, a: A): ...
    @implicit
    fn __init__(out self, b: B): ...

var t = OverloadedTarget(Source()) # Error: ambiguous call to '__init__': each
                                   # candidate requires 1 implicit conversion
```

In this case, you can fix the issue by explicitly casting to one of the
intermediate types. For example:

```mojo
var t = OverloadedTarget(A(Source())) # OK
```

Mojo applies at most one implicit conversion to a variable. For example:

```mojo
var t: OverloadedTarget = Source() # Error: can't implicitly convert Source
                                   # to Target
```

Would fail because there's no direct conversion from `Source` to
`OverloadedTarget`.

## Copy constructor

When Mojo encounters an assignment statement that doesn't use the [transfer
sigil (`^`)](/mojo/manual/values/ownership#transfer-arguments-owned-and-), it
tries to make a copy of the right-side value by calling upon that type's copy
constructor: the `__copyinit__()` method. Thus, it's the responsibility of the
type author to implement `__copyinit__()` so it returns a copy of the value.

For example, the `MyPet` type above does not have a copy constructor,
so this code fails to compile:

```mojo
var mine = MyPet("Loki", 4)
var yours = mine  # This requires a copy, but MyPet has no copy constructor
```

To make it work, we need to add the copy constructor, like
this:

```mojo
struct MyPet:
    var name: String
    var age: Int

    fn __init__(out self, name: String, age: Int):
        self.name = name
        self.age = age

    fn __copyinit__(out self, existing: Self):
        self.name = existing.name
        self.age = existing.age
```

:::note

`Self` (capital "S") is an alias for the current type name
(`MyPet`, in this example). Using this alias is a best practice to avoid any
mistakes when referring to the current struct name.

Also, notice that the `existing` argument in `__copyinit__()` is immutable
because the default [argument
convention](/mojo/manual/values/ownership#argument-conventions) is
`read`—this is a good thing because this function should not modify the
contents of the value being copied.

:::

Now this code works to make a copy:

```mojo
var mine = MyPet("Loki", 4)
var yours = mine
```

What makes Mojo's copy behavior different, compared to other languages, is that
`__copyinit__()` is designed to perform a deep copy of all fields in the type
(as per [value semantics](/mojo/manual/values/value-semantics)). That is,
it copies heap-allocated values, rather than just copying the pointer.

However, the Mojo compiler doesn't enforce this, so it's the type author's
responsibility to implement `__copyinit__()` with value semantics. For example,
here's a new `HeapArray` type that performs a deep copy in the copy constructor:

```mojo
struct HeapArray:
    var data: UnsafePointer[Int]
    var size: Int
    var cap: Int

    fn __init__(out self, size: Int, val: Int):
        self.size = size
        self.cap = size * 2
        self.data = UnsafePointer[Int].alloc(self.cap)
        for i in range(self.size):
            (self.data + i).init_pointee_copy(val)

    fn __copyinit__(out self, existing: Self):
        # Deep-copy the existing value
        self.size = existing.size
        self.cap = existing.cap
        self.data = UnsafePointer[Int].alloc(self.cap)
        for i in range(self.size):
            (self.data + i).init_pointee_copy(existing.data[i])
        # The lifetime of `existing` continues unchanged

    fn __del__(owned self):
        # We must free the heap-allocated data, but
        # Mojo knows how to destroy the other fields
        for i in range(self.size):
            (self.data + i).destroy_pointee()
        self.data.free()

    fn append(mut self, val: Int):
        # Update the array for demo purposes
        if self.size  0:
                print(", ", end="")
            print(self.data[i], end="")
        print("]")
```

Notice that `__copyinit__()` does not copy the `UnsafePointer` value (doing so would
make the copied value refer to the same `data` memory address as the original
value, which is a shallow copy). Instead, we initialize a new `UnsafePointer` to
allocate a new block of memory, and then copy over all the heap-allocated
values (this is a deep copy).

Thus, when we copy an instance of `HeapArray`, each copy has its own value on
the heap, so changes to one value do not affect the other, as shown here:

```mojo
fn copies():
    var a = HeapArray(2, 1)
    var b = a    # Calls the copy constructor
    a.dump()     # Prints [1, 1]
    b.dump()     # Prints [1, 1]

    b.append(2)  # Changes the copied data
    b.dump()     # Prints [1, 1, 2]
    a.dump()     # Prints [1, 1] (the original did not change)
```

:::note

In `HeapArray`, we must use the `__del__()` destructor to free the
heap-allocated data when the `HeapArray` lifetime ends, but Mojo automatically
destroys all other fields when their respective lifetimes end. We'll discuss
this destructor more in [Death of a value](/mojo/manual/lifecycle/death).

:::

If your type doesn't use any pointers for heap-allocated data, then writing the
constructor and copy constructor is all boilerplate code that you shouldn't
have to write. For most structs that don't manage memory explicitly, you can
just add the [`@value` decorator](/mojo/manual/decorators/value) to your
struct definition and Mojo will synthesize the `__init__()`, `__copyinit__()`,
and `__moveinit__()` methods.

:::note

Mojo also calls upon the copy constructor when a value is passed to a
function that takes the argument as
[`owned`](/mojo/manual/values/ownership#transfer-arguments-owned-and-)
*and* when the lifetime of the given value does *not* end at that point. If the
lifetime of the value does end there (usually indicated with the transfer
sigil `^`), then Mojo instead invokes the move constructor.

:::

## Move constructor

Although copying values provides predictable behavior that matches Mojo's
[value semantics](/mojo/manual/values/value-semantics), copying some data
types can be a significant hit on performance. If you're familiar with
reference semantics, then the solution here might seem clear: instead of making
a copy when passing a value, share the value as a reference. And if the
original variable is no longer needed, nullify the original to avoid any
double-free or use-after-free errors. That's generally known as a move
operation: the memory block holding the data remains the same (the memory does
not actually move), but the pointer to that memory moves to a new variable.

To support moving a value, implement the `__moveinit__()` method. The
`__moveinit__()` method performs a consuming move: it [transfers
ownership](/mojo/manual/values/ownership#transfer-arguments-owned-and-)
of a value from one variable to another when the original variable's lifetime
ends (also called a "destructive move").

:::note

A move constructor is **not required** to transfer ownership of a
value. Unlike in Rust, transferring ownership is not always a move operation;
the move constructors are only part of the implementation for how Mojo
transfers ownership of a value. You can learn more in the section about
[ownership
transfer](/mojo/manual/values/ownership#transfer-arguments-owned-and-).

:::

When a move occurs, Mojo immediately invalidates the original
variable, preventing any access to it and disabling its destructor. Invalidating
the original variable is important to avoid memory errors on heap-allocated
data, such as use-after-free and double-free errors.

Here's how to add the move constructor to the `HeapArray` example:

```mojo
struct HeapArray:
    var data: UnsafePointer[Int]
    var size: Int
    var cap: Int

    fn __init__(out self, size: Int, val: Int):
        self.size = size
        self.cap = size * 2
        self.data = UnsafePointer[Int].alloc(self.cap)
        for i in range(self.size):
            (self.data + i).init_pointee_copy(val)

    fn __copyinit__(out self, existing: Self):
        # Deep-copy the existing value
        self.size = existing.size
        self.cap = existing.cap
        self.data = UnsafePointer[Int].alloc(self.cap)
        for i in range(self.size):
            (self.data + i).init_pointee_copy(existing.data[i])
        # The lifetime of `existing` continues unchanged

    fn __moveinit__(out self, owned existing: Self):
        print("move")
        # Shallow copy the existing value
        self.size = existing.size
        self.cap = existing.cap
        self.data = existing.data
        # Then the lifetime of `existing` ends here, but
        # Mojo does NOT call its destructor

    fn __del__(owned self):
        # We must free the heap-allocated data, but
        # Mojo knows how to destroy the other fields
        for i in range(self.size):
            (self.data + i).destroy_pointee()
        self.data.free()

    fn append(mut self, val: Int):
        # Update the array for demo purposes
        if self.size  0:
                print(", ", end="")
            print(self.data[i], end="")
        print("]")
```

The critical feature of `__moveinit__()` is that it takes the incoming value as
`owned`, meaning this method gets unique ownership of the value. Moreover,
because this is a dunder method that Mojo calls only when performing a move
(during ownership transfer), the `existing` argument is guaranteed to be a
mutable reference to the original value, *not a copy* (unlike other methods that
may declare an argument as `owned`, but might receive the value as a copy if the
method is called without the [`^` transfer
sigil](/mojo/manual/values/ownership#transfer-arguments-owned-and-)).
That is, Mojo calls this move constructor *only* when the original variable's
lifetime actually ends at the point of transfer.

Here's an example showing how to invoke the move constructor for `HeapArray`:

```mojo
fn moves():
    var a = HeapArray(3, 1)

    a.dump()   # Prints [1, 1, 1]

    var b = a^ # Prints "move"; the lifetime of `a` ends here

    b.dump()   # Prints [1, 1, 1]
    #a.dump()  # ERROR: use of uninitialized value 'a'
```

Notice that `__moveinit__()` performs a shallow copy of the
existing field values (it copies the pointer, instead of allocating new memory
on the heap), which is what makes it useful for types with heap-allocated
values that are expensive to copy.

To go further and ensure your type can never be copied, you can make it
"move-only" by implementing `__moveinit__()` and *excluding* `__copyinit__()`.
A move-only type can be passed to other variables and passed into functions
with any argument convention (`read`, `mut`, and `owned`)—the only catch
is that you must use the `^` transfer sigil to end the lifetime of a
move-only type when assigning it to a new variable or when passing it as an
`owned` argument.

:::note

For types without heap-allocated fields, you get no real benefit from
the move constructor. Making copies of simple data types on the stack, like
integers, floats, and booleans, is very cheap. Yet, if you allow your type to
be copied, then there's generally no reason to disallow moves, so you can
synthesize both constructors by adding the [`@value`
decorator](/mojo/manual/decorators/value).

:::

## Simple value types {#value-decorator}

Because copy and move constructors are opt-in, Mojo provides great control for
exotic use cases (such as for atomic values that should never be copied or
moved), but most structs are simple aggregations of other types that should be
easily copied and moved, and we don't want to write a lot of boilerplate
constructors for those simple value types.

To solve this, Mojo provides the [`@value`
decorator](/mojo/manual/decorators/value), which synthesizes the
boilerplate code for the `__init__()`, `__copyinit__()`, and `__moveinit__()`
methods.

For example, consider a simple struct like this:

```mojo
@value
struct MyPet:
    var name: String
    var age: Int
```

Mojo sees the `@value` decorator and notices that you don't have a member-wise
initializer (a constructor with arguments for each field), a copy constructor,
or a move constructor, so it synthesizes them for you. The result is as if you
had actually written this:

```mojo
struct MyPet:
    var name: String
    var age: Int

    fn __init__(out self, owned name: String, age: Int):
        self.name = name^
        self.age = age

    fn __copyinit__(out self, existing: Self):
        self.name = existing.name
        self.age = existing.age

    fn __moveinit__(out self, owned existing: Self):
        self.name = existing.name^
        self.age = existing.age
```

Mojo synthesizes each lifecycle method only when it doesn't exist, so
you can use `@value` and still define your own versions to override the default
behavior. For example, it is fairly common to use the default member-wise and
move constructor, but create a custom copy constructor. Another common pattern
is to use `@value` to create a member-wise constructor, and add overloads that
take different sets of arguments. For example, if  you want to create
a `MyPet` struct without specifying an age, you could add an overloaded
constructor:

```mojo
@value
struct MyPet:
    var name: String
    var age: Int

    fn __init__(out self, owned name: String):
        self.name = name^
        self.age = 0

```

Note that this overloaded constructor **doesn't** prevent the `@value` decorator
from synthesizing the member-wise constructor. To override this default
constructor, you'd need to add a constructor with the same signature as the
default member-wise constructor.

Something you can see in this code that we didn't mention yet is that the
`__init__()` method takes all arguments as `owned`, because the constructor
must take ownership to store each value. This is a useful micro-optimization
and enables the use of move-only types. Trivial types like `Int` are also
passed as `owned`, but because ownership doesn't mean anything for integers, we
can elide that declaration and the transfer sigil (`^`) for simplicity. The
transfer operator is also just a formality in this case, because, even if it's
not used with `self.name = name^`, the Mojo compiler will notice that `name` is
last used here and convert this assignment into a move, instead of a
copy+delete.

:::note

If your type contains any move-only fields, Mojo will not generate
the copy constructor because it cannot copy those fields. Further, the `@value`
decorator won't work at all if any of your members are neither copyable nor
movable. For example, if you have something like `Atomic` in your struct, then
it probably isn't a true value type, and you don't want the copy/move
constructors anyway.

Also notice that the `MyPet` struct above doesn't include the `__del__()`
destructor (the `@value` decorator does not synthesize this), because Mojo
doesn't need it to destroy fields, as discussed in [Death of a
value](/mojo/manual/lifecycle/death)

:::

## Trivial types

So far, we've talked about values that live in memory, which means they have an
identity (an address) that can be passed around among functions (passed "by
reference"). This is great for most types, and it's a safe default for large
objects with expensive copy operations. However, it's inefficient for tiny
things like a single integer or floating point number. We call these types
"trivial" because they are just "bags of bits" that should be copied, moved,
and destroyed without invoking any custom lifecycle methods.

Trivial types are the most common types that surround us, and from a language
perspective, Mojo doesn't need special support for these written in a struct.
Usually, these values are so tiny that they should be passed around in CPU
registers, not indirectly through memory.

As such, Mojo provides a struct decorator to declare these types of values:
`@register_passable("trivial")`. This decorator tells Mojo that the type should
be copyable and movable but that it has no user-defined logic (no lifecycle
methods) for doing this. It also tells Mojo to pass the value in CPU registers
whenever possible, which has clear performance benefits.

You'll see this decorator on types like `Int` in the standard library:

```mojo
@register_passable("trivial")
struct Int:
    var value: __mlir_type.index

    fn __init__(value: __mlir_type.index) -> Int:
        return Self {value: value}
    ...
```

We expect to use this decorator pervasively on Mojo standard library types, but
it is safe to ignore for general application-level code.

For more information, see the [`@register_passable`
documentation](/mojo/manual/decorators/register-passable).

:::note TODO

This decorator is due for reconsideration.  Lack of custom
copy/move/destroy logic and "passability in a register" are orthogonal concerns
and should be split.  This former logic should be subsumed into a more general
`@value("trivial")` decorator, which is orthogonal from `@register_passable`.

:::

---

## Lifetimes, origins, and references

The Mojo compiler includes a lifetime checker, a compiler pass that analyzes
dataflow through your program. It identifies when variables are valid and
inserts destructor calls when a variable's lifetime ends.

The Mojo compiler uses a special value called an *origin* to track the lifetime
of variables and the validity of references.

Specifically, an origin answers two questions:

* What variable "owns" this value?
* Can the value be mutated using this reference?

For example, consider the following code:

```mojo
fn print_str(s: String):
    print(s)

name = String("Joan")
print_str(name)
```

```output
Joan
```

The line `name = String("Joan")` declares a variable with an identifier (`name`)
and logical storage space for a `String` value. When you pass `name` into the
`print_str()` function, the function gets an immutable reference to the value.
So both `name` and `s` refer to the same logical storage space, and have
associated origin values that lets the Mojo compiler reason about them.

Most of the time, origins are handled automatically by the compiler.
However, in some cases you'll need to interact with origins directly:

* When working with references—specifically `ref` arguments and `ref` return
  values.

* When working with types like
  [`Pointer`](/mojo/stdlib/memory/pointer/Pointer) or
  [`Span`](/mojo/stdlib/memory/span/Span) which are parameterized on the
  origin of the data they refer to.

This section also covers [`ref` arguments](#ref-arguments) and
[`ref` return values](#ref-return-values), which let functions
take arguments and provide return values as references with parametric
origins.

## Working with origins

Mojo's origin values are unlike most other values in the language, because
they're primitive values, not Mojo structs.

Likewise, because these values are mostly created by the
compiler, you can't just create your own origin value—you usually need to
derive an origin from an existing value.

### Origin types

Mojo supplies a struct and a set of aliases that you can use to specify
origin types. As the names suggest, the `ImmutableOrigin` and
`MutableOrigin` aliases represent immutable and mutable origins,
respectively:

```mojo
struct ImmutableRef[origin: ImmutableOrigin]:
    pass
```

Or you can use the [`Origin`](/mojo/stdlib/builtin/type_aliases/Origin)
struct to specify an origin with parametric mutability:

```mojo
struct ParametricRef[
    is_mutable: Bool,
    //,
    origin: Origin[is_mutable]
]:
    pass
```

Origin types carry the mutability of a reference as a boolean parameter value,
indicating whether the origin is mutable, immutable, or even with mutability
depending on a parameter specified by the enclosing API.

The `is_mutable` parameter here is an [infer-only
parameter](/mojo/manual/parameters/#infer-only-parameters). The `origin` value
is often inferred, as well. For example, the following code creates a
[`Pointer`](/mojo/stdlib/memory/pointer/Pointer) to an existing value, but
doesn't need to specify an origin—the `origin` is inferred from the existing
value.

```mojo
from memory import Pointer

def use_pointer():
    a = 10
    ptr = Pointer(to=a)
```

A final type of origin value is an `OriginSet`. As the name suggests, an
`OriginSet` represents a group of origins.

### Origin values

Most origin values are created by the compiler. As a developer, there are a
few ways to specify origin values:

* Static origin. The `StaticConstantOrigin` alias is an origin value
  representing immutable values that last for the duration of the program.
  String literal values have a `StaticConstantOrigin`.
* Derived origin. The `__origin_of()` magic function returns the origin
  associated with the value (or values) passed in.
* Inferred origin. You can use inferred parameters to capture the origin of a
  value passed in to a function.
* Wildcard origins. The `ImmutableAnyOrigin` and `MutableAnyOrigin` aliases are
  special cases indicating a reference that might access any live value.

#### Static origins

You can use the static origin `StaticConstantOrigin` when you have a
value that exists for the entire duration of the program.

For example, the `StringLiteral` method
[`as_string_slice()`](/mojo/stdlib/builtin/string_literal/StringLiteral#as_string_slice)
returns a
[`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice)
pointing to the original string literal. String literals are static—they're
allocated at compile time and never destroyed—so the slice is created with an
immutable, static origin.

#### Derived origins

Use the `__origin_of(value)` operator to obtain a value's origin. An argument
to `__origin_of()` can take an arbitrary expression that yields one of the
following:

- An origin value.

- A value with a memory location.

For example:

```mojo
__origin_of(self)
__origin_of(x.y)
__origin_of(foo())
```

The `__origin_of()` operator is analyzed statically at compile time;
The expressions passed to `__origin_of()` are never evaluated. (For example,
when the compiler analyzes `__origin_of(foo())`, it doesn't run the `foo()`
function.)

The following struct stores a string value using a
[`OwnedPointer`](/mojo/stdlib/memory/owned_pointer/OwnedPointer): a smart
pointer that holds an owned value. The `as_ptr()` method returns a `Pointer` to
the stored string, using the same origin as the original `OwnedPointer`.

```mojo
from memory import OwnedPointer, Pointer

struct BoxedString:
    var o_ptr: OwnedPointer[String]

    fn __init__(out self, value: String):
        self.o_ptr = OwnedPointer(value)

    fn as_ptr(mut self) -> Pointer[String, __origin_of(self.o_ptr)]:
        return Pointer(to=self.o_ptr[])
```

Note that the `as_ptr()` method takes its `self` argument as `mut self`. If it
used the default `read` argument convention, it would be immutable, and the
derived origin (`__origin_of(self.o_ptr)`) would also be immutable.

You can also pass multiple expressions to `__origin_of()` to express the union
of two or more origins:

`__origin_of(a, b)`

#### Inferred origins

The other common way to access an origin value is to *infer* it from the
the arguments passed to a function or method. For example, the `Span` type
has an associated `origin`:

```mojo
struct Span[
    is_mutable: Bool, //,
    T: Copyable & Movable,
    origin: Origin[is_mutable],
](CollectionElementNew):
    """A non owning view of contiguous data.
```

One of its constructors creates a `Span` from an existing `List`, and infers
its `origin` value from the list:

```mojo
    fn __init__(out self, ref [origin]list: List[T, *_]):
        """Construct a Span from a List.

        Args:
            list: The list to which the span refers.
        """
        self._data = list.data
        self._len = len(list)
```

## Working with references

You can use the `ref` keyword with arguments and return values to specify a
reference with parametric mutability. That is, they can be either mutable or
immutable.

From inside the called function, a `ref` argument looks like a `read` or
`mut` argument.

A `ref` return value looks like any other return value to the calling function,
but it is a *reference* to an existing value, not a copy.

### `ref` arguments

The `ref` argument convention lets you specify an argument of parametric
mutability: that is, you don't need to know in advance whether the passed
argument will be mutable or immutable. There are several reasons you might want
to use a `ref` argument:

* You want to accept an argument with parametric mutability.

* You want to tie the lifetime of one argument to the lifetime of another
  argument.

* When you want an argument that is guaranteed to be passed in memory: this can
  be important and useful for generic arguments that need an identity,
  irrespective of whether the concrete type is register passable.

The syntax for a `ref` argument is:

ref arg_name: arg_type

Or:

ref [origin_specifier(s)]
arg_name: arg_type

In the first form, the origin and mutability of the `ref` argument is inferred
from the value passed in. The second form includes an origin clause, consisting
of one or more origin specifiers inside square brackets. An origin
specifier can be either:

* An origin value.

* An arbitrary expression, which is treated as shorthand for
  `__origin_of(expression)`. In other words, the following declarations are
  equivalent:

  ```mojo
  ref [__origin_of(self)]
  ref [self]
  ```

* An [`AddressSpace`](/nightly/mojo/stdlib/memory/pointer/AddressSpace) value.

* An underscore character (`_`) to indicate that the origin is *unbound*. This
  is equivalent to omitting the origin specifier.

  ```mojo
  def add_ref(ref a: Int, b: Int) -> Int:
    return a+b
  ```

You can also name the origin explicitly. This is useful if you want to specify
an `ImmutableOrigin` or `MutableOrigin`, or if you want to bind to
the `is_mutable` parameter.

```mojo
def take_str_ref[
      is_mutable: Bool, //,
      origin: Origin[is_mutable]
    ](ref [origin] s: String):
    @parameter
    if is_mutable:
        print("Mutable: " + s)
    else:
        print("Immutable: " + s)

def pass_refs(s1: String, owned s2: String):
    take_str_ref(s1)
    take_str_ref(s2)

pass_refs("Hello", "Goodbye")
```

```output
Immutable: Hello
Mutable: Goodbye
```

### `ref` return values

Like `ref` arguments, `ref` return values allow a function to return a mutable
or immutable reference to a value. The syntax for a `ref` return value is:

-> ref [origin_specifier(s)]
 arg_type

Note that you **must** specify an origin specifier for a `ref` return value. The
values allowed for origin specifiers are the same as the ones listed for
[`ref` arguments](#ref-arguments).

`ref` return values can be an efficient way to handle updating items in a
collection. The standard way to do this is by implementing the `__getitem__()`
and `__setitem__()` dunder methods. These are invoked to read from and write to
a subscripted item in a collection:

```mojo
value = list[a]
list[b] += 10
```

With a `ref` argument, `__getitem__()` can return a mutable reference that can
be modified directly. This has pros and cons compared to using a `__setitem__()`
method:

* The mutable reference is more efficient—a single update isn't broken up across
  two methods. However, the referenced value must be in memory.

* A `__getitem__()`/`__setitem__()` pair allows for arbitrary code to be run
  when values are retrieved and set. For example, `__setitem__()` can validate
  or constrain input values.

For example, in the following example, `NameList` has a `__getitem__()` method
that returns a reference:

```mojo
struct NameList:
    var names: List[String]

    def __init__(out self, *names: String):
        self.names = List[String]()
        for name in names:
            self.names.append(name[])

    def __getitem__(ref self, index: Int) ->
        ref [self.names] String:
        if (index >=0 and index 
    ref [self] String:
```

Since the `origin` of the return value is tied to the origin of `self`, the
returned reference will be mutable if the method was called using a
mutable reference. The method still works if you have an immutable reference
to the `NameList`, but it returns an immutable reference:

```mojo
fn pass_immutable_list(list: NameList) raises:
    print(list[2])
    # list[2] += "?" # Error, this list is immutable

def use_name_list_again():
    list = NameList("Sophie", "Jack", "Diana")
    pass_immutable_list(list)

use_name_list_again()
```

```output
Diana
```

Without parametric mutability, you'd need to write two versions of
`__getitem__()`, one that accepts an immutable `self` and another that accepts
a mutable `self`.

#### Return values with union origins

A `ref` return value can include multiple values in its origin specifier, which
yields the union of the origins. For example, the following `pick_one()`
function returns a reference to one of the two input strings, with an origin
that's a union of both origins.

```mojo
def pick_one(cond: Bool, ref a: String, ref b: String) -> ref [a, b] String:
    return a if cond else b
```

---

## likely

`likely(val: Bool) -> Bool`

Provides information that the most probable value of `val` is going to be `True`. This information can be used by optimizers.

**Args:**

* ​val (`Bool`): The input value which is likely to be `True` most of the time.

**Returns:**

The input value.

---

## linalg

Provides CPU and GPU implementations of linear algebra functions.

## Modules

* [​`accumulate`](./accumulate/):
* [​`apple_accelerate`](./apple_accelerate/):
* [​`apple_amx_intrinsics`](./apple_amx_intrinsics/):
* [​`bmm`](./bmm/):
* [​`dispatch_table_a100_gpu`](./dispatch_table_a100_gpu/):
* [​`dispatch_table_amd`](./dispatch_table_amd/):
* [​`distributed_matmul`](./distributed_matmul/):
* [​`dual_gemm`](./dual_gemm/):
* [​`fast_div`](./fast_div/): Implements the fast division algorithm.
* [​`fp8_quantization`](./fp8_quantization/):
* [​`gemv`](./gemv/):
* [​`grouped_matmul`](./grouped_matmul/):
* [​`intel_amx_intrinsics`](./intel_amx_intrinsics/):
* [​`matmul`](./matmul/):
* [​`matmul_default`](./matmul_default/):
* [​`matmul_gpu`](./matmul_gpu/):
* [​`matmul_i8mm`](./matmul_i8mm/):
* [​`matmul_neon`](./matmul_neon/):
* [​`matmul_sm90`](./matmul_sm90/):
* [​`matmul_tile_scheduler`](./matmul_tile_scheduler/):
* [​`matmul_vendor`](./matmul_vendor/):
* [​`matmul_vnni`](./matmul_vnni/):
* [​`matrix_band_part`](./matrix_band_part/): The module implements matrix band part functions.
* [​`neon_intrinsics`](./neon_intrinsics/):
* [​`packing`](./packing/):
* [​`qr_factorization`](./qr_factorization/):
* [​`transpose`](./transpose/): The module implements Transpose functions.
* [​`utils`](./utils/):
* [​`utils_gpu`](./utils_gpu/):
* [​`vendor_blas`](./vendor_blas/):
* [​`vnni_intrinsics`](./vnni_intrinsics/):

---

## linear

Multi-layer Perceptron.

## `ColumnParallelLinear` {#max.nn.linear.ColumnParallelLinear}

> *class* max.nn.linear.ColumnParallelLinear(in\_dim, out\_dim, dtype, devices, tied\_weight=None, \*\*kwargs)

A Linear layer where the weight and bias are sharded onto multiple devices.

This layer first computes $y = xW_i^T + b_i$ for each device i in
\[0,…, num\_devices]:

```default
+-----+       +-----+ T     +-----+       +-----+
|     |       | W_0 |       | b_0 |       | y_0 | GPU0
|     |       +-----+       +-----+       +-----+
|     |       | W_1 |       | b_1 |       | y_1 | GPU1
|  x  |   @   +-----+   +   +-----+   =   +-----+
|     |       | W_2 |       | b_2 |       | y_2 | GPU2
|     |       +-----+       +-----+       +-----+
|     |       | W_3 |       | b_3 |       | y_3 | GPU3
+-----+       +-----+       +-----+       +-----+
```

The values are then collected using an Allgather op, producing the same
output tensor $y = xW^T + b$ on each device:

```default
GPU0  GPU1  GPU2  GPU3                      GPU0  GPU1  GPU2  GPU3
+-----+-----+-----+-----+                   +-----+-----+-----+-----+
| y_0 |  -  |  -  |  -  |                   | y_0 | y_0 | y_0 | y_0 |
+-----+-----+-----+-----+                   +-----+-----+-----+-----+
|  -  | y_1 |  -  |  -  |                   | y_1 | y_1 | y_1 | y_1 |
+-----+-----+-----+-----+  -- Allgather --> +-----+-----+-----+-----+
|  -  |  -  | y_2 |  -  |                   | y_2 | y_2 | y_2 | y_2 |
+-----+-----+-----+-----+                   +-----+-----+-----+-----+
|  -  |  -  |  -  | y_3 |                   | y_3 | y_3 | y_3 | y_3 |
+-----+-----+-----+-----+                   +-----+-----+-----+-----+
```

Example usage:

```python
from max.dtype import DType
from max.graph import DeviceRef
from max.nn import ColumnParallelLinear

num_devices = 4
distributed_linear = ColumnParallelLinear(
    in_dim,
    out_dim,
    DType.float32,
    devices=[DeviceRef.GPU(i) for i in range(num_devices)],
)
```

**Parameters:**

* **in\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of the input space.
* **out\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of the output space.
* **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The data type for both weights and bias.
* **devices** (`Sequence` `[` `DeviceRef` `]` ) – The target devices for computation.
  Weights remain on CPU until sharded and moved to device during
  computation.
* **tied\_weight** ([`Weight`](../graph/Weight.md#max.graph.Weight)  `|`  `None` )

## `DistributedMLP` {#max.nn.linear.DistributedMLP}

> *class* max.nn.linear.DistributedMLP(\*args, \*\*kwargs)

A distributed multi-layer perceptron.

This class has the same state keys as the non-distributed MLP Layer.

**Parameters:**

* **dtype** – DType to use for the layer weights, which should match the
  input dtype.
* **quantization\_encoding** – Quantization encoding of the layer weights.
* **hidden\_dim** – The last dimension of the layer input.
* **feed\_forward\_length** – Size of dimension used to project the inputs.
* **linear\_cls** – Linear class to use to create the projection layers.
* **devices** – Devices to run the MLP layer. If multiple are provided,
  the first device is used instead. Use DistributedMLP to use
  all devices.
* **activation\_function** – Activation function to use. Options are:
  * “silu”
  * “gelu”
  * “gelu\_tanh”
  * “relu”
  * “tanh”
  * “sigmoid”

## `Float8Config` {#max.nn.linear.Float8Config}

> *class* max.nn.linear.Float8Config(input\_scale, weight\_scale, mlp\_in\_float8, attn\_qkv\_in\_float8, embedding\_output\_dtype=None, quant\_method=None)

Configures float8 quantization settings for a layer or model section.

**Parameters:**

* **input\_scale** ([`Float8InputScaleSpec`](#max.nn.linear.Float8InputScaleSpec) )
* **weight\_scale** ([`Float8WeightScaleSpec`](#max.nn.linear.Float8WeightScaleSpec) )
* **mlp\_in\_float8** ([`set`](https://docs.python.org/3/library/stdtypes.html#set) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` )
* **attn\_qkv\_in\_float8** ([`set`](https://docs.python.org/3/library/stdtypes.html#set) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` )
* **embedding\_output\_dtype** ([`DType`](../dtype.md#max.dtype.DType)  `|`  `None` )
* **quant\_method** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `None` )

### `attn_qkv_in_float8` {#max.nn.linear.Float8Config.attn_qkv_in_float8}

> attn\_qkv\_in\_float8\*: [set](https://docs.python.org/3/library/stdtypes.html#set)\[[int](https://docs.python.org/3/library/functions.html#int)]\*

Set of layer indices with attention QKV projections in float8.

QKV projections are considered to be either “all quantized” or all not
quantized per layer.
So either all of {q,k,v,o}\_proj are float8, or all bfloat16.

### `embedding_output_dtype` {#max.nn.linear.Float8Config.embedding_output_dtype}

> embedding\_output\_dtype\*: [DType](../dtype.md#max.dtype.DType) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

The data type of the output from the embedding layer.

### `input_scale` {#max.nn.linear.Float8Config.input_scale}

> input\_scale\*: [Float8InputScaleSpec](#max.nn.linear.Float8InputScaleSpec)\*

Specification for input activation scaling.

### `is_dynamic` {#max.nn.linear.Float8Config.is_dynamic}

> *property* is\_dynamic\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

Returns true if this input scale is dynamic.

### `is_static` {#max.nn.linear.Float8Config.is_static}

> *property* is\_static\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

Returns true if this input scale is static.

### `mlp_in_float8` {#max.nn.linear.Float8Config.mlp_in_float8}

> mlp\_in\_float8\*: [set](https://docs.python.org/3/library/stdtypes.html#set)\[[int](https://docs.python.org/3/library/functions.html#int)]\*

Set of layer indices with MLPs in float8.

MLPs are considered to be either “all quantized” or all not quantized per
layer.
So either all of gate proj, down proj, and up proj are float8, or all bfloat16.

### `quant_method` {#max.nn.linear.Float8Config.quant_method}

> quant\_method\*: [str](https://docs.python.org/3/library/stdtypes.html#str) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

The quantization method used (e.g., “fbgemm\_fp8”).

### `weight_scale` {#max.nn.linear.Float8Config.weight_scale}

> weight\_scale\*: [Float8WeightScaleSpec](#max.nn.linear.Float8WeightScaleSpec)\*

Specification for weight scaling.

## `Float8InputScaleSpec` {#max.nn.linear.Float8InputScaleSpec}

> *class* max.nn.linear.Float8InputScaleSpec(granularity, origin, dtype, activation\_scale\_ub=None)

Specifies how input activations are scaled for float8 quantization.

**Parameters:**

* **granularity** ([`Float8ScaleGranularity`](#max.nn.linear.Float8ScaleGranularity) )
* **origin** ([`Float8ScaleOrigin`](#max.nn.linear.Float8ScaleOrigin) )
* **dtype** ([`DType`](../dtype.md#max.dtype.DType) )
* **activation\_scale\_ub** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  `None` )

### `activation_scale_ub` {#max.nn.linear.Float8InputScaleSpec.activation_scale_ub}

> activation\_scale\_ub\*: [float](https://docs.python.org/3/library/functions.html#float) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

An optional upper bound for dynamic activation scaling.

### `dtype` {#max.nn.linear.Float8InputScaleSpec.dtype}

> dtype\*: [DType](../dtype.md#max.dtype.DType)\*

The data type of the input scale factor(s).

### `granularity` {#max.nn.linear.Float8InputScaleSpec.granularity}

> granularity\*: [Float8ScaleGranularity](#max.nn.linear.Float8ScaleGranularity)\*

The granularity of the input scale factor application.

### `origin` {#max.nn.linear.Float8InputScaleSpec.origin}

> origin\*: [Float8ScaleOrigin](#max.nn.linear.Float8ScaleOrigin)\*

The origin (static or dynamic) of the input scale factor.

## `Float8ScaleGranularity` {#max.nn.linear.Float8ScaleGranularity}

> *class* max.nn.linear.Float8ScaleGranularity(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None)

Specifies the granularity of the quantization scale factor.

Determines whether a scale factor applies per-tensor, per-row (often for
weights), per-column, or per-block within a tensor.

### `BLOCK` {#max.nn.linear.Float8ScaleGranularity.BLOCK}

> BLOCK *= 'block'*

### `COLWISE` {#max.nn.linear.Float8ScaleGranularity.COLWISE}

> COLWISE *= 'colwise'*

### `ROWWISE` {#max.nn.linear.Float8ScaleGranularity.ROWWISE}

> ROWWISE *= 'rowwise'*

### `TENSOR` {#max.nn.linear.Float8ScaleGranularity.TENSOR}

> TENSOR *= 'tensor'*

## `Float8ScaleOrigin` {#max.nn.linear.Float8ScaleOrigin}

> *class* max.nn.linear.Float8ScaleOrigin(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None)

Specifies whether the quantization scale is determined statically or dynamically.

STATIC scales are pre-computed and loaded with the model weights.
DYNAMIC scales are computed at runtime based on the input data.

### `DYNAMIC` {#max.nn.linear.Float8ScaleOrigin.DYNAMIC}

> DYNAMIC *= 'dynamic'*

### `STATIC` {#max.nn.linear.Float8ScaleOrigin.STATIC}

> STATIC *= 'static'*

## `Float8WeightScaleSpec` {#max.nn.linear.Float8WeightScaleSpec}

> *class* max.nn.linear.Float8WeightScaleSpec(granularity, dtype)

Specifies how weights are scaled for float8 quantization.

**Parameters:**

* **granularity** ([`Float8ScaleGranularity`](#max.nn.linear.Float8ScaleGranularity) )
* **dtype** ([`DType`](../dtype.md#max.dtype.DType) )

### `dtype` {#max.nn.linear.Float8WeightScaleSpec.dtype}

> dtype\*: [DType](../dtype.md#max.dtype.DType)\*

The data type of the weight scale factor(s).

### `granularity` {#max.nn.linear.Float8WeightScaleSpec.granularity}

> granularity\*: [Float8ScaleGranularity](#max.nn.linear.Float8ScaleGranularity)\*

The granularity of the weight scale factor application.

### `is_block` {#max.nn.linear.Float8WeightScaleSpec.is_block}

> *property* is\_block\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

Whether the weight scale granularity is block-wise.

### `is_colwise` {#max.nn.linear.Float8WeightScaleSpec.is_colwise}

> *property* is\_colwise\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

Whether the weight scale granularity is column-wise.

### `is_rowwise` {#max.nn.linear.Float8WeightScaleSpec.is_rowwise}

> *property* is\_rowwise\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

Whether the weight scale granularity is row-wise.

### `is_tensor` {#max.nn.linear.Float8WeightScaleSpec.is_tensor}

> *property* is\_tensor\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

Whether the weight scale granularity is per-tensor.

## `GPTQLinear` {#max.nn.linear.GPTQLinear}

> *class* max.nn.linear.GPTQLinear(in\_dim, out\_dim, dtype, device, has\_bias=False, quantization\_encoding=None, quantization\_config=None, float8\_config=None)

A Linear layer for GPTQ encoding

Initializes the linear layer with weights and optional bias with
GPTQ quantization.

**Parameters:**

* **in\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of the input space.
* **out\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of the output space.
* **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The data type for both weights and bias.
* **device** (`DeviceRef` ) – The target device for computation.
  Weights remain on CPU until moved during computation.
* **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – When [`True`](https://docs.python.org/3/library/constants.html#True), adds a bias vector to the layer.
  Defaults to [`False`](https://docs.python.org/3/library/constants.html#False).
* **quantization\_encoding** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding)  `|`  `None` ) – The quantization encoding of the weights.
* **quantization\_config** ([`QuantizationConfig`](../graph/quantization.md#max.graph.quantization.QuantizationConfig)  `|`  `None` ) – Extra config for the weight quantization.
* **float8\_config** ([`Float8Config`](#max.nn.linear.Float8Config)  `|`  `None` )

## `GPTQLinearV1` {#max.nn.linear.GPTQLinearV1}

> *class* max.nn.linear.GPTQLinearV1(weight, bias=None, quantization\_encoding=None, quantization\_config=None, perm\_idx=None)

A Linear layer for GPTQ encoding

**Parameters:**

* **weight** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **bias** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )
* **quantization\_encoding** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding)  `|`  `None` )
* **quantization\_config** ([`QuantizationConfig`](../graph/quantization.md#max.graph.quantization.QuantizationConfig)  `|`  `None` )
* **perm\_idx** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )

### `perm_idx` {#max.nn.linear.GPTQLinearV1.perm_idx}

> perm\_idx\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

### `quantization_config` {#max.nn.linear.GPTQLinearV1.quantization_config}

> quantization\_config\*: [QuantizationConfig](../graph/quantization.md#max.graph.quantization.QuantizationConfig) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

## `Linear` {#max.nn.linear.Linear}

> *class* max.nn.linear.Linear(in\_dim, out\_dim, dtype, device, has\_bias=False, quantization\_encoding=None, float8\_config=None, name=None, clip\_weight=None)

Applies a linear transformation to incoming data: $y = xW^T + b$.

This layer implements a fully connected layer where inputs are multiplied
by a weight matrix and optionally added with a bias vector.
Both weights and bias initially reside on CPU, and the model init phase
moves them to [`device`](#max.nn.linear.Linear.device).

Example:

```python
linear_layer = Linear(
    in_dim=256,
    out_dim=128,
    dtype=DType.float32,
    device=DeviceRef.GPU(),
    name="linear",
    has_bias=True
)

input_tensor: TensorValue
output = linear_layer(input_tensor)
```

Initializes the linear layer with weights and optional bias.

**Parameters:**

* **in\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of the input space.
* **out\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimensionality of the output space.
* **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The data type for both weights and bias.
* **device** (`DeviceRef` ) – The target device for computation.
  Weights remain on CPU until moved during computation.
* **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `None` ) – Base name for weights (appended with `.weight` and
  `.bias` if applicable).
* **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – When [`True`](https://docs.python.org/3/library/constants.html#True), adds a bias vector to the layer.
  Defaults to [`False`](https://docs.python.org/3/library/constants.html#False).
* **quantization\_encoding** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding)  `|`  `None` )
* **float8\_config** ([`Float8Config`](#max.nn.linear.Float8Config)  `|`  `None` )
* **clip\_weight** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  `None` )

### `bias` {#max.nn.linear.Linear.bias}

> bias\*: [Weight](../graph/Weight.md#max.graph.Weight) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

The optional bias vector stored on CPU with shape (out\_dim,).
Model init moves the bias to [`device`](#max.nn.linear.Linear.device) if present.

### `device` {#max.nn.linear.Linear.device}

> device\*: DeviceRef\*

The device where matrix operations are performed.

### `input_scale` {#max.nn.linear.Linear.input_scale}

> input\_scale\*: [Weight](../graph/Weight.md#max.graph.Weight) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

The optional input scale stored on CPU with shape ().
Model init moves the input\_scale to [`device`](#max.nn.linear.Linear.device) if present.

### `set_sharding()` {#max.nn.linear.Linear.set_sharding}

> set\_sharding(strategy)

Sets the weight sharding for this linear layer.

**Parameters:**

**strategy** (`ShardingStrategy` ) – The strategy describing the weight sharding.

**Return type:**

None

### `weight` {#max.nn.linear.Linear.weight}

> weight\*: [Weight](../graph/Weight.md#max.graph.Weight)\*

The weight matrix stored on CPU with shape (out\_dim, in\_dim).
Model init transposes the weight and moves it to [`device`](#max.nn.linear.Linear.device).

### `weight_scale` {#max.nn.linear.Linear.weight_scale}

> weight\_scale\*: [Weight](../graph/Weight.md#max.graph.Weight) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

The optional weight scale stored on CPU with shape () or (N,).
Model init moves the weight\_scale to [`device`](#max.nn.linear.Linear.device) if present.

## `LinearV1` {#max.nn.linear.LinearV1}

> *class* max.nn.linear.LinearV1(weight, bias=None)

A unified linear layer that delegates to either regular or quantized implementation.

Deprecated: Use Linear instead.

**Parameters:**

* **weight** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **bias** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )

### `bias` {#max.nn.linear.LinearV1.bias}

> bias\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

### `create()` {#max.nn.linear.LinearV1.create}

> *classmethod* create(dtype, quantization\_encoding, in\_features, out\_features, weights, bias=None, quantization\_config=None)

Factory method to create a Linear layer with appropriate implementation.

**Parameters:**

* **dtype** ([`DType`](../dtype.md#max.dtype.DType) )
* **quantization\_encoding** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding)  `|`  `None` )
* **in\_features** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **out\_features** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **weights** (`Weights`  `|`  [`Weight`](../graph/Weight.md#max.graph.Weight) )
* **bias** (`Weights`  `|`  [`Weight`](../graph/Weight.md#max.graph.Weight)  `|`  `None` )
* **quantization\_config** ([`QuantizationConfig`](../graph/quantization.md#max.graph.quantization.QuantizationConfig)  `|`  `None` )

**Return type:**

[*LinearV1*](#max.nn.linear.LinearV1)

### `weight` {#max.nn.linear.LinearV1.weight}

> weight\*: Value\[TensorType] | [TensorValue](../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../graph/type.md#max.graph.type.Shape) | [Dim](../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

## `MLP` {#max.nn.linear.MLP}

> *class* max.nn.linear.MLP(dtype, quantization\_encoding, hidden\_dim, feed\_forward\_length, devices, linear\_cls=\, has\_bias=False, activation\_function='silu', float8\_config=None)

Simple multi-layer perceptron composed of three linear layers.
Defaults to SiLU activation function.

**Parameters:**

* **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – DType to use for the layer weights, which should match the
  input dtype.
* **quantization\_encoding** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding)  `|`  `None` ) – Quantization encoding of the layer weights.
* **hidden\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The last dimension of the layer input.
* **feed\_forward\_length** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Size of dimension used to project the inputs.
* **linear\_cls** (`Callable` `[` `...` `,`  [`Linear`](#max.nn.linear.Linear) `]` ) – Linear class to use to create the projection layers.
* **devices** (`Sequence` `[` `DeviceRef` `]` ) – Devices to run the MLP layer. If multiple are provided,
  the first device is used instead. Use DistributedMLP to use
  all devices.
* **activation\_function** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – Activation function to use. Options are:
  * “silu”
  * “gelu”
  * “gelu\_tanh”
  * “relu”
  * “tanh”
  * “sigmoid”
* **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )
* **float8\_config** ([`Float8Config`](#max.nn.linear.Float8Config)  `|`  `None` )

## `MLPV1` {#max.nn.linear.MLPV1}

> *class* max.nn.linear.MLPV1(gate\_proj, down\_proj, up\_proj)

Simple multi-layer perceptron composed of three linear layers.
Uses SiLU activation function.

**Parameters:**

* **gate\_proj** ([`LinearV1`](#max.nn.linear.LinearV1) )
* **down\_proj** ([`LinearV1`](#max.nn.linear.LinearV1) )
* **up\_proj** ([`LinearV1`](#max.nn.linear.LinearV1) )

### `down_proj` {#max.nn.linear.MLPV1.down_proj}

> down\_proj\*: [LinearV1](#max.nn.linear.LinearV1)\*

### `gate_proj` {#max.nn.linear.MLPV1.gate_proj}

> gate\_proj\*: [LinearV1](#max.nn.linear.LinearV1)\*

### `up_proj` {#max.nn.linear.MLPV1.up_proj}

> up\_proj\*: [LinearV1](#max.nn.linear.LinearV1)\*

## `QLinearV1` {#max.nn.linear.QLinearV1}

> *class* max.nn.linear.QLinearV1(weight, bias=None, quantization\_encoding=None)

A quantized fully connected layer.

**Parameters:**

* **weight** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **bias** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )
* **quantization\_encoding** ([`QuantizationEncoding`](../graph/quantization.md#max.graph.quantization.QuantizationEncoding)  `|`  `None` )

### `quantization_encoding` {#max.nn.linear.QLinearV1.quantization_encoding}

> quantization\_encoding\*: [QuantizationEncoding](../graph/quantization.md#max.graph.quantization.QuantizationEncoding) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

---

## linear_filter

`linear_filter(x: SIMD[float32, 1]) -> SIMD[float32, 1]`

This is a tent filter.

f(x) = 1 + x, x = 1

---

## linked_list

## Structs

* [​`LinkedList`](/mojo/stdlib/collections/linked_list/LinkedList): A doubly-linked list implementation.
* [​`Node`](/mojo/stdlib/collections/linked_list/Node): A node in a linked list data structure.

---

## LinkedList

`struct LinkedList[ElementType: Copyable & Movable]`

A doubly-linked list implementation.

A doubly-linked list is a data structure where each element points to both
the next and previous elements, allowing for efficient insertion and deletion
at any position.

## Parameters

* ​ElementType (`Copyable & Movable`): The type of elements stored in the list. Must implement the
  `Copyable` and `Movable` traits.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Initialize an empty linked list.

Notes:
Time Complexity: O(1).

`__init__(out self, owned *elements: ElementType)`

Initialize a linked list with the given elements.

Notes:
Time Complexity: O(n) in len(elements).

**Args:**

* ​\*elements (`ElementType`): Variable number of elements to initialize the list with.

`__init__(out self, *, owned elements: VariadicListMem[ElementType, origin, is_owned])`

Construct a list from a `VariadicListMem`.

Notes:
Time Complexity: O(n) in len(elements).

**Args:**

* ​elements (`VariadicListMem[ElementType, origin, is_owned]`): The elements to add to the list.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Initialize this list as a copy of another list.

Notes:
Time Complexity: O(n) in len(elements).

**Args:**

* ​other (`Self`): The list to copy from.

### `__moveinit__`

`__moveinit__(out self, owned other: Self)`

Initialize this list by moving elements from another list.

Notes:
Time Complexity: O(1).

**Args:**

* ​other (`Self`): The list to move elements from.

### `__del__`

`__del__(owned self)`

Clean up the list by freeing all nodes.

Notes:
Time Complexity: O(n) in len(self).

### `__bool__`

`__bool__(self) -> Bool`

Check if the list is non-empty.

Notes:
Time Complexity: O(1).

**Returns:**

True if the list has elements, False otherwise.

### `__getitem__`

`__getitem__[I: Indexer](ref self, index: I) -> ref [self] ElementType`

Get the element at the specified index.

Notes:
Time Complexity: O(n) in len(self).

**Parameters:**

* ​I (`Indexer`): The type of index to use.

**Args:**

* ​index (`I`): The index of the element to get.

**Returns:**

The element at the specified index.

### `__setitem__`

`__setitem__[I: Indexer](mut self, index: I, owned value: ElementType)`

Set the element at the specified index.

Notes:
Time Complexity: O(n) in len(self).

**Parameters:**

* ​I (`Indexer`): The type of index to use.

**Args:**

* ​index (`I`): The index of the element to set.
* ​value (`ElementType`): The new value to set.

### `__eq__`

`__eq__[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], other: LinkedList[ElementType]) -> Bool`

Checks if the two lists are equal.

Notes:
Time Complexity: O(n) in min(len(self), len(other)) compares.

**Parameters:**

* ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the
  function.

**Args:**

* ​other (`LinkedList[ElementType]`): The list to compare to.

**Returns:**

Whether the lists are equal.

### `__ne__`

`__ne__[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], other: LinkedList[ElementType]) -> Bool`

Checks if the two lists are not equal.

Notes:
Time Complexity: O(n) in min(len(self), len(other)) compares.

**Parameters:**

* ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the
  function.

**Args:**

* ​other (`LinkedList[ElementType]`): The list to compare to.

**Returns:**

Whether the lists are not equal.

### `__contains__`

`__contains__[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], value: ElementType) -> Bool`

Checks if the list contains `value`.

Notes:
Time Complexity: O(n) in len(self) compares.

**Parameters:**

* ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the
  function.

**Args:**

* ​value (`ElementType`): The value to search for in the list.

**Returns:**

Whether the list contains `value`.

### `append`

`append(mut self, owned value: ElementType)`

Add an element to the end of the list.

Notes:
Time Complexity: O(1).

**Args:**

* ​value (`ElementType`): The value to append.

### `prepend`

`prepend(mut self, owned value: ElementType)`

Add an element to the beginning of the list.

Notes:
Time Complexity: O(1).

**Args:**

* ​value (`ElementType`): The value to prepend.

### `reverse`

`reverse(mut self)`

Reverse the order of elements in the list.

Notes:
Time Complexity: O(n) in len(self).

### `pop`

`pop(mut self) -> ElementType`

Remove and return the last element of the list.

Notes:
Time Complexity: O(1).

**Returns:**

The last element in the list.

`pop[I: Indexer](mut self, owned i: I) -> ElementType`

Remove the ith element of the list, counting from the tail if given a negative index.

Notes:
Time Complexity: O(1).

**Parameters:**

* ​I (`Indexer`): The type of index to use.

**Args:**

* ​i (`I`): The index of the element to get.

**Returns:**

Ownership of the indicated element.

### `maybe_pop`

`maybe_pop(mut self) -> Optional[ElementType]`

Removes the head of the list and returns it, if it exists.

Notes:
Time Complexity: O(1).

**Returns:**

The head of the list, if it was present.

`maybe_pop[I: Indexer](mut self, owned i: I) -> Optional[ElementType]`

Remove the ith element of the list, counting from the tail if given a negative index.

Notes:
Time Complexity: O(1).

**Parameters:**

* ​I (`Indexer`): The type of index to use.

**Args:**

* ​i (`I`): The index of the element to get.

**Returns:**

The element, if it was found.

### `clear`

`clear(mut self)`

Removes all elements from the list.

Notes:
Time Complexity: O(n) in len(self).

### `copy`

`copy(self) -> Self`

Create a deep copy of the list.

Notes:
Time Complexity: O(n) in len(self).

**Returns:**

A new list containing copies of all elements.

### `insert`

`insert[I: Indexer](mut self, idx: I, owned elem: ElementType)`

Insert an element `elem` into the list at index `idx`.

Notes:
Time Complexity: O(1).

**Parameters:**

* ​I (`Indexer`): The type of index to use.

**Args:**

* ​idx (`I`): The index to insert `elem` at `-len(self) elem (`ElementType`): The item to insert into the list.

**Raises:**

When given an out of bounds index.

### `extend`

`extend(mut self, owned other: Self)`

Extends the list with another.

Notes:
Time Complexity: O(1).

**Args:**

* ​other (`Self`): The list to append to this one.

### `count`

`count[ElementType: EqualityComparable & Copyable & Movable, //](self: LinkedList[ElementType], elem: ElementType) -> UInt`

Count the occurrences of `elem` in the list.

Notes:
Time Complexity: O(n) in len(self) compares.

**Parameters:**

* ​ElementType (`EqualityComparable & Copyable & Movable`): The list element type, used to conditionally enable the
  function.

**Args:**

* ​elem (`ElementType`): The element to search for.

**Returns:**

The number of occurrences of `elem` in the list.

### `__len__`

`__len__(self) -> Int`

Get the number of elements in the list.

Notes:
Time Complexity: O(1).

**Returns:**

The number of elements in the list.

### `__iter__`

`__iter__(self) -> _LinkedListIter[ElementType, self]`

Iterate over elements of the list, returning immutable references.

Notes:
Time Complexity:

* O(1) for iterator construction.
* O(n) in len(self) for a complete iteration of the list.

**Returns:**

An iterator of immutable references to the list elements.

### `__reversed__`

`__reversed__(self) -> _LinkedListIter[ElementType, self, False]`

Iterate backwards over the list, returning immutable references.

Notes:
Time Complexity:

* O(1) for iterator construction.
* O(n) in len(self) for a complete iteration of the list.

**Returns:**

A reversed iterator of immutable references to the list elements.

### `__str__`

`__str__[ElementType: Copyable & Movable & Writable](self: LinkedList[ElementType]) -> String`

Convert the list to its string representation.

Notes:
Time Complexity: O(n) in len(self).

**Parameters:**

* ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function when
  `ElementType` is `Writable`.

**Returns:**

String representation of the list.

### `__repr__`

`__repr__[ElementType: Copyable & Movable & Writable](self: LinkedList[ElementType]) -> String`

Convert the list to its string representation.

Notes:
Time Complexity: O(n) in len(self).

**Parameters:**

* ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function when
  `ElementType` is `Writable`.

**Returns:**

String representation of the list.

### `write_to`

`write_to[W: Writer, ElementType: Copyable & Movable & Writable](self: LinkedList[ElementType], mut writer: W)`

Write the list to the given writer.

Notes:
Time Complexity: O(n) in len(self).

**Parameters:**

* ​W (`Writer`): The type of writer to write the list to.
* ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function when
  `ElementType` is `Writable`.

**Args:**

* ​writer (`W`): The writer to write the list to.

---

## list

Defines the List type.

These APIs are imported automatically, just like builtins.

## Structs

* [​`List`](/mojo/stdlib/collections/list/List): The `List` type is a dynamically-allocated list.

---

## List

`struct List[T: Copyable & Movable, hint_trivial_type: Bool = False]`

The `List` type is a dynamically-allocated list.

Notes:
It supports pushing and popping from the back resizing the underlying
storage as needed.  When it is deallocated, it frees its memory.

## Parameters

* ​T (`Copyable & Movable`): The type of the elements.
* ​hint\_trivial\_type (`Bool`): A hint to the compiler that the type T is trivial.
  It's not mandatory, but if set, it allows some optimizations.

## Fields

* ​data (`UnsafePointer[T]`): The underlying storage for the list.
* ​capacity (`Int`): The amount of elements that can fit in the list without resizing it.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Constructs an empty list.

`__init__(out self, *, capacity: Int)`

Constructs a list with the given capacity.

**Args:**

* ​capacity (`Int`): The requested capacity of the list.

`__init__(out self, *, length: UInt, fill: T)`

Constructs a list with the given capacity.

**Args:**

* ​length (`UInt`): The requested length of the list.
* ​fill (`T`): The element to fill each element of the list.

`__init__(out self, owned *values: T, *, __list_literal__: Tuple[] = Tuple())`

Constructs a list from the given values.

**Args:**

* ​\*values (`T`): The values to populate the list with.
* ​**list\_literal** (`Tuple[]`): Tell Mojo to use this method for list literals.

`__init__(out self, *, owned elements: VariadicListMem[T, origin, is_owned])`

Constructs a list from the given values.

**Args:**

* ​elements (`VariadicListMem[T, origin, is_owned]`): The values to populate the list with.

`__init__(out self, span: Span[T, origin])`

Constructs a list from the a Span of values.

**Args:**

* ​span (`Span[T, origin]`): The span of values to populate the list with.

`__init__(out self, *, unsafe_uninit_length: Int)`

Construct a list with the specified length, with uninitialized memory. This is unsafe, as it relies on the caller initializing the elements with unsafe operations, not assigning over the uninitialized data.

**Args:**

* ​unsafe\_uninit\_length (`Int`): The number of elements to allocate.

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Creates a deepcopy of the given list.

**Args:**

* ​existing (`Self`): The list to copy.

### `__del__`

`__del__(owned self)`

Destroy all elements in the list and free its memory.

### `__bool__`

`__bool__(self) -> Bool`

Checks whether the list has any elements or not.

**Returns:**

`False` if the list is empty, `True` if there is at least one
element.

### `__getitem__`

`__getitem__(self, slice: Slice) -> Self`

Gets the sequence of elements at the specified positions.

**Args:**

* ​slice (`Slice`): A slice that specifies positions of the new list.

**Returns:**

A new list containing the list at the specified slice.

`__getitem__[I: Indexer](ref self, idx: I) -> ref [self] T`

Gets the list element at the given index.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The index of the element.

**Returns:**

A reference to the element at the given index.

### `__eq__`

`__eq__[U: EqualityComparable & Copyable & Movable, //](self: List[U, hint_trivial_type], other: List[U, hint_trivial_type]) -> Bool`

Checks if two lists are equal.

Examples:

```mojo
var x = List[Int](1, 2, 3)
var y = List[Int](1, 2, 3)
print("x and y are equal" if x == y else "x and y are not equal")
```

**Parameters:**

* ​U (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the
  trait `EqualityComparable`.

**Args:**

* ​other (`List[U, hint_trivial_type]`): The list to compare with.

**Returns:**

True if the lists are equal, False otherwise.

### `__ne__`

`__ne__[U: EqualityComparable & Copyable & Movable, //](self: List[U, hint_trivial_type], other: List[U, hint_trivial_type]) -> Bool`

Checks if two lists are not equal.

Examples:

```mojo
var x = List[Int](1, 2, 3)
var y = List[Int](1, 2, 4)
print("x and y are not equal" if x != y else "x and y are equal")
```

**Parameters:**

* ​U (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the
  trait `EqualityComparable`.

**Args:**

* ​other (`List[U, hint_trivial_type]`): The list to compare with.

**Returns:**

True if the lists are not equal, False otherwise.

### `__contains__`

`__contains__[U: EqualityComparable & Copyable & Movable, //](self: List[U, hint_trivial_type], value: U) -> Bool`

Verify if a given value is present in the list.

Examples:

```mojo
var x = List[Int](1,2,3)
print("x contains 3" if 3 in x else "x does not contain 3")
```

**Parameters:**

* ​U (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the
  trait `EqualityComparable`.

**Args:**

* ​value (`U`): The value to find.

**Returns:**

True if the value is contained in the list, False otherwise.

### `__add__`

`__add__(self, owned other: Self) -> Self`

Concatenates self with other and returns the result as a new list.

**Args:**

* ​other (`Self`): List whose elements will be combined with the elements of
  self.

**Returns:**

The newly created list.

### `__mul__`

`__mul__(self, x: Int) -> Self`

Multiplies the list by x and returns a new list.

**Args:**

* ​x (`Int`): The multiplier number.

**Returns:**

The new list.

### `__iadd__`

`__iadd__(mut self, owned other: Self)`

Appends the elements of other into self.

**Args:**

* ​other (`Self`): List whose elements will be appended to self.

### `__imul__`

`__imul__(mut self, x: Int)`

Appends the original elements of this list x-1 times or clears it if x is x (`Int`): The multiplier number.

### `copy`

`copy(self) -> Self`

Creates a deep copy of the given list.

**Returns:**

A copy of the value.

### `__iter__`

`__iter__(ref self) -> _ListIter[T, hint_trivial_type, self_is_origin]`

Iterate over elements of the list, returning immutable references.

**Returns:**

An iterator of immutable references to the list elements.

### `__reversed__`

`__reversed__(ref self) -> _ListIter[T, hint_trivial_type, self_is_origin, False]`

Iterate backwards over the list, returning immutable references.

**Returns:**

A reversed iterator of immutable references to the list elements.

### `__len__`

`__len__(self) -> Int`

Gets the number of elements in the list.

**Returns:**

The number of elements in the list.

### `__str__`

`__str__[U: Representable & Copyable & Movable, //](self: List[U, hint_trivial_type]) -> String`

Returns a string representation of a `List`.

Notes:
Note that since we can't condition methods on a trait yet,
the way to call this method is a bit special. Here is an example
below:

```mojo
var my_list = List[Int](1, 2, 3)
print(my_list.__str__())
```

When the compiler supports conditional methods, then a simple
`String(my_list)` will be enough.

**Parameters:**

* ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the
  trait `Representable`.

**Returns:**

A string representation of the list.

### `write_to`

`write_to[W: Writer, U: Representable & Copyable & Movable, //](self: List[U, hint_trivial_type], mut writer: W)`

Write `my_list.__str__()` to a `Writer`.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.
* ​U (`Representable & Copyable & Movable`): The type of the List elements. Must have the trait
  `Representable`.

**Args:**

* ​writer (`W`): The object to write to.

### `__repr__`

`__repr__[U: Representable & Copyable & Movable, //](self: List[U, hint_trivial_type]) -> String`

Returns a string representation of a `List`.

Notes:
Note that since we can't condition methods on a trait yet, the way
to call this method is a bit special. Here is an example below:

```mojo
var my_list = List[Int](1, 2, 3)
print(my_list.__repr__())
```

When the compiler supports conditional methods, then a simple
`repr(my_list)` will be enough.

**Parameters:**

* ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the
  trait `Representable`.

**Returns:**

A string representation of the list.

### `byte_length`

`byte_length(self) -> Int`

Gets the byte length of the List (`len(self) * sizeof[T]()`).

**Returns:**

The byte length of the List (`len(self) * sizeof[T]()`).

### `append`

`append(mut self, owned value: T)`

Appends a value to this list.

Notes:
If there is no capacity left, resizes to twice the current capacity.
Except for 0 capacity where it sets 1.

**Args:**

* ​value (`T`): The value to append.

`append(mut self, elements: Span[T, origin])`

Appends elements to this list.

**Args:**

* ​elements (`Span[T, origin]`): The elements to append.

### `insert`

`insert(mut self, i: Int, owned value: T)`

Inserts a value to the list at the given index. `a.insert(len(a), value)` is equivalent to `a.append(value)`.

**Args:**

* ​i (`Int`): The index for the value.
* ​value (`T`): The value to insert.

### `extend`

`extend(mut self, owned other: List[T, hint_trivial_type])`

Extends this list by consuming the elements of `other`.

**Args:**

* ​other (`List[T, hint_trivial_type]`): List whose elements will be added in order at the end of this
  list.

`extend[D: DType, //](mut self: List[SIMD[D, 1], hint_trivial_type], value: SIMD[D, size])`

Extends this list with the elements of a vector.

Notes:
If there is no capacity left, resizes to `len(self) + value.size`.

**Parameters:**

* ​D (`DType`): The DType.

**Args:**

* ​value (`SIMD[D, size]`): The value to append.

`extend[D: DType, //](mut self: List[SIMD[D, 1], hint_trivial_type], value: SIMD[D, size], *, count: Int)`

Extends this list with `count` number of elements from a vector.

Notes:
If there is no capacity left, resizes to `len(self) + count`.

**Parameters:**

* ​D (`DType`): The DType.

**Args:**

* ​value (`SIMD[D, size]`): The value to append.
* ​count (`Int`): The ammount of items to append. Must be less than or equal to
  `value.size`.

`extend[D: DType, //](mut self: List[SIMD[D, 1], hint_trivial_type], value: Span[SIMD[D, 1], origin])`

Extends this list with the elements of a `Span`.

Notes:
If there is no capacity left, resizes to `len(self) + len(value)`.

**Parameters:**

* ​D (`DType`): The DType.

**Args:**

* ​value (`Span[SIMD[D, 1], origin]`): The value to append.

### `pop`

`pop(mut self, i: Int = -1) -> T`

Pops a value from the list at the given index.

**Args:**

* ​i (`Int`): The index of the value to pop.

**Returns:**

The popped value.

### `reserve`

`reserve(mut self, new_capacity: Int)`

Reserves the requested capacity.

Notes:
If the current capacity is greater or equal, this is a no-op.
Otherwise, the storage is reallocated and the date is moved.

**Args:**

* ​new\_capacity (`Int`): The new capacity.

### `resize`

`resize(mut self, new_size: Int, value: T)`

Resizes the list to the given new size.

Notes:
If the new size is smaller than the current one, elements at the end
are discarded. If the new size is larger than the current one, the
list is appended with new values elements up to the requested size.

**Args:**

* ​new\_size (`Int`): The new size.
* ​value (`T`): The value to use to populate new elements.

`resize(mut self, *, unsafe_uninit_length: Int)`

Resizes the list to the given new size leaving any new elements uninitialized.

If the new size is smaller than the current one, elements at the end
are discarded. If the new size is larger than the current one, the
list is extended and the new elements are left uninitialized.

**Args:**

* ​unsafe\_uninit\_length (`Int`): The new size.

### `shrink`

`shrink(mut self, new_size: Int)`

Resizes to the given new size which must be new\_size (`Int`): The new size.

### `reverse`

`reverse(mut self)`

Reverses the elements of the list.

### `index`

`index[C: EqualityComparable & Copyable & Movable, //](ref self: List[C, hint_trivial_type], value: C, start: Int = 0, stop: Optional[Int] = Optional(None)) -> Int`

Returns the index of the first occurrence of a value in a list restricted by the range given the start and stop bounds.

Examples:

```mojo
var my_list = List[Int](1, 2, 3)
print(my_list.index(2)) # prints `1`
```

**Parameters:**

* ​C (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the
  `EqualityComparable` trait.

**Args:**

* ​value (`C`): The value to search for.
* ​start (`Int`): The starting index of the search, treated as a slice index
  (defaults to 0).
* ​stop (`Optional[Int]`): The ending index of the search, treated as a slice index
  (defaults to None, which means the end of the list).

**Returns:**

The index of the first occurrence of the value in the list.

**Raises:**

ValueError: If the value is not found in the list.

### `clear`

`clear(mut self)`

Clears the elements in the list.

### `steal_data`

`steal_data(mut self) -> UnsafePointer[T]`

Take ownership of the underlying pointer from the list.

**Returns:**

The underlying data.

### `unsafe_get`

`unsafe_get(ref self, idx: Int) -> ref [self] T`

Get a reference to an element of self without checking index bounds.

Notes:
Users should consider using `__getitem__` instead of this method as
it is unsafe. If an index is out of bounds, this method will not
abort, it will be considered undefined behavior.

Note that there is no wraparound for negative indices, caution is
advised. Using negative indices is considered undefined behavior.
Never use `my_list.unsafe_get(-1)` to get the last element of the
list. Instead, do `my_list.unsafe_get(len(my_list) - 1)`.

**Args:**

* ​idx (`Int`): The index of the element to get.

**Returns:**

A reference to the element at the given index.

### `unsafe_set`

`unsafe_set(mut self, idx: Int, owned value: T)`

Write a value to a given location without checking index bounds.

Notes:
Users should consider using `my_list[idx] = value` instead of this
method as it is unsafe. If an index is out of bounds, this method
will not abort, it will be considered undefined behavior.

Note that there is no wraparound for negative indices, caution is
advised. Using negative indices is considered undefined behavior.
Never use `my_list.unsafe_set(-1, value)` to set the last element of
the list. Instead, do `my_list.unsafe_set(len(my_list) - 1, value)`.

**Args:**

* ​idx (`Int`): The index of the element to set.
* ​value (`T`): The value to set.

### `count`

`count[T: EqualityComparable & Copyable & Movable, //](self: List[T, hint_trivial_type], value: T) -> Int`

Counts the number of occurrences of a value in the list.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the
  trait `EqualityComparable`.

**Args:**

* ​value (`T`): The value to count.

**Returns:**

The number of occurrences of the value in the list.

### `swap_elements`

`swap_elements(mut self, elt_idx_1: Int, elt_idx_2: Int)`

Swaps elements at the specified indexes if they are different.

Examples:

```mojo
var my_list = List[Int](1, 2, 3)
my_list.swap_elements(0, 2)
print(my_list.__str__()) # 3, 2, 1
```

Notes:
This is useful because `swap(my_list[i], my_list[j])` cannot be
supported by Mojo, because a mutable alias may be formed.

**Args:**

* ​elt\_idx\_1 (`Int`): The index of one element.
* ​elt\_idx\_2 (`Int`): The index of the other element.

### `unsafe_ptr`

`unsafe_ptr(ref self) -> UnsafePointer[T, mut=self_is_mut, origin=self_is_origin]`

Retrieves a pointer to the underlying memory.

**Returns:**

The pointer to the underlying memory.

---

## listdir

`listdir[PathLike: PathLike](path: PathLike) -> List[String]`

Gets the list of entries contained in the path provided.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

Returns the list of entries in the path provided.

---

## llvm_intrinsic

`llvm_intrinsic[intrin: StringSlice[StaticConstantOrigin], type: AnyTrivialRegType, *types: AnyType, *, has_side_effect: Bool = True](*args: *types) -> type`

Calls an LLVM intrinsic with the name `intrin` and return type `type`.

**Parameters:**

* ​intrin (`StringSlice[StaticConstantOrigin]`): The name of the llvm intrinsic.
* ​type (`AnyTrivialRegType`): The return type of the intrinsic.
* ​\*types (`AnyType`): The argument types for the function.
* ​has\_side\_effect (`Bool`): If `True` the intrinsic will have side effects,
  otherwise its pure.

**Args:**

* ​\*args (`*types`): The arguments to the function.

**Returns:**

The result of calling the llvm intrinsic with no arguments.

---

## load

`load[type: DType, //, width: Int = 1, *, read_only: Bool = False, prefetch_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), cache_policy: CacheOperation = CacheOperation(0), eviction_policy: CacheEviction = CacheEviction(0), alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]() if is_nvidia_gpu() else 1](ptr: UnsafePointer[SIMD[type, 1]]) -> SIMD[type, width]`

Loads data from global memory into a SIMD vector.

Provides a high-level interface for vectorized memory loads with configurable
cache behavior and memory access patterns.

**Parameters:**

* ​type (`DType`): The data type to load.
* ​width (`Int`): Vector width (number of elements to load).
* ​read\_only (`Bool`): If True, marks the load as read-only for cache optimization.
* ​prefetch\_size (`OptionalReg[Int]`): Optional L2 cache prefetch size (64, 128, or 256 bytes).
* ​cache\_policy (`CacheOperation`): Cache operation policy for the load.
* ​eviction\_policy (`CacheEviction`): Cache eviction policy.
* ​alignment (`Int`): Memory alignment in bytes.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1]]`): Pointer to global memory to load from.

**Returns:**

SIMD vector containing the loaded data.

`load[OffsetType: Indexer, type: DType, //, width: Int = 1, *, read_only: Bool = False, prefetch_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), cache_policy: CacheOperation = CacheOperation(0), eviction_policy: CacheEviction = CacheEviction(0), alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]() if is_nvidia_gpu() else 1](ptr: UnsafePointer[SIMD[type, 1]], offset: OffsetType) -> SIMD[type, width]`

Loads data from global memory with an offset into a SIMD vector.

Provides a high-level interface for vectorized memory loads with configurable
cache behavior and memory access patterns, supporting offset-based addressing.

**Parameters:**

* ​OffsetType (`Indexer`): Type of the offset value.
* ​type (`DType`): The data type to load.
* ​width (`Int`): Vector width (number of elements to load).
* ​read\_only (`Bool`): If True, marks the load as read-only for cache optimization.
* ​prefetch\_size (`OptionalReg[Int]`): Optional L2 cache prefetch size (64, 128, or 256 bytes).
* ​cache\_policy (`CacheOperation`): Cache operation policy for the load.
* ​eviction\_policy (`CacheEviction`): Cache eviction policy.
* ​alignment (`Int`): Memory alignment in bytes.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1]]`): Base pointer to global memory.
* ​offset (`OffsetType`): Offset from base pointer in elements.

**Returns:**

SIMD vector containing the loaded data.

---

## load_acquire

`load_acquire[type: DType, //, *, scope: Scope = Scope(6), memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[type, 1]`

Performs an atomic load operation with acquire memory ordering semantics.

This function provides a memory barrier that ensures no subsequent memory operations
from the calling thread are executed until after this load completes.

Note:

* Only supported on GPUs.
* Maps directly to PTX ld.acquire instruction on NVIDIA, LLVM atomic
  load on AMDGPU.
* Ensures subsequent memory operations don't execute until after load.
* Critical for implementing synchronization primitives.

**Parameters:**

* ​type (`DType`): The data type to load.
* ​scope (`Scope`): Memory scope for the operation (default: Scope.SYSTEM).
* ​memory (`Bool`): Whether to include memory side effects in constraints (default: True).

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from.

**Returns:**

The loaded value.

---

## load_matrix_a

`load_matrix_a[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 4]`

Loads a tile of matrix A from memory to registers for TF32 tensor core operations.

**Constraints:**

The tile demensions must be m=16, n=8, k=8.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​a\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix A data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix A (stride between rows).

**Returns:**

SIMD vector containing 4 TF32 values loaded from matrix A in the required order.

`load_matrix_a[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float16, 4]`

Loads a tile of matrix A from memory to registers for FP16 tensor core operations.

**Constraints:**

The tile demensions must be m=16, n=8, k=8.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​a\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix A data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix A (stride between rows).

**Returns:**

SIMD vector containing 4 FP16 values loaded from matrix A in the required order.

`load_matrix_a[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[bfloat16, (div_s(#lit.struct.extract, 2) + -1) if ((k , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]`

Loads a tile of matrix A from memory to registers for BF16 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=8, k=8 or m=16, n=8, k=16.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​a\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix A data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix A (stride between rows).

**Returns:**

SIMD vector containing k//2 BF16 values loaded from matrix A in the required order.

---

## load_matrix_a_amd

`load_matrix_a_amd[m: Int, n: Int, k: Int](a_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 1]`

Loads a tile of matrix A from memory to registers for AMD FP32 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=16, k=4.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​a\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix A data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix A (stride between rows).

**Returns:**

SIMD vector containing 1 FP32 value loaded from matrix A.

`load_matrix_a_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](a_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float16, 4]`

Loads a tile of matrix A from memory to registers for AMD FP16 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=16, k=16 and n\_blocks=1 or m=4, n=4, k=4 and n\_blocks=16.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.
* ​n\_blocks (`Int`): Number of blocks.

**Args:**

* ​a\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix A data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix A (stride between rows).

**Returns:**

SIMD vector containing 4 FP16 values loaded from matrix A.

`load_matrix_a_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](a_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[bfloat16, 4]`

Loads a tile of matrix A from memory to registers for AMD BF16 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=16, k=16 and n\_blocks=1 or m=4, n=4, k=4 and n\_blocks=16.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.
* ​n\_blocks (`Int`): Number of blocks.

**Args:**

* ​a\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix A data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix A (stride between rows).

**Returns:**

SIMD vector containing 4 BF16 values loaded from matrix A.

---

## load_matrix_b

`load_matrix_b[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 2]`

Loads a tile of matrix B from memory to registers for TF32 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=8, k=8.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​b\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix B data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix B (stride between rows).

**Returns:**

SIMD vector containing 2 TF32 values loaded from matrix B in the required order.

`load_matrix_b[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float16, 2]`

Loads a tile of matrix B from memory to registers for FP16 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=8, k=8.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​b\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix B data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix B (stride between rows).

**Returns:**

SIMD vector containing 2 FP16 values loaded from matrix B in the required order.

`load_matrix_b[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[bfloat16, (div_s(#lit.struct.extract, 4) + -1) if ((k , 4) == 0) ^ True)) else div_s(#lit.struct.extract, 4)]`

Loads a tile of matrix B from memory to registers for BF16 tensor core operations.

**Constraints:**

The tile dimensions must be m=16, n=8, k=8 or m=16, n=8, k=16.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​b\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix B data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix B (stride between rows).

**Returns:**

SIMD vector containing k//4 BF16 values loaded from matrix B in the required order.

---

## load_matrix_b_amd

`load_matrix_b_amd[m: Int, n: Int, k: Int](b_ptr: UnsafePointer[SIMD[float32, 1]], tile_row: Int, tile_col: Int, ldm: Int) -> SIMD[float32, 1]`

Loads a tile of matrix B from memory to registers for AMD FP32 tensor core operations.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.

**Args:**

* ​b\_ptr (`UnsafePointer[SIMD[float32, 1]]`): Pointer to matrix B data in memory.
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix B (stride between rows).

**Returns:**

SIMD vector containing 1 FP32 value loaded from matrix B.

`load_matrix_b_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](b_ptr: UnsafePointer[SIMD[float16, 1]], tile_row: Int, tile_col: Int, ldm: Int, tile_loops: Int = 1) -> SIMD[float16, 4]`

Loads a tile of matrix B from memory to registers for AMD FP16 tensor core operations.

This function loads 4 consecutive FP16 values per thread from matrix B in a pattern
optimized for AMD GPU tensor core operations. Each thread loads values based on its
position within the warp.

Performance:

* Optimized for AMD GPU memory access patterns.
* Uses thread ID to determine which elements to load.
* Loads 4 consecutive elements per thread for efficient vectorization.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.
* ​n\_blocks (`Int`): Number of blocks.

**Args:**

* ​b\_ptr (`UnsafePointer[SIMD[float16, 1]]`): Pointer to matrix B data in memory (FP16 format).
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix B (stride between rows).
* ​tile\_loops (`Int`): Number of tile loops across matrix B's row dimension.

**Returns:**

SIMD vector containing 4 FP16 values loaded from matrix B.

`load_matrix_b_amd[m: Int, n: Int, k: Int, n_blocks: Int = 1](b_ptr: UnsafePointer[SIMD[bfloat16, 1]], tile_row: Int, tile_col: Int, ldm: Int, tile_loops: Int = 1) -> SIMD[bfloat16, 4]`

Loads a tile of matrix B from memory to registers for AMD BF16 tensor core operations.

This function loads 4 consecutive BF16 values per thread from matrix B in a pattern
optimized for AMD GPU tensor core operations. Each thread loads values based on its
position within the warp.

Performance:

* Optimized for AMD GPU memory access patterns.
* Uses thread ID to determine which elements to load.
* Loads 4 consecutive elements per thread for efficient vectorization.

**Parameters:**

* ​m (`Int`): Number of rows in the output matrix tile.
* ​n (`Int`): Number of columns in the output matrix tile.
* ​k (`Int`): Inner dimension for matrix multiplication.
* ​n\_blocks (`Int`): Number of blocks.

**Args:**

* ​b\_ptr (`UnsafePointer[SIMD[bfloat16, 1]]`): Pointer to matrix B data in memory (BF16 format).
* ​tile\_row (`Int`): Starting row index of the tile.
* ​tile\_col (`Int`): Starting column index of the tile.
* ​ldm (`Int`): Leading dimension of matrix B (stride between rows).
* ​tile\_loops (`Int`): Number of tile loops across matrix B's row dimension.

**Returns:**

SIMD vector containing 4 BF16 values loaded from matrix B.

---

## load_volatile

`load_volatile[type: DType, //, memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[type, 1]`

Performs a volatile load operation that cannot be optimized away.

This function guarantees that the load operation will be performed exactly as
specified, without being reordered or optimized away by the compiler.

Note:

* Only supported on NVIDIA GPUs.
* Maps directly to PTX ld.volatile instruction.
* Prevents compiler optimization of the load operation.
* Useful for memory-mapped I/O or synchronization primitives.
* May have performance implications compared to regular loads.

**Parameters:**

* ​type (`DType`): The data type to load.
* ​memory (`Bool`): Whether to include memory side effects in constraints (default: True).

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to load from.

**Returns:**

The loaded value.

---

## load_z

`load_z[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)`

---

## LoadStore_i8mm

`struct LoadStore_i8mm[type: DType, simd_size: Int, single_row: Bool, tile_rows: Int, tile_columns: Int]`

## Fields

* ​output\_tile (`_Accumulator[type, tile_rows, 0 if (simd_size == 0) else (div_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) + -1) if (((rem_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) == 0) ^ True) & ((simd_size , #lit.struct.extract, 0), {1}, simd_size), "value">), simd_size]`):
* ​skip\_boundary\_check (`Bool`):

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `num_simd_cols`

`alias num_simd_cols = 0 if (simd_size == 0) else (div_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) + -1) if (((rem_s(#lit.struct.extract, #lit.struct.extract, 0), {1}, simd_size), "value">) == 0) ^ True) & ((simd_size , #lit.struct.extract, 0), {1}, simd_size), "value">)`

## Methods

### `__init__`

`@implicit`
`__init__(out self, skip_boundary_check: Bool)`

---

## lock

## Structs

* [​`BlockingScopedLock`](/mojo/stdlib/utils/lock/BlockingScopedLock): A scope adapter for BlockingSpinLock.
* [​`BlockingSpinLock`](/mojo/stdlib/utils/lock/BlockingSpinLock): A basic locking implementation that uses an integer to represent the owner of the lock.
* [​`SpinWaiter`](/mojo/stdlib/utils/lock/SpinWaiter): A proxy for the C++ runtime's SpinWaiter type.

---

## log

`log[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs elementwise natural log (base E) of a SIMD vector.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): Vector to perform logarithm operation on.

**Returns:**

Vector containing result of performing natural log base E on x.

---

## log_probabilities

## `compute_log_probabilities_ragged()` {#max.pipelines.lib.log_probabilities.compute_log_probabilities_ragged}

> max.pipelines.lib.log\_probabilities.compute\_log\_probabilities\_ragged(\*, input\_row\_offsets, logits, next\_token\_logits, tokens, sampled\_tokens, batch\_top\_n, batch\_echo)

Computes the log probabilities for ragged model outputs.

**Parameters:**

* **input\_row\_offsets** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – Token offsets into token-indexed buffers, by batch
  index.  Should have 1 more element than there are batches (batch n
  is token indices \[input\_row\_offsets\[n], input\_row\_offsets\[n+1])).
* **logits** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` ) – (tokens, vocab\_dim) tensor full of tensor logits.  Token
  dimension mapped to batches using input\_row\_offsets.
* **next\_token\_logits** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – (batch, vocab\_dim) tensor full of logits for next
  tokens per batch.
* **sampled\_tokens** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – (batch\_dim,) tensor of sampled token per batch
* **batch\_top\_n** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – Number of top log probabilities to return per input in
  the batch. For any element where top\_n == 0, the
  LogProbabilities is skipped.
* **batch\_echo** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`bool`](https://docs.python.org/3/library/functions.html#bool) `]` ) – Whether to include input tokens in the returned log
  probabilities.
* **tokens** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

Computed log probabilities for each item in the batch.

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*LogProbabilities*](core.md#max.pipelines.core.LogProbabilities) | None]

## `log_softmax()` {#max.pipelines.lib.log_probabilities.log_softmax}

> max.pipelines.lib.log\_probabilities.log\_softmax(x, axis=-1)

Compute the logarithm of the softmax function.

This implementation uses the identity log(softmax(x)) = x - log(sum(exp(x)))
with numerical stability improvements to prevent overflow/underflow.

**Parameters:**

* **x** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – Input array
* **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Axis to compute values along

**Returns:**

Array with same shape as x, representing log(softmax(x))

**Return type:**

[*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)

---

## log10

`log10[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `log10` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `log10` of the input.

---

## log1p

`log1p[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `log1p` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `log1p` of the input.

---

## log2

`log2[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs elementwise log (base 2) of a SIMD vector.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): Vector to perform logarithm operation on.

**Returns:**

Vector containing result of performing log base 2 on x.

---

## log2_floor

`log2_floor(val: Int) -> Int`

Returns the floor of the base-2 logarithm of an integer value.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The floor of the base-2 logarithm of the input value, which is equal to
the position of the highest set bit. Returns -1 if val is 0.

---

## logb

`logb[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `logb` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `logb` of the input.

---

## logger

Provides logging functionality with different severity levels.

## Modules

* [​`logger`](/mojo/stdlib/logger/logger/): Provides logging functionality with different severity levels.

---

## logger

Provides logging functionality with different severity levels.

This module implements a simple logging system with configurable severity
levels: `NOTSET`, `DEBUG`, `INFO`, `WARNING`, `ERROR`, and `CRITICAL`. The
logging level can be set via the LOGGING\_LEVEL environment variable.

The main components are:

* `Level`: An enum-like struct defining the available logging levels
* `Logger`: A struct that handles logging messages with different severity levels

Example:

```mojo
from logger import Logger

var logger = Logger()  # Uses default level from LOGGING_LEVEL env var
logger.info("Starting process")
logger.debug("Debug information")
logger.error("An error occurred")
```

The logger can be configured to write to different file descriptors (default
stdout). Messages below the configured level will be silently ignored.

## Aliases

### `DEFAULT_LEVEL`

`alias DEFAULT_LEVEL = _from_str[::Bool,::Origin[$0]](env_get_string[::StringSlice[::Bool())`

## Structs

* [​`Level`](/mojo/stdlib/logger/logger/Level): Represents logging severity levels.
* [​`Logger`](/mojo/stdlib/logger/logger/Logger): A logger that outputs messages at or above a specified severity level.

---

## Logger

`struct Logger[level: Level = _from_str[::Bool,::Origin[$0]](env_get_string[::StringSlice[::Bool())]`

A logger that outputs messages at or above a specified severity level.

## Parameters

* ​level (`Level`): The minimum severity level for messages to be logged.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(out self, fd: FileDescriptor = FileDescriptor(1))`

Initializes a new Logger.

**Args:**

* ​fd (`FileDescriptor`): The file descriptor to write log messages to (defaults to stdout).

### `debug`

`debug[*Ts: Writable](self, *values: *Ts)`

Logs a debug message.

**Parameters:**

* ​\*Ts (`Writable`): The types of values to log.

**Args:**

* ​\*values (`*Ts`): The values to log.

### `info`

`info[*Ts: Writable](self, *values: *Ts)`

Logs an info message.

**Parameters:**

* ​\*Ts (`Writable`): The types of values to log.

**Args:**

* ​\*values (`*Ts`): The values to log.

### `warning`

`warning[*Ts: Writable](self, *values: *Ts)`

Logs a warning message.

**Parameters:**

* ​\*Ts (`Writable`): The types of values to log.

**Args:**

* ​\*values (`*Ts`): The values to log.

### `error`

`error[*Ts: Writable](self, *values: *Ts)`

Logs an error message.

**Parameters:**

* ​\*Ts (`Writable`): The types of values to log.

**Args:**

* ​\*values (`*Ts`): The values to log.

### `critical`

`critical[*Ts: Writable](self, *values: *Ts)`

Logs a critical message and aborts execution.

**Parameters:**

* ​\*Ts (`Writable`): The types of values to log.

**Args:**

* ​\*values (`*Ts`): The values to log.

---

## logical_divide

`logical_divide(layout_a: Layout, _layout_b: Layout) -> Layout`

Divides a layout into blocks according to another layout.

This function creates a hierarchical layout by dividing the first layout
according to the second layout. It's useful for creating blocked or tiled
representations of tensors.

**Args:**

* ​layout\_a (`Layout`): The layout to be divided.
* ​\_layout\_b (`Layout`): The layout defining the division pattern.

**Returns:**

A new layout representing the hierarchical division.

`logical_divide(layout_a: Layout, tiler: List[Layout]) -> Layout`

Divides a layout into blocks according to a list of layouts.

This is a variant of logical\_divide that works with a list of layouts
for more complex tiling patterns.

**Args:**

* ​layout\_a (`Layout`): The layout to be divided.
* ​tiler (`List[Layout]`): A list of layouts defining the division patterns.

**Returns:**

A new layout representing the hierarchical division.

---

## logical_product

`logical_product(_layout_a: Layout, layout_b: Layout) -> Layout`

Creates a product of two layouts.

This function creates a hierarchical layout by taking the logical product
of two layouts. It's a fundamental operation for creating blocked or tiled
layouts.

**Args:**

* ​\_layout\_a (`Layout`): The first layout.
* ​layout\_b (`Layout`): The second layout.

**Returns:**

A new layout representing the logical product of the two layouts.

`logical_product(layout_a: Layout, tiler: List[Layout]) -> Layout`

Creates a product of a layout with a list of layouts.

This is a variant of logical\_product that works with a list of layouts
for more complex tiling patterns. It applies the logical\_product operation
to each element of the layout with the corresponding element in the tiler list.

Example:

```mojo
from layout import Layout, LayoutList, IntTuple
from layout.layout import logical_product

# Create a product of a layout with a list of layouts
var base = Layout.row_major(6, 8)
var tilers = LayoutList()
tilers.append(Layout(IntTuple(2, 2)))
var result = logical_product(base, tilers)
```

.

**Args:**

* ​layout\_a (`Layout`): The base layout to create products with.
* ​tiler (`List[Layout]`): A list of layouts defining the product patterns.

**Returns:**

A new layout representing the logical product with the tiler layouts.

---

## logsoftmax

`logsoftmax[simd_width: Int, buffer_size: Dim, type: DType, origins: origin.set, input_fn_1d: fn[Int](Int) capturing -> SIMD[type, $0]](output: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)])`

Performs an unbatched logsoftmax on an input tensor using the three-pass algorithm.

The unbatched three-pass softmax is defined as:
procedure SoftmaxUnbatched(InputInput)
maxVal = -∞
denom = 0
STEP 1: find the max value in each batch
for i = 0 to N do
maxVal = max(maxVal, Input\[b, i])
end for
STEP 2: compute the sum of exponential of each batch
for i = 0 to N do
Output\[b, i] = Input\[b, i] - maxVal
accum += exp(Output\[b, i])
end for
STEP 3: normalize each batch
for i = 0 to N do
Output\[b, i] -= log(accum)
end for

**Parameters:**

* ​simd\_width (`Int`): The simd\_width to use in vectorization.
* ​buffer\_size (`Dim`): The size of the input and output buffers.
* ​type (`DType`): The type of the input and output buffers.
* ​origins (`origin.set`): The OriginSet of captured arguments by the input\_fn\_1d.
* ​input\_fn\_1d (`fn[Int](Int) capturing -> SIMD[type, $0]`): The elementwise input lambda.

**Args:**

* ​output (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The output buffer in which to store the softmax values.

`logsoftmax[: origin.set, //, type: DType, simd_width: Int, rank: Int, static_shape: DimList, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]](shape: IndexList[rank], output: NDBuffer[type, rank, origin, static_shape], axis: Int)`

`logsoftmax[type: DType, simd_width: Int, rank: Int, static_shape: DimList](input: NDBuffer[type, rank, origin, static_shape], output: NDBuffer[type, rank, origin, static_shape], axis: Int)`

---

## lookup_py_type_object

`lookup_py_type_object[T: TypeIdentifiable]() -> TypedPythonObject[__init__[__mlir_type.!kgen.string]("Type")]`

Retrieve a reference to the unique Python type describing Python objects containing Mojo values of type `T`.

This function looks up the Python type object that was previously registered
for the Mojo type `T` using a `PythonTypeBuilder`. The returned type object
can be used to create Python objects that wrap Mojo values of type `T`.

**Parameters:**

* ​T (`TypeIdentifiable`): The Mojo type to look up. Must implement the `TypeIdentifiable` trait
  to provide a unique type identifier.

**Returns:**

A `TypedPythonObject["Type"]` representing the Python type object that
binds the Mojo type `T` to the current CPython interpreter instance.

**Raises:**

If no `PythonTypeBuilder` was ever finalized for type `T`, or if no
Python type object has been registered for the provided type identifier.

---

## lop

`lop[lut: SIMD[int32, 1]](a: SIMD[int32, 1], b: SIMD[int32, 1], c: SIMD[int32, 1]) -> SIMD[int32, 1]`

Performs an arbitrary logical operation on 3 inputs using a lookup table.

Implements a 3-input lookup table (LUT) operation. The result is
determined by bits in the lookup table value for each input combination.

Note:

* Only supported on NVIDIA GPUs.
* Maps to the LOP3.B32 PTX instruction.
* Lookup table value determines output for each possible input combo.

**Parameters:**

* ​lut (`SIMD[int32, 1]`): 32-bit lookup table value that defines the logical operation.

**Args:**

* ​a (`SIMD[int32, 1]`): First input value.
* ​b (`SIMD[int32, 1]`): Second input value.
* ​c (`SIMD[int32, 1]`): Third input value.

**Returns:**

Result of applying the lookup table operation to the inputs.

---

## lstat

`lstat[PathLike: PathLike](path: PathLike) -> stat_result`

Get the status of a file or a file descriptor (similar to stat, but does not follow symlinks).

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

Returns the stat\_result on the path.

---

## mac16

`mac16(gpr: Int)`

SI16 matrix multiply and add.

---

## Magic changelog

This is the change history for the [`magic` CLI](/magic).

You can check which version you have with this command:

```sh
magic --version
```

You can update to the latest version with this:

```sh
magic self-update
```

## v0.7.2 (2025-03-14)

* Fixed a build regression that affected Ubuntu 22.04 compatibility.
* Enhanced the `init --from` command to initialize projects from recipes.

## v0.7.1 (2025-03-11)

* Added support for the `init --from` command to initialize projects from recipes.

## v0.7.0 (2025-02-19)

* Many small improvements and optimizations to `magic global install`
  and `magic global update` which should make installing `max-pipelines` faster.
* Update to [pixi 0.41.3](https://github.com/prefix-dev/pixi/releases/tag/v0.41.3)
* Update to [uv v0.5.29](https://github.com/astral-sh/uv/releases/tag/0.5.29)

## v0.6.4 (2025-01-27)

* Fix bug with magic init with a folder that didn't exist.

## v0.6.3 (2025-01-24)

* Add max-nightly channel as a default search channel.
* Update to [pixi v0.40.3](https://github.com/prefix-dev/pixi/releases/tag/v0.40.3)

## v0.6.2 (2025-01-11)

* Fix for an error warning about a missing /etc/magic/config.toml file on some
  linux distros.

## v0.6.1 (2025-01-10)

* Significant performance improvements for package solving and resolution
* Fix for a bug that caused `magic` to hang with some network configurations
* Update to [pixi v0.40.0](https://github.com/prefix-dev/pixi/releases/tag/v0.40.0)

## v0.5.1 (2024-12-12)

* Minor bug fix release for macOS. Remove an unintended runtime dependency
  on a library in homebrew.

## v0.5.0 (2024-12-03)

* Expose `magic auth` to allow for authentication to private channels
* Expose `magic upload` to allow for uploading packages to conda channels
* Update to [pixi 0.37.0](https://github.com/prefix-dev/pixi/releases/tag/v0.37.0)
* Update to uv 0.4.30 to fix minor bugs with installing some pypi packages

## v0.4.0 (2024-10-24)

* Updating to [pixi 0.33](https://github.com/prefix-dev/pixi/releases/tag/v0.33.0)
* Fix for magic search failing outside of a project
  [Issue Link](https://github.com/modular/modular/issues/209)

## v0.3.1 (2024-10-03)

* Fixes a certification error when fetching some packages
* Fixes for telemetry data
* Added controls to disable telemetry for `magic`
* Print the `pixi` version with the `magic --version` command

## v0.3.0 (2024-09-20)

* Updating to [pixi 0.29](https://pixi.sh/latest/CHANGELOG/#0290-2024-09-04)
* Telemetry improvements

## v0.2.3 (2024-09-05)

First Magic release! 🪄

Based on [Pixi 0.27.1](https://pixi.sh/latest/CHANGELOG/#0271-2024-08-09).

---

## Magic commands

# Magic commands

This document contains the help content for the `magic` command-line program.

## `magic`

magic - A high level package management tool by Modular for developing with Mojo and MAX.

To get started, run `magic init` in your project directory.

To see all available commands, run `magic --help` or `magic help`.

**Usage:** `magic [OPTIONS] `

###### **Subcommands:**

* `init` — Initialize a new Magic project
* `add` — Adds dependencies to the project
* `remove` — Removes dependencies from the project
* `install` — Install all dependencies
* `update` — Update dependencies as recorded in the local lock file
* `upgrade` — Update the version of packages to the latest possible version, disregarding the manifest version constraints
* `lock` — Solve environment and update the lock file
* `run` — Runs task in project
* `exec` — Run a command in a temporary environment
* `shell` — Start a shell in the magic environment of the project
* `shell-hook` — Print the magic environment activation script
* `project` — Modify the project configuration file through the command line
* `task` — Interact with tasks in the project
* `list` — List project's packages
* `tree` — Show a tree of project dependencies
* `global` — Subcommand for global package management actions
* `auth` — Login to prefix.dev or anaconda.org servers to access private channels
* `config` — Configuration management
* `info` — Information about the system, project and environments for the current machine
* `upload` — Upload a conda package
* `search` — Search a conda package
* `self-update` — Update magic to the latest or a specific version
* `clean` — Clean the parts of your system which are touched by magic. Defaults to cleaning the environments and task cache. Use the `cache` subcommand to clean the cache
* `completion` — Generates a completion script for a shell
* `telemetry` — Configure how telemetry data is emitted from magic and Modular packages
* `build` — Build a project
* `8ball` — Ask the 8-ball a question

###### **Options:**

* `-v`, `--verbose` — Increase logging verbosity

* `-q`, `--quiet` — Decrease logging verbosity

* `--color ` — Whether the log needs to be colored

  Default value: `auto`

  Possible values: `always`, `never`, `auto`

* `--no-progress` — Hide all progress bars

  Default value: `false`

## `magic init`

Initialize a new Magic project

**Usage:** `magic init [OPTIONS] [PATH]`

###### **Arguments:**

* `` — Where to place the project (defaults to current path)

  Default value: `.`

###### **Options:**

* `-c`, `--channel ` — Channels to use in the project

* `-p`, `--platform ` — Platforms that the project supports

* `-i`, `--import ` — Environment.yml file to bootstrap the project

* `--format ` — The manifest format to create

  Possible values: `magic`, `pyproject`, `mojoproject`

* `-s`, `--scm ` — Source Control Management used for this project

  Possible values: `github`, `gitlab`, `codeberg`

* `--from ` — Initialize a project using a template/recipe.

  To find new recipes, see 

  A project can be initialized a url to a zip file, a GitHub repo, or a released recipe in a GitHub repo.

  * Initialize with a released GitHub Recipe \*

    The format is for a GitHub recipe is `[//][@]`
    If a owner/repo is not provided explicitly, then modular/max-recipes is default.
    This searches for a released recipe.zip file in the GitHub repo's releases.

  * Initialize with a GitHub Repo \*

    A full GitHub repo (without history) can be used to initialize into the target folder
    with the format owner/repo\[@tag or branch]\[/]. A specified recipe is optional and if
    specified and found in the repo, only that recipe subfolder will be extracted.

  * Initialize with a ZIP archive \*

    Any https\://, http\://, and file:// URL can also be passed in explicitly. This will download
    and extract the full zip into the target project folder.

  Examples:

  * `magic init --from max-serve-open-webui` - Download modular/max-recipes @ HEAD and extract the max-serve-open-webui folder
  * `magic init --from modular/max-recipes/max-serve-open-webui` - Same but explicitly state the owner and repo.
  * `magic init --from modular/max-recipes/max-serve-open-webui@0.0.1` - Download a github release from modular/max-recipes looking for max-serve-open-webui @ 0.0.1
  * `magic init --from https://github.com/modular/modular/archive/refs/heads/main.zip` - Download and exract the zip file at a given URL.
  * `magic init --from modular/max` - download the entire modular/max repo at main HEAD without git history
  * `magic init --from modular/max@stable` - download the entire modular/max repo at the "stable" git tag or branch ref without git history
  * `magic init --from modular/max/examples` - download the entire modular/max repo at main HEAD without git history and extract the example folder

  Note: this feature currently does not work for versions tag, branches names, tags, or recipes that contain a "/" character.

  WARNING: this will clobber any existing files in the target folder if it already exists.

* `--run ` — Additional run command arguments to pass to the recipe after initialization. These are the same arguments that would be passed to `magic run`

## `magic add`

Adds dependencies to the project

The dependencies should be defined as MatchSpec for conda package, or a PyPI
requirement for the `--pypi` dependencies. If no specific version is
provided, the latest version compatible with your project will be chosen
automatically or a \* will be used.

Example usage:

* `magic add python=3.10`: This will select the latest minor version that
  complies with 3.10.\*, i.e., python version 3.10.0, 3.10.1, 3.10.2, etc.
* `magic add python`: In absence of a specified version, the latest version
  will be chosen. For instance, this could resolve to python version
  3.11.3.\* at the time of writing.

Adding multiple dependencies at once is also supported:

* `magic add python pytest`: This will add both `python` and `pytest` to the
  project's dependencies.

The `--platform` and `--build/--host` flags make the dependency target
specific.

* `magic add python --platform linux-64 --platform osx-arm64`: Will add the
  latest version of python for linux-64 and osx-arm64 platforms.
* `magic add python --build`: Will add the latest version of python for as a
  build dependency.

Mixing `--platform` and `--build`/`--host` flags is supported

The `--pypi` option will add the package as a pypi dependency. This cannot
be mixed with the conda dependencies

* `magic add --pypi boto3`
* \`magic add --pypi "boto3==version"

If the project manifest is a `pyproject.toml`, adding a pypi dependency will
add it to the native pyproject `project.dependencies` array or to the native
`dependency-groups` table if a feature is specified:

* `magic add --pypi boto3` will add `boto3` to the `project.dependencies`
  array
* `magic add --pypi boto3 --feature aws` will add `boto3` to the
  `dependency-groups.aws` array

Note that if `--platform` or `--editable` are specified, the pypi dependency
will be added to the `tool.magic.pypi-dependencies` table instead as native
arrays have no support for platform-specific or editable dependencies.

These dependencies will then be read by magic as if they had been added to
the magic `pypi-dependencies` tables of the default or of a named feature.

The versions will be automatically added with a pinning strategy based on
semver or the pinning strategy set in the config. There is a list of
packages that are not following the semver versioning scheme but will use
the minor version by default:
Python, Rust, Julia, GCC, GXX, GFortran, NodeJS, Deno, R, R-Base, Perl

**Usage:** `magic add [OPTIONS] ...`

###### **Arguments:**

* `` — The dependencies as names, conda MatchSpecs or PyPi requirements

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

* `--host` — The specified dependencies are host dependencies. Conflicts with `build` and `pypi`

* `--build` — The specified dependencies are build dependencies. Conflicts with `host` and `pypi`

* `--pypi` — The specified dependencies are pypi dependencies. Conflicts with `host` and `build`

* `-p`, `--platform ` — The platform(s) for which the dependency should be modified

* `-f`, `--feature ` — The feature for which the dependency should be modified

  Default value: `default`

* `-g`, `--git ` — The git url to use when adding a git dependency

* `--branch ` — The git branch

* `--tag ` — The git tag

* `--rev ` — The git revision

* `-s`, `--subdir ` — The subdirectory of the git repository to use

* `--no-lockfile-update` — Don't update lockfile, implies the no-install as well

* `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file

* `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file

* `--no-install` — Don't modify the environment, only modify the lock-file

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `--revalidate` — Run the complete environment validation. This will reinstall a broken environment

* `--editable` — Whether the pypi requirement should be editable

## `magic remove`

Removes dependencies from the project

If the project manifest is a `pyproject.toml`, removing a pypi dependency with the `--pypi` flag will remove it from either - the native pyproject `project.dependencies` array or, if a feature is specified, the native `project.optional-dependencies` table - magic `pypi-dependencies` tables of the default feature or, if a feature is specified, a named feature

**Usage:** `magic remove [OPTIONS] ...`

###### **Arguments:**

* `` — The dependencies as names, conda MatchSpecs or PyPi requirements

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

* `--host` — The specified dependencies are host dependencies. Conflicts with `build` and `pypi`

* `--build` — The specified dependencies are build dependencies. Conflicts with `host` and `pypi`

* `--pypi` — The specified dependencies are pypi dependencies. Conflicts with `host` and `build`

* `-p`, `--platform ` — The platform(s) for which the dependency should be modified

* `-f`, `--feature ` — The feature for which the dependency should be modified

  Default value: `default`

* `-g`, `--git ` — The git url to use when adding a git dependency

* `--branch ` — The git branch

* `--tag ` — The git tag

* `--rev ` — The git revision

* `-s`, `--subdir ` — The subdirectory of the git repository to use

* `--no-lockfile-update` — Don't update lockfile, implies the no-install as well

* `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file

* `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file

* `--no-install` — Don't modify the environment, only modify the lock-file

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `--revalidate` — Run the complete environment validation. This will reinstall a broken environment

## `magic install`

Install all dependencies

**Usage:** `magic install [OPTIONS]`

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

* `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file

* `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file

* `-e`, `--environment ` — The environment to install

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `-a`, `--all`

## `magic update`

Update dependencies as recorded in the local lock file

**Usage:** `magic update [OPTIONS] [PACKAGES]...`

###### **Arguments:**

* `` — The packages to update

###### **Options:**

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

* `--no-install` — Don't install the (solve) environments needed for pypi-dependencies solving

* `-n`, `--dry-run` — Don't actually write the lockfile or update any environment

* `-e`, `--environment ` — The environments to update. If none is specified, all environments are updated

* `-p`, `--platform ` — The platforms to update. If none is specified, all platforms are updated

* `--json` — Output the changes in JSON format

## `magic upgrade`

Update the version of packages to the latest possible version, disregarding the manifest version constraints

**Usage:** `magic upgrade [OPTIONS] [PACKAGES]...`

###### **Arguments:**

* `` — The packages to upgrade

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

* `--no-lockfile-update` — Don't update lockfile, implies the no-install as well

* `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file

* `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file

* `--no-install` — Don't modify the environment, only modify the lock-file

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `--revalidate` — Run the complete environment validation. This will reinstall a broken environment

* `-f`, `--feature ` — The feature to update

  Default value: `default`

* `--exclude ` — The packages which should be excluded

* `--json` — Output the changes in JSON format

* `-n`, `--dry-run` — Only show the changes that would be made, without actually updating the manifest, lock file, or environment

## `magic lock`

Solve environment and update the lock file

**Usage:** `magic lock [OPTIONS]`

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory
* `--json` — Output the changes in JSON format

## `magic run`

Runs task in project

**Usage:** `magic run [OPTIONS] [TASK]...`

###### **Arguments:**

* `` — The magic task or a task shell command you want to run in the project's environment, which can be an executable in the environment's PATH

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

* `--no-lockfile-update` — Don't update lockfile, implies the no-install as well

* `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file

* `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file

* `--no-install` — Don't modify the environment, only modify the lock-file

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `--revalidate` — Run the complete environment validation. This will reinstall a broken environment

* `--force-activate` — Do not use the environment activation cache. (default: true except in experimental mode)

* `-e`, `--environment ` — The environment to run the task in

* `--clean-env` — Use a clean environment to run the task

  Using this flag will ignore your current shell environment and use bare minimum environment to activate the magic environment in.

* `--skip-deps` — Don't run the dependencies of the task ('depends-on' field in the task definition)

* `-n`, `--dry-run` — Run the task in dry-run mode (only print the command that would run)

* `--help`

  Possible values: `true`, `false`

* `-h`

  Possible values: `true`, `false`

## `magic exec`

Run a command in a temporary environment

**Usage:** `magic exec [OPTIONS] [COMMAND]...`

###### **Arguments:**

* `` — The executable to run

###### **Options:**

* `-s`, `--spec ` — Matchspecs of packages to install. If this is not provided, the package is guessed from the command

* `-c`, `--channel ` — The channels to consider as a name or a url. Multiple channels can be specified by using this field multiple times.

  When specifying a channel, it is common that the selected channel also depends on the `conda-forge` channel.

  By default, if no channel is provided, `conda-forge` is used.

* `-p`, `--platform ` — The platform to create the environment for

  Default value: `osx-arm64`

* `--force-reinstall` — If specified a new environment is always created even if one already exists

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

## `magic shell`

Start a shell in the magic environment of the project

**Usage:** `magic shell [OPTIONS]`

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

* `--no-lockfile-update` — Don't update lockfile, implies the no-install as well

* `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file

* `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file

* `--no-install` — Don't modify the environment, only modify the lock-file

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `--revalidate` — Run the complete environment validation. This will reinstall a broken environment

* `-e`, `--environment ` — The environment to activate in the shell

* `--change-ps1 ` — Do not change the PS1 variable when starting a prompt

  Possible values: `true`, `false`

* `--force-activate` — Do not use the environment activation cache. (default: true except in experimental mode)

## `magic shell-hook`

Print the magic environment activation script.

You can source the script to activate the environment without needing magic itself.

**Usage:** `magic shell-hook [OPTIONS]`

###### **Options:**

* `-s`, `--shell ` — Sets the shell, options: [`bash`,  `zsh`,  `xonsh`,  `cmd`, `powershell`,  `fish`,  `nushell`]

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

* `--no-lockfile-update` — Don't update lockfile, implies the no-install as well

* `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file

* `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file

* `--no-install` — Don't modify the environment, only modify the lock-file

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `--revalidate` — Run the complete environment validation. This will reinstall a broken environment

* `--force-activate` — Do not use the environment activation cache. (default: true except in experimental mode)

* `-e`, `--environment ` — The environment to activate in the script

* `--json` — Emit the environment variables set by running the activation as JSON

  Default value: `false`

* `--change-ps1 ` — Do not change the PS1 variable when starting a prompt

  Possible values: `true`, `false`

## `magic project`

Modify the project configuration file through the command line

**Usage:** `magic project [OPTIONS] `

###### **Subcommands:**

* `channel` — Commands to manage project channels
* `description` — Commands to manage project description
* `platform` — Commands to manage project platforms
* `version` — Commands to manage project version
* `environment` — Commands to manage project environments
* `export` — Commands to export projects to other formats
* `name` — Commands to manage project name
* `system-requirements` — Commands to manage project environments

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic project channel`

Commands to manage project channels

**Usage:** `magic project channel `

###### **Subcommands:**

* `add` — Adds a channel to the project file and updates the lockfile
* `list` — List the channels in the project file
* `remove` — Remove channel(s) from the project file and updates the lockfile

## `magic project channel add`

Adds a channel to the project file and updates the lockfile

**Usage:** `magic project channel add [OPTIONS] ...`

###### **Arguments:**

* `` — The channel name or URL

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

* `--priority ` — Specify the channel priority

* `--prepend` — Add the channel(s) to the beginning of the channels list, making them the highest priority

* `--no-lockfile-update` — Don't update lockfile, implies the no-install as well

* `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file

* `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file

* `--no-install` — Don't modify the environment, only modify the lock-file

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `--revalidate` — Run the complete environment validation. This will reinstall a broken environment

* `-f`, `--feature ` — The name of the feature to modify

## `magic project channel list`

List the channels in the project file

**Usage:** `magic project channel list [OPTIONS]`

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory
* `--urls` — Whether to display the channel's names or urls

## `magic project channel remove`

Remove channel(s) from the project file and updates the lockfile

**Usage:** `magic project channel remove [OPTIONS] ...`

###### **Arguments:**

* `` — The channel name or URL

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

* `--priority ` — Specify the channel priority

* `--prepend` — Add the channel(s) to the beginning of the channels list, making them the highest priority

* `--no-lockfile-update` — Don't update lockfile, implies the no-install as well

* `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file

* `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file

* `--no-install` — Don't modify the environment, only modify the lock-file

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `--revalidate` — Run the complete environment validation. This will reinstall a broken environment

* `-f`, `--feature ` — The name of the feature to modify

## `magic project description`

Commands to manage project description

**Usage:** `magic project description [OPTIONS] `

###### **Subcommands:**

* `get` — Get the project description
* `set` — Set the project description

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic project description get`

Get the project description

**Usage:** `magic project description get`

## `magic project description set`

Set the project description

**Usage:** `magic project description set `

###### **Arguments:**

* `` — The project description

## `magic project platform`

Commands to manage project platforms

**Usage:** `magic project platform [OPTIONS] `

###### **Subcommands:**

* `add` — Adds a platform(s) to the project file and updates the lockfile
* `list` — List the platforms in the project file
* `remove` — Remove platform(s) from the project file and updates the lockfile

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic project platform add`

Adds a platform(s) to the project file and updates the lockfile

**Usage:** `magic project platform add [OPTIONS] ...`

###### **Arguments:**

* `` — The platform name(s) to add

###### **Options:**

* `--no-install` — Don't update the environment, only add changed packages to the lock-file
* `-f`, `--feature ` — The name of the feature to add the platform to

## `magic project platform list`

List the platforms in the project file

**Usage:** `magic project platform list`

## `magic project platform remove`

Remove platform(s) from the project file and updates the lockfile

**Usage:** `magic project platform remove [OPTIONS] ...`

###### **Arguments:**

* `` — The platform name(s) to remove

###### **Options:**

* `--no-install` — Don't update the environment, only remove the platform(s) from the lock-file
* `-f`, `--feature ` — The name of the feature to remove the platform from

## `magic project version`

Commands to manage project version

**Usage:** `magic project version [OPTIONS] `

###### **Subcommands:**

* `get` — Get the workspace version
* `set` — Set the workspace version
* `major` — Bump the workspace version to MAJOR
* `minor` — Bump the workspace version to MINOR
* `patch` — Bump the workspace version to PATCH

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic project version get`

Get the workspace version

**Usage:** `magic project version get`

## `magic project version set`

Set the workspace version

**Usage:** `magic project version set `

###### **Arguments:**

* `` — The new project version

## `magic project version major`

Bump the workspace version to MAJOR

**Usage:** `magic project version major`

## `magic project version minor`

Bump the workspace version to MINOR

**Usage:** `magic project version minor`

## `magic project version patch`

Bump the workspace version to PATCH

**Usage:** `magic project version patch`

## `magic project environment`

Commands to manage project environments

**Usage:** `magic project environment [OPTIONS] `

###### **Subcommands:**

* `add` — Adds an environment to the manifest file
* `list` — List the environments in the manifest file
* `remove` — Remove an environment from the manifest file

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic project environment add`

Adds an environment to the manifest file

**Usage:** `magic project environment add [OPTIONS] `

###### **Arguments:**

* `` — The name of the environment to add

###### **Options:**

* `-f`, `--feature ` — Features to add to the environment
* `--solve-group ` — The solve-group to add the environment to
* `--no-default-feature` — Don't include the default feature in the environment

  Default value: `false`
* `--force` — Update the manifest even if the environment already exists

  Default value: `false`

## `magic project environment list`

List the environments in the manifest file

**Usage:** `magic project environment list`

## `magic project environment remove`

Remove an environment from the manifest file

**Usage:** `magic project environment remove `

###### **Arguments:**

* `` — The name of the environment to remove

## `magic project export`

Commands to export projects to other formats

**Usage:** `magic project export `

###### **Subcommands:**

* `conda-explicit-spec` — Export project environment to a conda explicit specification file
* `conda-environment` — Export project environment to a conda environment.yaml file

## `magic project export conda-explicit-spec`

Export project environment to a conda explicit specification file

**Usage:** `magic project export conda-explicit-spec [OPTIONS] `

###### **Arguments:**

* `` — Output directory for rendered explicit environment spec files

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

* `-e`, `--environment `

* `-p`, `--platform ` — The platform to render. Can be repeated for multiple platforms. Defaults to all platforms available for selected environments

* `--ignore-pypi-errors` — PyPI dependencies are not supported in the conda explicit spec file

  Default value: `false`

* `--ignore-source-errors` — Source dependencies are not supported in the conda explicit spec file

  Default value: `false`

* `--no-lockfile-update` — Don't update lockfile, implies the no-install as well

* `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file

* `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file

* `--no-install` — Don't modify the environment, only modify the lock-file

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `--revalidate` — Run the complete environment validation. This will reinstall a broken environment

## `magic project export conda-environment`

Export project environment to a conda environment.yaml file

**Usage:** `magic project export conda-environment [OPTIONS] [OUTPUT_PATH]`

###### **Arguments:**

* `` — Explicit path to export the environment to

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory
* `-p`, `--platform ` — The platform to render the environment file for. Defaults to the current platform
* `-e`, `--environment ` — The environment to render the environment file for. Defaults to the default environment

## `magic project name`

Commands to manage project name

**Usage:** `magic project name [OPTIONS] `

###### **Subcommands:**

* `get` — Get the project name
* `set` — Set the project name

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic project name get`

Get the project name

**Usage:** `magic project name get`

## `magic project name set`

Set the project name

**Usage:** `magic project name set `

###### **Arguments:**

* `` — The project name

## `magic project system-requirements`

Commands to manage project environments

**Usage:** `magic project system-requirements [OPTIONS] `

###### **Subcommands:**

* `add` — Adds an environment to the manifest file
* `list` — List the environments in the manifest file

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic project system-requirements add`

Adds an environment to the manifest file

**Usage:** `magic project system-requirements add [OPTIONS]  `

###### **Arguments:**

* `` — The name of the system requirement to add

  Possible values:

  * `linux`:
    The version of the linux kernel (Find with `uname -r`)
  * `cuda`:
    The version of the CUDA driver (Find with `nvidia-smi`)
  * `macos`:
    The version of MacOS (Find with `sw_vers`)
  * `glibc`:
    The version of the glibc library (Find with `ldd --version`)
  * `other-libc`:
    Non Glibc libc family and version (Find with `ldd --version`)

* `` — The version of the requirement

###### **Options:**

* `--family ` — The Libc family, this can only be specified for requirement `other-libc`
* `-f`, `--feature ` — The name of the feature to modify

## `magic project system-requirements list`

List the environments in the manifest file

**Usage:** `magic project system-requirements list [OPTIONS]`

###### **Options:**

* `--json`
* `-e`, `--environment `

## `magic task`

Interact with tasks in the project

**Usage:** `magic task [OPTIONS] `

###### **Subcommands:**

* `add` — Add a command to the project
* `remove` — Remove a command from the project
* `alias` — Alias another specific command
* `list` — List all tasks in the project

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic task add`

Add a command to the project

**Usage:** `magic task add [OPTIONS]  ...`

###### **Arguments:**

* `` — Task name
* `` — One or more commands to actually execute

###### **Options:**

* `--depends-on ` — Depends on these other commands
* `-p`, `--platform ` — The platform for which the task should be added
* `-f`, `--feature ` — The feature for which the task should be added
* `--cwd ` — The working directory relative to the root of the project
* `--env ` — The environment variable to set, use --env key=value multiple times for more than one variable
* `--description ` — A description of the task to be added
* `--clean-env` — Isolate the task from the shell environment, and only use the magic environment to run the task

## `magic task remove`

Remove a command from the project

**Usage:** `magic task remove [OPTIONS] [NAMES]...`

###### **Arguments:**

* `` — Task names to remove

###### **Options:**

* `-p`, `--platform ` — The platform for which the task should be removed
* `-f`, `--feature ` — The feature for which the task should be removed

## `magic task alias`

Alias another specific command

**Usage:** `magic task alias [OPTIONS]  ...`

###### **Arguments:**

* `` — Alias name
* `` — Depends on these tasks to execute

###### **Options:**

* `-p`, `--platform ` — The platform for which the alias should be added
* `--description ` — The description of the alias task

## `magic task list`

List all tasks in the project

**Usage:** `magic task list [OPTIONS]`

###### **Options:**

* `-s`, `--summary` — Tasks available for this machine per environment
* `-e`, `--environment ` — The environment the list should be generated for. If not specified, the default environment is used
* `--json` — List as json instead of a tree If not specified, the default environment is used

## `magic list`

List project's packages.

Highlighted packages are explicit dependencies.

**Usage:** `magic list [OPTIONS] [REGEX]`

###### **Arguments:**

* `` — List only packages matching a regular expression

###### **Options:**

* `--platform ` — The platform to list packages for. Defaults to the current platform

* `--json` — Whether to output in json format

* `--json-pretty` — Whether to output in pretty json format

* `--sort-by ` — Sorting strategy

  Default value: `name`

  Possible values: `size`, `name`, `kind`

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

* `-e`, `--environment ` — The environment to list packages for. Defaults to the default environment

* `--no-lockfile-update` — Don't update lockfile, implies the no-install as well

* `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file

* `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file

* `--no-install` — Don't modify the environment, only modify the lock-file

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `--revalidate` — Run the complete environment validation. This will reinstall a broken environment

* `-x`, `--explicit` — Only list packages that are explicitly defined in the project

## `magic tree`

Show a tree of project dependencies

Dependency names highlighted in green are directly specified in the manifest. Yellow version numbers are conda packages, PyPI version numbers are blue.

**Usage:** `magic tree [OPTIONS] [REGEX]`

###### **Arguments:**

* `` — List only packages matching a regular expression

###### **Options:**

* `-p`, `--platform ` — The platform to list packages for. Defaults to the current platform

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

* `-e`, `--environment ` — The environment to list packages for. Defaults to the default environment

* `--no-lockfile-update` — Don't update lockfile, implies the no-install as well

* `--frozen` — Install the environment as defined in the lockfile, doesn't update lockfile if it isn't up-to-date with the manifest file

* `--locked` — Check if lockfile is up-to-date before installing the environment, aborts when lockfile isn't up-to-date with the manifest file

* `--no-install` — Don't modify the environment, only modify the lock-file

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `--revalidate` — Run the complete environment validation. This will reinstall a broken environment

* `-i`, `--invert` — Invert tree and show what depends on given package in the regex argument

## `magic global`

Subcommand for global package management actions

Install packages on the user level. Example: magic global install my\_package magic global remove my\_package

**Usage:** `magic global `

###### **Subcommands:**

* `add` — Adds dependencies to an environment
* `edit` — Edit the global manifest file
* `install` — Installs the defined packages in a globally accessible location and exposes their command line applications.
* `uninstall` — Uninstalls environments from the global environment.
* `remove` — Removes dependencies from an environment
* `list` — Lists all packages previously installed into a globally accessible location via `magic global install`.
* `sync` — Sync global manifest with installed environments
* `expose` — Interact with the exposure of binaries in the global environment
* `update` — Updates environments in the global environment

## `magic global add`

Adds dependencies to an environment

Example:

* magic global add --environment python numpy
* magic global add --environment my\_env pytest pytest-cov --expose pytest=pytest

**Usage:** `magic global add [OPTIONS] --environment  ...`

###### **Arguments:**

* `` — Specifies the packages that are to be added to the environment

###### **Options:**

* `-e`, `--environment ` — Specifies the environment that the dependencies need to be added to

* `--expose ` — Add one or more mapping which describe which executables are exposed. The syntax is `exposed_name=executable_name`, so for example `python3.10=python`. Alternatively, you can input only an executable\_name and `executable_name=executable_name` is assumed

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

## `magic global edit`

Edit the global manifest file

Opens your editor to edit the global manifest file.

**Usage:** `magic global edit [EDITOR]`

###### **Arguments:**

* `` — The editor to use, defaults to `EDITOR` environment variable or `nano` on Unix and `notepad` on Windows

## `magic global install`

Installs the defined packages in a globally accessible location and exposes their command line applications.

Example:

* magic global install starship nushell ripgrep bat
* magic global install jupyter --with polars
* magic global install --expose python3.8=python python=3.8
* magic global install --environment science --expose jupyter --expose ipython jupyter ipython polars

**Usage:** `magic global install [OPTIONS] ...`

###### **Arguments:**

* `` — Specifies the packages that are to be installed

###### **Options:**

* `-c`, `--channel ` — The channels to consider as a name or a url. Multiple channels can be specified by using this field multiple times.

  When specifying a channel, it is common that the selected channel also depends on the `conda-forge` channel.

  By default, if no channel is provided, `conda-forge` is used.

* `-p`, `--platform `

* `-e`, `--environment ` — Ensures that all packages will be installed in the same environment

* `--expose ` — Add one or more mapping which describe which executables are exposed. The syntax is `exposed_name=executable_name`, so for example `python3.10=python`. Alternatively, you can input only an executable\_name and `executable_name=executable_name` is assumed

* `--with ` — Add additional dependencies to the environment. Their executables will not be exposed

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `-u`, `--force-reinstall` — Specifies that the packages should be reinstalled even if they are already installed

## `magic global uninstall`

Uninstalls environments from the global environment.

Example:
magic global uninstall magic-pack rattler-build

**Usage:** `magic global uninstall [OPTIONS] ...`

###### **Arguments:**

* `` — Specifies the environments that are to be removed

###### **Options:**

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

## `magic global remove`

Removes dependencies from an environment

Use `magic global uninstall` to remove the whole environment

Example:

* magic global remove --environment python numpy

**Usage:** `magic global remove [OPTIONS] ...`

###### **Arguments:**

* `` — Specifies the packages that are to be removed

###### **Options:**

* `-e`, `--environment ` — Specifies the environment that the dependencies need to be removed from

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

## `magic global list`

Lists all packages previously installed into a globally accessible location via `magic global install`.

All environments:

* Yellow: the binaries that are exposed.
* Green: the packages that are explicit dependencies of the environment.
* Blue: the version of the installed package.
* Cyan: the name of the environment.

Per environment:

* Green: packages that are explicitly installed.

**Usage:** `magic global list [OPTIONS] [REGEX]`

###### **Arguments:**

* `` — List only packages matching a regular expression. Without regex syntax it acts like a `contains` filter

###### **Options:**

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `-e`, `--environment ` — The name of the environment to list

* `--sort-by ` — Sorting strategy for the package table of an environment

  Default value: `name`

  Possible values: `size`, `name`

## `magic global sync`

Sync global manifest with installed environments

**Usage:** `magic global sync [OPTIONS]`

###### **Options:**

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

## `magic global expose`

Interact with the exposure of binaries in the global environment

`magic global expose add python310=python3.10 --environment myenv` will expose the `python3.10` executable as `python310` from the environment `myenv`

`magic global expose remove python310 --environment myenv` will remove the exposed name `python310` from the environment `myenv`

**Usage:** `magic global expose `

###### **Subcommands:**

* `add` — Add exposed binaries from an environment to your global environment
* `remove` — Remove exposed binaries from the global environment

## `magic global expose add`

Add exposed binaries from an environment to your global environment

Example:

* magic global expose add python310=python3.10 python3=python3 --environment myenv
* magic global add --environment my\_env pytest pytest-cov --expose pytest=pytest

**Usage:** `magic global expose add [OPTIONS] --environment  [MAPPINGS]...`

###### **Arguments:**

* `` — Add one or more mapping which describe which executables are exposed. The syntax is `exposed_name=executable_name`, so for example `python3.10=python`. Alternatively, you can input only an executable\_name and `executable_name=executable_name` is assumed

###### **Options:**

* `-e`, `--environment ` — The environment to which the binaries should be exposed

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

## `magic global expose remove`

Remove exposed binaries from the global environment

`magic global expose remove python310 python3 --environment myenv` will remove the exposed names `python310` and `python3` from the environment `myenv`

**Usage:** `magic global expose remove [OPTIONS] [EXPOSED_NAMES]...`

###### **Arguments:**

* `` — The exposed names that should be removed

###### **Options:**

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

## `magic global update`

Updates environments in the global environment

**Usage:** `magic global update [OPTIONS] [ENVIRONMENTS]...`

###### **Arguments:**

* `` — Specifies the environments that are to be updated

###### **Options:**

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

## `magic auth`

Login to prefix.dev or anaconda.org servers to access private channels

**Usage:** `magic auth `

###### **Subcommands:**

* `login` — Store authentication information for a given host
* `logout` — Remove authentication information for a given host

## `magic auth login`

Store authentication information for a given host

**Usage:** `magic auth login [OPTIONS] `

###### **Arguments:**

* `` — The host to authenticate with (e.g. repo.prefix.dev)

###### **Options:**

* `--token ` — The token to use (for authentication with prefix.dev)
* `--username ` — The username to use (for basic HTTP authentication)
* `--password ` — The password to use (for basic HTTP authentication)
* `--conda-token ` — The token to use on anaconda.org / quetz authentication
* `--s3-access-key-id ` — The S3 access key ID
* `--s3-secret-access-key ` — The S3 secret access key
* `--s3-session-token ` — The S3 session token

## `magic auth logout`

Remove authentication information for a given host

**Usage:** `magic auth logout `

###### **Arguments:**

* `` — The host to remove authentication for

## `magic config`

Configuration management

**Usage:** `magic config `

###### **Subcommands:**

* `edit` — Edit the configuration file
* `list` — List configuration values
* `prepend` — Prepend a value to a list configuration key
* `append` — Append a value to a list configuration key
* `set` — Set a configuration value
* `unset` — Unset a configuration value

## `magic config edit`

Edit the configuration file

**Usage:** `magic config edit [OPTIONS] [EDITOR]`

###### **Arguments:**

* `` — The editor to use, defaults to `EDITOR` environment variable or `nano` on Unix and `notepad` on Windows

###### **Options:**

* `-l`, `--local` — Operation on project-local configuration
* `-g`, `--global` — Operation on global configuration
* `-s`, `--system` — Operation on system configuration
* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic config list`

List configuration values

Example: magic config list default-channels

**Usage:** `magic config list [OPTIONS] [KEY]`

###### **Arguments:**

* `` — Configuration key to show (all if not provided)

###### **Options:**

* `--json` — Output in JSON format
* `-l`, `--local` — Operation on project-local configuration
* `-g`, `--global` — Operation on global configuration
* `-s`, `--system` — Operation on system configuration
* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic config prepend`

Prepend a value to a list configuration key

Example: magic config prepend default-channels bioconda

**Usage:** `magic config prepend [OPTIONS]  `

###### **Arguments:**

* `` — Configuration key to set
* `` — Configuration value to (pre|ap)pend

###### **Options:**

* `-l`, `--local` — Operation on project-local configuration
* `-g`, `--global` — Operation on global configuration
* `-s`, `--system` — Operation on system configuration
* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic config append`

Append a value to a list configuration key

Example: magic config append default-channels bioconda

**Usage:** `magic config append [OPTIONS]  `

###### **Arguments:**

* `` — Configuration key to set
* `` — Configuration value to (pre|ap)pend

###### **Options:**

* `-l`, `--local` — Operation on project-local configuration
* `-g`, `--global` — Operation on global configuration
* `-s`, `--system` — Operation on system configuration
* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic config set`

Set a configuration value

Example: magic config set default-channels '\["conda-forge", "bioconda"]'

**Usage:** `magic config set [OPTIONS]  [VALUE]`

###### **Arguments:**

* `` — Configuration key to set
* `` — Configuration value to set (key will be unset if value not provided)

###### **Options:**

* `-l`, `--local` — Operation on project-local configuration
* `-g`, `--global` — Operation on global configuration
* `-s`, `--system` — Operation on system configuration
* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic config unset`

Unset a configuration value

Example: magic config unset default-channels

**Usage:** `magic config unset [OPTIONS] `

###### **Arguments:**

* `` — Configuration key to unset

###### **Options:**

* `-l`, `--local` — Operation on project-local configuration
* `-g`, `--global` — Operation on global configuration
* `-s`, `--system` — Operation on system configuration
* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic info`

Information about the system, project and environments for the current machine

**Usage:** `magic info [OPTIONS]`

###### **Options:**

* `--extended` — Show cache and environment size
* `--json` — Whether to show the output as JSON or not
* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

## `magic upload`

Upload a conda package

With this command, you can upload a conda package to a channel. Example: magic upload  my\_package.conda

Use `magic auth login` to authenticate with the server.

**Usage:** `magic upload  `

###### **Arguments:**

* `` — The host + channel to upload to
* `` — The file to upload

## `magic search`

Search a conda package

Its output will list the latest version of package.

**Usage:** `magic search [OPTIONS] `

###### **Arguments:**

* `` — Name of package to search

###### **Options:**

* `-c`, `--channel ` — The channels to consider as a name or a url. Multiple channels can be specified by using this field multiple times.

  When specifying a channel, it is common that the selected channel also depends on the `conda-forge` channel.

  By default, if no channel is provided, `conda-forge` is used.
* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory
* `-p`, `--platform ` — The platform to search for, defaults to current platform

  Default value: `osx-arm64`
* `-l`, `--limit ` — Limit the number of search results

## `magic self-update`

Update magic to the latest or a specific version.

Note: If the magic binary is not found in the default location (e.g. `~/.modular/bin/magic`), magic won't update to prevent breaking the current installation.

**Usage:** `magic self-update [OPTIONS]`

###### **Options:**

* `--version ` — The version to downgrade or upgrade to. The latest version is used if not specified
* `--force` — Force the update even if the magic binary is not found in the default location

## `magic clean`

Clean the parts of your system which are touched by magic. Defaults to cleaning the environments and task cache. Use the `cache` subcommand to clean the cache

**Usage:** `magic clean [OPTIONS] [COMMAND]`

###### **Subcommands:**

* `cache` — Clean the cache of your system which are touched by magic

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory
* `-e`, `--environment ` — The environment directory to remove
* `--activation-cache` — Only remove the activation cache

## `magic clean cache`

Clean the cache of your system which are touched by magic

**Usage:** `magic clean cache [OPTIONS]`

###### **Options:**

* `--pypi` — Clean only the pypi related cache
* `--conda` — Clean only the conda related cache
* `--mapping` — Clean only the mapping cache
* `--exec` — Clean only `exec` cache
* `--repodata` — Clean only the repodata cache
* `--tool` — Clean only the build backend tools cache
* `-y`, `--yes` — Answer yes to all questions

## `magic completion`

Generates a completion script for a shell

**Usage:** `magic completion --shell `

###### **Options:**

* `-s`, `--shell ` — The shell to generate a completion script for

  Possible values:

  * `bash`:
    Bourne Again SHell (bash)
  * `elvish`:
    Elvish shell
  * `fish`:
    Friendly Interactive SHell (fish)
  * `nushell`:
    Nushell
  * `powershell`:
    PowerShell
  * `zsh`:
    Z SHell (zsh)

## `magic telemetry`

Configure how telemetry data is emitted from magic and Modular packages

**Usage:** `magic telemetry [OPTIONS]`

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory
* `-e`, `--environment ` — The environment to control telemetry for Modular packages (default environment, if unspecified)
* `--enable` — Enable telemetry
* `--disable` — Disable telemetry

## `magic build`

Build a project

**Usage:** `magic build [OPTIONS]`

###### **Options:**

* `--manifest-path ` — The path to `pixi.toml`, `pyproject.toml`, or the project directory

* `--tls-no-verify` — Do not verify the TLS certificate of the server

* `--auth-file ` — Path to the file containing the authentication token

* `--pypi-keyring-provider ` — Specifies if we want to use uv keyring provider

  Possible values: `disabled`, `subprocess`

* `--concurrent-solves ` — Max concurrent solves, default is the number of CPUs

* `--concurrent-downloads ` — Max concurrent network requests, default is 50

* `-t`, `--target-platform ` — The target platform to build for (defaults to the current platform)

  Default value: `osx-arm64`

* `-o`, `--output-dir ` — The output directory to place the build artifacts

  Default value: `.`

## `magic 8ball`

Ask the 8-ball a question

**Usage:** `magic 8ball [OPTIONS] `

###### **Arguments:**

* `` — The question to ask the 8-ball

###### **Options:**

* `-d`, `--debug` — Enable debug verbose output
* `-g`, `--generate` — Use the 'generate' command instead of 'serve'
* `-m`, `--model ` — Model to use

  Default value: `modularai/Llama-3.1-8B-Instruct-GGUF`
* `--start-server` — Start the server instead of using a pre-existing one (default)
* `-n`, `--no-start-server` — Do not start the server

---

## Magic FAQ

## Why did you create Magic?

We created Magic to simplify your developer experience with Mojo.

When you're developing with MAX and Mojo, your code doesn’t exist in a vacuum.
Your project has dependencies and runtime requirements, such as specific Python
versions, Python packages, and potentially other Mojo code.

Previously, you might have required separate tools for managing Python
toolchains, handling Python dependencies, and managing MAX/Mojo toolchains and
dependencies. That could be four or more systems to create a consistent,
reproducible build environment.

This is where Magic comes in. Magic is a package manager and virtual
environment system that unifies all these dependency management tasks with one
tool. It also works seamlessly with popular Python package repositories and
tools, while still allowing us to customize the virtual environment, packaging,
and build tools for the growing MAX/Mojo platform.

Magic ensures that your builds are consistent, reproducible, and ready for
production, no matter where they’re deployed. And, because we built Magic upon
the already amazing [pixi](https://github.com/prefix-dev/pixi) tool, it
provides a smooth experience that feels like magic. 🪄

[Install Magic now](/magic/#install-magic).

## Why not just use conda?

We love conda and all the tools in the conda ecosystem, but conda alone doesn't
do all the things that we want to do for MAX and Mojo projects. So we were
thrilled when we saw that [prefix.dev](https://prefix.dev) already built a tool
that improves upon conda in all the ways that we wanted to. Because Magic is
just a small extension to their pixi tool, we suggest you read their
explanations for why they built pixi:

* [Let's stop dependency hell](https://prefix.dev/blog/launching_pixi)
* [7 reasons to Switch from Conda to
  Pixi](https://prefix.dev/blog/pixi_a_fast_conda_alternative)
* [Pixi FAQ](https://pixi.sh/latest/FAQ/)

That said, the `max` package is a conda package, and you can also install it
using other conda tools. For details, see how to [add MAX/Mojo in a conda
project](/magic/conda).

## Why not just use pixi?

We have every intention of contributing changes to the [pixi
project](https://github.com/prefix-dev/pixi). However, Mojo is still a very
young language, MAX requires some unique environment settings, and we're still
building and planning a lot of features for our build and packaging system. So
it's simply too soon to contribute some of our changes to a project like pixi,
and we currently can't make some of their features work with MAX/Mojo (`magic`
is missing some commands available in `pixi`).

The pixi team has a much larger developer community that they must prioritize
and support. Meanwhile, we're building a new language and new developer tools
from the ground up and we need to iterate fast. Quite simply, our projects have
different priorities right now. That said, we have a very good relationship
with the pixi team. They've been nothing but supportive and helpful in our
endeavour, and we look forward to collaborating with them.

## Do I have to use Magic?

No. You can also install MAX and Mojo (the `max` package) [using other conda
tools](/magic/conda) or—as of version 25.3—[with pip](/magic/pip).

Although we now support installing with pip, we still recommend using
`magic` or `conda` for Mojo development, because the Python wheel installed
with `pip` currently doesn't include the Mojo LSP or debugger. So you'll have
a better IDE experience with Mojo through `magic`.

## Is the Magic tool open sourced?

Not today. We worked really hard to get Magic released as quickly as possible,
and properly open sourcing any software is also a lot of work. The `magic` code
currently has nothing proprietary in it, so it should just be a matter of time.

## What's the alpha-numeric string in the Magic install URL?

The string appended to the `https://magic.modular.com` URL is a
universally unique identifier (UUID) that helps us improve our tools and user
experience. Any data associated with this ID is anonymized and not linked to
any personally identifiable information.

---

## make_buffer_resource

`make_buffer_resource[type: DType](gds_ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], num_records: Int = __init__[::Intable](SIMD(max_or_inf[::DType]()))) -> SIMD[uint32, 4]`

Creates a 128-bit buffer resource descriptor for AMD GPU buffer operations.

This function constructs a 128-bit buffer resource descriptor used by AMD GPUs
for buffer load/store operations. The descriptor contains information about
the memory location, size, and access properties needed by the hardware to
perform memory operations.

Notes:

* Only supported on AMD GPUs.
* The descriptor follows AMD's hardware-specific format:
  * Bits 0-63: Base address
  * Bits 64-95: Number of records (size)
  * Bits 96-127: Flags controlling access properties
* Used with buffer\_load and buffer\_store operations.
* Performance-critical for optimized memory access patterns on AMD GPUs.

Example:

```mojo
from gpu.intrinsics import make_buffer_resource

var ptr = UnsafePointer[Scalar[DType.float32]].alloc(1024)
var resource = make_buffer_resource[DType.float32](ptr, 1024)
# Use resource with buffer_load/buffer_store operations
```

.

**Parameters:**

* ​type (`DType`): The data type of elements in the buffer.

**Args:**

* ​gds\_ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Global memory base address pointer to the start of the buffer.
* ​num\_records (`Int`): Maximum number of records that can be accessed through this
  resource descriptor. Reads with offsets beyond this value return 0.
  Defaults to UInt32.MAX for maximum possible range.

**Returns:**

A 128-bit buffer resource descriptor as a SIMD\[DType.uint32, 4].

---

## make_layout

`make_layout(*layouts: Layout) -> Layout`

Creates a composite layout by concatenating multiple layouts.

This function combines multiple layouts into a single layout by concatenating
their shapes and strides. The resulting layout represents a hierarchical
structure where each input layout becomes a component of the output layout.

Example:

```mojo
from layout import Layout, IntTuple
from layout.layout import make_layout

var layout1 = Layout(IntTuple(2, 3), IntTuple(3, 1))
var layout2 = Layout(IntTuple(4, 5), IntTuple(5, 1))
var combined = make_layout(layout1, layout2)
# Result: Layout with shape ((2, 3), (4, 5)) and stride ((3, 1), (5, 1))
```

.

**Args:**

* ​\*layouts (`Layout`): Variable number of `Layout` objects to combine.

**Returns:**

A new Layout with concatenated shapes and strides from the input layouts.

`make_layout(layout_a: Layout, layout_b: Layout) -> Layout`

Creates a composite layout from two layouts.

This is a specialized version of make\_layout that takes exactly two layouts
and combines them into a single layout. This function exists as a workaround
for compiler limitations.

**Args:**

* ​layout\_a (`Layout`): The first layout to include in the composite.
* ​layout\_b (`Layout`): The second layout to include in the composite.

**Returns:**

A new `Layout` with concatenated shapes and strides from the input layouts.

---

## make_layout

`make_layout[l1: Layout, l2: Layout, /, *, linear_idx_type: DType = uint64](a: RuntimeLayout[l1, element_type=element_type, linear_idx_type=linear_idx_type], b: RuntimeLayout[l2, element_type=element_type, linear_idx_type=linear_idx_type]) -> RuntimeLayout[make_layout(l1, l2), element_type=element_type, linear_idx_type=linear_idx_type]`

Combine two runtime layouts into a single composite layout.

This creates a new layout by concatenating the dimensions and strides of the
input layouts.

**Parameters:**

* ​l1 (`Layout`): The static layout type of `a`.
* ​l2 (`Layout`): The static layout type of `b`.
* ​linear\_idx\_type (`DType`): The integer type of the all index calculated by the returned
  runtime layout.

**Args:**

* ​a (`RuntimeLayout[l1, element_type=element_type, linear_idx_type=linear_idx_type]`): The first `RuntimeLayout` to combine.
* ​b (`RuntimeLayout[l2, element_type=element_type, linear_idx_type=linear_idx_type]`): The second `RuntimeLayout` to combine.

**Returns:**

A new `RuntimeLayout` with dimensions from both input layouts.

---

## make_ldmatrix_swizzle

`make_ldmatrix_swizzle[type: DType, row_size: Int, log2_vector_width: Int = 0]() -> Swizzle`

Make swizzle to avoid bank conflict for ldmatrix ops.

Creates a swizzle pattern optimized for `ldmatrix` operations.
Minimizes bank conflicts in shared memory for these operations.
Calculates swizzle parameters based on data type and row size.

**Parameters:**

* ​type (`DType`): The data type of the elements.
* ​row\_size (`Int`): Size of each row in elements.
* ​log2\_vector\_width (`Int`): Log2 of the vector width (default: 0).

**Returns:**

A `Swizzle` object configured for `ldmatrix`.

---

## make_ordered_layout

`make_ordered_layout(shape: IntTuple[origin], order: IntTuple[origin]) -> Layout`

Creates a layout with strides ordered according to a specified traversal order.

This function generates a compact (bijective) layout where the stride values
follow the traversal order specified by the order parameter. This allows
creating layouts with custom memory traversal patterns while maintaining
a compact memory representation.

Example:

```mojo
from layout import IntTuple, Layout
from layout.layout import make_ordered_layout

# Create a layout with shape (2,3,4,5) where dimensions are traversed
# in the order: dim0, dim3, dim2, dim1
var layout = make_ordered_layout(
    IntTuple(2, 3, 4, 5),
    IntTuple(1, 4, 3, 2)
)
# Result: Layout with shape (2,3,4,5) and stride (1,24,6,2)
```

.

**Args:**

* ​shape (`IntTuple[origin]`): The shape of the layout.
* ​order (`IntTuple[origin]`): The traversal order priority (lower values indicate higher priority).

**Returns:**

A `Layout` with the specified shape and strides ordered according to the
traversal order.

---

## make_swizzle

`make_swizzle[num_rows: Int, row_size: Int, access_size: Int]() -> Swizzle`

Create a 2D swizzle to avoid bank conflicts.

Generates a swizzle pattern for 2D memory layout to minimize
bank conflicts in shared memory access.

**Parameters:**

* ​num\_rows (`Int`): Number of rows in the minimum access pattern.
* ​row\_size (`Int`): Size of each row in elements.
* ​access\_size (`Int`): Number of elements accessed at once.

**Returns:**

A `Swizzle` object for 2D memory access.

`make_swizzle[type: DType, mode: TensorMapSwizzle]() -> Swizzle`

Create swizzle based on predefined swizzle modes.

Returns a swizzle pattern based on standard modes (32B, 64B,
128B, none), adjusted for data type.

**Parameters:**

* ​type (`DType`): The data type of the elements.
* ​mode (`TensorMapSwizzle`): The swizzle mode to use (TensorMapSwizzle enum).

**Returns:**

A `Swizzle` object configured by the specified mode.

---

## makedirs

`makedirs[PathLike: PathLike](path: PathLike, mode: Int = 511, exist_ok: Bool = False)`

Creates a specified leaf directory along with any necessary intermediate directories that don't already exist.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.
* ​mode (`Int`): The mode to create the directory with.
* ​exist\_ok (`Bool`): Ignore error if `True` and path exists (default `False`).

---

## MakeLayoutList

`MakeLayoutList(v0: Layout, v1: Layout) -> List[Layout]`

Creates a list containing two layouts.

This is a convenience function for creating a LayoutList with two elements.

**Args:**

* ​v0 (`Layout`): The first layout to include in the list.
* ​v1 (`Layout`): The second layout to include in the list.

**Returns:**

A LayoutList containing the two provided layouts.

---

## MakeTileLayoutList

`MakeTileLayoutList[*tile_sizes: Int]() -> List[Layout]`

Creates a list of layouts for tiling operations.

This function creates a list of simple layouts, each with a shape from the
provided tile\_sizes and a stride of 1. These layouts can be used for tiling
operations.

**Parameters:**

* ​\*tile\_sizes (`Int`): Variable number of integer tile dimensions.

**Returns:**

A LayoutList containing layouts for each tile size.

---

## Mammoth

import ContactSection from '@site/src/components/ContactSection';

Mammoth (formerly referred to as MAX Inference Cluster) is a Kubernetes-native
distributed AI serving tool that makes it easier to run and manage LLMs at scale
using MAX as a backend for optimal model performance. It's built on the
[Modular Platform](/max/intro) and is designed to give you efficient use of your
hardware with minimal configuration, even when running multiple models across
thousands of nodes.

The Mammoth control plane automatically selects the best available hardware to
meet performance targets when deploying a model and supports both manual and
automatic scaling. Mammoth's built-in router intelligently distributes traffic,
taking into account hardware load, GPU memory, and caching states. You can
deploy and serve multiple models simultaneously across different hardware types
or versions without complex setup or duplication of infrastructure.

:::note

Mammoth is not yet generally available.
[Get in touch](https://www.modular.com/company/talk-to-us) to learn about early
access for enterprise teams.

:::

   
   Spin up an inference cluster and deploy models at scale from your CLI.
 

## Become a design partner

Mammoth is currently only available through Modular's early access program where
we're actively partnering with select organizations as design partners. Design
partners collaborate directly with Modular's engineering and product teams,
gain early access to in-development features, and receive tailored guidance on
integrating the Modular Platform into their existing generative AI workloads.

---

## managed_tensor_slice

Implements the `ManagedTensorSlice` type - a view of a tensor that doesn't own the underlying data. This type is used to build custom graph operations.

## Aliases

### `InputTensor`

`alias InputTensor = ManagedTensorSlice[IOSpec(), static_spec=?]`

### `InputVariadicTensors`

`alias InputVariadicTensors = VariadicTensors[?, ?, ?, IOSpec(), static_specs=?]`

### `OutputTensor`

`alias OutputTensor = ManagedTensorSlice[IOSpec(), static_spec=?]`

### `OutputVariadicTensors`

`alias OutputVariadicTensors = VariadicTensors[?, ?, ?, IOSpec(), static_specs=?]`

## Structs

* [​`DynamicTensor`](/max/api/mojo/tensor/managed_tensor_slice/DynamicTensor):
* [​`ManagedTensorSlice`](/max/api/mojo/tensor/managed_tensor_slice/ManagedTensorSlice): A view of a tensor that does not own the underlying allocated pointer. When the object lifetime ends it does not free the underlying pointer. Conversely, if a `ManagedTensorSlice` is created, it will not extend the life of the underlying pointer.
* [​`VariadicTensors`](/max/api/mojo/tensor/managed_tensor_slice/VariadicTensors): A tuple-like container of tensors representing variadic arguments from the graph compiler.

## Functions

* [​`foreach`](/max/api/mojo/tensor/managed_tensor_slice/foreach): Apply the function `func` to each element of the tensor slice.
* [​`rebuild_mix_precision_static_tensor_specs_with_input_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_mix_precision_static_tensor_specs_with_input_lambda):
* [​`rebuild_mix_precision_static_tensor_specs_with_output_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_mix_precision_static_tensor_specs_with_output_lambda):
* [​`rebuild_static_tensor_specs_with_input_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_static_tensor_specs_with_input_lambda):
* [​`rebuild_static_tensor_specs_with_output_lambda`](/max/api/mojo/tensor/managed_tensor_slice/rebuild_static_tensor_specs_with_output_lambda):
* [​`trace_slice_arg`](/max/api/mojo/tensor/managed_tensor_slice/trace_slice_arg): Helper to stringify the type and shape of a kernel argument for tracing.

---

## managed_tensor_slice_to_ndbuffer

`managed_tensor_slice_to_ndbuffer[: DType, : Int, spec: StaticTensorSpec[$0, $1], //](tensor: ManagedTensorSlice[io_spec, static_spec=spec]) -> NDBuffer[type, rank, MutableAnyOrigin, spec.shape, spec.strides, alignment=spec.alignment, address_space=spec.address_space, exclusive=spec.exclusive]`

---

## managed_tensor_slice_to_ndbuffer

`managed_tensor_slice_to_ndbuffer[: DType, : Int, spec: StaticTensorSpec[$0, $1], //](tensor: ManagedTensorSlice[io_spec, static_spec=spec]) -> NDBuffer[type, rank, MutableAnyOrigin, spec.shape, spec.strides, alignment=spec.alignment, address_space=spec.address_space, exclusive=spec.exclusive]`

---

## managed_tensor_slice_to_ndbuffer

`managed_tensor_slice_to_ndbuffer[: DType, : Int, spec: StaticTensorSpec[$0, $1], //](tensor: ManagedTensorSlice[io_spec, static_spec=spec]) -> NDBuffer[type, rank, MutableAnyOrigin, spec.shape, spec.strides, alignment=spec.alignment, address_space=spec.address_space, exclusive=spec.exclusive]`

---

## ManagedTensorSlice

`@register_passable(trivial)`
`struct ManagedTensorSlice[mut: Bool, input: IO, type: DType, rank: Int, //, io_spec: IOSpec[mut, input], *, static_spec: StaticTensorSpec[type, rank]]`

A view of a tensor that does not own the underlying allocated pointer. When the object lifetime ends it does not free the underlying pointer. Conversely, if a `ManagedTensorSlice` is created, it will not extend the life of the underlying pointer.

Therefore, the user must take care to keep the pointer alive until the last
use of a `ManagedTensorSlice` instance. This class is useful for writing
custom operations where memory is managed by an external runtime like in
MAX's inference stack.

## Implemented traits

`AnyType`,
`Copyable`,
`DevicePassable`,
`ExplicitlyCopyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `address_space`

`alias address_space = static_spec.address_space`

### `alignment`

`alias alignment = static_spec.alignment`

### `device_type`

`alias device_type = LayoutTensor[type, static_spec.to_layout(), MutableAnyOrigin]`

### `exclusive`

`alias exclusive = static_spec.exclusive`

## Methods

### `__init__`

`__init__(ptr: UnsafePointer[SIMD[type, 1]], slices: InlineArray[Slice, rank], slicer_spec: RuntimeTensorSpec[type, rank]) -> Self`

Initializes a ManagedTensorSlice from a pointer, array of slices and tensor spec.

In general, custom operations should not create `ManagedTensorSlice`
instances, but instead use the ones provided by the MAX inference
engine.

`__init__(ptr: UnsafePointer[SIMD[type, 1]], shape: IndexList[rank]) -> Self`

Initializes a ManagedTensorSlice from a pointer and shape.

In general, custom operations should not create `ManagedTensorSlice`
instances, but instead use the ones provided by the MAX inference
engine.

`__init__(ptr: UnsafePointer[SIMD[type, 1]], shape: IndexList[rank], strides: IndexList[rank]) -> Self`

Initializes a ManagedTensorSlice from a pointer, shape, and strides.

In general, custom operations should not create `ManagedTensorSlice`
instances, but instead use the ones provided by the MAX inference
engine.

### `__getitem__`

`__getitem__(self, indices: IndexList[rank]) -> SIMD[type, 1]`

Gets the value at the specified indices.

**Args:**

* ​indices (`IndexList[rank]`): The indices of the value to retrieve.

**Returns:**

The value at the specified indices.

`__getitem__(self, *indices: Int) -> SIMD[type, 1]`

Gets the value at the specified indices.

**Args:**

* ​\*indices (`Int`): The indices of the value to retrieve.

**Returns:**

The value at the specified indices.

### `__setitem__`

`__setitem__(self, *indices: Int, *, val: SIMD[type, 1])`

Stores the value at the specified indices.

**Args:**

* ​\*indices (`Int`): The indices of the value to store.
* ​val (`SIMD[type, 1]`): The value to store.

`__setitem__(self, indices: IndexList[rank], val: SIMD[type, 1])`

Stores the value at the specified indices.

**Args:**

* ​indices (`IndexList[rank]`): The indices of the value to store.
* ​val (`SIMD[type, 1]`): The value to store.

### `get_type_name`

`static get_type_name() -> String`

### `get_device_type_name`

`static get_device_type_name() -> String`

### `spec`

`spec(self) -> RuntimeTensorSpec[type, rank]`

Gets the `TensorSpec` of this tensor slice, which provides meta-data about the tensor slice.

**Returns:**

The static `TensorSpec` for this tensor slice.

### `shape`

`shape(self) -> IndexList[rank]`

Gets the shape of this tensor slice, as an `IndexList`.

**Returns:**

The shape of this tensor slice.

### `dim_size`

`dim_size(self, index: Int) -> Int`

Gets the size of a given dimension of this tensor slice using a run time value.

**Args:**

* ​index (`Int`): The zero-based index of the dimension.

**Returns:**

The size of the tensor slice in the given dimension.

`dim_size[index: Int](self) -> Int`

Gets the size of a given dimension of this tensor slice using a compile time value.

**Parameters:**

* ​index (`Int`): The zero-based index of the dimension.

**Returns:**

The size of the tensor slice in the given dimension.

### `strides`

`strides(self) -> IndexList[rank]`

Gets the strides of this tensor slice, as an `IndexList`.

**Returns:**

The strides of this tensor slice.

### `stride_length`

`stride_length(self, index: Int) -> Int`

Gets the length of the stride of a given dimension of this tensor slice using a run time value.

**Args:**

* ​index (`Int`): The zero-based index of the dimension.

**Returns:**

The size of the tensor slice in the given dimension.

`stride_length[index: Int](self) -> Int`

Gets the length of the stride of a given dimension of this tensor slice using a compile time value.

**Parameters:**

* ​index (`Int`): The zero-based index of the dimension.

**Returns:**

The size of the tensor slice in the given dimension.

### `size`

`size(self) -> Int`

Computes the tensor slice's number of elements.

**Returns:**

The total number of elements in the tensor slice.

### `unsafe_ptr`

`unsafe_ptr[__type: DType = type](self) -> UnsafePointer[SIMD[__type, 1]]`

Get the pointer stored in this tensor slice.

Since this method obtains the pointer stored in this tensor slice, it
can modify the invariants of this tensor slice and lead to unexpected
behavior. It should be used with caution.

**Parameters:**

* ​\_\_type (`DType`): The type of the `UnsafePointer` in this tensor slice.

**Returns:**

The `UnsafePointer` which contains the data for this tensor slice.

### `load`

`load[width: Int, _rank: Int](self, index: IndexList[_rank]) -> SIMD[type, width]`

Gets data from this tensor slice as a `SIMD`.

**Parameters:**

* ​width (`Int`): The width of the `SIMD` value. This must be large enough to contain the data from this tensor slice.
* ​\_rank (`Int`): The rank of the tensor slice.

**Args:**

* ​index (`IndexList[_rank]`): An `IndexList` of size `_rank` to indicate the dimension of the tensor slice to obtain data from.

**Returns:**

Data from this tensor slice at dimension `index`.

### `store`

`store[width: Int, _rank: Int, element_alignment: Int = 1](self: ManagedTensorSlice[io_spec, static_spec=static_spec], index: IndexList[_rank], val: SIMD[type, width])`

Sets data in this tensor slice from a `SIMD`.

**Parameters:**

* ​width (`Int`): The width of the `SIMD` value.
* ​\_rank (`Int`): The rank of the tensor slice.
* ​element\_alignment (`Int`): Indicate the alignment of the pointer stored to memory. This is needed to issue vector store for GPUs with strict alignment requirements.

**Args:**

* ​index (`IndexList[_rank]`): An `IndexList` of size `_rank` to indicate the dimension of the tensor slice to set data in.
* ​val (`SIMD[type, width]`): The data to set into this tensor slice.

### `with_layout`

`with_layout[new_rank: Int, //, new_static_shape: DimList, new_static_strides: DimList](self, new_runtime_shape: IndexList[new_rank], new_runtime_strides: IndexList[new_rank], offset_ptr: OptionalReg[UnsafePointer[SIMD[type, 1]]] = OptionalReg[UnsafePointer[SIMD[type, 1]]]({:i1 0, 1})) -> ManagedTensorSlice[io_spec, static_spec=static_spec.with_layout[::Int](new_static_shape, new_static_strides)]`

### `to_layout_tensor`

`to_layout_tensor(self) -> LayoutTensor[type, static_spec.to_layout(), MutableAnyOrigin]`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this buffer to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__repr__`

`__repr__(self) -> String`

Gets the buffer as a string.

**Returns:**

A compact string representation of the buffer.

### `__str__`

`__str__(self) -> String`

Gets the buffer as a string.

**Returns:**

A compact string of the buffer.

---

## manager

Abstract base class for KVCacheManager for KV Cache.

## `KVCacheInputSymbols` {#max.nn.kv_cache.manager.KVCacheInputSymbols}

> *class* max.nn.kv\_cache.manager.KVCacheInputSymbols

Base class for input symbols for KV cache managers.

The derived class is responsible for defining the input symbols for the
specific KV cache manager.
For example, here’s a derived class for a text KV cache manager:

```python
@dataclass
class ContinuousBatchingKVCacheInputSymbols(KVCacheInputSymbols):
    kv_blocks: TensorType
    cache_lengths: TensorType
    lookup_table: TensorType
    max_lengths: TensorType
```

## `KVCacheInputs` {#max.nn.kv_cache.manager.KVCacheInputs}

> *class* max.nn.kv\_cache.manager.KVCacheInputs

A base class that holds KV cache related (Tensor) inputs.

It is meant to be subclassed by concrete KV cache input types.
For example, here’s a derived class for a text KV cache manager:

```python
@dataclass
class RaggedKVCacheInputs(KVCacheInputs):
    blocks: Tensor
    cache_lengths: Tensor
    lookup_table: Tensor
    max_lengths: Tensor
```

## `KVCacheInputsSequence` {#max.nn.kv_cache.manager.KVCacheInputsSequence}

> *class* max.nn.kv\_cache.manager.KVCacheInputsSequence(kv\_cache\_inputs)

`KVCacheInputsSequence` is a sequence of [`KVCacheInputs`](#max.nn.kv_cache.manager.KVCacheInputs).

It is primarily used in our multistep execution to represent batched
KVCacheInputs.

**Parameters:**

**kv\_cache\_inputs** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`KVCacheInputs`](#max.nn.kv_cache.manager.KVCacheInputs) `]` )

### `kv_cache_inputs` {#max.nn.kv_cache.manager.KVCacheInputsSequence.kv_cache_inputs}

> kv\_cache\_inputs\*: [Sequence](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[KVCacheInputs](#max.nn.kv_cache.manager.KVCacheInputs)]\*

## `KVCacheManager` {#max.nn.kv_cache.manager.KVCacheManager}

> *class* max.nn.kv\_cache.manager.KVCacheManager(params, max\_batch\_size, max\_seq\_len, num\_layers, devices, session, is\_ragged=False)

**Parameters:**

* **params** ([`KVCacheParams`](cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **max\_batch\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **num\_layers** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **devices** (`Sequence` `[` [`Device`](../../driver.md#max.driver.Device) `]` )
* **session** ([`InferenceSession`](../../engine.md#max.engine.InferenceSession) )
* **is\_ragged** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

### `claim()` {#max.nn.kv_cache.manager.KVCacheManager.claim}

> claim(n)

Claims `n` blocks of memory in the cache for incoming requests.

This returns a list of sequence ids, which identify a sequence’s
location within the cache. This sequence id can then be passed
in the fetch function to return the `ContinuousBatchingKVCacheCollection`
for those sequences.

**Parameters:**

**n** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[int](https://docs.python.org/3/library/functions.html#int)]

### `contains()` {#max.nn.kv_cache.manager.KVCacheManager.contains}

> contains(seq\_id)

**Parameters:**

**seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[bool](https://docs.python.org/3/library/functions.html#bool)

### `estimated_memory_size()` {#max.nn.kv_cache.manager.KVCacheManager.estimated_memory_size}

> *abstract classmethod* estimated\_memory\_size(params, max\_batch\_size, max\_seq\_len, num\_layers, available\_cache\_memory, devices, \*\*kwargs)

Returns the estimated total memory usage of the kv cache.

**Parameters:**

* **params** ([`KVCacheParams`](cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **max\_batch\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **num\_layers** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **available\_cache\_memory** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **devices** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Device`](../../driver.md#max.driver.Device) `]` )
* **kwargs** ([`Any`](https://docs.python.org/3/library/typing.html#typing.Any) )

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `external_claim()` {#max.nn.kv_cache.manager.KVCacheManager.external_claim}

> external\_claim(seq\_ids)

Variant of the above where sequence ids are reserved externally.

**Parameters:**

**seq\_ids** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` )

**Return type:**

None

### `fetch()` {#max.nn.kv_cache.manager.KVCacheManager.fetch}

> *abstract* fetch(batch, num\_steps=1)

Returns blocks and other inputs to kv cache kernel for given
sequence ids and prompts.

**Parameters:**

* **batch** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `T` `]` )
* **num\_steps** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*KVCacheInputs*](#max.nn.kv_cache.manager.KVCacheInputs)]

### `increment_cache_lengths()` {#max.nn.kv_cache.manager.KVCacheManager.increment_cache_lengths}

> increment\_cache\_lengths(kv\_cache\_inputs, prev\_model\_inputs)

Prepare the inputs for a multistep execution, generally by incrementing
the cache lengths. This should not require a device synchronization,
as this would defeat the purpose of multistep execution.

This should also not update the cache lengths in our manager, this batch is
still considered in-progress.

**Parameters:**

* **kv\_cache\_inputs** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`RaggedKVCacheInputs`](#max.nn.kv_cache.manager.RaggedKVCacheInputs) `]`  `|`  [`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`PaddedKVCacheInputs`](#max.nn.kv_cache.manager.PaddedKVCacheInputs) `]` )
* **prev\_model\_inputs** ([`Any`](https://docs.python.org/3/library/typing.html#typing.Any) )

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*RaggedKVCacheInputs*](#max.nn.kv_cache.manager.RaggedKVCacheInputs)] | [list](https://docs.python.org/3/library/stdtypes.html#list)\[[*PaddedKVCacheInputs*](#max.nn.kv_cache.manager.PaddedKVCacheInputs)]

### `infer_optimal_batch_size()` {#max.nn.kv_cache.manager.KVCacheManager.infer_optimal_batch_size}

> *abstract classmethod* infer\_optimal\_batch\_size(params, max\_seq\_len, num\_layers, available\_cache\_memory, devices, \*\*kwargs)

Returns the estimated optimal batch size for the kv cache.

**Parameters:**

* **params** ([`KVCacheParams`](cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **num\_layers** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **available\_cache\_memory** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **devices** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Device`](../../driver.md#max.driver.Device) `]` )
* **kwargs** ([`Any`](https://docs.python.org/3/library/typing.html#typing.Any) )

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `input_symbols()` {#max.nn.kv_cache.manager.KVCacheManager.input_symbols}

> *abstract* input\_symbols()

Returns the input symbols for the kv cache manager.

**Return type:**

[*Sequence*](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence)\[[*KVCacheInputSymbols*](#max.nn.kv_cache.manager.KVCacheInputSymbols)]

### `num_kv_inputs()` {#max.nn.kv_cache.manager.KVCacheManager.num_kv_inputs}

> num\_kv\_inputs()

Returns the default number of KV cache inputs for KV managers.

Subclasses with a different number of KV cache inputs should override
this method and [`increment_cache_lengths`](#max.nn.kv_cache.manager.KVCacheManager.increment_cache_lengths).

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `release()` {#max.nn.kv_cache.manager.KVCacheManager.release}

> release(seq\_id)

Release `seq_id` provided, marking this sequence as complete.
This returns the `seq_id` back to the available pool of cache memory,
allowing it to be reused when a new sequence is claimed.

**Parameters:**

**seq\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

None

### `slots_remaining` {#max.nn.kv_cache.manager.KVCacheManager.slots_remaining}

> *property* slots\_remaining\*: [set](https://docs.python.org/3/library/stdtypes.html#set)\[[int](https://docs.python.org/3/library/functions.html#int)]\*

The outstanding cache slots available.

### `step()` {#max.nn.kv_cache.manager.KVCacheManager.step}

> step(batch)

Commit the new tokens into the prefix cache.

This is a no-op if prefix caching is disabled.

**Parameters:**

**batch** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `T` `]` )

**Return type:**

None

## `PaddedKVCacheInputs` {#max.nn.kv_cache.manager.PaddedKVCacheInputs}

> *class* max.nn.kv\_cache.manager.PaddedKVCacheInputs(k\_cache, v\_cache, start\_pos, null\_op)

`PaddedKVCacheInputs` is a class that holds the inputs for
KV cache when used together with padded tensors.

**Parameters:**

* **k\_cache** ([`Tensor`](../../driver.md#max.driver.Tensor) )
* **v\_cache** ([`Tensor`](../../driver.md#max.driver.Tensor) )
* **start\_pos** ([`Tensor`](../../driver.md#max.driver.Tensor) )
* **null\_op** ([`Tensor`](../../driver.md#max.driver.Tensor) )

### `k_cache` {#max.nn.kv_cache.manager.PaddedKVCacheInputs.k_cache}

> k\_cache\*: [Tensor](../../driver.md#max.driver.Tensor)\*

### `null_op` {#max.nn.kv_cache.manager.PaddedKVCacheInputs.null_op}

> null\_op\*: [Tensor](../../driver.md#max.driver.Tensor)\*

### `start_pos` {#max.nn.kv_cache.manager.PaddedKVCacheInputs.start_pos}

> start\_pos\*: [Tensor](../../driver.md#max.driver.Tensor)\*

### `v_cache` {#max.nn.kv_cache.manager.PaddedKVCacheInputs.v_cache}

> v\_cache\*: [Tensor](../../driver.md#max.driver.Tensor)\*

## `RaggedKVCacheInputs` {#max.nn.kv_cache.manager.RaggedKVCacheInputs}

> *class* max.nn.kv\_cache.manager.RaggedKVCacheInputs(blocks, cache\_lengths, lookup\_table, max\_lengths)

`RaggedKVCacheInputs` is a class that holds the inputs for
KV cache when used together with ragged tensors.

**Parameters:**

* **blocks** ([`Tensor`](../../driver.md#max.driver.Tensor) )
* **cache\_lengths** ([`Tensor`](../../driver.md#max.driver.Tensor) )
* **lookup\_table** ([`Tensor`](../../driver.md#max.driver.Tensor) )
* **max\_lengths** ([`Tensor`](../../driver.md#max.driver.Tensor) )

### `blocks` {#max.nn.kv_cache.manager.RaggedKVCacheInputs.blocks}

> blocks\*: [Tensor](../../driver.md#max.driver.Tensor)\*

### `cache_lengths` {#max.nn.kv_cache.manager.RaggedKVCacheInputs.cache_lengths}

> cache\_lengths\*: [Tensor](../../driver.md#max.driver.Tensor)\*

### `lookup_table` {#max.nn.kv_cache.manager.RaggedKVCacheInputs.lookup_table}

> lookup\_table\*: [Tensor](../../driver.md#max.driver.Tensor)\*

### `max_lengths` {#max.nn.kv_cache.manager.RaggedKVCacheInputs.max_lengths}

> max\_lengths\*: [Tensor](../../driver.md#max.driver.Tensor)\*

---

## map

`map[origins: origin.set, //, func: fn(Int) capturing -> None](size: Int)`

Maps a function over a range from 0 to size.

**Parameters:**

* ​origins (`origin.set`): The capture origins.
* ​func (`fn(Int) capturing -> None`): Function to map.

**Args:**

* ​size (`Int`): The number of elements.

---

## map_reduce

`map_reduce[simd_width: Int, size: Dim, type: DType, acc_type: DType, origins_gen: origin.set, input_gen_fn: fn[DType, Int](Int) capturing -> SIMD[$0, $1], origins_vec: origin.set, reduce_vec_to_vec_fn: fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2], reduce_vec_to_scalar_fn: fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1]](dst: NDBuffer[type, 1, origin, __init__[::Intable](size)], init: SIMD[acc_type, 1]) -> SIMD[acc_type, 1]`

Stores the result of calling input\_gen\_fn in dst and simultaneously reduce the result using a custom reduction function.

**Parameters:**

* ​simd\_width (`Int`): The vector width for the computation.
* ​size (`Dim`): The buffer size.
* ​type (`DType`): The buffer elements dtype.
* ​acc\_type (`DType`): The dtype of the reduction accumulator.
* ​origins\_gen (`origin.set`): The OriginSet of captured arguments by the input\_gen\_fn.
* ​input\_gen\_fn (`fn[DType, Int](Int) capturing -> SIMD[$0, $1]`): A function that generates inputs to reduce.
* ​origins\_vec (`origin.set`): The OriginSet of captured arguments by the reduce\_vec\_to\_vec\_fn.
* ​reduce\_vec\_to\_vec\_fn (`fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]`): A mapping function. This function is used to
  combine (accumulate) two chunks of input data: e.g. we load two
  `8xfloat32` vectors of elements and need to reduce them into a single
  `8xfloat32` vector.
* ​reduce\_vec\_to\_scalar\_fn (`fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1]`): A reduction function. This function is used to
  reduce a vector to a scalar. E.g. when we got `8xfloat32` vector and want
  to reduce it to an `float32` scalar.

**Args:**

* ​dst (`NDBuffer[type, 1, origin, __init__[::Intable](size)]`): The output buffer.
* ​init (`SIMD[acc_type, 1]`): The initial value to use in accumulator.

**Returns:**

The computed reduction value.

---

## masked_load

`masked_load[dtype: DType, //, size: Int](addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], mask: SIMD[bool, size], passthrough: SIMD[dtype, size], alignment: Int = 1) -> SIMD[dtype, size]`

Loads data from memory and return it, replacing masked lanes with values from the passthrough vector.

**Parameters:**

* ​dtype (`DType`): DType of the return SIMD buffer.
* ​size (`Int`): Size of the return SIMD buffer.

**Args:**

* ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The base pointer for the load.
* ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of
  the memory stored at addr.
* ​passthrough (`SIMD[dtype, size]`): In the result vector, the masked-off lanes are replaced
  with the passthrough vector.
* ​alignment (`Int`): The alignment of the source addresses. Must be 0 or a power
  of two constant integer value. Default is 1.

**Returns:**

The loaded memory stored in a vector of type SIMD\[dtype, size].

---

## masked_store

`masked_store[size: Int](value: SIMD[dtype, size], addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], mask: SIMD[bool, size], alignment: Int = 1)`

Stores a value at a memory location, skipping masked lanes.

**Parameters:**

* ​size (`Int`): Size of `value`, the data to store.

**Args:**

* ​value (`SIMD[dtype, size]`): The vector containing data to store.
* ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): A vector of memory location to store data at.
* ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of
  `value`.
* ​alignment (`Int`): The alignment of the destination locations. Must be 0 or a
  power of two constant integer value.

---

## MaskName

`struct MaskName`

A tile's masking status.

## Fields

* ​name (`String`):

## Implemented traits

`AnyType`,
`Stringable`,
`UnknownDestructibility`

## Aliases

### `CAUSAL`

`alias CAUSAL = MaskName(__init__[__mlir_type.!kgen.string]("causal"))`

### `CHUNKED`

`alias CHUNKED = MaskName(__init__[__mlir_type.!kgen.string]("chunked"))`

### `CHUNKED_CAUSAL`

`alias CHUNKED_CAUSAL = MaskName(__init__[__mlir_type.!kgen.string]("chunked_causal"))`

### `MATERIALIZED`

`alias MATERIALIZED = MaskName(__init__[__mlir_type.!kgen.string]("materialized"))`

### `NULL`

`alias NULL = MaskName(__init__[__mlir_type.!kgen.string]("null"))`

### `SLIDING_WINDOW_CAUSAL`

`alias SLIDING_WINDOW_CAUSAL = MaskName(__init__[__mlir_type.!kgen.string]("sliding_window_causal"))`

## Methods

### `__init__`

`__init__(out self, name: String)`

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

`__eq__(self, rhs: String) -> Bool`

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

### `__str__`

`__str__(self) -> String`

---

## MaterializedMask

`@register_passable(trivial)`
`struct MaterializedMask[type_: DType, rank_: Int, shape_: DimList]`

Mask that's backed by a materialized tensor.

## Fields

* ​mask\_tensor (`NDBuffer[type_, rank_, MutableAnyOrigin, shape_]`):
* ​start\_pos (`OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]`):
* ​is\_multiple\_of\_2 (`Bool`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`MHAMask`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask = True`

### `mask_out_of_bound`

`alias mask_out_of_bound = True`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds = False`

### `MaskType`

`alias MaskType = NDBuffer[type_, rank_, MutableAnyOrigin, shape_]`

## Methods

### `__init__`

`__init__(mask_tensor: NDBuffer[type_, rank_, MutableAnyOrigin, shape_], start_pos: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]] = OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]]({:i1 0, 1})) -> Self`

### `get_start_pos`

`get_start_pos(self, batch_idx: Int) -> Int`

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

### `status`

`status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

---

## matfp

`matfp(gpr: Int)`

Float16 matrix multiply.

---

## math

Implements math methods that work on layout tensors.

## Functions

* [​`max`](./max): Computes maximum reduction along specified axis.
* [​`outer_product_acc`](./outer_product_acc): Updates result tensor with the outer product of two vectors.
* [​`sum`](./sum): Computes sum reduction along specified axis.

---

## math

Defines basic math functions for use in the open source parts of the standard library since the `math` package is currently closed source and cannot be depended on in the open source parts of the standard library.

These are Mojo built-ins, so you don't need to import them.

## Traits

* [​`Absable`](/mojo/stdlib/builtin/math/Absable): The `Absable` trait describes a type that defines an absolute value operation.
* [​`Powable`](/mojo/stdlib/builtin/math/Powable): The `Powable` trait describes a type that defines a power operation (i.e. exponentiation) with the same base and exponent types.
* [​`Roundable`](/mojo/stdlib/builtin/math/Roundable): The `Roundable` trait describes a type that defines a rounding operation.

## Functions

* [​`abs`](/mojo/stdlib/builtin/math/abs): Get the absolute value of the given object.
* [​`divmod`](/mojo/stdlib/builtin/math/divmod): Performs integer division and returns the quotient and the remainder.
* [​`max`](/mojo/stdlib/builtin/math/max): Gets the maximum of two integers.
* [​`min`](/mojo/stdlib/builtin/math/min): Gets the minimum of two integers.
* [​`pow`](/mojo/stdlib/builtin/math/pow): Computes the `base` raised to the power of the `exp`.
* [​`round`](/mojo/stdlib/builtin/math/round): Get the rounded value of the given object.

---

## math

Implements the math package.

## Modules

* [​`constants`](/mojo/stdlib/math/constants/): Defines math utilities.
* [​`math`](/mojo/stdlib/math/math/): Defines math utilities.
* [​`polynomial`](/mojo/stdlib/math/polynomial/): Provides two implementations for evaluating polynomials.

---

## math

Defines math utilities.

You can import these APIs from the `math` package. For example:

```mojo
from math import floor
```

## Traits

* [​`Ceilable`](/mojo/stdlib/math/math/Ceilable): The `Ceilable` trait describes a type that defines a ceiling operation.
* [​`CeilDivable`](/mojo/stdlib/math/math/CeilDivable): The `CeilDivable` trait describes a type that defines a ceil division operation.
* [​`CeilDivableRaising`](/mojo/stdlib/math/math/CeilDivableRaising): The `CeilDivable` trait describes a type that define a floor division and negation operation that can raise.
* [​`Floorable`](/mojo/stdlib/math/math/Floorable): The `Floorable` trait describes a type that defines a floor operation.
* [​`Truncable`](/mojo/stdlib/math/math/Truncable): The `Truncable` trait describes a type that defines a truncation operation.

## Functions

* [​`acos`](/mojo/stdlib/math/math/acos): Computes the `acos` of the inputs.
* [​`acosh`](/mojo/stdlib/math/math/acosh): Computes the `acosh` of the inputs.
* [​`align_down`](/mojo/stdlib/math/math/align_down): Returns the closest multiple of alignment that is less than or equal to value.
* [​`align_up`](/mojo/stdlib/math/math/align_up): Returns the closest multiple of alignment that is greater than or equal to value.
* [​`asin`](/mojo/stdlib/math/math/asin): Computes the `asin` of the inputs.
* [​`asinh`](/mojo/stdlib/math/math/asinh): Computes the `asinh` of the inputs.
* [​`atan`](/mojo/stdlib/math/math/atan): Computes the `atan` of the inputs.
* [​`atan2`](/mojo/stdlib/math/math/atan2): Computes the `atan2` of the inputs.
* [​`atanh`](/mojo/stdlib/math/math/atanh): Computes the `atanh` of the inputs.
* [​`cbrt`](/mojo/stdlib/math/math/cbrt): Computes the `cbrt` of the inputs.
* [​`ceil`](/mojo/stdlib/math/math/ceil): Get the ceiling value of the given object.
* [​`ceildiv`](/mojo/stdlib/math/math/ceildiv): Return the rounded-up result of dividing numerator by denominator.
* [​`clamp`](/mojo/stdlib/math/math/clamp): Clamps the integer value vector to be in a certain range.
* [​`copysign`](/mojo/stdlib/math/math/copysign): Returns a value with the magnitude of the first operand and the sign of the second operand.
* [​`cos`](/mojo/stdlib/math/math/cos): Computes the `cos` of the inputs.
* [​`cosh`](/mojo/stdlib/math/math/cosh): Computes the `cosh` of the inputs.
* [​`erf`](/mojo/stdlib/math/math/erf): Performs the elementwise Erf on a SIMD vector.
* [​`erfc`](/mojo/stdlib/math/math/erfc): Computes the `erfc` of the inputs.
* [​`exp`](/mojo/stdlib/math/math/exp): Calculates elementwise exponential of the input vector.
* [​`exp2`](/mojo/stdlib/math/math/exp2): Computes elementwise 2 raised to the power of n, where n is an element of the input SIMD vector.
* [​`expm1`](/mojo/stdlib/math/math/expm1): Computes the `expm1` of the inputs.
* [​`factorial`](/mojo/stdlib/math/math/factorial): Computes the factorial of the integer.
* [​`floor`](/mojo/stdlib/math/math/floor): Get the floor value of the given object.
* [​`fma`](/mojo/stdlib/math/math/fma): Performs `fma` (fused multiply-add) on the inputs.
* [​`frexp`](/mojo/stdlib/math/math/frexp): Breaks floating point values into a fractional part and an exponent part. This follows C and Python in increasing the exponent by 1 and normalizing the fraction from 0.5 to 1.0 instead of 1.0 to 2.0.
* [​`gamma`](/mojo/stdlib/math/math/gamma): Computes the Gamma of the input.
* [​`gcd`](/mojo/stdlib/math/math/gcd): Compute the greatest common divisor of two integers.
* [​`hypot`](/mojo/stdlib/math/math/hypot): Computes the `hypot` of the inputs.
* [​`iota`](/mojo/stdlib/math/math/iota): Creates a SIMD vector containing an increasing sequence, starting from offset.
* [​`isclose`](/mojo/stdlib/math/math/isclose): Checks if the two input values are numerically within a tolerance.
* [​`isqrt`](/mojo/stdlib/math/math/isqrt): Performs elementwise reciprocal square root on a SIMD vector.
* [​`j0`](/mojo/stdlib/math/math/j0): Computes the Bessel function of the first kind of order 0 for each input value.
* [​`j1`](/mojo/stdlib/math/math/j1): Computes the Bessel function of the first kind of order 1 for each input value.
* [​`lcm`](/mojo/stdlib/math/math/lcm): Computes the least common multiple of two integers.
* [​`ldexp`](/mojo/stdlib/math/math/ldexp): Computes elementwise ldexp function.
* [​`lgamma`](/mojo/stdlib/math/math/lgamma): Computes the `lgamma` of the inputs.
* [​`log`](/mojo/stdlib/math/math/log): Performs elementwise natural log (base E) of a SIMD vector.
* [​`log10`](/mojo/stdlib/math/math/log10): Computes the `log10` of the inputs.
* [​`log1p`](/mojo/stdlib/math/math/log1p): Computes the `log1p` of the inputs.
* [​`log2`](/mojo/stdlib/math/math/log2): Performs elementwise log (base 2) of a SIMD vector.
* [​`logb`](/mojo/stdlib/math/math/logb): Computes the `logb` of the inputs.
* [​`modf`](/mojo/stdlib/math/math/modf): Computes the integral and fractional part of the value.
* [​`recip`](/mojo/stdlib/math/math/recip): Performs elementwise reciprocal on a SIMD vector.
* [​`remainder`](/mojo/stdlib/math/math/remainder): Computes the `remainder` of the inputs.
* [​`scalb`](/mojo/stdlib/math/math/scalb): Computes the `scalb` of the inputs.
* [​`sin`](/mojo/stdlib/math/math/sin): Computes the `sin` of the inputs.
* [​`sinh`](/mojo/stdlib/math/math/sinh): Computes the `sinh` of the inputs.
* [​`sqrt`](/mojo/stdlib/math/math/sqrt): Performs square root on an integer.
* [​`tan`](/mojo/stdlib/math/math/tan): Computes the `tan` of the inputs.
* [​`tanh`](/mojo/stdlib/math/math/tanh): Performs elementwise evaluation of the tanh function.
* [​`trunc`](/mojo/stdlib/math/math/trunc): Get the truncated value of the given object.
* [​`ulp`](/mojo/stdlib/math/math/ulp): Computes the ULP (units of last place) or (units of least precision) of the number.
* [​`y0`](/mojo/stdlib/math/math/y0): Computes the Bessel function of the second kind of order 0 for each input value.
* [​`y1`](/mojo/stdlib/math/math/y1): Computes the Bessel function of the second kind of order 1 for each input value.

---

## matmul

## Structs

* [​`TiledMatmul`](./TiledMatmul): Tiled matmul implementation integrating packing, inner loop and tile partitions.

## Traits

* [​`InnerMatmulKernel`](./InnerMatmulKernel):

## Functions

* [​`elementwise_epilogue_c_tile`](./elementwise_epilogue_c_tile):
* [​`matmul`](./matmul):
* [​`tiled_matmul_run`](./tiled_matmul_run): Interface function to run tiled matmul on a given sub-tile.

---

## matmul

`matmul[transpose_a: Bool = False, transpose_b: Bool = False, b_packed: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), saturated_vnni: Bool = False, single_thread_blocking_override: Bool = False, _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], ctx: DeviceContextPtr = DeviceContextPtr())`

`matmul[transpose_a: Bool = False, transpose_b: Bool = False, b_packed: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), saturated_vnni: Bool = False, single_thread_blocking_override: Bool = False, _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], ctx: Optional[DeviceContext])`

---

## matmul

`matmul[c_type: DType, a_type: DType, b_type: DType, //, use_tensor_core: Bool = False, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), config: OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]] = OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]]({:i1 0, 1}), _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[b_type, 2, origin, shape], ctx: DeviceContext)`

This implements the matmul kernel for the Blackwell architecture. Note that we do not currently have pure mojo kernels which would utilize blackwell architectures, so in place we just call the CUBLAS library.

---

## matmul

`matmul[use_tf32: Bool = False](ctx: DeviceContext, c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], *, c_row_major: Bool = False, transpose_a: Bool = False, transpose_b: Bool = False, alpha: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](1), beta: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](0))`

Matmul using the vendor BLAS library. With a global handle.

`matmul[use_tf32: Bool = False](ctx: DeviceContext, handle: Handle[backend], c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], *, c_row_major: Bool = False, transpose_a: Bool = False, transpose_b: Bool = False, alpha: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](1), beta: SIMD[float32, 1] = __init__[__mlir_type.!pop.float_literal](0))`

---

## matmul_allreduce

`matmul_allreduce[ngpus: Int, partition_dim: Int, num_partitions: Int, outputs_lambda: fn[Int, DType, Int, Int, Int](IndexList[$2], SIMD[$1, $3]) capturing -> None, type: DType, a_static_shape: DimList, b_static_shape: DimList, c_static_shape: DimList, out_static_shape: DimList](a_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, a_static_shape], ngpus], b_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, b_static_shape], ngpus], c_temp_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, c_static_shape], ngpus], output_buffers: InlineArray[NDBuffer[type, 2, MutableAnyOrigin, out_static_shape], ngpus], rank_sigs: InlineArray[UnsafePointer[Signal], 8], ctxs: List[DeviceContext])`

Performs C = matmul(A, B^T) followed with Out = allreduce(C) operation across multiple GPUs. Split the A or B and C matrices into `num_partitions` submatrices at dimension `partition_dim`. This way we can perform `num_partitions` independent matmul + allreduce kernels, and overlap some of the computation.

---

## matmul_default

## Structs

* [​`Inner_matmul_default`](./Inner_matmul_default):

---

## matmul_dynamic_scaled_fp8

`matmul_dynamic_scaled_fp8[c_type: DType, a_type: DType, b_type: DType, a_scales_type: DType, b_scales_type: DType, //, transpose_b: Bool = False, config: OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]] = OptionalReg[MatmulConfig[a_type, b_type, c_type, transpose_b]]({:i1 0, 1}), target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[b_type, 2, origin, shape], a_scales: NDBuffer[a_scales_type, 2, origin, shape], b_scales: NDBuffer[b_scales_type, 2, origin, shape], ctx: DeviceContext)`

---

## matmul_gpu

## Structs

* [​`AMDSchedulerTuning`](./AMDSchedulerTuning):

## Functions

* [​`__nvvm_ldg_f4`](./__nvvm_ldg_f4):
* [​`matmul_kernel`](./matmul_kernel): Matrix Multiplication using shared memory. This version loads blocks of size tile\_size x tile\_size from A and B and updates a tile\_size x tile\_size in C. The thread block should have shape (tile\_size, tile\_size, 1). Each thread is mapped one element in C. The grid should have shape (N/tile\_size, M/tile\_size, 1). N is the first dimension for coalesced access.
* [​`matmul_kernel_naive`](./matmul_kernel_naive):
* [​`multistage_gemm`](./multistage_gemm):
* [​`split_k_reduce`](./split_k_reduce):

---

## matmul_gpu_qint4

`matmul_gpu_qint4[c_type: DType, a_type: DType, //, group_size: Int, target: StringSlice[StaticConstantOrigin], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[uint8, 2, origin, shape], ctx: DeviceContextPtr = DeviceContextPtr())`

---

## matmul_gpu_qint4_impl

`matmul_gpu_qint4_impl[c_type: DType, a_type: DType, //, group_size: Int, target: StringSlice[StaticConstantOrigin], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, shape], a: NDBuffer[a_type, 2, origin, shape], b: NDBuffer[uint8, 2, origin, shape], ctx: Optional[DeviceContext])`

---

## matmul_i8mm

## Structs

* [​`Inner_matmul_i8mm`](./Inner_matmul_i8mm):
* [​`LoadStore_i8mm`](./LoadStore_i8mm):

---

## matmul_kernel

`matmul_kernel[c_type: DType, a_type: DType, b_type: DType, tile_size: Int, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c_ptr: UnsafePointer[SIMD[c_type, 1]], a_ptr: UnsafePointer[SIMD[a_type, 1]], b_ptr: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)`

Matrix Multiplication using shared memory. This version loads blocks of size tile\_size x tile\_size from A and B and updates a tile\_size x tile\_size in C. The thread block should have shape (tile\_size, tile\_size, 1). Each thread is mapped one element in C. The grid should have shape (N/tile\_size, M/tile\_size, 1). N is the first dimension for coalesced access.

---

## matmul_kernel_naive

`matmul_kernel_naive[c_type: DType, a_type: DType, b_type: DType, BLOCK_DIM: Int, transpose_b: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), s_type: DType = get_accum_type[::DType,::DType]()](c_ptr: UnsafePointer[SIMD[c_type, 1]], a_ptr: UnsafePointer[SIMD[a_type, 1]], b_ptr: UnsafePointer[SIMD[b_type, 1]], m: Int, n: Int, k: Int)`

---

## matmul_neon

## Structs

* [​`Inner_matmul_neon`](./Inner_matmul_neon):

---

## matmul_Q4_K

`matmul_Q4_K[elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a: NDBuffer[float32, 2, origin], b: NDBuffer[uint8, 2, origin], c: NDBuffer[float32, 2, origin])`

---

## matmul_Q4_K_pack_b

`matmul_Q4_K_pack_b[b_origin: MutableOrigin, b_packed_origin: MutableOrigin](b: NDBuffer[uint8, 2, b_origin], b_packed: NDBuffer[uint8, 2, b_packed_origin])`

---

## matmul_Q6_K

`matmul_Q6_K[elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a: NDBuffer[float32, 2, origin], b: NDBuffer[uint8, 2, origin], c: NDBuffer[float32, 2, origin])`

---

## matmul_Q6_K_pack_b

`matmul_Q6_K_pack_b[b_origin: MutableOrigin, b_packed_origin: MutableOrigin](b: NDBuffer[uint8, 2, b_origin], b_packed: NDBuffer[uint8, 2, b_packed_origin])`

---

## matmul_qint4

`matmul_qint4[group_size: Int, b_static_shape: DimList = create_unknown[::Int](), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](a: NDBuffer[float32, 2, origin], b: NDBuffer[uint8, 2, origin, b_static_shape], c: NDBuffer[float32, 2, origin])`

---

## matmul_qint4_pack_b

`matmul_qint4_pack_b[group_size: Int](b: NDBuffer[uint8, 2, origin], b_rot: NDBuffer[uint8, 2, origin])`

---

## matmul_sm90

## Aliases

### `NumWarpPerWarpGroup`

`alias NumWarpPerWarpGroup = 4`

### `WARP_GROUP_SIZE`

`alias WARP_GROUP_SIZE = 128`

## Functions

* [​`cluster_size`](./cluster_size):
* [​`consumer_main_loop`](./consumer_main_loop):
* [​`hopper_matmul_tma_wgmma`](./hopper_matmul_tma_wgmma):
* [​`hopper_matmul_tma_wgmma_kernel`](./hopper_matmul_tma_wgmma_kernel):
* [​`producer_main_loop`](./producer_main_loop):
* [​`promote_to_cuda_cores`](./promote_to_cuda_cores):
* [​`tma_wgmma_warp_specialized_gemm_kernel`](./tma_wgmma_warp_specialized_gemm_kernel):
* [​`tma_wgmma_warp_specialized_gemm_kernel_persistent`](./tma_wgmma_warp_specialized_gemm_kernel_persistent):
* [​`warp_specialize_gemm_with_multicasting`](./warp_specialize_gemm_with_multicasting):
* [​`warp_specialized_gemm_output`](./warp_specialized_gemm_output):

---

## matmul_tile_scheduler

## Structs

* [​`MatmulSchedule`](./MatmulSchedule):
* [​`TileScheduler`](./TileScheduler):
* [​`WorkInfo`](./WorkInfo):

---

## matmul_vendor

## Functions

* [​`matmul`](./matmul): This implements the matmul kernel for the Blackwell architecture. Note that we do not currently have pure mojo kernels which would utilize blackwell architectures, so in place we just call the CUBLAS library.

---

## matmul_vnni

## Structs

* [​`Inner_matmul_vnni`](./Inner_matmul_vnni):

---

## MatmulConfig

`@register_passable(trivial)`
`struct MatmulConfig[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool = False, mma_shape: IndexList[3] = get_mma_shape[::DType,::DType,::Int]()]`

Static configuration of GPU matmul.

## Fields

* ​block\_tile\_shape (`IndexList[3]`):
* ​warp\_tile\_shape (`IndexList[3]`):
* ​num\_pipeline\_stages (`UInt`):
* ​num\_k\_partitions (`UInt`):
* ​k\_group\_size (`UInt`):
* ​num\_warp\_k\_partitions (`UInt`):
* ​cluster\_shape (`IndexList[3]`):
* ​num\_consumer (`UInt`):
* ​partitioned\_multicast (`Bool`):
* ​scheduler\_hint (`IndexList[3]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `ACCUM_PRECISION`

`alias ACCUM_PRECISION = 1`

### `accum_type`

`alias accum_type = get_accum_type[::DType,::DType]()`

### `OUTPUT_PRECISION`

`alias OUTPUT_PRECISION = 2`

### `split_k_reduction_scheme`

`alias split_k_reduction_scheme = env_get_int[::StringSlice[::Bool()`

### `split_k_reduction_type`

`alias split_k_reduction_type = c_type if (env_get_int[::StringSlice[::Bool() == 2) else get_accum_type[::DType,::DType]()`

## Methods

### `__init__`

`__init__(block_tile_shape: IndexList[3] = Index(128, 128, 32), warp_tile_shape: IndexList[3] = Index(64, 64, 32), cluster_shape: IndexList[3] = Index(1, 1, 1), num_pipeline_stages: UInt = UInt(4), num_k_partitions: UInt = UInt(1), k_group_size: UInt = UInt(1), num_warp_k_partitions: UInt = UInt(1), num_consumer: UInt = UInt(1), partitioned_multicast: Bool = False, scheduler_hint: IndexList[3] = Index(2, 2, 2), pdl_level: PDLLevel = PDLLevel()) -> Self`

### `__eq__`

`__eq__(self, rhs: MatmulConfig[a_type, b_type, c_type, transpose_b, mma_shape]) -> Bool`

### `num_warps_m`

`num_warps_m(self) -> UInt`

### `num_warps_n`

`num_warps_n(self) -> UInt`

### `num_threads`

`num_threads(self) -> UInt`

### `shared_mem_usage`

`shared_mem_usage(self) -> Int`

### `grid_dim`

`grid_dim(self, m: UInt, n: UInt) -> IndexList[3]`

### `block_dim`

`block_dim(self) -> IndexList[3]`

### `work_space_size`

`work_space_size(self, M: UInt, N: UInt) -> UInt`

### `pdl_level`

`pdl_level(self) -> PDLLevel`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

### `__repr__`

`__repr__(self) -> String`

### `__hash__`

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with the underlying bytes.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

---

## MatmulKernels

`@register_passable(trivial)`
`struct MatmulKernels[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool = False]`

Supported matmul kernels.

The configurations are named as: **.
BK, mma shape, and warp tile shape are decided internally.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ampere_128x128_4`

`alias ampere_128x128_4 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(4), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `ampere_256x128_3`

`alias ampere_256x128_3 = MatmulConfig(Index(128, 256, (_bk_base[::DType,::Bool]() * 2)), Index(64, 64, (_bk_base[::DType,::Bool]() * 2)), Index(1, 1, 1), UInt(3), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `ampere_256x64_4`

`alias ampere_256x64_4 = MatmulConfig(Index(64, 256, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(4), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `hopper_128x128_4`

`alias hopper_128x128_4 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(4), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `mi300x_128x128_1`

`alias mi300x_128x128_1 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `mi300x_128x128_2`

`alias mi300x_128x128_2 = MatmulConfig(Index(128, 128, _bk_base[::DType,::Bool]()), Index(64, 64, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(2), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `mi300x_128x256_1`

`alias mi300x_128x256_1 = MatmulConfig(Index(128, 256, _bk_base[::DType,::Bool]()), Index(64, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 4, 2), PDLLevel())`

### `mi300x_192x256_1`

`alias mi300x_192x256_1 = MatmulConfig(Index(192, 256, _bk_base[::DType,::Bool]()), Index(96, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(4, 6, 2), PDLLevel())`

### `mi300x_224x256_1`

`alias mi300x_224x256_1 = MatmulConfig(Index(224, 256, _bk_base[::DType,::Bool]()), Index(112, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(4, 7, 2), PDLLevel())`

### `mi300x_256x256_1`

`alias mi300x_256x256_1 = MatmulConfig(Index(256, 256, _bk_base[::DType,::Bool]()), Index(128, 128, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(4, 8, 2), PDLLevel())`

### `mi300x_64x64_1`

`alias mi300x_64x64_1 = MatmulConfig(Index(64, 64, _bk_base[::DType,::Bool]()), Index(32, 32, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(1), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `mi300x_64x64_splitk_1`

`alias mi300x_64x64_splitk_1 = MatmulConfig(Index(64, 64, _bk_base[::DType,::Bool]()), Index(32, 32, _bk_base[::DType,::Bool]()), Index(1, 1, 1), UInt(1), UInt(4), UInt(1), UInt(1), UInt(1), False, Index(2, 2, 2), PDLLevel())`

### `tuning_config`

`alias tuning_config = MatmulConfig(Index(env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool()), Index(env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool(), env_get_int[::StringSlice[::Bool()), Index(1, 1, 1), UInt(env_get_int[::StringSlice[::Bool()), UInt(env_get_int[::StringSlice[::Bool()), UInt(1), UInt(env_get_int[::StringSlice[::Bool()), UInt(1), False, Index(2, 2, 2), PDLLevel())`

---

## MatmulSchedule

`@register_passable(trivial)`
`struct MatmulSchedule`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `NONE`

`alias NONE = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](-1))`

### `TILE1D`

`alias TILE1D = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](0))`

### `TILE2D`

`alias TILE2D = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](1))`

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

---

## matrix_band_part

The module implements matrix band part functions.

## Functions

* [​`matrix_band_part`](./matrix_band_part):

---

## matrix_band_part

`matrix_band_part[: origin.set, //, type: DType, int_type: DType, cond_type: DType, rank: Int, input_0_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], simd_width: Int, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[rank], num_lower: NDBuffer[int_type, 1, origin], num_upper: NDBuffer[int_type, 1, origin], exclude_buf: NDBuffer[cond_type, 1, origin], output: NDBuffer[type, rank, origin], ctx: DeviceContextPtr)`

---

## max

The MAX Mojo API reference.

The MAX API provides a state-of-the-art graph compiler and runtime
library that executes AI models with incredible speed on a wide range of
hardware.

## Packages

* [​`tensor`](/max/api/mojo/tensor/): APIs to create and manage tensors in a graph.

---

## max

The MAX Python API reference.

The MAX API provides a state-of-the-art graph compiler and runtime
library that executes AI models with incredible speed on a wide range of
hardware.

## Modules

* [`driver`](/max/api/python/driver): APIs to interact with devices.
* [`dtype`](/max/api/python/dtype): APIs to define data types.
* [`engine`](/max/api/python/engine): APIs to load and execute models.
* [`entrypoints`](/max/api/python/entrypoints): APIs to run MAX models.
* [`torch`](/max/api/python/torch): APIs to use custom ops with PyTorch.

## Packages

* [`graph`](/max/api/python/graph): APIs to build models (inference graphs).
* [`pipelines`](/max/api/python/pipelines): APIs to build model pipelines.
* [`nn`](/max/api/python/nn): APIs to build MAX NN models.

---

## max

`max[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], out: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Computes maximum reduction along specified axis.

Reduces the input tensor by taking maximum elements along the specified
axis and stores the result in the output tensor.

**Constraints:**

All tensors must have statically known shapes.
`out.rank` must equal `inp.rank - 1`.
Non-reduction dimensions must match between `inp` and `out`.
Currently only supports rank-2 inputs.

**Parameters:**

* ​axis (`Int`): The axis to take maximum along.

**Args:**

* ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to reduce.
* ​out (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor to store maximum results.

`max[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, _reduce_res_row_major_shape(axis, layout), MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Computes maximum reduction along specified axis, returning a new tensor.

Reduces the input tensor by taking maximum elements along the specified
axis and returns a new tensor with the results.

**Constraints:**

All tensors must have statically known shapes.
Result will have rank equal to `inp.rank` - 1.
Non-reduction dimensions in the result match the input.
Currently only supports rank-2 inputs.

**Parameters:**

* ​axis (`Int`): The axis to take maximum along.

**Args:**

* ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to reduce.

**Returns:**

A new tensor containing the maximum values along the specified axis.

`max[dtype: DType, layout: Layout](x: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], y: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`

Computes element-wise maximum of two tensors.

Returns a new tensor containing the element-wise maximum between the
input tensors.

**Constraints:**

Input tensors must have statically known shapes and matching layouts.

**Parameters:**

* ​dtype (`DType`): The data type of the input tensors.
* ​layout (`Layout`): The layout of the input tensors.

**Args:**

* ​x (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): First input tensor.
* ​y (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Second input tensor.

**Returns:**

A new tensor containing the element-wise maximum.

---

## max

`max(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]`

Computes the max element in a buffer.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.

**Returns:**

The maximum of the buffer elements.

`max[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])`

Computes the max across reduce\_axis of an NDBuffer.

**Parameters:**

* ​reduce\_axis (`Int`): The axis to reduce across.

**Args:**

* ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer.
* ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer.

`max[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())`

Computes the max across the input and output shape.

This performs the max computation on the domain specified by `input_shape`,
loading the inputs using the `input_fn`. The results are stored using
the `output_fn`.

**Parameters:**

* ​type (`DType`): The type of the input and output.
* ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input.
* ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.

**Args:**

* ​input\_shape (`IndexList[size]`): The input shape.
* ​reduce\_dim (`Int`): The axis to perform the max on.
* ​context (`DeviceContextPtr`): The pointer to DeviceContext.

---

## max

`max(x: Int, y: Int, /) -> Int`

Gets the maximum of two integers.

**Args:**

* ​x (`Int`): Integer input to max.
* ​y (`Int`): Integer input to max.

**Returns:**

Maximum of x and y.

`max(x: UInt, y: UInt, /) -> UInt`

Gets the maximum of two integers.

**Args:**

* ​x (`UInt`): Integer input to max.
* ​y (`UInt`): Integer input to max.

**Returns:**

Maximum of x and y.

`max[dtype: DType, //](x: SIMD[dtype, size], y: SIMD[dtype, size], /) -> SIMD[dtype, size]`

Performs elementwise maximum of x and y.

An element of the result SIMD vector will be the maximum of the
corresponding elements in x and y.

**Constraints:**

The type of the inputs must be numeric or boolean.

**Parameters:**

* ​dtype (`DType`): The data type of the SIMD vector.

**Args:**

* ​x (`SIMD[dtype, size]`): First SIMD vector.
* ​y (`SIMD[dtype, size]`): Second SIMD vector.

**Returns:**

A SIMD vector containing the elementwise maximum of x and y.

`max[T: Copyable & GreaterThanComparable](x: T, *ys: T) -> T`

Gets the maximum value from a sequence of values.

**Parameters:**

* ​T (`Copyable & GreaterThanComparable`): A type that is both copyable and comparable with greater than.

**Args:**

* ​x (`T`): The first value to compare.
* ​\*ys (`T`): Zero or more additional values to compare.

**Returns:**

The maximum value from the input sequence.

---

## max

`max[type: DType, width: Int, //, *, block_size: Int, broadcast: Bool = True](val: SIMD[type, width]) -> SIMD[type, width]`

Computes the maximum value across all threads in a block.

Performs a parallel reduction using warp-level operations and shared memory
to find the global maximum across all threads in the block.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements.
* ​width (`Int`): The number of elements in each SIMD vector.
* ​block\_size (`Int`): The total number of threads in the block.
* ​broadcast (`Bool`): If True, the final reduced value is broadcast to all
  threads in the block. If False, only the first thread will have the
  complete result.

**Args:**

* ​val (`SIMD[type, width]`): The SIMD value to reduce. Each thread contributes its value to find
  the maximum.

**Returns:**

If broadcast is True, each thread in the block will receive the maximum
value across the entire block. Otherwise, only the first thread will
have the complete result.

---

## max

`max[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Computes the maximum value across all lanes in a warp.

This is a convenience wrapper around lane\_group\_max that operates on the entire warp.
It performs a parallel reduction using warp shuffle operations to find the global maximum
value across all lanes in the warp.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the maximum.

**Returns:**

A SIMD value where all lanes contain the maximum value found across the entire warp.

---

## max CLI

import InstallModular from '@site/docs/_includes/install-modular.mdx';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

The `max` CLI tool accelerates GenAI tasks by creating optimized inference
pipelines with [OpenAI-compatible
endpoints](https://platform.openai.com/docs/api-reference/introduction). It
supports models from [Hugging Face](https://builds.modular.com/?category=models)
and [MAX Graph](/max/model-formats.mdx#max-graph) optimized versions of models
like Llama 3.1, Mistral, and Replit Code.

Generate text or start an OpenAI-compatible endpoint with a single command using
the `max` CLI tool.

:::note

The `max-pipelines` CLI tool has been renamed to `max`.
Users should switch to using the `max` CLI tool.
The underlying implementation remains identical with the same commands and flags,
so your existing workflows will continue to work as expected.

:::

## Install

Create a Python project to install our APIs and the `max` CLI.

When you install the `modular` package, you'll get access to the `max` CLI tool
automatically. You can check your version like this:

```sh
max --version
```

## Run your first model

Now that you have `max` installed, you can run your first model:

```sh
max generate --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
  --prompt "Generate a story about a robot"
```

:::note

If you use private or gated models, you must set your [Hugging Face access
token](https://huggingface.co/docs/hub/en/security-tokens) first. For example:

```sh
export HF_TOKEN="hf_..."
```

Then you can run commands in `max` for a private or gated model.

:::

## Uninstall

To remove the `modular` Python package:

  
```sh
pip uninstall modular
```

  
```sh
uv pip uninstall modular
```

  
```sh
magic remove modular
```

  
## Commands

`max` provides the following commands.

You can also print the available commands and documentation with `--help`.
For example:

```sh
max --help
```

```sh
max serve --help
```

### `encode`

Converts input text into embeddings for semantic search, text similarity, and NLP applications.

```sh
max encode [OPTIONS]
```

**Example**

Basic embedding generation:

```sh
max encode \
  --model-path sentence-transformers/all-MiniLM-L6-v2 \
  --prompt "Convert this text into embeddings"
```

### `generate`

Performs text generation based on a provided prompt.

```sh
max generate [OPTIONS]
```

**Examples**

Text generation:

```sh
max generate \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
  --max-length 1024 \
  --max-new-tokens 100 \
  --prompt "Generate a story about a robot"
```

Text generation with controls:

```sh
max generate \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
  --max-length 1024 \
  --max-new-tokens 500 \
  --top-k 40 \
  --quantization-encoding q4_k \
  --cache-strategy paged \
  --prompt "Explain quantum computing"
```

Process an image using a vision-language model given a URL to an image:

**LLama 3.2 Vision**

LLama Vision models take prompts with `` and `` tokens.
For more information, see the [LLama 3.2 Vision
documentation](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_2/#-vision-model-inputs-and-outputs-).

```sh
max generate \
  --model-path meta-llama/Llama-3.2-11B-Vision-Instruct \
  --prompt "What is in this image?" \
  --image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
  --max-new-tokens 100 \
  --max-batch-size 1 \
  --max-length 108172
```

**Pixtral**

Pixtral models take prompts with `[IMG]` tokens. For more information, see the
[Pixtral
documentation](https://huggingface.co/docs/transformers/en/model_doc/pixtral#pixtral).

```sh
max generate \
  --model-path mistral-community/pixtral-12b \
  --max-length 6491 \
  --image_url https://upload.wikimedia.org/wikipedia/commons/5/53/Almendro_en_flor_Sierras_de_Tejeda%2C_Almijara_y_Alhama.jpg \
  --prompt "[INST]Describe the images.\n[IMG][/INST]"
```

:::note

You can adjust parameters like `--max-batch-size` and `--max-length` depending on
your system's available resources such as GPU memory.

:::

For more information on how to use the `generate` command with vision models, see
[Generate image descriptions with Llama 3.2
Vision](/max/tutorials/deploy-llama-vision).

### `list`

Displays available model architectures and configurations, including:

- Hugging Face model repositories
- Supported encoding types
- Available cache strategies

```sh
max list
```

### `serve`

Launches an OpenAI-compatible REST API server for production deployments. For
more detail, see [the Serve API docs](/max/api/serve).

```sh
max serve [OPTIONS]
```

**Examples**

CPU serving:

```sh
max serve \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF
```

Optimized GPU serving:

```sh
max serve \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
  --devices gpu \
  --quantization-encoding bfloat16 \
  --max-batch-size 4 \
  --cache-strategy paged
```

Production setup:

```sh
max serve \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
  --devices gpu:0,1 \
  --max-batch-size 8 \
  --device-memory-utilization 0.9
```

**Custom architectures**

The `max` CLI supports loading custom model architectures through the
`--custom-architectures` flag. This allows you to extend MAX's capabilities with
your own model implementations:

```sh
max serve \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
  --custom-architectures path/to/module1:module1 \
  --custom-architectures path/to/module2:module2
```

:::note Custom architectures

The `--custom-architectures` flag allows you to load custom pipeline
architectures from your own Python modules. You can set the `ARCHITECTURES`
variable containing the architecture definitions. Each entry in
`--custom-architectures` can be specified in two formats:

- A raw module name; for example: `my_module`.
- An import path followed by a colon and the module name; for example: `folder/path/to/import:my_module`.

The `ARCHITECTURES` variable in your module should be a list of implementations
that conform to the
[SupportedArchitecture](/max/api/python/pipelines/registry#max.pipelines.lib.registry.SupportedArchitecture)
interface. These will be registered with the MAX pipeline registry automatically.

:::

### `warm-cache`

Preloads and compiles the model to optimize initialization time by:

- Pre-compiling models before deployment
- Warming up the Hugging Face cache

This command is useful to run before serving a model.

```sh
max warm-cache [OPTIONS]
```
Example:

Basic cache warming:

```sh
max warm-cache \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF
```

:::note

The Modular Executable Format (MEF) is platform independent, but
the serialized cache (MEF files) produced during compilation is
platform-dependent. This is because:

- Platform-dependent optimizations happen during compilation.
- Fallback operations assume a particular runtime environment.

Weight transformations and hashing during MEF caching can impact performance.
While efforts to improve this through weight externalization are ongoing,
compiled MEF files remain platform-specific and are not generally portable.

:::

### Model configuration

Core settings for model loading and execution.

| Option                    | Description                        | Default | Values                                                                    |
|---------------------------|------------------------------------|---------|---------------------------------------------------------------------------|
| `--custom-architectures`  | Load custom pipeline architectures |         | Module path format: `folder/path/to/import:my_module`                     |
| `--engine`                | Backend engine                     | `max`   | `max`\|`huggingface`                                                      |
| `--model-path TEXT`       | (required) Path to model           |         | Any valid path or Hugging Face repo ID (e.g. `mistralai/Mistral-7B-v0.1`) |
| `--quantization-encoding` | Weight encoding type               |         | `float32`\|`bfloat16`\|`q4_k`\|`q4_0`\|`q6_k`\|`gptq`                     |
| `--weight-path PATH`      | Custom model weights path          |         | Valid file path (supports multiple paths via repeated flags)              |

:::note Quantization encoding

When using GGUF models, quantization encoding formats are automatically detected.
If no `--quantization-encoding` is specified, MAX Serve automatically detects and
uses the first encoding option from the repository. If quantization encoding is
provided, it must align with the available encoding options in the repository.

If the repository contains multiple quantization formats, specify which encoding
type you want to use with the `--quantization-encoding` parameter.

:::

### Device configuration

Controls hardware placement and memory usage.

| Option                        | Description                   | Default | Values                                                            |
|-------------------------------|-------------------------------|---------|-------------------------------------------------------------------|
| `--devices`                   | Target devices                |         | `cpu`\|`gpu`\|`gpu:{id}` (e.g. `gpu:0,1`)                     |
| `--device-specs`              | Specific device configuration | `CPU`   | `DeviceSpec` format (e.g. `DeviceSpec(id=-1, device_type='cpu')`) |
| `--device-memory-utilization` | Device memory fraction        | `0.9`   | Float between 0.0 and 1.0                                         |

### Performance tuning

Optimization settings for batch processing, caching, and sequence handling.

| Option                 | Description                         | Default                                                                                | Values                                                  |
|------------------------|-------------------------------------|----------------------------------------------------------------------------------------|---------------------------------------------------------|
| `--cache-strategy`     | Cache strategy                      |                                                                                        | `naive`\|`continuous`                                   |
| `--kv-cache-page-size` | Token count per KVCache page        | `128`                                                                                  | Positive integer                                        |
| `--max-batch-size`     | Maximum cache size per batch        | `1`                                                                                    | Positive integer                                        |
| `--max-ce-batch-size`  | Maximum context encoding batch size | `32`                                                                                   | Positive integer                                        |
| `--max-length`         | Maximum input sequence length       | The Hugging Face model's default max length is used. | Positive integer (must be less than model's max config) |
| `--max-new-tokens`     | Maximum tokens to generate          | `-1`                                                                                   | Integer (-1 for model max)                              |
| `--pad-to-multiple-of` | Input tensor padding multiple       | `2`                                                                                    | Positive integer                                        |

### Model state control

Options for saving or loading model states and handling external code

| Option                            | Description                    | Default | Values          |
|-----------------------------------|--------------------------------|---------|-----------------|
| `--force-download`                | Force re-download cached files | `false` | `true`\|`false` |
| `--trust-remote-code`             | Allow custom Hugging Face code | `false` | `true`\|`false` |

### Generation parameters

Controls for generation behavior.

| Option                          | Description                                                                          | Default | Values                                   |
|---------------------------------|--------------------------------------------------------------------------------------|---------|------------------------------------------|
| `--enable-constrained-decoding` | Enable constrained generation                                                        | `false` | `true`\|`false`                          |
| `--enable-echo`                 | Enable model echo                                                                    | `false` | `true`\|`false`                          |
| `--image_url`                   | URLs of images to include with prompt. Ignored if model doesn't support image inputs | `[]`    | List of valid URLs                       |
| `--rope-type`                   | RoPE type for GGUF weights                                                           |         | `none`\|`normal`\|`neox`                 |
| `--top-k`                       | Limit sampling to top K tokens                                                       | `1`     | Positive integer (1 for greedy sampling) |

---

## MAX container

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import TutorialStack from '@site/src/components/TutorialStack';

The MAX container is our official Docker container that simplifies the process
to deploy a GenAI model on an endpoint. The
container includes the latest version of MAX and it integrates with
orchestration tools like Kubernetes.

Alternatively, you can also experiment with MAX on a local endpoint using
the [`max serve`](/max/max-cli#serve) command. The result is
basically the same because the MAX container is a containerized
environment that runs `max serve` to create the endpoint you can
interact with using our OpenAI-compatible [REST API](/max/api/serve).

:::note Linux only

The MAX container is currently not compatible with macOS.

:::

## Get started

Here's how to start an endpoint with the MAX container:

1. Make sure you have [Docker
installed](https://docs.docker.com/get-started/get-docker/).

2. Start the container and an endpoint for Llama 3:

    ```bash
    docker run --gpus=1 \
        -v ~/.cache/huggingface:/root/.cache/huggingface \
        -p 8000:8000 \
        docker.modular.com/modular/max-nvidia-full:latest \
        --model-path modularai/Llama-3.1-8B-Instruct-GGUF
    ```

    It can take a few minutes to pull the container and then download and
    compile the model.

    When the endpoint is ready, you'll see a message that says this:

    ```output
    Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
    ```

3. Open a new terminal and send a request using the `openai` Python API or
`curl`:

    
        1. Create a new virtual environment:

            ```sh
            mkdir quickstart && cd quickstart
            ```

            ```sh
            python3 -m venv .venv/quickstart \
              && source .venv/quickstart/bin/activate
            ```

        2. Install the OpenAI Python API:

            ```bash
            pip install openai
            ```

        3. Create the following file to send an inference request:

            ```python title="generate-text.py"
            from openai import OpenAI

            client = OpenAI(
                base_url="http://0.0.0.0:8000/v1",
                api_key="EMPTY",
            )

            completion = client.chat.completions.create(
                model="modularai/Llama-3.1-8B-Instruct-GGUF",
                messages=[
                    {
                      "role": "user",
                      "content": "Who won the world series in 2020?"
                    },
                ],
            )

            print(completion.choices[0].message.content)
            ```

        4. Run it and you should see results like this:

            ```sh
            python generate-text.py
            ```

            ```output
            The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
            ```

      
        Run this command:

        ```sh
        curl -N http://0.0.0.0:8000/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
            "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
            "stream": true,
            "messages": [
                {"role": "user", "content": "Who won the World Series in 2020?"}
            ]
        }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
        ```

        You should see results like this:

        ```output
        The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
        ```

      
For details about the OpenAI-compatible endpoint, see [our Serve API
docs](/max/api/serve).

To run a different model, change the `--model-path` to something else from [our
model repository](https://builds.modular.com/?category=models).

For information about the available containers, see the [Modular
Docker Hub repositories](https://hub.docker.com/r/modular).

## Container options

The `docker run` command above includes the bare minimum commands and options,
but there are other `docker` options you might consider, plus several options
to control features of the endpoint.

### Docker options

- `--gpus`: If your system includes a compatible GPU, you must add the
[`--gpus`
option](https://docs.docker.com/reference/cli/docker/container/run/#gpus) in
order for the container to access it. It doesn't hurt to include this even if
your system doesn't have a [GPU compatible with
MAX](/max/faq#gpu-requirements).

- `--devices`: When deploying MAX on multiple GPUs, you must specify the ID of
the GPUs to use. For example, to use four available GPUs, you should include the
following: `--devices gpu:0,1,2,3`. When you don't specify a `--devices` option,
MAX defaults to using the first available GPU it discovers (equivalent to
`--devices gpu:0`). You can also optionally specify `--devices cpu`.

- `-v`: We use the [`-v`
option](https://docs.docker.com/reference/cli/docker/container/run/#volume) to
save a cache of Hugging Face models to your local disk that we can reuse across
containers.

- `-p`: We use the [`-p`
option](https://docs.docker.com/reference/cli/docker/container/run/#publish) to
specify the exposed port for the endpoint.

You also might need some environment variables (set with `--env`):

- `HF_TOKEN`: This is required to access gated models on Hugging Face
(after your account is granted access). For example:

  ```sh
  docker run \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=" \
    -p 8000:8000 \
    docker.modular.com/modular/max-nvidia-full:latest \
    --model-path mistralai/Mistral-7B-Instruct-v0.2
  ```

  Learn more about
  [`HF_TOKEN`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hftoken)
  and how to create [Hugging Face access
  tokens](https://huggingface.co/docs/hub/en/security-tokens).

- `HF_HUB_ENABLE_HF_TRANSFER`: Set this to `1` to enable faster model downloads
from Hugging Face. For example:

  ```sh
  docker run \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --env "HF_HUB_ENABLE_HF_TRANSFER=1" \
    docker.modular.com/modular/max-nvidia-full:latest \
    --model-path modularai/Llama-3.1-8B-Instruct-GGUF
  ```

  Learn more about
[`HF_HUB_ENABLE_HF_TRANSFER`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hfhubenablehftransfer).

### MAX options

Following the container name in the `docker run` command, you must specify a
model with `--model-path`, but there are other options you might need
to configure the `max serve` behavior.

To see all available options, see the [`max` CLI
page](/max/max-cli#serve), because the MAX container is basically a
wrapper around that tool.

- `--model-path`: This is required to specify the model you want to
deploy. To find other GenAI models that are compatible with MAX, check out
our [list of models on MAX
Builds](https://builds.modular.com/?category=models).

- `--max-length`: Specifies the maximum length of the text sequence (including
the input tokens). We mention this one here because it's often necessary to
adjust the max length when you have trouble running a large model on a machine
with limited memory.

For the rest of the `max serve` options, see the [`max` CLI
page](/max/max-cli#serve).

## Container contents

The MAX container is based on the NVIDIA CUDA Deep Learning Container
[version 12.5.0 base Ubuntu 22.04](https://hub.docker.com/layers/nvidia/cuda/12.5.0-base-ubuntu22.04/images/sha256-e58b22698c6f468de4dd32578d40821e30eae77251e18713ef986576d08ea825).

There are multiple MAX container options, including:
- [`max-nvidia-full`](/max/container#full-container)
- [`max-nvidia-base`](/max/container#base-container)

### Full container

The full MAX container (`max-nvidia-full`) includes MAX, as well as additional
dependencies, such as PyTorch for GPU and cuDNN. The full MAX container is the
default `max` CLI container and includes the following:

- Ubuntu 22.04
- Python 3.12
- MAX 25.3
- PyTorch (GPU) 2.6.0
- cuDNN
- CUDA 12.8
- NumPy
- Hugging Face Transformers

For more information on the full MAX container, see the
[Docker Hub repository](https://hub.docker.com/r/modular/max-nvidia-full).

### Base container

The base MAX container (`max-nvidia-base`) includes only the essentials for
deploying MAX, offering faster downloads and fewer dependencies. It only
requires the NVIDIA Driver, instead of all of CUDA, resulting in a much more
optimized container. The base container includes:

- Ubuntu 22.04
- Python 3.12
- MAX 25.3
- PyTorch (CPU) 2.5
- CUDA 12.8 (Requires NVIDIA Driver only)
- NumPy
- Hugging Face Transformers

For more information on the full MAX container, see the
[Docker Hub repository](https://hub.docker.com/r/modular/max-nvidia-base).

## Recommended cloud instances

For best performance and compatibility with the [available models on MAX
Builds](https://builds.modular.com/?category=models), we recommend that you
deploy the MAX container on a cloud instance with a GPU that meets the [MAX
system requirements](/max/faq#system-requirements).

The following are some cloud-based GPU instances and virtual machines that we
recommend.

AWS instances:

- [P5](https://aws.amazon.com/ec2/instance-types/p5/) instance family
  (H100 GPU)
- [P4d](https://aws.amazon.com/ec2/instance-types/p4/) instance family
  (A100 GPU)
- [G5](https://aws.amazon.com/ec2/instance-types/g5/) instance family
  (A10G GPU)
- [G6](https://aws.amazon.com/ec2/instance-types/g6/) instance family
  (L4 GPU)
- [G6e](https://aws.amazon.com/ec2/instance-types/g6e/) instance family
  (L40S GPU)

GCP instances:

- [A3](https://cloud.google.com/compute/docs/gpus#a3-series) machine series
  (H100 GPU)
- [A2](https://cloud.google.com/compute/docs/gpus#a100-gpus) machine series
  (A100 GPU)
- [G2](https://cloud.google.com/compute/docs/gpus#l4-gpus) machine series
  (L4 GPU)

Azure instances:

- [NCads_H100_v5-series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nc-family#ncads_h100_v5-series) virtual machine
- [NCCads_H100_v5-series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nc-family#nccads_h100_v5-series) virtual machine
- [ND_H100_v5-series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-family#nd_h100_v5-series) virtual machine
- [NC_A100_v4-series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nc-family#nc_a100_v4-series) virtual machine
- [NDm_A100_v4-series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-family#ndm_a100_v4-series) virtual machine
- [ND_A100_v4-series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nd-family#nd_a100_v4-series) virtual machine
- [NVads-A10 v5-series](https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/nv-family#nvads-a10-v5-series) virtual machine

## Logs

The MAX container writes logs to stdout, which you can consume and view via your
cloud provider's platform (for example,
[with AWS CloudWatch](https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html)).

Console log level is `INFO` by default. You can modify the log level using the
`MAX_SERVE_LOGS_CONSOLE_LEVEL` environment variable. It accepts the following
log levels (in order of increasing verbosity): `CRITICAL`, `ERROR`, `WARNING`,
`INFO`, `DEBUG`. For example:

```bash
docker run docker.modular.com/modular/max-nvidia-full:latest \
    -e MAX_SERVE_LOGS_CONSOLE_LEVEL=DEBUG \
    ...
```

For readability, logs default to unstructured text, but you can emit them with
structured JSON by adding the `MODULAR_STRUCTURED_LOGGING=1` environment
variable.

## Metrics

The MAX container exposes a `/metrics` endpoint that follows the
[Prometheus](https://prometheus.io/docs/introduction/overview/) text format.
You can scrape the metrics listed below using Prometheus or another collection
service.

These are raw metrics and it's up to you to compute the desired time series and
aggregations. For example, we provide a count for output tokens
(`maxserve_num_output_tokens_total`), which you can use to calculate the output
tokens per second (OTP/s).

Here are all the available metrics:

- `maxserve_request_time_milliseconds`:  Histogram of time spent handling each
  request (total inference time, or TIT), in milliseconds.
- `maxserve_input_processing_time_milliseconds`: Histogram of input processing
  time (IPT), in milliseconds.
- `maxserve_output_processing_time_milliseconds`: Histogram of output
  generation time (OGT), in milliseconds.
- `maxserve_time_to_first_token_milliseconds`: Histogram of time to first
  token (TTFT), in milliseconds.
- `maxserve_num_input_tokens_total`: Total number of input tokens processed
  so far.
- `maxserve_num_output_tokens_total`: Total number of output tokens processed
  so far.
- `maxserve_request_count_total`: Total requests since start.
- `maxserve_num_requests_running`: Number of requests currently running.

### Telemetry

In addition to sharing these metrics via the `/metrics` endpoint, the MAX
container actively sends the metrics to Modular via push telemetry (using
OpenTelemetry).

:::note

None of the telemetry includes personally identifiable information (PII).

:::

This telemetry is anonymous and helps us quickly identify problems and build
better products for you. Without this telemetry, we would rely solely on
user-submitted bug reports, which are limited and would severely limit our
performance insights.

However, if you don't want to share this data with Modular, you can disable
telemetry in your container. To disable telemetry, enable the
`MAX_SERVE_DISABLE_TELEMETRY` environment variable when you start your MAX
container. For example:

```bash
docker run docker.modular.com/modular/max-nvidia-full:latest \
    -e MAX_SERVE_DISABLE_TELEMETRY=1 \
    ...
```

#### Deployment and user ID

Again, the telemetry is completely anonymous by default. But if you'd like to
share some information to help our team assist you in understanding your
deployment performance, you can add some identity information to the telemetry
with these environment variables:

- `MAX_SERVE_DEPLOYMENT_ID`: Your application name.
- `MODULAR_USER_ID`: Your company name.

For example:

```bash
docker run docker.modular.com/modular/max-nvidia-full:latest \
    -e MAX_SERVE_DEPLOYMENT_ID='Project name' \
    -e MODULAR_USER_ID='Example Inc.' \
    ...
```

## License

The MAX container is released under the
[NVIDIA Deep Learning Container license](https://developer.download.nvidia.com/licenses/NVIDIA_Deep_Learning_Container_License.pdf).

## Next steps

export const tutorials = [
  'max-serve-local-to-cloud',
  'deploy-max-serve-on-kubernetes',
];

---

## max_finite

`max_finite[dtype: DType]() -> SIMD[dtype, 1]`

Returns the maximum finite value of type.

**Parameters:**

* ​dtype (`DType`): The value dtype.

**Returns:**

The maximum representable value of the type. Does not include infinity
for floating-point types.

---

## max_int__

`max_int__(gpr: Int)`

UI16 matrix multiply.

---

## max_or_inf

`max_or_inf[dtype: DType]() -> SIMD[dtype, 1]`

Returns the maximum (potentially infinite) value of type.

**Parameters:**

* ​dtype (`DType`): The value dtype.

**Returns:**

The maximum representable value of the type. Can include infinity for
floating-point types.

---

## max_pool

`max_pool[type: DType, int_type: DType, rank: Int = 4](input: NDBuffer[type, rank, origin], filter: NDBuffer[int_type, 1, origin], strides: NDBuffer[int_type, 1, origin], dilations: NDBuffer[int_type, 1, origin], paddings: NDBuffer[int_type, 1, origin], output: NDBuffer[type, rank, origin], ceil_mode: Bool = False)`

Computes fp32 pooling.

**Args:**

* ​input (`NDBuffer[type, rank, origin]`): Batched image input to the pool2d operator.
* ​filter (`NDBuffer[int_type, 1, origin]`): Filter size on height and width dimensions with assumed tuple
  def (filter\_h, filter\_w).
* ​strides (`NDBuffer[int_type, 1, origin]`): Strides on height and width dimensions with assumed
  tuple def (stride\_h, stride\_w).
* ​dilations (`NDBuffer[int_type, 1, origin]`): Dilations on height and width dimensions with assumed
  tuple def (dilation\_h, dilation\_w).
* ​paddings (`NDBuffer[int_type, 1, origin]`): Paddings on height and width dimensions with assumed
  tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)).
* ​output (`NDBuffer[type, rank, origin]`): Pre-allocated output tensor space.
* ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding.

---

## max_pool_gpu

`max_pool_gpu[type: DType, int_type: DType, rank: Int = 4](ctx: DeviceContext, input: NDBuffer[type, rank, origin], filter: NDBuffer[int_type, 1, origin], strides: NDBuffer[int_type, 1, origin], dilations: NDBuffer[int_type, 1, origin], paddings: NDBuffer[int_type, 1, origin], output: NDBuffer[type, rank, origin], ceil_mode: Bool = False)`

Computes max pooling on GPU.

**Args:**

* ​ctx (`DeviceContext`): The DeviceContext to use for GPU execution.
* ​input (`NDBuffer[type, rank, origin]`): (On device) Batched image input to the pool2d operator.
* ​filter (`NDBuffer[int_type, 1, origin]`): (On host) Filter size on height and width dimensions with assumed tuple
  def (filter\_h, filter\_w).
* ​strides (`NDBuffer[int_type, 1, origin]`): (On host) Strides on height and width dimensions with assumed
  tuple def (stride\_h, stride\_w).
* ​dilations (`NDBuffer[int_type, 1, origin]`): (On host) Dilations on height and width dimensions with assumed
  tuple def (dilation\_h, dilation\_w).
* ​paddings (`NDBuffer[int_type, 1, origin]`): (On host) Paddings on height and width dimensions with assumed
  tuple def (pad\_h\_before, pad\_h\_after, pad\_w\_before, pad\_w\_after)).
* ​output (`NDBuffer[type, rank, origin]`): (On device) Pre-allocated output tensor space.
* ​ceil\_mode (`Bool`): Ceiling mode defines the output shape and implicit padding.

---

## maybe_uninitialized

## Structs

* [​`UnsafeMaybeUninitialized`](/mojo/stdlib/memory/maybe_uninitialized/UnsafeMaybeUninitialized): A memory location that may or may not be initialized.

---

## mbarrier_arrive

`mbarrier_arrive[type: AnyType](shared_mem: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]) -> Int`

Signal thread arrival at a shared memory barrier.

Records that the calling thread has reached the barrier synchronization point.
Only supported on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The data type stored at the barrier location.

**Args:**

* ​shared\_mem (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier.

**Returns:**

An integer representing the current state of the memory barrier.

---

## mbarrier_arrive_expect_tx_shared

`mbarrier_arrive_expect_tx_shared[type: AnyType](addr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], tx_count: SIMD[int32, 1])`

Configure a shared memory barrier to expect additional async transactions.

Updates the current phase of the memory barrier to track completion of
additional asynchronous transactions. Only supported on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The type of the memory barrier.

**Args:**

* ​addr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier.
* ​tx\_count (`SIMD[int32, 1]`): Number of expected transactions to track.

---

## mbarrier_init

`mbarrier_init[type: AnyType](shared_mem: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], num_threads: SIMD[int32, 1])`

Initialize a shared memory barrier for synchronizing multiple threads.

Sets up a memory barrier in shared memory that will be used to synchronize
the specified number of threads. Only supported on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The data type stored at the barrier location.

**Args:**

* ​shared\_mem (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to shared memory location for the barrier.
* ​num\_threads (`SIMD[int32, 1]`): Number of threads that will synchronize on this barrier.

---

## mbarrier_test_wait

`mbarrier_test_wait[type: AnyType](shared_mem: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], state: Int) -> Bool`

Test if all threads have arrived at the memory barrier.

Non-blocking check to see if all participating threads have reached the barrier.
Only supported on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The data type stored at the barrier location.

**Args:**

* ​shared\_mem (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier.
* ​state (`Int`): Expected state of the memory barrier.

**Returns:**

True if all threads have arrived, False otherwise.

---

## mbarrier_try_wait_parity_shared

`mbarrier_try_wait_parity_shared[type: AnyType](addr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], phase: SIMD[int32, 1], ticks: SIMD[int32, 1])`

Wait for completion of a barrier phase with timeout.

Waits for the shared memory barrier to complete the specified phase,
or until the timeout period expires. Only supported on NVIDIA GPUs.

**Parameters:**

* ​type (`AnyType`): The type of the memory barrier.

**Args:**

* ​addr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the shared memory barrier.
* ​phase (`SIMD[int32, 1]`): Phase number to wait for.
* ​ticks (`SIMD[int32, 1]`): Timeout period in nanoseconds.

---

## mean

`mean(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]`

Computes the mean value of the elements in a buffer.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer of elements for which the mean is computed.

**Returns:**

The mean value of the elements in the given buffer.

`mean[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])`

Computes the mean across reduce\_axis of an NDBuffer.

**Parameters:**

* ​reduce\_axis (`Int`): The axis to reduce across.

**Args:**

* ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer.
* ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer.

`mean[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, output_shape: IndexList[size], context: DeviceContextPtr = DeviceContextPtr())`

Computes the mean across the input and output shape.

This performs the mean computation on the domain specified by `input_shape`,
loading the inputs using the `input_fn`. The results' domain is
`output_shape` which are stored using the `output_fn`.

**Parameters:**

* ​type (`DType`): The type of the input and output.
* ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input.
* ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.

**Args:**

* ​input\_shape (`IndexList[size]`): The input shape.
* ​reduce\_dim (`Int`): The axis to perform the mean on.
* ​output\_shape (`IndexList[size]`): The output shape.
* ​context (`DeviceContextPtr`): The pointer to DeviceContext.

---

## memcmp

`memcmp[type: AnyType, address_space: AddressSpace](s1: UnsafePointer[type, address_space=address_space], s2: UnsafePointer[type, address_space=address_space], count: Int) -> Int`

Compares two buffers. Both strings are assumed to be of the same length.

**Parameters:**

* ​type (`AnyType`): The element type.
* ​address\_space (`AddressSpace`): The address space of the pointer.

**Args:**

* ​s1 (`UnsafePointer[type, address_space=address_space]`): The first buffer address.
* ​s2 (`UnsafePointer[type, address_space=address_space]`): The second buffer address.
* ​count (`Int`): The number of elements in the buffers.

**Returns:**

Returns 0 if the bytes strings are identical, 1 if s1 > s2, and -1 if
s1

---

## memcpy

`memcpy[T: AnyType](dest: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], src: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], count: Int)`

Copies a memory area.

**Parameters:**

* ​T (`AnyType`): The element type.

**Args:**

* ​dest (`UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]`): The destination pointer.
* ​src (`UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]`): The source pointer.
* ​count (`Int`): The number of elements to copy.

---

## memcpy_or_fuse

`memcpy_or_fuse[rank: Int, type: DType, epilogue_fn: OptionalReg[fn[DType, Int, Int, Int](IndexList[$1], SIMD[$0, $2]) capturing -> None]](dest_data: UnsafePointer[SIMD[int8, 1]], out_byte_offset: Int, src_data: UnsafePointer[SIMD[int8, 1]], n: Int, out_shape: IndexList[rank, element_type=element_type])`

---

## memory

Implements `parallel_memcpy`.

You can import these APIs from the `algorithm` package. For example:

```mojo
from algorithm import parallel_memcpy
```

## Functions

* [​`parallel_memcpy`](/mojo/stdlib/algorithm/memory/parallel_memcpy): Copies `count` elements from a memory buffer `src` to `dest` in parallel by spawning `num_tasks` tasks each copying `count_per_task` elements.

---

## memory

## Functions

* [​`clobber_memory`](/mojo/stdlib/benchmark/memory/clobber_memory): Forces all pending memory writes to be flushed to memory.

---

## memory

This module provides GPU memory operations and utilities.

The module implements low-level memory operations for GPU programming, with a focus on:

* Memory address space abstractions (global, shared, constant)
* Cache control operations and policies
* Memory access patterns and optimizations
* Memory alignment and pointer manipulation

It provides a unified interface for memory operations across different GPU architectures,
with specialized implementations for NVIDIA and AMD GPUs where needed.

The module is designed for performance-critical code and requires careful usage to
achieve optimal memory access patterns and cache utilization.

## Aliases

### `AddressSpace`

`alias AddressSpace = _GPUAddressSpace`

## Structs

* [​`CacheEviction`](/mojo/stdlib/gpu/memory/CacheEviction): Represents cache eviction policies for GPU memory operations.
* [​`CacheOperation`](/mojo/stdlib/gpu/memory/CacheOperation): Represents different GPU cache operation policies.
* [​`Consistency`](/mojo/stdlib/gpu/memory/Consistency): Represents memory consistency models for GPU memory operations.
* [​`Fill`](/mojo/stdlib/gpu/memory/Fill): Represents memory fill patterns for GPU memory operations.
* [​`ReduceOp`](/mojo/stdlib/gpu/memory/ReduceOp): Represents reduction operations for parallel reduction algorithms.

## Functions

* [​`async_copy`](/mojo/stdlib/gpu/memory/async_copy): Asynchronously copies data from global memory to shared memory.
* [​`async_copy_commit_group`](/mojo/stdlib/gpu/memory/async_copy_commit_group): Commits all prior initiated but uncommitted cp.async instructions into a cp.async-group.
* [​`async_copy_wait_all`](/mojo/stdlib/gpu/memory/async_copy_wait_all): Waits for completion of all committed cp.async-groups.
* [​`async_copy_wait_group`](/mojo/stdlib/gpu/memory/async_copy_wait_group): Waits for the completion of `n` most recently committed cp.async-groups.
* [​`cp_async_bulk_tensor_global_shared_cta`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_global_shared_cta): Initiates an asynchronous copy operation to transfer tensor data from shared CTA memory to global memory using NVIDIA's Tensor Memory Access (TMA) mechanism.
* [​`cp_async_bulk_tensor_reduce`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_reduce): Initiates an asynchronous reduction operation between shared CTA memory and global memory using NVIDIA's Tensor Memory Access (TMA) mechanism.
* [​`cp_async_bulk_tensor_shared_cluster_global`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_shared_cluster_global): Initiates an asynchronous bulk copy operation of tensor data from global memory to shared memory.
* [​`cp_async_bulk_tensor_shared_cluster_global_multicast`](/mojo/stdlib/gpu/memory/cp_async_bulk_tensor_shared_cluster_global_multicast): Initiates an asynchronous multicast load operation using NVIDIA's Tensor Memory Access (TMA) to copy tensor data from global memory to shared memories of multiple CTAs in a cluster.
* [​`external_memory`](/mojo/stdlib/gpu/memory/external_memory): Gets a pointer to dynamically allocated external memory.
* [​`fence_mbarrier_init`](/mojo/stdlib/gpu/memory/fence_mbarrier_init): Creates a memory fence after mbarrier initialization.
* [​`fence_proxy_tensormap_generic_sys_acquire`](/mojo/stdlib/gpu/memory/fence_proxy_tensormap_generic_sys_acquire): Acquires a system-wide memory fence for tensor map operations.
* [​`fence_proxy_tensormap_generic_sys_release`](/mojo/stdlib/gpu/memory/fence_proxy_tensormap_generic_sys_release): Releases the system-wide memory fence for tensor map operations.
* [​`load`](/mojo/stdlib/gpu/memory/load): Loads data from global memory into a SIMD vector.
* [​`multimem_ld_reduce`](/mojo/stdlib/gpu/memory/multimem_ld_reduce): Performs a vectorized load-reduce operation using NVIDIA's multimem feature.
* [​`multimem_st`](/mojo/stdlib/gpu/memory/multimem_st): Stages an inline multimem.st instruction.
* [​`tma_store_fence`](/mojo/stdlib/gpu/memory/tma_store_fence): Establishes a memory fence for shared memory stores in TMA operations.

---

## memory

The memory package provides several pointer types, as well as utility functions for dealing with memory.

## Modules

* [​`arc`](/mojo/stdlib/memory/arc/): Reference-counted smart pointers.
* [​`maybe_uninitialized`](/mojo/stdlib/memory/maybe_uninitialized/):
* [​`memory`](/mojo/stdlib/memory/memory/): Defines functions for memory manipulations.
* [​`owned_pointer`](/mojo/stdlib/memory/owned_pointer/): Implements `OwnedPointer`, a safe, single-ownership smart pointer.
* [​`pointer`](/mojo/stdlib/memory/pointer/): Implements the Pointer type.
* [​`span`](/mojo/stdlib/memory/span/): Implements the `Span` type.
* [​`unsafe`](/mojo/stdlib/memory/unsafe/): Provides utility functions for unsafe manipulation of SIMD values.
* [​`unsafe_pointer`](/mojo/stdlib/memory/unsafe_pointer/): Implement a generic unsafe pointer type.

---

## memory

Defines functions for memory manipulations.

You can import these APIs from the `memory` package. For example:

```mojo
from memory import memcmp
```

## Functions

* [​`memcmp`](/mojo/stdlib/memory/memory/memcmp): Compares two buffers. Both strings are assumed to be of the same length.
* [​`memcpy`](/mojo/stdlib/memory/memory/memcpy): Copies a memory area.
* [​`memset`](/mojo/stdlib/memory/memory/memset): Fills memory with the given value.
* [​`memset_zero`](/mojo/stdlib/memory/memory/memset_zero): Fills memory with zeros.
* [​`stack_allocation`](/mojo/stdlib/memory/memory/stack_allocation): Allocates data buffer space on the stack given a data type and number of elements.

---

## MemoryElement

`struct MemoryElement[dtype: DType, layout: Layout, address_space: AddressSpace, alignment: Int, /, *, index_type: DType = _get_index_type(layout, address_space)]`

Represents data in memory organized according to a specific layout.

The `MemoryElement` struct provides a high-level interface for accessing data
in memory with a specific layout. It encapsulates a pointer to the memory
location and the runtime layout information needed to access the data correctly.

This abstraction enables efficient memory operations that respect the underlying
memory organization, supporting vectorized loads and stores while handling
different memory layouts transparently.

## Parameters

* ​dtype (`DType`): The data type of the elements.
* ​layout (`Layout`): The memory layout describing how elements are organized.
* ​address\_space (`AddressSpace`): The memory address space where the data is located.
* ​alignment (`Int`): The memory alignment requirement for the data.
* ​index\_type (`DType`): The integer type of the index pointing to each memory element.

## Fields

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment]`): Pointer to the memory location where the data is stored.
  This pointer provides access to the underlying memory with the specified
  address space and alignment requirements. It points to the first element
  of the data structure in memory.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): Runtime layout information used for memory access calculations.
  This field stores the runtime layout information needed to compute memory
  offsets for accessing elements according to the specified layout pattern.
  It handles both compile-time known dimensions and runtime-determined dimensions.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment], runtime_layout: RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type])`

Initializes a `MemoryElement` with the given pointer and runtime layout.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment]`): Pointer to the memory location of the element.
* ​runtime\_layout (`RuntimeLayout[layout, element_type=int32, linear_idx_type=index_type]`): The runtime layout to use for memory access.

### `load`

`load(self, out result: Element[dtype, layout, index_type])`

Loads data from memory according to the specified layout.

This method performs a layout-aware load operation, reading data from memory
following the access patterns defined by the layout. It optimizes memory
reads based on the layout's stride patterns to maximize performance.

The method leverages the underlying `Element.load` implementation which handles
different memory layout patterns including contiguous and strided access.

**Returns:**

An `Element` containing the loaded data organized according to the layout.

### `store`

`store(self, src: Element[dtype, layout, index_type])`

Stores element data to the memory location of this MemoryElement.

This method performs a layout-aware store operation, writing data to memory
following the access patterns defined by the layout. It optimizes memory
writes based on the layout's stride patterns to maximize performance.

The method delegates to the `Element.store` implementation which handles
different memory layout patterns including vectorized stores for contiguous memory
and element-by-element stores for non-contiguous layouts.

**Args:**

* ​src (`Element[dtype, layout, index_type]`): The `Element` containing the data to store.

### `transfer`

`transfer(self, src: MemoryElement[dtype, layout, address_space, alignment, index_type=index_type])`

Transfers data from another `MemoryElement` to this one.

This method efficiently transfers data between memory locations with potentially
different layouts and data types. It performs the following operations:

1. Loads data from the source `MemoryElement` using its layout
2. Converts the data to the destination data type if necessary
3. Stores the converted data to the destination memory location using its layout

This provides a high-performance way to copy and convert data between different
memory representations while respecting both source and destination memory layouts.

**Args:**

* ​src (`MemoryElement[dtype, layout, address_space, alignment, index_type=index_type]`): The source `MemoryElement` to transfer data from.

---

## memset

`memset[type: AnyType, address_space: AddressSpace](ptr: UnsafePointer[type, address_space=address_space], value: SIMD[uint8, 1], count: Int)`

Fills memory with the given value.

**Parameters:**

* ​type (`AnyType`): The element dtype.
* ​address\_space (`AddressSpace`): The address space of the pointer.

**Args:**

* ​ptr (`UnsafePointer[type, address_space=address_space]`): UnsafePointer to the beginning of the memory block to fill.
* ​value (`SIMD[uint8, 1]`): The value to fill with.
* ​count (`Int`): Number of elements to fill (in elements, not bytes).

---

## memset_zero

`memset_zero[type: AnyType, address_space: AddressSpace, //](ptr: UnsafePointer[type, address_space=address_space], count: Int)`

Fills memory with zeros.

**Parameters:**

* ​type (`AnyType`): The element type.
* ​address\_space (`AddressSpace`): The address space of the pointer.

**Args:**

* ​ptr (`UnsafePointer[type, address_space=address_space]`): UnsafePointer to the beginning of the memory block to fill.
* ​count (`Int`): Number of elements to fill (in elements, not bytes).

`memset_zero[dtype: DType, address_space: AddressSpace, //, *, count: Int](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space])`

Fills memory with zeros.

**Parameters:**

* ​dtype (`DType`): The element type.
* ​address\_space (`AddressSpace`): The address space of the pointer.
* ​count (`Int`): Number of elements to fill (in elements, not bytes).

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space]`): UnsafePointer to the beginning of the memory block to fill.

---

## merge

`merge[type: DType, out_idx_type: DType, rank: Int](mut buf_keys: NDBuffer[type, rank, origin], mut buf_ids: NDBuffer[out_idx_type, rank, origin], start: Int, mid: Int, end: Int)`

Merge two sorted subarrays into one sorted array.

---

## merge_sort_recursive

`merge_sort_recursive[type: DType, out_idx_type: DType, rank: Int](mut buf_keys: NDBuffer[type, rank, origin], mut buf_ids: NDBuffer[out_idx_type, rank, origin], start: Int, end: Int)`

Recursive merge sort implementation.

---

## mha

## Functions

* [​`flash_attention`](./flash_attention):
* [​`flash_attention_dispatch`](./flash_attention_dispatch):
* [​`flash_attention_hw_supported`](./flash_attention_hw_supported):
* [​`get_mha_decoding_num_partitions`](./get_mha_decoding_num_partitions):
* [​`managed_tensor_slice_to_ndbuffer`](./managed_tensor_slice_to_ndbuffer):
* [​`mha`](./mha):
* [​`mha_decoding`](./mha_decoding):
* [​`mha_decoding_single_batch`](./mha_decoding_single_batch): Flash attention v2 algorithm.
* [​`mha_decoding_single_batch_pipelined`](./mha_decoding_single_batch_pipelined): Flash attention v2 algorithm.
* [​`mha_gpu_naive`](./mha_gpu_naive):
* [​`mha_single_batch`](./mha_single_batch): MHA for token gen where seqlen = 1 and num\_keys >= 1.
* [​`mha_single_batch_pipelined`](./mha_single_batch_pipelined): MHA for token gen where seqlen = 1 and num\_keys >= 1.
* [​`mha_splitk_reduce`](./mha_splitk_reduce):
* [​`scale_and_mask_helper`](./scale_and_mask_helper):

---

## mha

`mha[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, config: MHAConfig, group: Int = 1, use_score_mod: Bool = False, ragged: Bool = False, is_shared_kv: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], scale: SIMD[float32, 1], batch_size: Int, seq_len_arg: Int, num_keys_arg: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]], mask: mask_t, score_mod: score_mod_t)`

---

## mha_cross

## Functions

* [​`mha_cross_gpu_naive`](./mha_cross_gpu_naive): Naive cross attention on GPU.

---

## mha_cross_gpu_naive

`mha_cross_gpu_naive[cache_t: KVCacheT, mask_t: MHAMask, type: DType, q_shape: DimList, //, rank: Int](output: NDBuffer[type, rank, MutableAnyOrigin, shape, strides], q: NDBuffer[type, rank, MutableAnyOrigin, q_shape, strides], q_input_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin, shape, strides], q_max_seq_len: Int, k: cache_t, v: cache_t, kv_input_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin, shape, strides], mask_functor: mask_t, scale: SIMD[float32, 1], ctx: DeviceContext)`

Naive cross attention on GPU.

Note that this assumes ragged tensor inputs and uses a mask functor.

Computes:
(1) Transpose (Q) BSHD -> BHSD;
(2) Transpose (K) BSHD -> BHSD;
(3) Transpose (V) BSHD -> BHSD;
(4) P = Bmm(Q, K), P is also called "score";
(5) P = P \* scale + mask;
(6) P = softmax(P);
(7) O = Bmm(P, V)
(8) Output = Transpose(O).

B, S, H, D denote batch size, sequence length, head count and depth, respectively.
(1), (2), (3) happens while loading the data into shared memory.
(8) happens when writing output to global memory.

All inputs (query, key, and value) must have BSHD layout. The mask can be
BSS or BHSS.

This kernel also handles grouped attention optimization. In this case the shape of
K and V are BShD where h = H / num\_groups.

---

## mha_decoding

`mha_decoding[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, ragged: Bool = False, is_shared_kv: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], batch_size: Int, num_partitions: Int, max_cache_valid_length: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], mask: mask_t, score_mod: score_mod_t)`

---

## mha_decoding_single_batch

`mha_decoding_single_batch[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], num_keys: UInt, num_partitions: UInt, max_cache_valid_length: UInt, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)`

Flash attention v2 algorithm.

---

## mha_decoding_single_batch_pipelined

`mha_decoding_single_batch_pipelined[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], num_keys: UInt, num_partitions: UInt, max_cache_valid_length: UInt, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)`

Flash attention v2 algorithm.

---

## mha_gpu_naive

`mha_gpu_naive[output_type: DType, k_t: MHAOperand, v_t: MHAOperand, mask_t: MHAMask, rank: Int, //, ragged: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False](q: NDBuffer[type, rank, origin, shape, strides], k: k_t, v: v_t, mask_functor: mask_t, output: NDBuffer[output_type, rank, origin, shape, strides], valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], batch_size: Int, max_prompt_len: Int, max_cache_size: Int, num_heads: Int, depth: Int, group: Int, ctx: DeviceContext)`

`mha_gpu_naive[q_type: DType, k_type: DType, v_type: DType, output_type: DType, rank: Int, mask_type: DType, mask_rank: Int, //](q: NDBuffer[q_type, rank, origin, shape, strides], k: NDBuffer[k_type, rank, origin, shape, strides], v: NDBuffer[v_type, rank, origin, shape, strides], mask: NDBuffer[mask_type, mask_rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], output: NDBuffer[output_type, rank, origin, shape, strides], scale: SIMD[float32, 1], batch_size: Int, seq_len: Int, num_keys: Int, num_heads: Int, depth: Int, group: Int, ctx: DeviceContext)`

`mha_gpu_naive[q_type: DType, output_type: DType, cache_t: KVCacheT, mask_t: MHAMask, rank: Int, //, ragged: Bool = False](q: NDBuffer[q_type, rank, origin, shape, strides], k: cache_t, v: cache_t, mask_functor: mask_t, output: NDBuffer[output_type, rank, origin, shape, strides], valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], scale: SIMD[float32, 1], batch_size: Int, max_prompt_len: Int, max_cache_size: Int, num_heads: Int, depth: Int, group: Int, ctx: DeviceContext)`

---

## mha_mask

## Aliases

### `MASK_VALUE`

`alias MASK_VALUE = -10000`

## Structs

* [​`AndMask`](./AndMask): Mask that's the AND of two masks.
* [​`CausalMask`](./CausalMask): MHA causal mask ensures a token is only affected by previous tokens.
* [​`ChunkedMask`](./ChunkedMask): Mask implementing Chunked attention.
* [​`MaskName`](./MaskName): A tile's masking status.
* [​`MaterializedMask`](./MaterializedMask): Mask that's backed by a materialized tensor.
* [​`NullMask`](./NullMask): Mask that's effectively a noop.
* [​`OrMask`](./OrMask): Mask that's the OR of two masks.
* [​`SlidingWindowCausalMask`](./SlidingWindowCausalMask): Mask implementing Sliding Window attention.
* [​`TileMaskStatus`](./TileMaskStatus): A tile's masking status.

## Traits

* [​`MHAMask`](./MHAMask): The MHAMask trait describes masks for MHA kernels, such as the causal mask.

## Functions

* [​`ChunkedCausalMask`](./ChunkedCausalMask): Mask implementing Chunked Causal attention for Llama4 models.

---

## mha_operand

## Structs

* [​`KVCacheMHAOperand`](./KVCacheMHAOperand): An implementation for `mo.opaque` KVCacheT arguments to MHA kernels.
* [​`NDBufferMHAOperand`](./NDBufferMHAOperand): An implementation for NDBuffer arguments to MHA kernels.
* [​`RaggedMHAOperand`](./RaggedMHAOperand): An implementation for ragged NDBuffer arguments to MHA kernels.

## Traits

* [​`MHAOperand`](./MHAOperand): This serves as the trait to support arguments to our MHA kernel.

---

## mha_score_mod

## Structs

* [​`AlibiScoreMod`](./AlibiScoreMod): AlibiScoreMod adds the appropriate ALiBi constant bias to attention score.
* [​`IdentityScoreMod`](./IdentityScoreMod): IdentityScoreMod simply returns attention score.

## Traits

* [​`ScoreModTrait`](./ScoreModTrait): The ScoreMod trait desctribes score\_mod for mha kernel like alibi bias.

---

## mha_single_batch

`mha_single_batch[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, config: MHAConfig, group: Int = 1, use_score_mod: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], scale: SIMD[float32, 1], seq_len: Int, max_seq_len: Int, start_pos: SIMD[uint32, 1], num_keys: Int, mask_tensor_col: Int, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)`

MHA for token gen where seqlen = 1 and num\_keys >= 1.

The general data layout and steps conform to flash attention. Two exceptions:

1 Partition across B, H, and num\_keys (TODO).  The last one is split-K and
will need a separate reduction kernel at the end.

2 Frist bmm becomes gemv and second bmm becomes gevm.
TODO: use more optimized kernels for them

---

## mha_single_batch_pipelined

`mha_single_batch_pipelined[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, config: MHAConfig, group: Int = 1, use_score_mod: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], scale: SIMD[float32, 1], seq_len: Int, max_seq_len: Int, start_pos: SIMD[uint32, 1], num_keys: Int, mask_tensor_col: Int, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)`

MHA for token gen where seqlen = 1 and num\_keys >= 1.

The general data layout and steps conform to flash attention. Two exceptions:

1 Partition across B, H, and num\_keys (TODO).  The last one is split-K and
will need a separate reduction kernel at the end.

2 Frist bmm becomes gemv and second bmm becomes gevm.
TODO: use more optimized kernels for them

---

## mha_sm90

## Structs

* [​`DynamicInt`](./DynamicInt):
* [​`MHAPosition`](./MHAPosition): Position of the MHA-kernel. When `decoding=False`, `q_head_stride == num_heads`. When `decoding=True`, `q_head_stride == 1`.
* [​`NoPartition`](./NoPartition):
* [​`SplitKPartition`](./SplitKPartition):
* [​`StaticInt`](./StaticInt):

## Traits

* [​`MHAPartitionScheme`](./MHAPartitionScheme):
* [​`OptionallyStaticInt`](./OptionallyStaticInt):

## Functions

* [​`mha_sm90_dispatch`](./mha_sm90_dispatch):
* [​`valid_length_managed_tensor_slice_to_ndbuffer`](./valid_length_managed_tensor_slice_to_ndbuffer):

---

## mha_sm90_dispatch

`mha_sm90_dispatch[k_t: MHAOperand, v_t: MHAOperand, mask_t: MHAMask, score_mod_t: ScoreModTrait, type: DType, output_type: DType, max_prompt_len_t: OptionallyStaticInt, partition_t: MHAPartitionScheme, //, config: MHAConfig, group: Int, use_score_mod: Bool, ragged: Bool, _is_cache_length_accurate: Bool](output: UnsafePointer[SIMD[output_type, 1]], q: UnsafePointer[SIMD[type, 1]], k: k_t, v: v_t, mask_functor: mask_t, score_mod_functor: score_mod_t, valid_length: ManagedTensorSlice[io_spec, static_spec=static_spec], max_prompt_len_arg: max_prompt_len_t, max_cache_valid_length_arg: Int, scale: SIMD[float32, 1], kv_input_row_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]], batch_size_arg: Int, partition: partition_t, ctx: DeviceContext)`

---

## mha_splitk_reduce

`mha_splitk_reduce[output_type: DType, depth: UInt, num_heads: UInt, num_threads: UInt, group: UInt = UInt(1), use_exp2: Bool = False](intermediate_ptr: UnsafePointer[SIMD[output_type, 1]], output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], batch_size: Int, num_partitions: Int)`

---

## mha_tile_scheduler

## Structs

* [​`MHASchedule`](./MHASchedule):
* [​`MHASchedulerSynchronization`](./MHASchedulerSynchronization):
* [​`MHATileState`](./MHATileState):
* [​`MHATileSummary`](./MHATileSummary):
* [​`QueuedTileScheduler`](./QueuedTileScheduler): If `decoding == False`, then `num_heads` is `q_num_heads`. If `decoding == True`, then `num_heads` is `kv_num_heads`.
* [​`SeqInfo`](./SeqInfo):
* [​`TileScheduler`](./TileScheduler):
* [​`TransientScheduler`](./TransientScheduler):
* [​`WorkInfo`](./WorkInfo):

## Traits

* [​`MHATileScheduler`](./MHATileScheduler):

---

## mha_utils

## Aliases

### `callback_fn_type`

`alias callback_fn_type = fn[MHAMask, ScoreModTrait](mask: $0, score_mod: $1) raises capturing -> None`

### `is_sm90`

`alias is_sm90 = _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90"))`

## Structs

* [​`FlashAttentionAlgorithm`](./FlashAttentionAlgorithm):
* [​`MHAConfig`](./MHAConfig):

## Functions

* [​`dispatch_mask_and_score_mod`](./dispatch_mask_and_score_mod):
* [​`dispatch_materialized_mask_and_score_mod`](./dispatch_materialized_mask_and_score_mod):
* [​`get_start_and_end_for_partitions`](./get_start_and_end_for_partitions): Calculate start and end indices for a partition.

---

## MHAConfig

`@register_passable(trivial)`
`struct MHAConfig`

## Fields

* ​type (`DType`):
* ​num\_heads (`UInt`):
* ​depth (`UInt`):
* ​num\_queries\_per\_block (`UInt`):
* ​num\_keys\_per\_block (`UInt`):
* ​BK (`UInt`):
* ​WM (`UInt`):
* ​WN (`UInt`):
* ​num\_pipeline\_stages (`UInt`):
* ​k\_group\_size (`UInt`):
* ​algorithm (`FlashAttentionAlgorithm`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(type: DType, num_heads: UInt, depth: UInt, num_queries_per_block: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), num_keys_per_block: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), BK: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), WM: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), WN: OptionalReg[UInt] = OptionalReg[UInt]({:i1 0, 1}), num_pipeline_stages: UInt = UInt(2 if _accelerator_arch().__contains__[::Bool,::Origin[$2]](__init__[__mlir_type.!kgen.string](":90")) else 4), k_group_size: UInt = UInt(1), algorithm: FlashAttentionAlgorithm = FlashAttentionAlgorithm()) -> Self`

### `block_m`

`block_m(self) -> UInt`

### `block_n`

`block_n(self) -> UInt`

### `block_k`

`block_k(self) -> UInt`

### `warp_m`

`warp_m(self) -> UInt`

### `warp_n`

`warp_n(self) -> UInt`

### `num_warps_m`

`num_warps_m(self) -> UInt`

### `num_warps_n`

`num_warps_n(self) -> UInt`

### `num_consumer_threads`

`num_consumer_threads(self) -> UInt`

### `num_producer_threads`

`num_producer_threads[producer_consumer_kernel: Bool = False](self) -> UInt`

### `num_threads`

`num_threads[producer_consumer_kernel: Bool = False](self) -> UInt`

### `q_smem_size`

`q_smem_size(self, sm_90: Bool = False) -> UInt`

### `kv_smem_size`

`kv_smem_size(self, sm_90: Bool = False) -> UInt`

### `k_smem_size`

`k_smem_size(self, sm_90: Bool = False) -> UInt`

### `v_smem_size`

`v_smem_size(self, sm_90: Bool = False) -> UInt`

### `p_smem_size`

`p_smem_size(self) -> UInt`

### `warp_scratch_smem_size`

`warp_scratch_smem_size(self) -> UInt`

### `shared_mem_bytes`

`shared_mem_bytes[shared_kv: Bool = False, sm_90: Bool = False](self) -> UInt`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

---

## MHAMask

The MHAMask trait describes masks for MHA kernels, such as the causal mask.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask`

Does the mask require `log2e` to be applied after the mask, or can it be fused with the scaling?

### `mask_out_of_bound`

`alias mask_out_of_bound`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds`

Is the mask safe to read out of bounds?

## Methods

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self: _Self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

Return mask vector at given coordinates.

Arguments:
coord is (seq\_id, head, q\_idx, k\_idx)
score\_vec is at `coord` of the score matrix

The functor could capture an mask tensor and add to the score e.g. Replit.

### `status`

`status[*, element_type: DType = uint32](self: _Self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

Given a tile's index range, return its masking status.

---

## MHAOperand

This serves as the trait to support arguments to our MHA kernel.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `type`

`alias type`

## Methods

### `block_paged_ptr`

`block_paged_ptr[tile_size: Int](self: _Self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[get_vtable_entry(:trait _Self, "type"), 1]]`

### `cache_length`

`cache_length(self: _Self, batch_idx: Int) -> Int`

Returns the length of the cache for a given batch index.

### `max_context_length`

`max_context_length(self: _Self) -> SIMD[uint32, 1]`

Returns the maximum cache length in a given batch index.

---

## MHAPartitionScheme

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `accum_dtype`

`alias accum_dtype`

### `do_partition`

`alias do_partition`

## Methods

### `num_partitions`

`num_partitions(self: _Self) -> SIMD[uint32, 1]`

### `get_exp_sum_qk_max_pointer`

`get_exp_sum_qk_max_pointer(self: _Self) -> UnsafePointer[SIMD[get_vtable_entry(:trait _Self, "accum_dtype"), 1]]`

---

## MHAPosition

`@register_passable(trivial)`
`struct MHAPosition[BM: Int, BN: Int, depth: Int, num_heads: Int, group: Int, decoding: Bool]`

Position of the MHA-kernel. When `decoding=False`, `q_head_stride == num_heads`. When `decoding=True`, `q_head_stride == 1`.

## Fields

* ​q\_out\_offset (`Int`):
* ​num\_keys (`SIMD[uint32, 1]`):
* ​start\_pos (`SIMD[uint32, 1]`):
* ​seq\_len (`SIMD[uint32, 1]`):
* ​head\_idx (`SIMD[uint32, 1]`):
* ​prompt\_offset (`SIMD[uint32, 1]`):
* ​prompt\_idx (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `q_output_gmem_layout`

`alias q_output_gmem_layout = __init__[::Origin[::Bool(IntTuple(BM, depth), IntTuple(depth if decoding else (depth * num_heads), 1))`

### `q_stride`

`alias q_stride = depth if decoding else (depth * num_heads)`

## Methods

### `__init__`

`__init__(q_out_offset: Int, num_keys: SIMD[uint32, 1], start_pos: SIMD[uint32, 1], seq_info: SeqInfo) -> Self`

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

### `q_head_idx`

`q_head_idx(self) -> SIMD[uint32, 1]`

### `kv_head_idx`

`kv_head_idx(self) -> SIMD[uint32, 1]`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

### `q_tile_num_rows`

`q_tile_num_rows(self) -> SIMD[uint32, 1]`

### `q_out_gmem_tensor`

`q_out_gmem_tensor[dtype: DType](self, ptr: UnsafePointer[SIMD[dtype, 1]]) -> LayoutTensor[dtype, __init__[::Origin[::Bool(IntTuple(BM, depth), IntTuple(depth if decoding else (depth * num_heads), 1)), MutableAnyOrigin, layout_int_type=int32, linear_idx_type=int32, masked=True]`

### `mask_status`

`mask_status[mask_t: MHAMask](self, mask: mask_t, kv_tile_start_row: SIMD[uint32, 1]) -> TileMaskStatus`

### `exp_sum_qk_max_ptr`

`exp_sum_qk_max_ptr[partition_t: MHAPartitionScheme](self, partition: partition_t, batch_size: SIMD[uint32, 1]) -> Tuple[UnsafePointer[SIMD[get_vtable_entry(:trait partition_t, "accum_dtype"), 1]], UnsafePointer[SIMD[get_vtable_entry(:trait partition_t, "accum_dtype"), 1]]]`

### `get_start_and_end_for_partitions`

`get_start_and_end_for_partitions[partition_t: MHAPartitionScheme, //, BN: Int](self, partition: partition_t) -> Tuple[SIMD[uint32, 1], SIMD[uint32, 1]]`

---

## MHASchedule

`@register_passable(trivial)`
`struct MHASchedule`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `DEFAULT`

`alias DEFAULT = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))`

### `PROMPT_ROTATE`

`alias PROMPT_ROTATE = MHASchedule(__init__[__mlir_type.!pop.int_literal](1))`

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

---

## MHASchedulerSynchronization

`@register_passable(trivial)`
`struct MHASchedulerSynchronization`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ALL`

`alias ALL = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](2))`

### `DEFAULT`

`alias DEFAULT = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))`

### `NONE`

`alias NONE = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](0))`

### `PRODUCER`

`alias PRODUCER = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))`

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

---

## MHATileScheduler

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `may_advance`

`alias may_advance`

### `mha_schedule`

`alias mha_schedule`

The MHATileScheduler trait describes a schedule for the persistent kernel.

## Methods

### `get_current_work_info`

`get_current_work_info(self: _Self, ts: MHATileSummary, state: MHATileState) -> WorkInfo`

Returns the current `WorkInfo`.

### `advance`

`advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self: _Self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]`

Advance state to the next work item. `func` must return a `Bool` indicating whether there is more work. Returns `True` if there is more work.

### `grid_dim`

`static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]`

Return the grid\_dim required for the kernel.

### `initial_state`

`initial_state(self: _Self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState`

Create the initial state object.

### `unsafe_seq_info`

`unsafe_seq_info[ragged: Bool](self: _Self, ts: MHATileSummary, state: MHATileState) -> SeqInfo`

---

## MHATileState

`@register_passable(trivial)`
`struct MHATileState`

## Fields

* ​idx (`SIMD[uint32, 1]`):
* ​sidx\_ptr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)]`):
* ​max\_idx (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(idx: SIMD[uint32, 1], sidx_ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], max_idx: SIMD[uint32, 1]) -> Self`

### `is_valid`

`is_valid(self, idx: SIMD[uint32, 1]) -> Bool`

`is_valid(self) -> Bool`

---

## MHATileSummary

`@register_passable(trivial)`
`struct MHATileSummary`

## Fields

* ​batch\_size (`SIMD[uint32, 1]`):
* ​max\_num\_prompt\_tiles (`SIMD[uint32, 1]`):
* ​valid\_length (`NDBuffer[uint32, 1, MutableAnyOrigin]`):
* ​max\_seq\_len (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1], valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_len: SIMD[uint32, 1]) -> Self`

### `get_current_work_info`

`get_current_work_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], schedule: MHASchedule](self, idx: SIMD[uint32, 1]) -> WorkInfo`

`get_current_work_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], schedule: MHASchedule](self, idx: MHATileState) -> WorkInfo`

### `unsafe_get_current_work_info`

`unsafe_get_current_work_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], schedule: MHASchedule](self, idx: SIMD[uint32, 1]) -> WorkInfo`

### `max_idx`

`max_idx(self, num_heads: SIMD[uint32, 1]) -> SIMD[uint32, 1]`

### `grid_dim`

`static grid_dim[num_heads: SIMD[uint32, 1]](max_num_prompt_tiles: SIMD[uint32, 1], batch_size: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]`

### `seq_info`

`seq_info[ragged: Bool](self, work: WorkInfo) -> SeqInfo`

### `unsafe_seq_info`

`unsafe_seq_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], ragged: Bool, schedule: MHASchedule](self, idx: SIMD[uint32, 1]) -> SeqInfo`

`unsafe_seq_info[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], ragged: Bool, schedule: MHASchedule](self, state: MHATileState) -> SeqInfo`

---

## MicroKernelShape

`@register_passable(trivial)`
`struct MicroKernelShape`

Record describing the inner kernel shape.

## Fields

* ​simd\_rows (`Int`):
* ​simd\_cols (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(rows: Int, cols: Int) -> Self`

---

## min

`min(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]`

Computes the min element in a buffer.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.

**Returns:**

The minimum of the buffer elements.

`min[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])`

Computes the min across reduce\_axis of an NDBuffer.

**Parameters:**

* ​reduce\_axis (`Int`): The axis to reduce across.

**Args:**

* ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer.
* ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer.

`min[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())`

Computes the min across the input and output shape.

This performs the min computation on the domain specified by `input_shape`,
loading the inputs using the `input_fn`. The results are stored using
the `output_fn`.

**Parameters:**

* ​type (`DType`): The type of the input and output.
* ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input.
* ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.

**Args:**

* ​input\_shape (`IndexList[size]`): The input shape.
* ​reduce\_dim (`Int`): The axis to perform the min on.
* ​context (`DeviceContextPtr`): The pointer to DeviceContext.

---

## min

`min(x: Int, y: Int, /) -> Int`

Gets the minimum of two integers.

**Args:**

* ​x (`Int`): Integer input to min.
* ​y (`Int`): Integer input to min.

**Returns:**

Minimum of x and y.

`min(x: UInt, y: UInt, /) -> UInt`

Gets the minimum of two integers.

**Args:**

* ​x (`UInt`): Integer input to min.
* ​y (`UInt`): Integer input to min.

**Returns:**

Minimum of x and y.

`min[dtype: DType, //](x: SIMD[dtype, size], y: SIMD[dtype, size], /) -> SIMD[dtype, size]`

Gets the elementwise minimum of x and y.

An element of the result SIMD vector will be the minimum of the
corresponding elements in x and y.

**Constraints:**

The type of the inputs must be numeric or boolean.

**Parameters:**

* ​dtype (`DType`): The data type of the SIMD vector.

**Args:**

* ​x (`SIMD[dtype, size]`): First SIMD vector.
* ​y (`SIMD[dtype, size]`): Second SIMD vector.

**Returns:**

A SIMD vector containing the elementwise minimum of x and y.

`min[T: Copyable & LessThanComparable](x: T, *ys: T) -> T`

Gets the minimum value from a sequence of values.

**Parameters:**

* ​T (`Copyable & LessThanComparable`): A type that is both copyable and comparable with less than.

**Args:**

* ​x (`T`): The first value to compare.
* ​\*ys (`T`): Zero or more additional values to compare.

**Returns:**

The minimum value from the input sequence.

---

## min

`min[type: DType, width: Int, //, *, block_size: Int, broadcast: Bool = True](val: SIMD[type, width]) -> SIMD[type, width]`

Computes the minimum value across all threads in a block.

Performs a parallel reduction using warp-level operations and shared memory
to find the global minimum across all threads in the block.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements.
* ​width (`Int`): The number of elements in each SIMD vector.
* ​block\_size (`Int`): The total number of threads in the block.
* ​broadcast (`Bool`): If True, the final minimum is broadcast to all threads in the
  block. If False, only the first thread will have the complete min.

**Args:**

* ​val (`SIMD[type, width]`): The SIMD value to reduce. Each thread contributes its value to find
  the minimum.

**Returns:**

If broadcast is True, each thread in the block will receive the minimum
value across the entire block. Otherwise, only the first thread will
have the complete result.

---

## min

`min[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Computes the minimum value across all lanes in a warp.

This is a convenience wrapper around lane\_group\_min that operates on the entire warp.
It performs a parallel reduction using warp shuffle operations to find the global minimum
value across all lanes in the warp.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to find the minimum.

**Returns:**

A SIMD value where all lanes contain the minimum value found across the entire warp.
The minimum value is broadcast to all lanes.

---

## min_finite

`min_finite[dtype: DType]() -> SIMD[dtype, 1]`

Returns the minimum (lowest) finite value of type.

**Parameters:**

* ​dtype (`DType`): The value dtype.

**Returns:**

The minimum representable value of the type. Does not include negative
infinity for floating-point types.

---

## min_or_neg_inf

`min_or_neg_inf[dtype: DType]() -> SIMD[dtype, 1]`

Returns the minimum (potentially negative infinite) value of type.

**Parameters:**

* ​dtype (`DType`): The value dtype.

**Returns:**

The minimum representable value of the type. Can include negative
infinity for floating-point types.

---

## min_p_sampling

`min_p_sampling[type: DType, rank: Int, out_idx_type: DType, //, _test_sort: Bool = False](min_ps: NDBuffer[type, 1, origin], input_logits: NDBuffer[type, rank, origin], out_token_ids: NDBuffer[out_idx_type, rank, origin], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))`

Naive CPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the calculated probability threshold (Min-P).

---

## min_p_sampling_gpu

`min_p_sampling_gpu[type: DType, rank: Int, out_idx_type: DType, //, _test_sort: Bool = False](ctx: DeviceContext, min_ps: NDBuffer[type, 1, origin], input_logits: NDBuffer[type, rank, origin], out_token_ids: NDBuffer[out_idx_type, rank, origin], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))`

GPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the calculated probability threshold (Min-P).

---

## mkdir

`mkdir[PathLike: PathLike](path: PathLike, mode: Int = 511)`

Creates a directory at the specified path.

If the directory can not be created an error is raised.
Absolute and relative paths are allowed, relative paths are resolved from cwd.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.
* ​mode (`Int`): The mode to create the directory with.

---

## mkdtemp

`mkdtemp(suffix: String = __init__[__mlir_type.!kgen.string](""), prefix: String = __init__[__mlir_type.!kgen.string]("tmp"), dir: Optional[String] = Optional(None)) -> String`

Create a temporary directory. Caller is responsible for deleting the directory when done with it.

**Args:**

* ​suffix (`String`): Suffix to use for the directory name.
* ​prefix (`String`): Prefix to use for the directory name.
* ​dir (`Optional[String]`): Directory in which the directory will be created.

**Returns:**

The name of the created directory.

**Raises:**

If the directory can not be created.

---

## mla

## Functions

* [​`flare_mla_decoding`](./flare_mla_decoding): MLA decoding kernel that would only be called in the optimized compute graph.
* [​`flare_mla_decoding_dispatch`](./flare_mla_decoding_dispatch):
* [​`flare_mla_prefill`](./flare_mla_prefill): MLA prefill kernel that would only be called in the optimized compute graph. Only supports ragged Q/K/V inputs.
* [​`flare_mla_prefill_dispatch`](./flare_mla_prefill_dispatch):
* [​`mla_decoding`](./mla_decoding):
* [​`mla_decoding_single_batch`](./mla_decoding_single_batch): Flash attention v2 algorithm.
* [​`mla_prefill`](./mla_prefill):
* [​`mla_prefill_plan`](./mla_prefill_plan): This calls a GPU kernel that plans how to process a batch of sequences with varying lengths using a fixed-size buffer.
* [​`mla_prefill_plan_kernel`](./mla_prefill_plan_kernel):
* [​`mla_prefill_single_batch`](./mla_prefill_single_batch): MLA for encoding where seqlen > 1.

---

## mla_decoding

`mla_decoding[q_type: DType, k_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, ragged: Bool = False, _use_valid_length: Bool = False, _is_cache_length_accurate: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], batch_size: Int, num_partitions: Int, max_cache_valid_length: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], mask: mask_t, score_mod: score_mod_t)`

---

## mla_decoding_single_batch

`mla_decoding_single_batch[q_type: DType, k_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, BM: UInt, BN: UInt, BK: UInt, WM: UInt, WN: UInt, depth: UInt, depth_v: UInt, num_heads: UInt, num_threads: UInt, num_pipeline_stages: UInt, group: UInt = UInt(1), use_score_mod: Bool = False, decoding_warp_split_k: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], exp_sum_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], qk_max_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], num_keys: UInt, num_partitions: UInt, max_cache_valid_length: UInt, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)`

Flash attention v2 algorithm.

---

## mla_prefill

`mla_prefill[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, k_rope_t: MHAOperand, output_type: DType, softmax_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, config: MHAConfig, group: Int = 128, q_depth: Int = 192, cache_depth: Int = 576, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False, _ndbuffer_mha_operand: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, k_rope: k_rope_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], softmax_info_ptr: UnsafePointer[SIMD[softmax_type, 1]], prev_output_ptr: UnsafePointer[SIMD[output_type, 1]], prev_softmax_info_ptr: UnsafePointer[SIMD[softmax_type, 1]], scale: SIMD[float32, 1], batch_size: Int, seq_len_arg: Int, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], cache_offsets: OptionalReg[NDBuffer[uint32, 1, MutableAnyOrigin]], mask: mask_t, score_mod: score_mod_t)`

---

## mla_prefill_plan

`mla_prefill_plan[cache_t: KVCacheT](buffer_row_offsets: NDBuffer[uint32, 2, origin, shape, strides], cache_offsets: NDBuffer[uint32, 2, origin, shape, strides], buffer_lengths: NDBuffer[int32, 1, origin, shape, strides], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], k_cache: cache_t, buffer_token_size: SIMD[uint32, 1], ctx: DeviceContext)`

This calls a GPU kernel that plans how to process a batch of sequences with varying lengths using a fixed-size buffer.

Each sequence in the batch has some existing cached tokens and new input
tokens. The kernel divides the total tokens into chunks of buffer\_token\_size.

For each chunk (iteration), it calculates:
1\. Buffer offsets for each sequence in each chunk
2\. Cache offsets for each sequence in each chunk
3\. Total buffer lengths for each processing iteration

---

## mla_prefill_plan_kernel

`mla_prefill_plan_kernel[buffer_lengths_shape: DimList, cache_t: KVCacheT](buffer_row_offsets: NDBuffer[uint32, 2, MutableAnyOrigin], cache_offsets: NDBuffer[uint32, 2, MutableAnyOrigin], buffer_lengths: NDBuffer[int32, 1, MutableAnyOrigin, buffer_lengths_shape], input_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], k_cache: cache_t, buffer_token_size: SIMD[uint32, 1])`

---

## mla_prefill_single_batch

`mla_prefill_single_batch[q_type: DType, k_t: MHAOperand, v_t: MHAOperand, k_rope_t: MHAOperand, output_type: DType, mask_t: MHAMask, score_mod_t: ScoreModTrait, *, config: MHAConfig, group: Int = 1, q_depth: Int = 192, cache_depth: Int = 576, use_score_mod: Bool = False, write_softmax_info: Bool = False, use_cascade_attention: Bool = False](q_ptr: UnsafePointer[SIMD[q_type, 1]], k: k_t, v: v_t, k_rope: k_rope_t, output_ptr: UnsafePointer[SIMD[output_type, 1]], softmax_info_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], prev_output_ptr: UnsafePointer[SIMD[output_type, 1]], prev_softmax_info_ptr: UnsafePointer[SIMD[get_accum_type[::DType,::DType](), 1]], scale: SIMD[float32, 1], seq_len: Int, max_seq_len: Int, start_pos: SIMD[uint32, 1], cache_start_pos: SIMD[uint32, 1], num_keys: Int, mask: mask_t, score_mod: score_mod_t, batch_idx: Int)`

MLA for encoding where seqlen > 1.

---

## mma

This module includes utilities for working with the warp-matrix-matrix-multiplication (wmma) instructions.

## Structs

* [​`WGMMADescriptor`](/mojo/stdlib/gpu/mma/WGMMADescriptor): Descriptor for shared memory operands used in warp group matrix multiply operations.

## Functions

* [​`ld_matrix`](/mojo/stdlib/gpu/mma/ld_matrix): Loads a matrix from shared memory into registers in a format suitable for tensor core operations.
* [​`mma`](/mojo/stdlib/gpu/mma/mma): Performs warp sync Tensor Core based Matrix-multiply and accumulate (MMA) operation.
* [​`st_matrix`](/mojo/stdlib/gpu/mma/st_matrix): Performs warp-synchronized copy from registers to shared memory.
* [​`wgmma_async`](/mojo/stdlib/gpu/mma/wgmma_async): Performs warp group async Matrix-multiply and accumulate (WGMMA) operation.
* [​`wgmma_commit_group_sync`](/mojo/stdlib/gpu/mma/wgmma_commit_group_sync): Commits pending warp group matrix multiply operations.
* [​`wgmma_fence_aligned`](/mojo/stdlib/gpu/mma/wgmma_fence_aligned): Inserts a memory fence for warp group matrix multiply operations.
* [​`wgmma_wait_group_sync`](/mojo/stdlib/gpu/mma/wgmma_wait_group_sync): Waits for all pending warp group matrix multiply operations to complete.

---

## mma

`mma[block_size: Int = 1](mut d: SIMD[dtype, size], a: SIMD[dtype, size], b: SIMD[dtype, size], c: SIMD[dtype, size])`

Performs warp sync Tensor Core based Matrix-multiply and accumulate (MMA) operation.

This function executes a matrix multiply-accumulate operation using GPU Tensor Cores,
synchronizing across the warp. It dispatches to architecture-specific implementations
for NVIDIA and AMD GPUs.

The operation performed is: d = (a \* b) + c

Supported configurations depend on the GPU architecture:

* NVIDIA: Various combinations of FP32, FP16, BF16, and FP8 formats
* AMD: Limited subset of FP32 and FP16 operations

Note:

* All threads in a warp must execute this operation together
* Input matrices must be properly loaded and formatted for Tensor Core operations
* Matrix dimensions and data types must match hardware requirements

**Parameters:**

* ​block\_size (`Int`): The size of the block of the MMA operation (e.g., 4x4x4\_16B). Applies to AMD GPUs only.

**Args:**

* ​d (`SIMD[dtype, size]`): Output SIMD vector to store the result.
* ​a (`SIMD[dtype, size]`): First input matrix as SIMD vector.
* ​b (`SIMD[dtype, size]`): Second input matrix as SIMD vector.
* ​c (`SIMD[dtype, size]`): Accumulator matrix as SIMD vector.

---

## mma

`mma[kind: UMMAKind, //, cta_group: Int = 1, /, *, c_scale: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1)](a_desc: MMASmemDescriptor, b_desc: MMASmemDescriptor, c_tmem: SIMD[uint32, 1], inst_desc: UMMAInsDescriptor[kind])`

Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction.

**Parameters:**

* ​kind (`UMMAKind`): Data type of the matrices.
* ​cta\_group (`Int`): Number of ctas used by MMA.
* ​c\_scale (`SIMD[uint32, 1]`): Scale factor for the C matrix, 0 or 1.

**Args:**

* ​a\_desc (`MMASmemDescriptor`): The descriptor for the A matrix.
* ​b\_desc (`MMASmemDescriptor`): The descriptor for the B matrix.
* ​c\_tmem (`SIMD[uint32, 1]`): The address of the C matrix in the tensor memory.
* ​inst\_desc (`UMMAInsDescriptor[kind]`): The descriptor for the MMA instruction.

`mma[kind: UMMAKind, //, cta_group: Int = 1, /, *, c_scale: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1)](a_desc: SIMD[uint32, 1], b_desc: MMASmemDescriptor, c_tmem: SIMD[uint32, 1], inst_desc: UMMAInsDescriptor[kind])`

Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction.

**Parameters:**

* ​kind (`UMMAKind`): Data type of the matrices.
* ​cta\_group (`Int`): Number of ctas used by MMA.
* ​c\_scale (`SIMD[uint32, 1]`): Scale factor for the C matrix, 0 or 1.

**Args:**

* ​a\_desc (`SIMD[uint32, 1]`): The descriptor for the A matrix.
* ​b\_desc (`MMASmemDescriptor`): The descriptor for the B matrix.
* ​c\_tmem (`SIMD[uint32, 1]`): The address of the C matrix in the tensor memory.
* ​inst\_desc (`UMMAInsDescriptor[kind]`): The descriptor for the MMA instruction.

---

## mma_arrive

`mma_arrive[cta_group: Int = 1](mbar_ptr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin])`

Arrive at the mbar pointer for the MMA instruction.

**Parameters:**

* ​cta\_group (`Int`): Number of ctas used by MMA.

**Args:**

* ​mbar\_ptr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the mbar.

---

## mma_arrive_multicast

`mma_arrive_multicast[cta_group: Int = 1](mbar_ptr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], cta_mask: SIMD[uint16, 1])`

Arrive at the mbar pointer for the MMA instruction for multiple ctas.

**Parameters:**

* ​cta\_group (`Int`): Number of ctas used by MMA.

**Args:**

* ​mbar\_ptr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the mbar.
* ​cta\_mask (`SIMD[uint16, 1]`): Mask of ctas to signal.

---

## mma_sm100

This module includes utilities for working with the SM100 MMA instructions.

## Structs

* [​`MMASmemDescriptor`](/mojo/stdlib/gpu/mma_sm100/MMASmemDescriptor): Descriptor for shared memory operands tcgen05 mma instructions.
* [​`UMMAInsDescriptor`](/mojo/stdlib/gpu/mma_sm100/UMMAInsDescriptor): Descriptor for UMMA instructions.
* [​`UMMAKind`](/mojo/stdlib/gpu/mma_sm100/UMMAKind): Struct for UMMA instruction types.

## Functions

* [​`mma`](/mojo/stdlib/gpu/mma_sm100/mma): Perform a matrix multiply-accumulate operation using the tcgen05.mma instruction.
* [​`mma_arrive`](/mojo/stdlib/gpu/mma_sm100/mma_arrive): Arrive at the mbar pointer for the MMA instruction.
* [​`mma_arrive_multicast`](/mojo/stdlib/gpu/mma_sm100/mma_arrive_multicast): Arrive at the mbar pointer for the MMA instruction for multiple ctas.

---

## mma_util

Matrix multiply accumulate (MMA) utilities for GPU tensor cores.

This module provides functions for loading matrix tiles from memory into registers and storing
results back to memory when using tensor cores for matrix multiplication. It supports both
NVIDIA and AMD GPUs with functions specialized for different data types (FP32, FP16, BF16).

The key functions are:

* load\_matrix\_a: Loads tiles from the first input matrix A
* load\_matrix\_b: Loads tiles from the second input matrix B
* store\_matrix\_d: Stores result tiles to the output matrix D

Each function handles the specific memory access patterns required by the tensor core
instructions on each GPU architecture. The tile sizes and data layouts match the hardware
requirements documented in:

NVIDIA PTX: 
AMD Matrix Cores: 

## Functions

* [​`load_matrix_a`](/mojo/stdlib/gpu/mma_util/load_matrix_a): Loads a tile of matrix A from memory to registers for TF32 tensor core operations.
* [​`load_matrix_a_amd`](/mojo/stdlib/gpu/mma_util/load_matrix_a_amd): Loads a tile of matrix A from memory to registers for AMD FP32 tensor core operations.
* [​`load_matrix_b`](/mojo/stdlib/gpu/mma_util/load_matrix_b): Loads a tile of matrix B from memory to registers for TF32 tensor core operations.
* [​`load_matrix_b_amd`](/mojo/stdlib/gpu/mma_util/load_matrix_b_amd): Loads a tile of matrix B from memory to registers for AMD FP32 tensor core operations.
* [​`store_matrix_d`](/mojo/stdlib/gpu/mma_util/store_matrix_d): Stores matrix D tile from registers to memory after tensor core operation.

---

## MMASmemDescriptor

`@register_passable(trivial)`
`struct MMASmemDescriptor`

Descriptor for shared memory operands tcgen05 mma instructions.

This struct represents a descriptor that encodes information about shared memory layout
and access patterns for warp group matrix multiply operations. The descriptor contains
the following bit fields:

bits layout:

Bit-field | size | Description
0-13   |  14  | Base address in shared memory
14-15   |   2  | Unused, 0
16-29   |  14  | LBO: leading dim byte offset
30-31   |   2  | Unused, 0
32-45   |  14  | SBO: stride dim byte offset
46-48   |   3  | Unused, 0
49-51   |   3  | Matrix Base offset, 0 for canonical layouts
52      |   1  | LBO mode, only matters for 48B K tile
53-60   |   8  | fixed, 0
61-63   |   3  | Swizzle mode

* Start address, LBO, SBO ingnores 4 LSBs.

See 

## Fields

* ​desc (`SIMD[uint64, 1]`): The 64-bit descriptor encodes shared memory operand information.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(val: SIMD[uint64, 1]) -> Self`

Initialize descriptor with raw 64-bit value.

This constructor allows creating a descriptor directly from a 64-bit integer
that already contains the properly formatted bit fields for the descriptor.

The implicit attribute enables automatic conversion from `UInt64` to `MMASmemDescriptor`.

**Args:**

* ​val (`SIMD[uint64, 1]`): A 64-bit integer containing the complete descriptor bit layout.

### `__add__`

`__add__(self, offset: Int) -> Self`

Add offset to descriptor's base address.

**Args:**

* ​offset (`Int`): Byte offset to add to base address.

**Returns:**

New descriptor with updated base address.

### `__iadd__`

`__iadd__(mut self, offset: Int)`

Add offset to descriptor's base address in-place.

**Args:**

* ​offset (`Int`): Byte offset to add to base address.

### `create`

`static create[stride_byte_offset: Int, leading_byte_offset: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))](smem_ptr: UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]) -> Self`

Create a descriptor for shared memory operand.

**Parameters:**

* ​stride\_byte\_offset (`Int`): Stride dimension offset in bytes.
* ​leading\_byte\_offset (`Int`): Leading dimension stride in bytes.
* ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern mode.

**Args:**

* ​smem\_ptr (`UnsafePointer[type, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to shared memory operand.

**Returns:**

Initialized descriptor for the shared memory operand.

---

## Mode

`struct Mode`

Defines a Benchmark Mode to distinguish between test runs and actual benchmarks.

## Fields

* ​value (`Int`): Represents the mode type.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `Benchmark`

`alias Benchmark = Mode(0)`

### `Test`

`alias Test = Mode(1)`

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Check if its Benchmark mode or test mode.

**Args:**

* ​other (`Self`): The mode to be compared against.

**Returns:**

If its a test mode or benchmark mode.

---

## Model

```c
#include "max/c/model.h"
```

## Functions

### `M_newCompileConfig()`

> [M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*M\_newCompileConfig()

Creates an object you can use to configure model compilation.

You need `M_CompileConfig` as an argument for several functions, including [`M_setModelPath()`](#model_8h_1a03244f05c8a6092a55d3abc124ad90b7), `M_setTorchInputSpecs()`, and [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750).

* **Returns:**

  A pointer to a new compilation configuration. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeCompileConfig()`](#model_8h_1abbf74b13adaf5bc8a0bb4d46c40688d9). This compilation configuration can only be used for a single compilation call. Any subsequent compilations must be passed a new `M_CompileConfig` (created by calling [`M_newCompileConfig()`](#model_8h_1a417e7a581c096ca26c36a1875163b665) again).

### `M_cloneCompileConfig()`

> [M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*M\_cloneCompileConfig([M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*other)

Clones an object you can use to configure model compilation.

* **Returns:**

  A pointer to a deep-cloned compilation configuration. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeCompileConfig()`](#model_8h_1abbf74b13adaf5bc8a0bb4d46c40688d9). This compilation configuration can only be used for a single compilation call. Any subsequent compilations must be passed a new `M_CompileConfig` (created by calling [`M_newCompileConfig()`](#model_8h_1a417e7a581c096ca26c36a1875163b665) or [`M_cloneCompileConfig()`](#model_8h_1a964d9da1706841788fc492d527c116dc) again).

### `M_setModelPath()`

> void M\_setModelPath([M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*compileConfig, const char \*path)

Sets the path to a model.

You must call this before you call [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750). Otherwise, [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750) returns an error in `status`.

Note: PyTorch models must be in TorchScript format.

* **Parameters:**

  * **compileConfig** – The compilation configuration for your model, from [`M_newCompileConfig()`](#model_8h_1a417e7a581c096ca26c36a1875163b665).
  * **path** – The path to your model. The model does not need to exist on the filesystem at this point. This follows the same semantics and expectations as `std::filesystem::path`.

### `M_newModelSource()`

> [M\_ModelSource](types.md#_CPPv413M_ModelSource) \*M\_newModelSource(void \*source, [M\_FrameworkFormat](types.md#_CPPv417M_FrameworkFormat) format)

Creates an opaque torchscript model representation.

* **Parameters:**

  * **source** – A pointer to the model representation.
  * **format** – The framework format matching the model representation.
* **Returns:**

  A pointer to the opaque model representation. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeModelSource()`](#model_8h_1a1c4b2248fdfed4c9f0dbabe846e6a990).

### `M_setModelSource()`

> void M\_setModelSource([M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*compileConfig, [M\_ModelSource](types.md#_CPPv413M_ModelSource) \*modelSource)

Sets the opaque representation of the model for compilation.

You must call this or [`M_setModelPath()`](#model_8h_1a03244f05c8a6092a55d3abc124ad90b7) before you call [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750). Otherwise, [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750) returns an error in `status`.

* **Parameters:**

  * **compileConfig** – The compilation configuration for your model, from [`M_newCompileConfig()`](#model_8h_1a417e7a581c096ca26c36a1875163b665).
  * **modelSource** – The opaque representation of your model.

### `M_compileModel()`

> [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*M\_compileModel(const [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, [M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*\*compileConfig, [M\_Status](types.md#_CPPv48M_Status) \*status)

Compiles a model.

This immediately returns an `M_AsyncCompiledModel`, with compilation happening asynchronously. If you need to block to await compilation, you can then call [`M_waitForCompilation()`](#model_8h_1a8040a6488596f863c205d769d92ad013).

You must call [`M_setModelPath()`](#model_8h_1a03244f05c8a6092a55d3abc124ad90b7) before you call this. For example:

```c
M_CompileConfig *compileConfig = M_newCompileConfig();
M_setModelPath(compileConfig, modelPath);
M_AsyncCompiledModel *compiledModel =
    M_compileModel(context, &compileConfig, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}
```

When using a TorchScript model, you must also specify the input shapes via `M_setTorchInputSpecs()` before you compile it.

The `M_AsyncCompiledModel` returned here is not ready for inference yet. You need to then initialize the model with [`M_initModel()`](#model_8h_1a2dcb9570ae117602579182d8faed494a).

* **Parameters:**

  * **context** – The runtime context, from [`M_newRuntimeContext()`](context.md#context_8h_1a46a6c670f73e1ce560f3c2cc1de93175).
  * **compileConfig** – Address of compilation configuration for your model created with [`M_newCompileConfig()`](#model_8h_1a417e7a581c096ca26c36a1875163b665), and with the model set via [`M_setModelPath()`](#model_8h_1a03244f05c8a6092a55d3abc124ad90b7). Ownership of configuration is handed over to API.
  * **status** – The status used to report errors in the case of failures during model compilation.
* **Returns:**

  A pointer to an `M_AsyncCompiledModel`. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeCompiledModel()`](#model_8h_1a5b6846eb4d47d445eb65c305b1c81b1c). If the config is invalid, it returns a `NULL` pointer. If the model compilation fails, the pointer is `NULL` and the `status` parameter contains an error message. `compileConfig` will be reset to `NULL` after this call irrespective of status and cannot be reused, and any subsequent calls must take a new `M_CompileConfig`.

### `M_waitForCompilation()`

> void M\_waitForCompilation([M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*compiledModel, [M\_Status](types.md#_CPPv48M_Status) \*status)

Blocks execution until the model is compiled.

This waits for the async compiled model to be complete after calling [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750). When this function returns, the model is resolved to either a compiled model or an error.

* **Parameters:**

  * **compiledModel** – The model received from [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750).
  * **status** – The status used to report errors in the case of failures.

### `M_compileModelSync()`

> [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*M\_compileModelSync(const [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, [M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*\*compileConfig, [M\_Status](types.md#_CPPv48M_Status) \*status)

Synchronously compiles a model.

Unlike [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750), this blocks until model compilation is complete. It returns an `M_AsyncCompiledModel` without needing to call [`M_waitForCompilation()`](#model_8h_1a8040a6488596f863c205d769d92ad013). All other setup and usage is identical to [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750).

* **Parameters:**

  * **context** – The runtime context, from [`M_newRuntimeContext()`](context.md#context_8h_1a46a6c670f73e1ce560f3c2cc1de93175).
  * **compileConfig** – Address of compilation configuration for your model created with [`M_newCompileConfig()`](#model_8h_1a417e7a581c096ca26c36a1875163b665), and with the model set via [`M_setModelPath()`](#model_8h_1a03244f05c8a6092a55d3abc124ad90b7). Ownership of configuration is handed over to API.
  * **status** – The status used to report errors in the case of failures during model compilation.
* **Returns:**

  A pointer to an `M_AsyncCompiledModel`. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeCompiledModel()`](#model_8h_1a5b6846eb4d47d445eb65c305b1c81b1c). If the config is invalid, it returns a `NULL` pointer. If the model compilation fails, the pointer is `NULL` and the `status` parameter contains an error message. `compileConfig` will be reset to `NULL` after this call irrespective of status and cannot be reused, and any subsequent calls must take a new `M_CompileConfig`.

### `M_initModel()`

> [M\_AsyncModel](types.md#_CPPv412M_AsyncModel) \*M\_initModel(const [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*compiledModel, const [M\_WeightsRegistry](types.md#_CPPv417M_WeightsRegistry) \*weightsRegistry, [M\_Status](types.md#_CPPv48M_Status) \*status)

Sets up a model for execution.

You can call this immediately after [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750)—you don’t need to wait for the async compilation.

This function also returns immediately with model initialization happening asynchronously. For example:

```c
M_AsyncModel *model = M_initModel(
  context, compiledModel, weightsRegistry, status);
if (M_isError(status)) {
  logError(M_getError(status));
  return EXIT_FAILURE;
}
```

If you want to block until `M_AsyncModel` is initialized, you can call [`M_waitForModel()`](#model_8h_1a852bec3f80cebb5c06911091d5cab349), but that’s not necessary and you can immediately call [`M_executeModelSync()`](#model_8h_1a2ced4683834a77d0b943a6bc72d846d5).

* **Parameters:**

  * **context** – The runtime context, from [`M_newRuntimeContext()`](context.md#context_8h_1a46a6c670f73e1ce560f3c2cc1de93175).
  * **compiledModel** – The compiled model, from [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750).
  * **weightsRegistry** – A mapping from weights’ names to their data. The weights registry is used to update weights or otherwise pass weights to the model init block at runtime, without recompiling the model graph. If the model doesn’t use the weights registry, it is safe to pass as NULL
  * **status** – The status used to report errors in the case of failures. The status contains an error only if the given context or compiled model is invalid. Other errors will not surface until the next synchronization point.
* **Returns:**

  A pointer to an `M_AsyncModel` that holds an async value to a compiled model. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeModel()`](#model_8h_1a4094fa8e414f8b6a6563474f8840d33c). If model initialization fails, the `status` parameter contains an error message.

### `M_getInputNames()`

> [M\_TensorNameArray](types.md#_CPPv417M_TensorNameArray) \*M\_getInputNames(const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, [M\_Status](types.md#_CPPv48M_Status) \*status)

Gets all input tensor names.

* **Parameters:**

  * **model** – The compiled model.
  * **status** – The status used to report errors in the case of failures. The status contains an error only if the given model is invalid.
* **Returns:**

  An array of input tensor names or a `NULL` pointer if the model is invalid. If `NULL`, the `status` parameter contains an error message. Callers are responsible for freeing the returned array by calling [`M_freeTensorNameArray()`](tensor.md#tensor_8h_1a7fa5d2aff7f89143ae1905fc29b5b112).

### `M_getOutputNames()`

> [M\_TensorNameArray](types.md#_CPPv417M_TensorNameArray) \*M\_getOutputNames(const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, [M\_Status](types.md#_CPPv48M_Status) \*status)

Gets all output tensor names

* **Parameters:**

  * **model** – The compiled model.
  * **status** – The status used to report errors in the case of failures. The status contains an error only if the given model is invalid.
* **Returns:**

  An array of output tensor names or a `NULL` pointer if the model is invalid. If `NULL`, the `status` parameter contains an error message. Callers are responsible for freeing the returned array by calling [`M_freeTensorNameArray()`](tensor.md#tensor_8h_1a7fa5d2aff7f89143ae1905fc29b5b112).

### `M_getTensorNameAt()`

> const char \*M\_getTensorNameAt(const [M\_TensorNameArray](types.md#_CPPv417M_TensorNameArray) \*tensorNameArray, size\_t index)

Gets the tensor name in `tensorNameArray` at `index`.

* **Parameters:**

  * **tensorNameArray** – The tensor name array.
  * **index** – The index of the tensor name to get.
* **Returns:**

  A pointer to the tensor name at `index` or a `NULL` pointer if the index is out of bounds, or if `tensorNameArray` is `NULL`. The returned string is owned by `tensorNameArray`. The returned string is null terminated.

### `M_getModelInputSpecByName()`

> [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*M\_getModelInputSpecByName(const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, const char \*tensorName, [M\_Status](types.md#_CPPv48M_Status) \*status)

Gets the specifications for an input tensor by the tensor’s name.

* **Parameters:**

  * **model** – The compiled model.
  * **tensorName** – The name of the input tensor.
  * **status** – The status used to report errors in the case of failures. The status contains an error only if the given model or `tensorName` is invalid.
* **Returns:**

  A pointer to an `M_TensorSpec`, or a `NULL` pointer if the model or index is invalid. If `NULL`, the `status` parameter contains an error message.

### `M_getModelOutputSpecByName()`

> [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*M\_getModelOutputSpecByName(const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, const char \*tensorName, [M\_Status](types.md#_CPPv48M_Status) \*status)

Gets the specifications for an output tensor by the tensor’s name.

* **Parameters:**

  * **model** – The compiled model.
  * **tensorName** – The name of the output tensor.
  * **status** – The status used to report errors in the case of failures. The status contains an error only if the given model or `tensorName` is invalid.
* **Returns:**

  A pointer to an `M_TensorSpec`, or a `NULL` pointer if the model or index is invalid. If `NULL`, the `status` parameter contains an error message.

### `M_waitForModel()`

> void M\_waitForModel([M\_AsyncModel](types.md#_CPPv412M_AsyncModel) \*model, [M\_Status](types.md#_CPPv48M_Status) \*status)

Blocks execution until the model is initialized.

This waits for the model setup to finish in [`M_initModel()`](#model_8h_1a2dcb9570ae117602579182d8faed494a).

* **Parameters:**

  * **model** – The model.
  * **status** – The status used to report errors in the case of failures.

### `M_executeModelSync()`

> [M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*M\_executeModelSync(const [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context, [M\_AsyncModel](types.md#_CPPv412M_AsyncModel) \*initializedModel, [M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*inputs, [M\_Status](types.md#_CPPv48M_Status) \*status)

Executes a model synchronously.

The inputs and outputs are `M_AsyncTensorMap` objects to allow chaining of inference. This operation is blocking and waits until the output results are ready.

* **Parameters:**

  * **context** – The runtime context.
  * **initializedModel** – The model to execute, from [`M_initModel()`](#model_8h_1a2dcb9570ae117602579182d8faed494a). Although that function is async, you can pass the `M_AsyncModel` here immediately.
  * **inputs** – The tensor inputs.
  * **status** – The status used to report errors in the case of failures. This includes failures encountered while running the model; there is no need for an explicit synchronization point.
* **Returns:**

  A pointer to an `M_AsyncTensorMap` that holds the output tensors. These tensors are in a resolved state. You are responsible for the memory associated with the pointer returned. You can deallocate the memory by calling [`M_freeAsyncTensorMap()`](tensor.md#tensor_8h_1a0ac9628dcba39c9977b7f7ff95d8781e). In the case that executing the model fails, the `status` parameter contains an error message.

### `M_getNumModelInputs()`

> size\_t M\_getNumModelInputs(const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, [M\_Status](types.md#_CPPv48M_Status) \*status)

Gets the number of inputs for the model.

If the model is not yet resolved/ready, this function blocks execution.

You should call [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750) before calling this.

* **Parameters:**

  * **model** – The compiled model.
  * **status** – The status used to report errors in the case of failures.
* **Returns:**

  The number of inputs for the model, or `0` if there is an error in getting the model metadata. If `0`, the `status` parameter contains an error message.

### `M_getNumModelOutputs()`

> size\_t M\_getNumModelOutputs(const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, [M\_Status](types.md#_CPPv48M_Status) \*status)

Gets the number of outputs for the model.

If the model is not yet resolved/ready, this function blocks execution.

You should call [`M_compileModel()`](#model_8h_1a88afca26a64b945885e1e1a0d09b5750) before calling this.

* **Parameters:**

  * **model** – The compiled model.
  * **status** – The status used to report errors in the case of failures.
* **Returns:**

  The number of outputs for the model, or `0` if there is an error in getting the model metadata. If `0`, the `status` parameter contains an error message.

### `M_validateInputTensorSpec()`

> void M\_validateInputTensorSpec(const [M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, [M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensors, [M\_Status](types.md#_CPPv48M_Status) \*status)

Validate input tensor specs for compatibility with the compiled model.

The status message shows which validation check failed for the input.

* **Parameters:**

  * **model** – The compiled model.
  * **tensors** – The tensors whose specs need to be validated
  * **status** – The status used to report errors in the case of failures.
* **Returns:**

  True if the `tensors` has valid specs for the `model`

### `M_freeModel()`

> void M\_freeModel([M\_AsyncModel](types.md#_CPPv412M_AsyncModel) \*model)

Deallocates the memory for the model. No-op if `model` is `NULL`.

* **Parameters:**

  **model** – The model to deallocate.

### `M_freeCompiledModel()`

> void M\_freeCompiledModel([M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model)

Deallocates the memory for the compiled model. No-op if `model` is `NULL`.

* **Parameters:**

  **model** – The compiled model to deallocate.

### `M_freeCompileConfig()`

> void M\_freeCompileConfig([M\_CompileConfig](types.md#_CPPv415M_CompileConfig) \*config)

Deallocates the memory for the compile config. No-op if `config` is `NULL`.

* **Parameters:**

  **config** – The compilation configuration to deallocate.

### `M_freeModelSource()`

> void M\_freeModelSource([M\_ModelSource](types.md#_CPPv413M_ModelSource) \*modelSource)

Deallocates the memory for the model source. No-op if `modelSource` is `NULL`.

* **Parameters:**

  **modelSource** – The model source to deallocate.

### `M_exportCompiledModel()`

> void M\_exportCompiledModel([M\_AsyncCompiledModel](types.md#_CPPv420M_AsyncCompiledModel) \*model, const char \*path, [M\_Status](types.md#_CPPv48M_Status) \*status)

Exports a compiled model as a MEF to a given path.

* **Parameters:**

  * **model** – The model instance to export.
  * **path** – The path of the MEF file to export.
  * **status** – The status used to report errors in the case of failures.

---

## Model support

import TutorialStack from '@site/src/components/TutorialStack';

MAX allows you to pick the perfect GenAI for your project from Hugging Face.
You just provide the name of the model you want, and MAX takes care of the
rest. It builds the model as a high-performance graph and starts a serving
endpoint that runs the model on either a CPU and GPU.

This page explains how this works out of the box with models from Hugging Face,
and introduces how you can customize an existing model or create your own.

:::note MAX model repo

If you just want to browse some models, check out the [MAX model
repository](https://builds.modular.com/?category=models&type=MAX+Model).

:::

## Model configs

To understand how MAX accelerates hundreds of GenAI models from Hugging Face,
you should first know a little about Hugging Face model configurations.

Nowadays, the definitive place to find AI models is [Hugging Face Model
Hub](https://huggingface.co/models). Although models on Hugging Face might be
built and trained with different machine learning frameworks, they all include
a `config.json` file, which is like a model blueprint. This file contains all
the information you need to understand the model architecture and its
configuration, such as the number of layers used, the embedding size, and other
hyperparameters.

By reading the model configuration, we can reconstruct any model from Hugging
Face as a MAX model.

## MAX models {#max-graph}

A MAX model is a high-performance inferencing model built with our [MAX Python
API](/max/api/python/). It's a unique model format that allows the MAX graph
compiler to optimize the model for inference on a wide range of hardware and
deliver state-of-the-art performance you normally see only from model-specific
inference libraries written in C or C++.

You can build these models yourself with our Python API, but you don't have to.
All you have to do is specify the GenAI model you want from Hugging Face (such
as
[`meta-llama/Llama-3.2-1B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)),
and MAX will programmatically reconstruct it as a MAX model.

This works because we have already built a library of [base model
architectures](https://github.com/modular/modular/tree/main/max/pipelines/architectures)
with the MAX Python API. When you ask MAX to start an inference server with a
Hugging Face model, MAX pulls the corresponding pre-built architecture from our
library and makes the appropriate changes based on the configuration from
Hugging Face.

This all happens automatically when you start a serving endpoint with the
[`max`](/max/max-cli) CLI or with the [MAX
container](/max/container). For example, here's how to start an endpoint using
Meta's Llama 3.2 Instruct model as a MAX model:

```sh
max serve --model-path=meta-llama/Llama-3.2-1B-Instruct
```

:::caution This model requires a GPU

The command above will fail if your system doesn't have a
[compatible GPU](/max/faq#gpu-requirements). However, you can make it work if
you instead [load quantized weights](#customize-a-model) as shown below.

:::

When you run the `max serve` command, MAX pulls the model
configuration and weights from Hugging Face and builds it as a MAX model. Then
it starts up an endpoint to handle inference requests that you send using
[our REST API](/max/api/serve).

### Customize a model

If you want to load a different set of weights for a given model, you can pass
them in GGUF or Safetensors format using the `--weight-path` argument. This
accepts either a local path or a Hugging Face repo with the weights.

For example, here's how to run `Llama-3.2-1B-Instruct` on a CPU with quantized
weights ([from
bartowski](https://huggingface.co/bartowski/Llama-3.2-1B-Instruct-GGUF)):

```sh
max serve --model-path=meta-llama/Llama-3.2-1B-Instruct \
  --weight-path=bartowski/Llama-3.2-1B-Instruct-GGUF/Llama-3.2-1B-Instruct-Q6_K.gguf
```

When using GGUF models, quantization encoding formats are automatically detected.
When using the `max` command with a model from a Hugging Face
repository, explicitly providing a quantization encoding is optional.

```sh
max serve --model-path="modularai/Llama-3.1-8B-Instruct-GGUF" \
  --quantization-encoding=q4_k
```

If no quantization encoding is specified, MAX Serve automatically detects and
uses the first encoding option from the repository. If a quantization encoding is
provided, it must align with the available encoding options in the repository. If
the repository contains multiple quantization formats, be sure to specify which
encoding type you want to use.

For help creating your own weights in GGUF format, see the tutorial to [Bring
your own fine-tuned model](/max/tutorials/max-pipeline-bring-your-own-model).

### Build your own model

Although our model-building APIs are still under heavy development while we
implement the most popular architectures, you can also
build your own models with the MAX APIs today.

To build your own inferencing model with the MAX, the process generally looks
like this:

1. Instantiate a [`Graph`](/max/api/python/graph/Graph) by specifying the input
shape as a
[`TensorType`](/max/api/python/graph/type#max.graph.type.TensorType).

2. Build the graph by chaining [`ops`](/max/api/python/graph/ops/) functions.
Each function takes and returns a [`Value`](/max/api/python/graph/Value)
object.

3. Add the final `Value` to the graph using the
[`output()`](/max/api/python/graph/Graph#max.graph.Graph.output) method.

For more information, see our tutorial to [get started with MAX Graph in
Python](/max/tutorials/get-started-with-max-graph-in-python).

## PyTorch eager mode

As you might suspect, MAX doesn't have a pre-built architecture to match
*every* model on Hugging Face. But that's fine, because MAX
also supports eager-mode execution for all other PyTorch LLMs (using the
Hugging Face Transformers API).

If MAX doesn't have a pre-built model architecture for the Hugging Face model
you pass in, it falls back to running the model with Hugging Face Transformers.
That means the model won't be compiled and accelerated with MAX, but you'll
still get an endpoint with [our serving API](/max/api/serve) that's
OpenAI-compatible.

However, this is an increasingly unlikely situation for popular GenAI models,
because most of the popular models are based on a handful of architectures that
we've implemented as MAX models. For example, there are thousands of models
based on the `LlamaForCausalLM` architecture.

You can see the most popular models that work with MAX today (either as MAX
models or with eager mode) in [the MAX model
repository](https://builds.modular.com/?category=models&type=MAX+Model).

## Get started

export const tutorials = [
  'max-serve-local-to-cloud',
  'deploy-max-serve-on-kubernetes',
];

---

## modf

`modf[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> Tuple[SIMD[dtype, width], SIMD[dtype, width]]`

Computes the integral and fractional part of the value.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input value.

**Returns:**

A tuple containing the integral and fractional part of the value.

---

## Modular Documentation

import Homepage, { GetStartedButton } from '@site/src/components/Homepage';
import CodeNote from '@site/src/components/Homepage/CodeNote';
import { ArrowTransfer } from '@site/src/shared/Svgs/ArrowTransfer';
import { ArrowCloud } from '@site/src/shared/Svgs/ArrowCloud';
import { DesktopCode } from '@site/src/shared/Svgs/DesktopCode';
import { AIChip } from '@site/src/shared/Svgs/AIChip';
import { RecipesIcon } from '@site/src/shared/Svgs/RecipesIcon';
import { OpenBook } from '@site/src/shared/Svgs/OpenBook';
import { PuzzleIcon } from '@site/src/shared/Svgs/PuzzleIcon';

## Modular Documentation

The Modular Platform accelerates AI inference and abstracts hardware
complexity. Using our Docker container, you can deploy a GenAI model from
Hugging Face with an OpenAI-compatible endpoint on a wide range of hardware.

And if you need to customize the model or tune a GPU kernel, Modular
provides a depth of model extensibility and GPU programmability that you
won't find anywhere else.

```python title="python"
from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="EMPTY")

completion = client.chat.completions.create(
  model="modularai/Llama-3.1-8B-Instruct-GGUF",
  messages=[
    {"role": "user", "content": "Who won the world series in 2020?"}
  ],
)

print(completion.choices[0].message.content)
```

export const sectionCards = [
  {
    title: 'Serving',
    description:
      'Modular’s serving library is compatible with OpenAI APIs, so you can own your endpoint with minimal client-side code changes.',
    to: '/max/serve/',
    icon: ,
  },
  {
    title: 'Deploying',
    description:
      'You can quickly deploy your GenAI model to the cloud using our ready-to-deploy Docker container.',
    to: '/max/deploy/',
    icon: ,
  },
  {
    title: 'Developing',
    description:
      'The Modular platform provides full extensibility, so you can write custom ops, hardware-agnostic GPU kernels, and more.',
    to: '/max/develop/',
    icon: ,
  },
  {
    title: 'Programming with Mojo🔥',
    description:
      'Mojo is a Python-style programming language that allows you to write code for both CPUs and GPUs. ',
    to: '/mojo/manual/',
    icon: ,
  },
];

export const learningToolCards = [
  {
    title: 'Tutorials',
    description:
      'Step-by-step instructions to develop and deploy with the Modular platform.',
    to: '/max/tutorials/',
    icon: ,
  },
  {
    title: 'Recipes',
    description:
      'Turn-key applications that use GenAI models with the Modular platform.',
    href: 'https://builds.modular.com/?category=recipes',
    icon: ,
  },
  {
    title: 'GPU Puzzles',
    description:
      'A hands-on guide to mastering GPU programming with Mojo.',
    href: 'https://builds.modular.com/puzzles',
    icon: ,
  },
];

---

## Modules and packages

Mojo provides a packaging system that allows you to organize and compile code
libraries into importable files. This page introduces the necessary concepts
about how to organize your code into modules and packages (which is a lot
like Python), and shows you how to create a packaged binary with the [`mojo
package`](/mojo/cli/package) command.

## Mojo modules

To understand Mojo packages, you first need to understand Mojo modules. A
Mojo module is a single Mojo source file that includes code suitable for use
by other files that import it. For example, you can create a module
to define a struct such as this one:

```mojo title="mymodule.mojo"
struct MyPair:
    var first: Int
    var second: Int

    fn __init__(out self, first: Int, second: Int):
        self.first = first
        self.second = second

    fn dump(self):
        print(self.first, self.second)
```

Notice that this code has no `main()` function, so you can't execute
`mymodule.mojo`. However, you can import this into another file with a
`main()` function and use it there.

For example, here's how you can import `MyPair` into a file named `main.mojo`
that's in the same directory as `mymodule.mojo`:

```mojo title="main.mojo"
from mymodule import MyPair

fn main():
    var mine = MyPair(2, 4)
    mine.dump()
```

Alternatively, you can import the whole module and then access its members
through the module name. For example:

```mojo title="main.mojo"
import mymodule

fn main():
    var mine = mymodule.MyPair(2, 4)
    mine.dump()
```

You can also create an alias for an imported member with `as`, like this:

```mojo title="main.mojo"
import mymodule as my

fn main():
    var mine = my.MyPair(2, 4)
    mine.dump()
```

In this example, it only works when `mymodule.mojo` is in the same directory as
`main.mojo`. Currently, you can't import `.mojo` files as modules if they
reside in other directories. That is, unless you treat the directory as a Mojo
package, as described in the next section.

:::note

A Mojo module may include a `main()` function and may also be
executable, but that's generally not the practice and modules typically include
APIs to be imported and used in other Mojo programs.

:::

## Mojo packages

A Mojo package is just a collection of Mojo modules in a directory that
includes an `__init__.mojo` file. By organizing modules together in a
directory, you can then import all the modules together or individually.
Optionally, you can also compile the package into a `.mojopkg` or `.📦` file
that's easier to share and still compatible with other system architectures.

You can import a package and its modules either directly from source files or
from a compiled `.mojopkg`/`.📦` file. It makes no real difference to Mojo
which way you import a package. When importing from source files, the directory
name works as the package name, whereas when importing from a compiled package,
the filename is the package name (which you specify with the [`mojo
package`](/mojo/cli/package) command—it can differ from the directory
name).

For example, consider a project with these files:

```ini
main.mojo
mypackage/
    __init__.mojo
    mymodule.mojo
```

`mymodule.mojo` is the same code from examples above (with the `MyPair`
struct) and `__init__.mojo` is empty.

In this case, the `main.mojo` file can now import `MyPair` through the package
name like this:

```mojo title="main.mojo"
from mypackage.mymodule import MyPair

fn main():
    var mine = MyPair(2, 4)
    mine.dump()
```

Notice that the `__init__.mojo` is crucial here. If you delete it, then Mojo
doesn't recognize the directory as a package and it cannot import `mymodule`.

Then, let's say you don't want the `mypackage` source code in the same location
as `main.mojo`. So, you can compile it into a package file like this:

```sh
mojo package mypackage -o mypack.mojopkg
```

:::note

A `.mojopkg` file contains non-elaborated code, so you can share it across
systems. The code becomes an architecture-specific executable only after it's
imported into a Mojo program that's then compiled with `mojo build`.

:::

Now, you can move the `mypackage` source somewhere else, and the project files
now look like this:

```ini
main.mojo
mypack.mojopkg
```

Because we named the package file different from the directory, we need to fix
the import statement and it all works the same:

```mojo title="main.mojo"
from mypack.mymodule import MyPair
```

:::note

If you want to rename your package, you cannot simply edit the
`.mojopkg` or `.📦` filename, because the package name is encoded in the file.
You must instead run `mojo package` again to specify a new name.

:::

### The `__init__` file

As mentioned above, the `__init__.mojo` file is required to indicate that a
directory should be treated as a Mojo package, and it can be empty.

Currently, top-level code is not supported in `.mojo` files, so unlike Python,
you can't write code in `__init__.mojo` that executes upon import. You can,
however, add structs and functions, which you can then import from the package
name.

However, instead of adding APIs in the `__init__.mojo` file, you can import
module members, which has the same effect by making your APIs accessible from
the package name, instead of requiring the `.`
notation.

For example, again let's say you have these files:

```ini
main.mojo
mypackage/
    __init__.mojo
    mymodule.mojo
```

Let's now add the following line in `__init__.mojo`:

```mojo title="__init__.mojo"
from .mymodule import MyPair
```

That's all that's in there. Now, we can simplify the import statement in
`main.mojo` like this:

```mojo title="main.mojo"
from mypackage import MyPair
```

This feature explains why some members in the Mojo standard library can be
imported from their package name, while others required the
`.` notation. For example, the
[`functional`](/mojo/stdlib/algorithm/functional/) module resides in the
`algorithm` package, so you can import members of that module (such as the
`map()` function) like this:

```mojo
from algorithm.functional import map
```

However, the `algorithm/__init__.mojo` file also includes these lines:

```mojo title="algorithm/__init__.mojo"
from .functional import *
from .reduction import *
```

So you can actually import anything from `functional` or `reduction` simply by
naming the package. That is, you can drop the `functional` name from the import
statement, and it also works:

```mojo
from algorithm import map
```

:::note

Which modules in the standard library are imported to the package
scope varies, and is subject to change. Refer to the [documentation for each
module](/mojo/lib) to see how you can import its members.

:::

---

## moe

## Functions

* [​`moe_create_indices`](./moe_create_indices):
* [​`moe_create_indices_kernel`](./moe_create_indices_kernel):

---

## moe_create_indices

`moe_create_indices[input_type: DType, //, target: StringSlice[StaticConstantOrigin]](token_expert_order: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], expert_start_indices: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], restore_token_order: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], expert_ids: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], expert_usage_stats: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], topk_ids: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], context: DeviceContextPtr)`

---

## moe_create_indices_kernel

`moe_create_indices_kernel[input_type: DType, num_threads: Int, token_expert_order_layout: Layout, expert_start_indices_layout: Layout, restore_token_order_layout: Layout, expert_ids_layout: Layout, expert_usage_stats_layout: Layout, indices_padded_layout: Layout, padded_input_layout: Layout, topk_ids_layout: Layout](token_expert_order: LayoutTensor[uint32, token_expert_order_layout, MutableAnyOrigin], expert_start_indices: LayoutTensor[uint32, expert_start_indices_layout, MutableAnyOrigin], restore_token_order: LayoutTensor[uint32, restore_token_order_layout, MutableAnyOrigin], expert_ids: LayoutTensor[uint32, expert_ids_layout, MutableAnyOrigin], expert_usage_stats: LayoutTensor[uint32, expert_usage_stats_layout, MutableAnyOrigin], indices_padded: LayoutTensor[uint32, indices_padded_layout, MutableAnyOrigin], topk_ids_padded: LayoutTensor[input_type, padded_input_layout, MutableAnyOrigin], topk_ids: LayoutTensor[input_type, topk_ids_layout, MutableAnyOrigin])`

---

## mojo

The Mojo🔥 command line interface.

## Synopsis

```
mojo 
mojo [run-options] 
mojo [options]
mojo
```

## Description

The `mojo` CLI provides all the tools you need for Mojo development, such as commands to run, compile, and package Mojo code. A list of all commands are listed below, and you can learn more about each one by adding the `--help` option to the command (for example, `mojo package --help`).

However, you may omit the `run` and `repl` commands. That is, you can run a Mojo file by simply passing the filename to `mojo`:

```
mojo hello.mojo
```

And you can start a REPL session by running `mojo` with no commands.

To update Mojo to the latest version, use the [`magic` tool](/mojo/manual/get-started#update-mojo):

```
magic update
```

You can check your current version with `mojo --version`. For information about Mojo updates, see the [Mojo changelog](/mojo/changelog.html).

## Commands

[`run`](run.md) — Builds and executes a Mojo file.

[`build`](build.md) — Builds an executable from a Mojo file.

[`repl`](repl.md) — Launches the Mojo REPL.

[`debug`](debug.md) — Launches the Mojo debugger using the command-line interface or an external editor.

[`package`](package.md) — Compiles a Mojo package.

[`format`](format.md) — Formats Mojo source files.

[`doc`](doc.md) — Compiles docstrings from a Mojo file.

[`demangle`](demangle.md) — Demangles the given name.

[`test`](test.md) — Execute unit, integration, and documentation tests.

## Options

### Diagnostic options

#### `--version`, `-v`

Prints the Mojo version and exits.

### Common options

#### `--help`, `-h`

Displays help information.

---

## mojo build

Builds an executable from a Mojo file.

## Synopsis

```
mojo build [options] 
```

## Description

Compiles the Mojo file at the given path into an executable.

By default, the executable is saved to the current directory and named the same as the input file, but without a file extension.

Beware that any Python libraries used in your Mojo project are not included in the executable binary, so they must be provided by the environment where you run the executable.

## Options

### Output options

#### `-o `

Sets the path and filename for the executable output. By default, it outputs the executable to the same location as the Mojo file, with the same name and no extension.

#### `--emit `

The type of output file to generate.

* `exe` (default): emit an executable binary file.
* `shared-lib`: emit a shared (dynamic) library.
* `object`: (EXPERIMENTAL) emit a single object file.
* `llvm`: emit LLVM IR.
* `asm`: emit target assembly.

### Compilation options

#### `--optimization-level `, `-O`, `--no-optimization (LEVEL=0)`

Sets the level of optimization to use at compilation. The value must be a number between 0 and 3. The default is 3.

#### `-I `

Appends the given path to the list of directories to search for imported Mojo files.

#### `-D `

Defines a named value that can be used from within the Mojo source file being executed. For example, `-D foo=42` defines a name `foo` that, when queried with the `sys.param_env` module from within the Mojo program, would yield the compile-time value `42`.

#### `--debug-level `, `-g (LEVEL=full)`

Sets the level of debug info to use at compilation. The value must be one of: `none` (the default value), `line-tables`, or `full`. Please note that there are issues when generating debug info for some Mojo programs that have yet to be addressed.

#### `--num-threads `, `-j`

Sets the maximum number of threads to use for compilation. The default is 0 (use all available threads).

### Target options

#### `--target-triple `

Sets the compilation target triple. Defaults to the host target.

#### `--target-cpu `

Sets the compilation target CPU. Defaults to the host CPU.

#### `--target-features `

Sets the compilation target CPU features. Defaults to the host features.

#### `--march `

Sets the architecture for which to generate code.

#### `--mcpu `

Sets the CPU for which to generate code.

#### `--mtune `

Sets the CPU for which to tune code.

### Compilation diagnostic options

Controls how the Mojo compiler outputs diagnostics related to compiling and running Mojo source code.

#### `--diagnose-missing-doc-strings`

Emits diagnostics for missing or partial doc strings.

#### `--validate-doc-strings`

Emits errors for invalid doc strings instead of warnings.

#### `--max-notes-per-diagnostic `

When the Mojo compiler emits diagnostics, it sometimes also prints notes with additional information. This option sets an upper threshold on the number of notes that can be printed with a diagnostic. If not specified, the default maximum is 10.

#### `--disable-builtins`

Do not use builtins when create package.

#### `--disable-warnings`

Do not print warning messages.

### Experimental compilation options

#### `--sanitize `

Turns on runtime checks. The following values are supported: `address` (detects memory issues), and `thread` (detects multi-threading issues).

#### `--shared-libasan`

Dynamically link the address sanitizer runtime. Requires address sanitization turned on with `--sanitize` option.

#### `--debug-info-language `

Sets the language to emit as part of the debug info. The supported languages are: `Mojo`, and `C`. `C` is the default, and is useful to enable rudimentary debugging and binary introspection in tools that don't understand Mojo.

### Common options

#### `--diagnostic-format `

The format in which diagnostics and error messages are printed. Must be one of "text" or "json" ("text" is the default).

#### `--help`, `-h`

Displays help information.

---

## mojo debug

Launches the Mojo debugger using the command-line interface or an external editor.

## Synopsis

```
mojo debug [debug-options]
```

## Description

This command, which underneath uses the LLDB debugger, or cuda-gdb, offers four basic debug session modes:

* Build and debug a Mojo file.

  ```
    mojo debug [options]  [runtime args]
  ```

  Builds the Mojo file at the given path and launches it under the debugger. Options, which come before the Mojo file, can include any compilation options expected by the `mojo run`, as well as regular debuggingcommands. Runtime args, which come after the Mojo file, are passed directly to the debuggee upon launch. By default, this mode uses `-O0` and `--debug-level=full` as compilation options.

* Debug a precompiled program.

  ```
    mojo debug [options]  [runtime args]
  ```

  Launches the program at the given path in the debugger. Options, which come before the program path, cannot include compilation commands. Runtime args, which come after the program path, are passed directly to the debuggee upon launch.

* Attach to a running process.

  ```
    mojo debug [options] [--pid  | --process-name ]
  ```

  Attaches to the process specified by pid or name, which can be the full path of the process' executable. Options other than the process identifier cannot include compilation options.

* Start the debugger command-line interface.

  ```
    mojo debug [options]
  ```

  Launches the debugger CLI with support for debugging Mojo programs. This command only supports LLDB or cuda-gdb options via the `--X` option.

You can also select one of two interfaces for the debug session:

* CLI: By default, all debug session modes are launched using the regular debugger command-line interface.

* VS Code Debug Server: If you add the `--vscode` option, the debug session is launched in VS Code via the Mojo extension. VS Code must be running and the Mojo extension must be enabled. Besides that, the environment variables and the current working directory of this invocation are preserved when launching programs in the debugger on VS Code.

Finally, it is worth mentioning that this debugger can debug programs written in other standard native languages like Rust, C and C++, as it is based on LLDB or cuda-gdb.

Debugger capabilitis:

* LLDB: this is the default debugger and has great support for CPU Mojo code, but has no support at all for Mojo GPU code.

* cuda-gdb: this is invoked via the `--cuda-gdb` option and has minimal support for CPU Mojo code but it has support for GPU Mojo code.

## Options

### Attach options

#### `--pid `

Indicates the debugger to attach to the process with the given PID.

#### `--process-name `

Indicates the debugger to attach to the process with the given name or path.

### cuda-gdb options

#### `--cuda-gdb`

Uses cuda-gdb instead of LLDB for debugging. In this mode, it's possible to step into GPU code, but the CPU debugging experience is degraded.

#### `--cuda-gdb-path `

Uses the given CUDA\_GDB\_PATH instead of looking for cuda-gdb in the PATH environment variable.

#### `--break-on-launch`

Set the breakOnLaunch option for cuda-gdb.  This makes the debugger break on the first instruction of every launched kernel.

### Compilation options

#### `--optimization-level `, `-O`, `--no-optimization (LEVEL=0)`

Sets the level of optimization to use at compilation. The value must be a number between 0 and 3. The default is 3.

#### `-I `

Appends the given path to the list of directories to search for imported Mojo files.

#### `-D `

Defines a named value that can be used from within the Mojo source file being executed. For example, `-D foo=42` defines a name `foo` that, when queried with the `sys.param_env` module from within the Mojo program, would yield the compile-time value `42`.

#### `--debug-level `, `-g (LEVEL=full)`

Sets the level of debug info to use at compilation. The value must be one of: `none` (the default value), `line-tables`, or `full`. Please note that there are issues when generating debug info for some Mojo programs that have yet to be addressed.

#### `--num-threads `, `-j`

Sets the maximum number of threads to use for compilation. The default is 0 (use all available threads).

### Target options

#### `--target-triple `

Sets the compilation target triple. Defaults to the host target.

#### `--target-cpu `

Sets the compilation target CPU. Defaults to the host CPU.

#### `--target-features `

Sets the compilation target CPU features. Defaults to the host features.

#### `--march `

Sets the architecture for which to generate code.

#### `--mcpu `

Sets the CPU for which to generate code.

#### `--mtune `

Sets the CPU for which to tune code.

### Compilation diagnostic options

Controls how the Mojo compiler outputs diagnostics related to compiling and running Mojo source code.

#### `--diagnose-missing-doc-strings`

Emits diagnostics for missing or partial doc strings.

#### `--validate-doc-strings`

Emits errors for invalid doc strings instead of warnings.

#### `--max-notes-per-diagnostic `

When the Mojo compiler emits diagnostics, it sometimes also prints notes with additional information. This option sets an upper threshold on the number of notes that can be printed with a diagnostic. If not specified, the default maximum is 10.

#### `--disable-builtins`

Do not use builtins when create package.

#### `--disable-warnings`

Do not print warning messages.

### Debugger options

#### `--X `

Passes ARG as an argument to the debugger when the debug session is launched using the debugger command-line interface. This option can be specified multiple times. It is ignored when using the RPC mode.

### Debug server options

#### `--vscode`

Launches the debug session on VS Code via the Mojo extension.

#### `--rpc`

Alias for --vscode.

#### `--terminal `

The type of terminal to use when starting a launch debug session.

* `console` (default): the debuggee will be launched in the default environment for the editor. If using VS Code, this will be the Debug Console.
* `dedicated`: the debuggee will be launched in a dedicated terminal within the editor.

#### `--port `

Uses the given PORT to communicate with the RPC debug server. Defaults to trying all ports from 12355 to 12364 inclusive.

#### `--stop-on-entry`

Automatically stop after launch.

#### `--init-command `

Initialization command executed upon debugger startup. Can be specified multiple times.

### Experimental compilation options

#### `--sanitize `

Turns on runtime checks. The following values are supported: `address` (detects memory issues), and `thread` (detects multi-threading issues).

#### `--shared-libasan`

Dynamically link the address sanitizer runtime. Requires address sanitization turned on with `--sanitize` option.

#### `--debug-info-language `

Sets the language to emit as part of the debug info. The supported languages are: `Mojo`, and `C`. `C` is the default, and is useful to enable rudimentary debugging and binary introspection in tools that don't understand Mojo.

### Common options

#### `--help`, `-h`

Displays help information.

---

## Mojo decorators

A Mojo decorator is a [higher-order
function](https://en.wikipedia.org/wiki/Higher-order_function) that modifies or
extends the behavior of a struct, a function, or some other code. Instead of
actually calling the higher-order function, you simply add the decorator (such
as the `@value` decorator) above your code (such as a struct). The Mojo
compiler then uses the decorator function to modify your code at compile time.

:::note No custom decorators

The creation of custom decorators is not yet supported. The available ones are
built directly into the compiler.

:::

The following pages describe each built-in decorator with examples.

:::🔥#docs
:::

---

## mojo demangle

Demangles the given name.

## Synopsis

```
mojo demangle [options] 
```

## Description

If the given name is a mangled Mojo symbol name, prints the demangled name. If no name is provided, one is read from standard input.

## Options

### Common options

#### `--help`, `-h`

Displays help information.

---

## mojo doc

Compiles docstrings from a Mojo file.

## Synopsis

```
mojo doc [options] 
```

## Description

This is an early version of a documentation tool that generates an API reference from Mojo code comments. Currently, it generates a structured output of all docstrings into a JSON file, and it does not generate HTML. This output format is subject to change.

The input may be a single file or a directory. If you specify a directory, it will generate a single JSON output with documentation for all modules found in that path, recursively.

## Options

### Output options

#### `-o `

Sets the path and filename for the JSON output. If not provided, output is written to stdout.

### Compilation options

#### `-I `

Appends the given path to the list of directories that Mojo will search for any package/module dependencies. That is, if the file you pass to `mojo doc` imports any packages that do not reside in the local path and are not part of the Mojo standard library, use this to specify the path where Mojo can find those packages.

### Validation options

The following validation options help ensure that your docstrings use valid structure and meet other style criteria. By default, warnings are emitted only if the docstrings contain errors that prevent translation to the output format. (More options coming later.)

#### `--diagnose-missing-doc-strings`

Emits diagnostics for missing or partial doc strings.

#### `--validate-doc-strings`

Emits errors for invalid doc strings instead of warnings.

### Compilation diagnostic options

Controls how the Mojo compiler outputs diagnostics related to compiling and running Mojo source code.

#### `--max-notes-per-diagnostic `

When the Mojo compiler emits diagnostics, it sometimes also prints notes with additional information. This option sets an upper threshold on the number of notes that can be printed with a diagnostic. If not specified, the default maximum is 10.

### Common options

#### `--diagnostic-format `

The format in which diagnostics and error messages are printed. Must be one of "text" or "json" ("text" is the default).

#### `--help`, `-h`

Displays help information.

---

## Mojo documentation code examples

# Mojo documentation code examples

This directory includes code examples used in the Mojo Manual and related
documentation at [docs.modular.com/mojo](/mojo).
Reference solutions for Mojo tutorials can be found in the
[`/examples/mojo`](../../examples/mojo) directory. The primary purpose of this
directory is to enable automated testing of code examples.

**Note:** Code examples in the API reference documentation for the Mojo Standard
Library and other Modular open source libraries are embedded in the source files
for those libraries and are not included here.

## Contributing

If you see something in the documentation or the code examples that is incorrect
or could be improved, we'd love to accept your contributions. At this time, code
from this directory is **not** automatically included in the corresponding
documentation file. If you contribute a change to a code example, please be sure
to make a corresponding change to the copy of the code in the related
documentation, as well as any explanatory text.

Be aware that we don't provide tools to generate a preview of the website,
because the Mojo docs are built along with other content that's not included in
this repo. As such, we recommend you preview your edits in an IDE that can
render Markdown and MDX files, such as VS Code, including the
[VS Code environment in GitHub](https://github.dev/modular/modular/blob/main/).

For more information about how to contribute, see the [Contributor
Guide](../CONTRIBUTING.md).

---

## mojo format

Formats Mojo source files.

## Synopsis

```
mojo format [options] 
```

## Description

Formats the given set of Mojo sources using a Mojo-specific lint tool.

## Options

### Format options

#### `--line-length `, `-l `

Sets the max character line length. Default is 80.

### Diagnostic options

#### `--quiet`, `-q`

Disables non-error messages.

### Common options

#### `--help`, `-h`

Displays help information.

---

## Mojo language basics

This page provides an overview of the Mojo language.

If you know Python, then a lot of Mojo code will look familiar. However, Mojo
incorporates features like static type checking, memory safety, next-generation
compiler technologies, and more. As such, Mojo also has a lot in common with
languages like C++ and Rust.

If you prefer to learn by doing, follow the [Get started with
Mojo](/mojo/manual/get-started) tutorial. You'll install the [Magic](/magic)
CLI, create a Mojo project and write your first Mojo program.

On this page, we'll introduce the essential Mojo syntax, so you can start
coding quickly and understand other Mojo code you encounter. Subsequent
sections in the Mojo Manual dive deeper into these topics, and links are
provided below as appropriate.

Let's get started! 🔥

:::note

Mojo is a young language and there are many [features still
missing](/mojo/roadmap). As such, Mojo is currently **not** meant for
beginners. Even this basics section assumes some programming experience.
However, throughout the Mojo Manual, we try not to assume experience with any
particular language.

:::

## Hello world

Here's the traditional "Hello world" program in Mojo:

```mojo
def main():
    print("Hello, world!")
```

Every Mojo program must include a function named `main()` as the entry point.
We'll talk more about functions soon, but for now it's enough to know that
you can write `def main():` followed by an indented function body.

The `print()` statement does what you'd expect, printing its arguments to
the standard output.

## Variables

In Mojo, you can declare a variable by simply assigning a value to
a new named variable:

```mojo
def main():
    x = 10
    y = x * x
    print(y)
```

You can also _explicitly_ declare variables with the `var` keyword:

```mojo
var x = 10
```

When declaring a variable with `var`, you can also declare a variable type, with
or without an assignment:

```mojo
def main():
    var x: Int = 10
    var sum: Int
    sum = x + x
```

Both implicitly declared and explicitly declared variables are statically typed:
that is, the type is set at compile time, and doesn't change at runtime.
If you don't specify a type, Mojo uses the type of the first value assigned to
the variable.

```mojo
x = 10
x = "Foo" # Error: Cannot convert "StringLiteral" value to "Int"
```

For more details, see the page about
[variables](/mojo/manual/variables).

## Blocks and statements

Code blocks such as functions, conditions, and loops are defined
with a colon followed by indented lines. For example:

```mojo
def loop():
    for x in range(5):
        if x % 2 == 0:
            print(x)
```

You can use any number of spaces or tabs for your indentation (we prefer 4
spaces).

All code statements in Mojo end with a newline. However, statements can span
multiple lines if you indent the following lines. For example, this long string
spans two lines:

```mojo
def print_line():
    long_text = "This is a long line of text that is a lot easier to read if"
                " it is broken up across two lines instead of one long line."
    print(long_text)
```

And you can chain function calls across lines:

```mojo
def print_hello():
    text = String(",")
          .join("Hello", " world!")
    print(text)
```

For more information on loops and conditional statements, see
[Control flow](/mojo/manual/control-flow).

## Functions

You can define a Mojo function using either the `def` or `fn` keyword. For
example, the following uses the `def` keyword to define a function named
`greet` that requires a single `String` argument and returns a `String`:

```mojo
def greet(name: String) -> String:
    return "Hello, " + name + "!"
```

Where `def` and `fn` differ is error handling and argument mutability defaults.
Refer to the [Functions](/mojo/manual/functions) page for more details on
defining and calling functions.

## Code comments

You can create a one-line comment using the hash `#` symbol:

```mojo
# This is a comment. The Mojo compiler ignores this line.
```

Comments may also follow some code:

```mojo
var message = "Hello, World!" # This is also a valid comment
```

API documentation comments are enclosed in triple quotes. For example:

```mojo
fn print(x: String):
    """Prints a string.

    Args:
        x: The string to print.
    """
    ...
```

Documenting your code with these kinds of comments (known as "docstrings")
is a topic we've yet to fully specify, but you can generate an API reference
from docstrings using the [`mojo doc` command](/mojo/cli/doc).

:::note

Technically, docstrings aren't _comments_, they're a special use of Mojo's
syntax for multi-line string literals. For details, see
[String literals](/mojo/manual/types#string-literals) in the page on
[Types](/mojo/manual/types).

:::

## Structs

You can build high-level abstractions for types (or "objects") as a `struct`.

A `struct` in Mojo is similar to a `class` in Python: they both support
methods, fields, operator overloading, decorators for metaprogramming, and so
on. However, Mojo structs are completely static—they are bound at compile-time,
so they do not allow dynamic dispatch or any runtime changes to the structure.
(Mojo will also support Python-style classes in the future.)

For example, here's a basic struct:

```mojo
struct MyPair:
    var first: Int
    var second: Int

    fn __init__(out self, first: Int, second: Int):
        self.first = first
        self.second = second

    fn __copyinit__(out self, existing: Self):
        self.first = existing.first
        self.second = existing.second

    def dump(self):
        print(self.first, self.second)
```

And here's how you can use it:

```mojo
def use_mypair():
    var mine = MyPair(2, 4)
    mine.dump()
```

Note that some functions are declared with `fn` function, while the `dump()`
function is declared with `def`. In general, you can use either form in a
struct.

The `MyPair` struct contains two special methods, `__init__()`, the constructor,
and `__copyinit__()`, the copy constructor. _Lifecycle methods_ like this
control how a struct is created, copied, moved, and destroyed.

For most simple types, you don't need to write the lifecycle methods. You can
use the `@value` decorator to generate the boilerplate code for you. So the
`MyPair` struct can be simplified to this:

```mojo
@value
struct MyPair:
    var first: Int
    var second: Int

    def dump(self):
        print(self.first, self.second)
```

For more details, see the page about
[structs](/mojo/manual/structs).

### Traits

A trait is like a template of characteristics for a struct. If you want to
create a struct with the characteristics defined in a trait, you must implement
each characteristic (such as each method). Each characteristic in a trait is a
"requirement" for the struct, and when your struct implements all of the
requirements, it's said to "conform" to the trait.

Using traits allows you to write generic functions that can accept any type
that conforms to a trait, rather than accept only specific types.

For example, here's how you can create a trait:

```mojo
trait SomeTrait:
    fn required_method(self, x: Int): ...
```

The three dots following the method signature are Mojo syntax indicating that
the method is not implemented.

Here's a struct that conforms to `SomeTrait`:

```mojo
@value
struct SomeStruct(SomeTrait):
    fn required_method(self, x: Int):
        print("hello traits", x)
```

Then, here's a function that uses the trait as an argument type (instead of the
struct type):

```mojo
fn fun_with_traits[T: SomeTrait](x: T):
    x.required_method(42)

fn use_trait_function():
    var thing = SomeStruct()
    fun_with_traits(thing)
```

You'll see traits used in a lot of APIs provided by Mojo's standard library. For
example, Mojo's collection types like `List` and `Dict` can store any type that
conforms to the `Copyable` and `Movable` traits. You can specify the type when
you create a collection:

```mojo
my_list = List[Float64]()
```

:::note

You're probably wondering about the square brackets on `fun_with_traits()`.
These aren't function *arguments* (which go in parentheses); these are function
*parameters*, which we'll explain next.

:::

Without traits, the `x` argument in `fun_with_traits()` would have to declare a
specific type that implements `required_method()`, such as `SomeStruct`
(but then the function would accept only that type). With traits, the function
can accept any type for `x` as long as it conforms to (it "implements")
`SomeTrait`. Thus, `fun_with_traits()` is known as a "generic function" because
it accepts a *generalized* type instead of a specific type.

For more details, see the page about [traits](/mojo/manual/traits).

## Parameterization

In Mojo, a parameter is a compile-time variable that becomes a runtime
constant, and it's declared in square brackets on a function or struct.
Parameters allow for compile-time metaprogramming, which means you can generate
or modify code at compile time.

Many other languages use "parameter" and "argument" interchangeably, so be
aware that when we say things like "parameter" and "parametric function," we're
talking about these compile-time parameters. Whereas, a function "argument" is
a runtime value that's declared in parentheses.

Parameterization is a complex topic that's covered in much more detail in the
[Metaprogramming](/mojo/manual/parameters/) section, but we want to break the
ice just a little bit here. To get you started, let's look at a parametric
function:

```mojo
def repeat[count: Int](msg: String):
    @parameter # evaluate the following for loop at compile time
    for i in range(count):
        print(msg)
```

This function has one parameter of type `Int` and one argument of type
`String`. To call the function, you need to specify both the parameter and the
argument:

```mojo
def call_repeat():
    repeat[3]("Hello")
    # Prints "Hello" 3 times
```

By specifying `count` as a parameter, the Mojo compiler is able to optimize the
function because this value is guaranteed to not change at runtime. And the
`@parameter` decorator in the code tells the compiler to evaluate the `for` loop
at compile time, not runtime.

The compiler effectively generates a unique version of the `repeat()` function
that repeats the message only 3 times. This makes the code more performant
because there's less to compute at runtime.

Similarly, you can define a struct with parameters, which effectively allows
you to define variants of that type at compile-time, depending on the parameter
values.

For more detail on parameters, see the section on
[Metaprogramming](/mojo/manual/parameters/).

## Python integration

Mojo supports the ability to import Python modules as-is, so you can leverage existing Python code right away.

For example, here's how you can import and use NumPy:

```mojo
from python import Python

def main():
    var np = Python.import_module("numpy")
    var ar = np.arange(15).reshape(3, 5)
    print(ar)
    print(ar.shape)
```

You must have the Python module (such as `numpy`) installed in the environment
where you're using Mojo. You can install Python packages into your virtual
environment using [Magic](/magic/) or [Conda](/magic/conda).

For more details, see the page on
[Python integration](/mojo/manual/python/).

## Next steps

Hopefully this page has given you enough information to start experimenting with
Mojo, but this is only touching the surface of what's available in Mojo.

If you're in the mood to read more, continue through each page of this
Mojo Manual—the next page from here is [Functions](/mojo/manual/functions).

Otherwise, here are some other resources to check out:

* See [Get started with Mojo](/mojo/manual/get-started) for a hands-on
  tutorial that will get you up and running with Mojo.

* If you want to experiment with some code, clone [our GitHub
  repo](https://github.com/modular/modular/) to try our code examples:

  ```sh
  git clone https://github.com/modular/modular.git
  ```

  ```sh
  cd max/examples/mojo
  ```

* To see all the available Mojo APIs, check out the [Mojo standard library
  reference](/mojo/lib).

---

## Mojo Manual

Welcome to the Mojo Manual, a complete guide to the Mojo🔥 programming language!

Mojo is designed to solve a variety of AI development challenges that no other
language can, because Mojo is the first programming language built from the
ground-up with [MLIR](https://mlir.llvm.org/) (a compiler infrastructure that's
ideal for heterogeneous hardware, from CPUs and GPUs, to various AI ASICs). We
also designed Mojo as the best way to extend Python because we love Python and its
community, but we couldn't realistically enhance Python to do all the things we
wanted. For a longer discussion on this topic, read [Why
Mojo](/mojo/why-mojo).

Beware that Mojo is still a very young language, so there's a lot that hasn't
been built yet. Likewise, there's a lot of documentation that hasn't been
written yet. But we're excited to share Mojo with you and [get your
feedback](https://www.modular.com/community).

## Contents

* **Get started**

  * [Why Mojo](/mojo/why-mojo)
  * [Get started with Mojo](/mojo/manual/get-started)

* **Language basics**

  * [Overview](/mojo/manual/basics)
  * [Functions](/mojo/manual/functions)
  * [Variables](/mojo/manual/variables)
  * [Types](/mojo/manual/types)
  * [Operators and expressions](/mojo/manual/operators)
  * [Control flow](/mojo/manual/control-flow)
  * [Errors and context managers](/mojo/manual/errors)
  * [Structs](/mojo/manual/structs)
  * [Modules and packages](/mojo/manual/packages)

* **Value ownership**

  * [Intro to value ownership](/mojo/manual/values/)
  * [Value semantics](/mojo/manual/values/value-semantics)
  * [Ownership](/mojo/manual/values/ownership)
  * [Lifetimes, origins, and references](/mojo/manual/values/lifetimes)

* **Value lifecycle**

  * [Intro to value lifecycle](/mojo/manual/lifecycle/)
  * [Life of a value](/mojo/manual/lifecycle/life)
  * [Death of a value](/mojo/manual/lifecycle/death)

* **Traits and parameters**

  * [Traits](/mojo/manual/traits)
  * [Parameterization: compile-time metaprogramming](/mojo/manual/parameters/)

* **Pointers**

  * [Intro to pointers](/mojo/manual/pointers/)
  * [Unsafe pointers](/mojo/manual/pointers/unsafe-pointers)

* **GPU programming**

  * [Get started with GPU programming](/mojo/manual/gpu/intro-tutorial)
  * [GPU basics](/mojo/manual/gpu/basics)

* **Layouts and LayoutTensor**

  * [Introduction to Layouts](/mojo/manual/layout/layouts)

* **Python**

  * [Python integration](/mojo/manual/python/)
  * [Mojo calling Python](/mojo/manual/python/mojo-calling-python)
  * [Python calling Mojo](/mojo/manual/python/python-calling-mojo)
  * [Python types](/mojo/manual/python/types)

* **Tools**

  * [Debugging](/mojo/tools/debugging)
  * [GPU debugging](/mojo/tools/debugging)
  * [Testing](/mojo/tools/testing)

* **Project information**

  * [Roadmap and sharp edges](/mojo/roadmap)
  * [Changelog](/mojo/changelog)
  * [FAQ](/mojo/faq)

---

## mojo package

Compiles a Mojo package.

## Synopsis

```
mojo package [options] 
```

## Description

Compiles a directory of Mojo source files into a binary package suitable to share and import into other Mojo programs and modules. A Mojo package is portable across different systems because it includes only non-elaborated code (it's not an arch-specific package). The code becomes an arch-specific executable only after it's imported into a Mojo program that's then compiled with `mojo build`.

To create a Mojo package, first add an `__init__.mojo` file to your package directory. Then pass that directory name to this command, and specify the output path and filename with `-o`.

For more information, see [Mojo modules and packages](/mojo/manual/packages).

## Options

### Output options

#### `-o `

Sets the path and filename for the output package. The filename must end with either `.mojopkg` or `.📦`. The filename given here defines the package name you can then use to import the code (minus the file extension). If you don't specify this option, a `.mojopkg` file is generated in the current working directory, with a name based on the name of the input directory.

### Compilation options

#### `-I `

Appends the given path to the list of directories to search for imported Mojo files.

#### `-kgenModule`

Export as a KGEN module.

### Compilation diagnostic options

Controls how the Mojo compiler outputs diagnostics related to compiling and running Mojo source code.

#### `--diagnose-missing-doc-strings`

Emits diagnostics for missing or partial doc strings.

#### `--validate-doc-strings`

Emits errors for invalid doc strings instead of warnings.

#### `--max-notes-per-diagnostic `

When the Mojo compiler emits diagnostics, it sometimes also prints notes with additional information. This option sets an upper threshold on the number of notes that can be printed with a diagnostic. If not specified, the default maximum is 10.

#### `--disable-builtins`

Do not use builtins when create package.

#### `--disable-warnings`

Do not print warning messages.

### Common options

#### `--diagnostic-format `

The format in which diagnostics and error messages are printed. Must be one of "text" or "json" ("text" is the default).

#### `--help`, `-h`

Displays help information.

---

## Mojo reference

This section includes the Mojo API references:

- [Standard library](#standard-library): Common Mojo APIs.
- [MAX AI Kernels library](#max-ai-kernels-library). Mojo APIs for writing
  high-performance computational kernels and custom operations for AI models.
- [MAX library](#max-library). MAX Mojo APIs, including tensor APIs for custom
  operations, and legacy MAX APIs.
- [Decorators](#decorators). Mojo decorators reference.

## How to read the Mojo API docs

Mojo syntax is covered in detail in the [Mojo manual](/mojo/manual/). Here's a
quick cheat-sheet on reading struct and function signatures.

### Arguments

Function arguments appear in parentheses after the function name:

```mojo
fn example_fn(pos: Int, /, pos_or_kw: Int, *, kw_only: Bool = False):
    ...
```

Here's a quick overview of some special syntax in the argument list:

- Slash (`/`): arguments declared before a slash are
  [positional-only arguments](/mojo/manual/functions#positional-only-and-keyword-only-arguments).

- Star (`*`): a star by itself in place of an argument indicates that the
  arguments after the star are
  [keyword-only](/mojo/manual/functions#positional-only-and-keyword-only-arguments).

- An equals sign (`=`) introduces a default value for an
  [optional argument](/mojo/manual/functions#optional-arguments).

You may also see argument names prefixed with one or two stars (`*`):

```mojo
def myfunc2(*names, **attributes):
```

- An argument name prefixed by a single star character, like `*names` identifies
  a [variadic argument](/mojo/manual/functions/#variadic-arguments).

- An argument name prefixed with a double star, like `**attributes` identifies a
  [variadic keyword-only argument](/mojo/manual/functions/#variadic-keyword-arguments).

An argument may also be preceded by an _argument convention_, which indicates
how the value is passed:

```mojo
fn sort(mut names: List[String]):
```

The most common conventions are:

- `read`(default): the callee receives an **immutable reference** to the value.
- `mut`: the callee receives a **mutable reference** to the value.
- `owned`: the callee receives ownership of a value.

For details and a complete list of argument conventions, see
[Argument conventions](/mojo/manual/values/ownership#argument-conventions).

### Parameters

Mojo structs and functions can take parameters. Parameters are evaluated
at compilation time, and act as constants at runtime. Parameter lists are
enclosed in square brackets:

```mojo
struct ExampleStruct[size: Int, //, thing: Thing[size]]:
```

Parameters that occur before a double-slash (`//`) in the parameter list are
[infer-only parameters](/mojo/manual/parameters/#infer-only-parameters). You
usually don't need to specify infer-only parameters; as the name suggests,
they're usually inferred.

Like arguments, parameters can be positional-only, keyword-or-positional, or
keyword-only, and they can be required or optional. The `/`, `*`, and `=`
characters have the same meaning in parameter lists as they do in argument lists.

## Standard library

The Mojo standard library provides nearly everything you'll need for
writing Mojo programs, including basic data types like
[`Int`](/mojo/stdlib/builtin/int/Int) and
[`SIMD`](/mojo/stdlib/builtin/simd/SIMD), collection types like
[`List`](/mojo/stdlib/collections/list/List), reusable
[algorithms](/mojo/stdlib/algorithm/) and modules to support
[GPU programming](/mojo/stdlib/gpu).

Top-level packages:

:::🔥#stdlib
:::

## MAX AI kernels library

The MAX AI kernels library provides a collection of highly optimized, reusable
compute kernels for high-performance numerical and AI workloads. These kernels
serve as the foundational building blocks for writing [MAX custom
operations](/max/custom-ops/) or standalone [GPU
kernels](/mojo/manual/gpu/basics) that are portable across CPUs and GPUs.

Top-level packages:

:::🔥#kernels
:::

## MAX library

The Mojo MAX library provides APIs to interact with the MAX graph compiler
and runtime.

Top-level packages:

:::🔥#maxlib
:::

## Decorators

A Mojo decorator is a higher-order function that modifies or extends the
behavior of a struct, a function, or some other code.

:::🔥#decorators
:::

---

## mojo repl

Launches the Mojo REPL.

## Synopsis

```
mojo repl [lldb-options]
```

## Description

Launches a Mojo read-evaluate-print loop (REPL) environment, which provides interactive development in the terminal. You can also start the REPL by simply running `mojo`.

Any number of options and arguments may be specified on the command line. These are then forwarded to the underlying lldb tool, which runs the REPL.

## Options

### Common options

#### `--help`, `-h`

Displays help information.

---

## mojo run

Builds and executes a Mojo file.

## Synopsis

```
mojo run [options]  [path-arguments...]
```

## Description

Compiles the Mojo file at the given path and immediately executes it. Another way to execute this command is to simply pass a file to `mojo`. For example:

```
mojo hello.mojo
```

Options for this command itself, such as the ones listed below, must appear before the input file `path` argument. Any command line arguments that appear after the Mojo source file `path` are interpreted as arguments for that Mojo program.

## Options

### Compilation options

#### `--optimization-level `, `-O`, `--no-optimization (LEVEL=0)`

Sets the level of optimization to use at compilation. The value must be a number between 0 and 3. The default is 3.

#### `-I `

Appends the given path to the list of directories to search for imported Mojo files.

#### `-D `

Defines a named value that can be used from within the Mojo source file being executed. For example, `-D foo=42` defines a name `foo` that, when queried with the `sys.param_env` module from within the Mojo program, would yield the compile-time value `42`.

#### `--debug-level `, `-g (LEVEL=full)`

Sets the level of debug info to use at compilation. The value must be one of: `none` (the default value), `line-tables`, or `full`. Please note that there are issues when generating debug info for some Mojo programs that have yet to be addressed.

#### `--num-threads `, `-j`

Sets the maximum number of threads to use for compilation. The default is 0 (use all available threads).

### Target options

#### `--target-triple `

Sets the compilation target triple. Defaults to the host target.

#### `--target-cpu `

Sets the compilation target CPU. Defaults to the host CPU.

#### `--target-features `

Sets the compilation target CPU features. Defaults to the host features.

#### `--march `

Sets the architecture for which to generate code.

#### `--mcpu `

Sets the CPU for which to generate code.

#### `--mtune `

Sets the CPU for which to tune code.

### Compilation diagnostic options

Controls how the Mojo compiler outputs diagnostics related to compiling and running Mojo source code.

#### `--diagnose-missing-doc-strings`

Emits diagnostics for missing or partial doc strings.

#### `--validate-doc-strings`

Emits errors for invalid doc strings instead of warnings.

#### `--max-notes-per-diagnostic `

When the Mojo compiler emits diagnostics, it sometimes also prints notes with additional information. This option sets an upper threshold on the number of notes that can be printed with a diagnostic. If not specified, the default maximum is 10.

#### `--disable-builtins`

Do not use builtins when create package.

#### `--disable-warnings`

Do not print warning messages.

### Experimental compilation options

#### `--sanitize `

Turns on runtime checks. The following values are supported: `address` (detects memory issues), and `thread` (detects multi-threading issues).

#### `--shared-libasan`

Dynamically link the address sanitizer runtime. Requires address sanitization turned on with `--sanitize` option.

#### `--debug-info-language `

Sets the language to emit as part of the debug info. The supported languages are: `Mojo`, and `C`. `C` is the default, and is useful to enable rudimentary debugging and binary introspection in tools that don't understand Mojo.

### Common options

#### `--diagnostic-format `

The format in which diagnostics and error messages are printed. Must be one of "text" or "json" ("text" is the default).

#### `--help`, `-h`

Displays help information.

---

## mojo test

Execute unit, integration, and documentation tests.

## Synopsis

```
mojo test [options] 
```

## Description

Execute the given Mojo tests.

## Options

### Collection options

#### `--collect-only`, `--co`

Only collect tests, don't execute them.

#### `--filter `

A POSIX extended regular expression regex string that will be used to filter test IDs. See  for more information.

### Test run options

#### `--debug`

Launch a debugger session with `mojo debug`. This is not supported for docstring tests. Most debug flags are supported, including `--vscode`.

### Compilation options

#### `--optimization-level `, `-O`, `--no-optimization (LEVEL=0)`

Sets the level of optimization to use at compilation. The value must be a number between 0 and 3. The default is 3.

#### `-I `

Appends the given path to the list of directories to search for imported Mojo files.

#### `-D `

Defines a named value that can be used from within the Mojo source file being executed. For example, `-D foo=42` defines a name `foo` that, when queried with the `sys.param_env` module from within the Mojo program, would yield the compile-time value `42`.

#### `--debug-level `, `-g (LEVEL=full)`

Sets the level of debug info to use at compilation. The value must be one of: `none` (the default value), `line-tables`, or `full`. Please note that there are issues when generating debug info for some Mojo programs that have yet to be addressed.

#### `--num-threads `, `-j`

Sets the maximum number of threads to use for compilation. The default is 0 (use all available threads).

### Target options

#### `--target-triple `

Sets the compilation target triple. Defaults to the host target.

#### `--target-cpu `

Sets the compilation target CPU. Defaults to the host CPU.

#### `--target-features `

Sets the compilation target CPU features. Defaults to the host features.

#### `--march `

Sets the architecture for which to generate code.

#### `--mcpu `

Sets the CPU for which to generate code.

#### `--mtune `

Sets the CPU for which to tune code.

### Compilation diagnostic options

Controls how the Mojo compiler outputs diagnostics related to compiling and running Mojo source code.

#### `--diagnose-missing-doc-strings`

Emits diagnostics for missing or partial doc strings.

#### `--validate-doc-strings`

Emits errors for invalid doc strings instead of warnings.

#### `--max-notes-per-diagnostic `

When the Mojo compiler emits diagnostics, it sometimes also prints notes with additional information. This option sets an upper threshold on the number of notes that can be printed with a diagnostic. If not specified, the default maximum is 10.

#### `--disable-builtins`

Do not use builtins when create package.

#### `--disable-warnings`

Do not print warning messages.

### Debugger options

#### `--X `

Passes ARG as an argument to the debugger when the debug session is launched using the debugger command-line interface. This option can be specified multiple times. It is ignored when using the RPC mode.

### Debug server options

#### `--vscode`

Launches the debug session on VS Code via the Mojo extension.

#### `--rpc`

Alias for --vscode.

#### `--terminal `

The type of terminal to use when starting a launch debug session.

* `console` (default): the debuggee will be launched in the default environment for the editor. If using VS Code, this will be the Debug Console.
* `dedicated`: the debuggee will be launched in a dedicated terminal within the editor.

#### `--port `

Uses the given PORT to communicate with the RPC debug server. Defaults to trying all ports from 12355 to 12364 inclusive.

#### `--stop-on-entry`

Automatically stop after launch.

#### `--init-command `

Initialization command executed upon debugger startup. Can be specified multiple times.

### Experimental compilation options

#### `--sanitize `

Turns on runtime checks. The following values are supported: `address` (detects memory issues), and `thread` (detects multi-threading issues).

#### `--shared-libasan`

Dynamically link the address sanitizer runtime. Requires address sanitization turned on with `--sanitize` option.

#### `--debug-info-language `

Sets the language to emit as part of the debug info. The supported languages are: `Mojo`, and `C`. `C` is the default, and is useful to enable rudimentary debugging and binary introspection in tools that don't understand Mojo.

### Common options

#### `--diagnostic-format `

The format in which diagnostics and error messages are printed. Must be one of "text" or "json" ("text" is the default).

#### `--help`, `-h`

Displays help information.

---

## Mojo🔥 changelog

This is a list of changes to the Mojo language, standard library, and tools.

To check your current version, run `mojo --version`. To update the version of
Mojo for your project with the `magic` package manager, follow the instructions
in [Update a package](/magic#update-a-package) to update the `max` package.

## v25.4 nightly

This version is still a work in progress.

See how to [install the nightly
release](/max/packages#nightly-release).

### ✨ Highlights

* The Python-Mojo bindings are available as a preview release!  This is the
  ability to call into Mojo functions from existing Python codebases.  The use
  case is to speed up hot spots/slow Python code by rewriting certain portions
  of your code in Mojo to achieve performance.

* Parts of the Kernel library continue to be progressively open sourced!
  Packages that are open sourced now include:
  * `kv_cache`
  * `quantization`
  * `nvml`
  * Benchmarks
  * `Mogg` directory which contains registration of kernels with the Graph
    Compiler

* Implicit trait conformance is deprecated. Each instance of implicit
  conformance results in a warning, but compilation still goes through. Soon it
  will be upgraded into an error. Any code currently relying on implicit
  conformance should either declare conformances explicitly or, if appropriate,
  replace empty, non-load-bearing traits with trait compositions.

### Language changes

* The type [`Dict`](/mojo/stdlib/collections/dict/Dict/) is now part of the
  prelude, so there is no need to import them anymore.

* The Mojo compiler will now synthesize `__moveinit__` and `__copyinit__` and
  `copy()` methods for structs that conform to `Movable`, `Copyable`, and
  `ExplicitlyCopyable` (respectively) but that do not implement the methods
  explicitly.

* A new `@fieldwise_init` decorator can be attached to structs to synthesize a
  fieldwise initializer - an `__init__` method that takes the same arguments as
  the fields in the struct.  This gives access to this helpful capability
  without having to opt into the rest of the methods that `@value` synthesizes.
  This decorator allows an optional `@fieldwise_init("implicit")` form for
  single-element structs, which marks the initializer as `@implicit`.

* `try` and `raise` now work at comptime.

* "Initializer lists" are now supported for creating struct instances with an
  inferred type based on context, for example:

  ```mojo
  fn foo(x: SomeComplicatedType): ...

  # Example with normal initializer.
  foo(SomeComplicatedType(1, kwarg=42))
  # Example with initializer list.
  foo({1, kwarg=42})
  ```

* List literals have been redesigned to work better.  They produce homogenous
  sequences by invoking the `T(, __list_literal__: ())` constructor
  of a type `T` that is inferred by context, or otherwise defaulting to the
  standard library `List[Elt]` type.  The `ListLiteral` type has been removed
  from the standard library.

* Dictionary and set literals now work and default to creating instances of the
  `Dict` and `Set` types in the collections library.

### Standard library changes

* The `CollectionElement` trait has been removed.

* Added support for a wider range of consumer-grade hardware, including:
  * NVIDIA RTX 2060 GPUs
  * NVIDIA RTX 4090 GPUs

* The `bitset` datastructure was added to the `collections` package. This is a
  fixed `bitset` that simplifies working with a set of bits and perform bit
  operations.

* Fixed GPU `sum` and `prefix_sum` implementations in `gpu.warp` and `gpu.block`
  modules. Previously, the implementations have been incorrect and would either
  return wrong results or hang the kernel (due to the deadlock). [PR
  4508](https://github.com/modular/modular/pull/4508) and [PR
  4553](https://github.com/modular/modular/pull/4553) by [Kirill
  Bobyrev](https://github.com/kirillbobyrev) mitigate the found issues and add
  tests to ensure correctness going forward.

Changes to Python-Mojo interoperability:

* Python objects are now constructible with list/set/dict literal syntax, e.g.:
  `var list: PythonObject = [1, "foo", 2.0]` will produce a Python list
  containing other Python objects and `var d: PythonObject = {}` will construct
  an empty dictionary.

* `Python.{unsafe_get_python_exception, throw_python_exception_if_error_state}`
  have been removed in favor of `CPython.{unsafe_get_error, get_error}`.

* Since virtually any operation on a `PythonObject` can raise, the
  `PythonObject` struct no longer implements the `Indexer` and `Intable` traits.
  Instead, it now conforms to `IntableRaising`, and users should convert
  explictly to builtin types and handle exceptions as needed. In particular, the
  `PythonObject.__int__` method now returns a Python `int` instead of a mojo
  `Int`, so users must explicitly convert to a mojo `Int` if they need one (and
  must handle the exception if the conversion fails, e.g. due to overflow).

* `PythonObject` no longer implements the following traits:
  * `Stringable`. Instead, the `PythonObject.__str__` method now returns a
    Python `str` object and can raise. The new `Python.str` function can also be
    used to convert an arbitrary `PythonObject` to a Python `str` object.
  * `KeyElement`. Since Python objects may not be hashable, and even if they
    are, could theoretically raise in the `__hash__` method, `PythonObject`
    cannot conform to `Hashable`. This has no effect on accessing Python `dict`
    objects with `PythonObject` keys, since `__getitem__` and `__setitem__`
    should behave correctly and raise as needed. Two overloads of the
    `Python.dict` factory function have been added to allow constructing
    dictionaries from a list of key-value tuples and from keyword arguments.
  * `EqualityComparable`. The `PythonObject.{__eq__, __ne__}` methods need to
    return other `PythonObject` values to support rich comparisons.
    Code that previously compared `PythonObject` values should be wrapped in
    `Bool(..)` to perform the fallible conversion explicitly:
    `if Bool(obj1 == obj2): ...`.
  * `Floatable`. An explicit, raising constructor is added to `SIMD` to allow
    constructing `Float64` values from `PythonObject` values that implement
    `__float__`.

* `String` and `Bool` now implement `ConvertibleFromPython`.

* A new `def_function` API is added to `PythonModuleBuilder` to allow declaring
  Python bindings for arbitrary functions that take and return `PythonObject`s.
  Similarly, a new `def_method` API is added to `PythonTypeBuilder` to allow
  declaring Python bindings for methods that take and return `PythonObject`s.

* The `ConvertibleFromPython` trait is now public. This trait is implemented
  by Mojo types that can be constructed by converting from a `PythonObject`.
  This is the reverse operation of the `PythonConvertible` trait.

* `PythonObject(alloc=)` is a new constructor that can be used to
  directly store Mojo values in Python objects.

  This initializer will fail if the type of the provided Mojo value has not
  previously had a corresponding Python 'type' object globally registered using
  `PythonModuleBuilder.add_type[T]()`.

* `PythonObject` has new methods for downcasting to a pointer to a contained
  Mojo value, for use in Python/Mojo interop.

  ```mojo
  struct Person:
      var name: String

  fn greet(obj: PythonObject) raises:
    var person = obj.downcast_value_ptr[Person]()

    print("Hello ", person[].name, "from Mojo🔥!")
  ```

  * `PythonObject.downcast_value_ptr[T]()` checks if the object is a wrapped
    instance of the Mojo type `T`, and if so, returns an `UnsafePointer[T]`.
    Otherwise, an exception is raised.

  * `PythonObject.unchecked_downcast_value_ptr[T]()` unconditionally
    returns an `UnsafePointer[T]` with any runtime type checking. This is useful
    when using Python/Mojo interop to optimize an inner loop and minimizing
    overhead is desirable.

    Also added equivalent `UnsafePointer` initializer for downcasting from a
    `PythonObject`.

* The `Python.is_type(x, y)` static method has been removed. Use the
  expression `x is y` instead.

* `os.abort(messages)` no longer supports generic variadic number of `Writable`
  messages.  While this API was high-level and convenient, it generates a lot of
  IR for simple and common cases, such as when we have a single `StringLiteral`
  message.  We now no longer need to generate a bunch of bloated IR, and
  instead, callers must create the `String` on their side before calling
  `os.abort(message)`.

* The function `atof` has been entirely rewritten as it produced incorrect
  results for very low and very high exponents.
  It now works correctly for strings with less than
  19 digits left of the `e`. For example `1.1385616158185648648648648648616186186e-3`
  won't work, and will raise an error. Anything that does
  not produce an error is now garanteed to be correct.
  While the current implementation is not the fastest, it's based on the paper
  [Number Parsing at a Gigabyte per Second](https://arxiv.org/abs/2101.11408) by
  Daniel Lemire. So with a bit of effort to
  pinpoints the slow parts, we can easily have state of the
  art performance in the future.

### Tooling changes

* Added support for emitting LLVM Intermediate Representation (.ll) using `--emit=llvm`.
  * Example usage: `mojo build --emit=llvm YourModule.mojo`

* Removing support for command line option `--emit-llvm` infavor of `--emit=llvm`.

* Added support for emitting assembly code (.s) using `--emit-asm`.
  * Example usage: `mojo build --emit=asm YourModule.mojo`

* Added `associated alias` support for documentation generated via `mojo doc`.

### 🛠️ Fixed

* [#4352](https://github.com/modular/modular/issues/4352) - `math.sqrt`
  products incorrect results for large inputs.
* [#4518](https://github.com/modular/modular/issues/4518) - Try Except Causes
  False Positive "Uninitialized Value".
* [#4677](https://github.com/modular/modular/issues/4677),
  [#4688](https://github.com/modular/modular/issues/4668) - Incorrect result for
  unsigned `gt` and `le` comparisions.

## v25.3 (2025-05-06)

### ✨ Highlights

* Parts of the Mojo standard library continue to be progressively open sourced!
  Packages that are open sourced now include:

  * `algorithm`
  * `benchmark`
  * `buffer`
  * `compile`
  * `complex`
  * `gpu`
  * `logger`
  * `runtime`
  * `subprocess`

  For more information, see the
  [Standard library reference](/mojo/lib#standard-library) and the
  [Standard library source](https://github.com/modular/modular/tree/main/mojo/stdlib).

* Parts of the MAX AI kernels library continue to be progressively open sourced!
  Packages that are open sourced now include:

  * `layout`
  * `linalg`
  * `register`

  For more information, see the
  [MAX AI kernels library reference](/mojo/lib#max-ai-kernels-library) and the
  [MAX AI kernels source](https://github.com/modular/modular/tree/main/max/kernels).

* Trait compositions are now supported via the `&` syntax. A trait composition
  combines two traits into one logical trait whose constraint set is the union
  of the constraint sets of the two original traits. For more information, see
  [Trait compositions](/mojo/manual/traits/#trait-compositions) in the Mojo
  Manual.

* String types in Mojo got several significant improvements. See
  [Standard library changes](#25-3-standard-library-changes) for details.

### Language changes {#25-3-language-changes}

* Mojo can now use [user-declared `__merge_with__()` dunder
  methods](https://github.com/modular/modular/blob/main/mojo/proposals/custom-type-merging.md)
  to merge values when using different types in ternary operations. This has
  been adopted to allow pointers to work naturally with the ternary operator,
  for example `var x = one_pointer if cond else other_pointer`.

* Auto-parameterization now extends to struct metatypes. For example, this
  declaration `fn foo[M: __type_of(StringLiteral[_])]` will auto-parameterize
  on the unbound parameter of `StringLiteral`.

* The Mojo compiler now warns about stores to values that are never used, e.g.:
  `x = foo(); x = bar()` will warn about the first assignment to `x` because
  it is overwritten.  You can generally address this by deleting dead code, or
  by assigning to `_` instead: `_ = foo(); x = bar()`.  You may also encounter
  this in variable declarations, e.g. `var x = 0; ...; x = foo()`.  In this
  case, change the variable to being declared as uninitialized, e.g.
  `var x: Int`.  You may also silence this warning entirely for a variable by
  renaming it to start with an underscore, e.g. `_x`.

* The Mojo compiler now warns about obsolete use of `mut self` in initializers,
  please switch over to `fn __init__(out self)` instead.

* `def` functions now require type annotations on arguments, and treat a missing
  return type as returning `None`. Previously these defaulted to the `object`
  type which led to a variety of problems.  Support for `object` has been
  removed until we have time to investigate a proper replacement.

### Standard library changes {#25-3-standard-library-changes}

String types in Mojo got several significant improvements:

* The [`String`](/mojo/stdlib/collections/string/string/String/) type no longer
  copies data from
  [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral/) and
  [`StaticString`](/mojo/stdlib/collections/string/string_slice/#aliases) since
  they are known-static-constant values. This allows us to make construction
  from these values be implicit, which improves ergonomics and performance
  together. It also implements the "small string optimization", which avoids
  heap allocation for common short strings. On a 64-bit system, `String` can
  hold up to 23 bytes inline. Its copy constructor is now O(1), performing
  string data copy lazily on mutation.

* The types
  \[`StringSlice`(/mojo/stdlib/collections/string/string\_slice/StringSlice/) and [`StaticString`](/mojo/stdlib/collections/string/string_slice/#aliases)
  are now part of the prelude, there is no need to import them anymore. These
  are useful for code that just needs a "view" of string data, not to own and
  mutate it.

* The [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral/) type
  has been moved to a more reliable "dependent type" design where the value of
  the string is carried in a parameter instead of a stored member. This defines
  away a category of compiler crashes when working with `StringLiteral` that
  involved attempting to manipulate a `StringLiteral` at run time. As a
  consequence of this change, many APIs should switch to using
  [`StaticString`](/mojo/stdlib/collections/string/string_slice/#aliases)
  instead of `StringLiteral`. For more information on this "dependent type"
  design for literals, see the proposal,
  [Fixing Simple Literals in Mojo](https://github.com/modular/modular/blob/main/mojo/proposals/fixing-simple-literals.md).

* `String` supports a new `String(unsafe_uninit_length=x)` constructor and
  `str.resize(unsafe_uninit_length=x)` for clients that want to allocate space
  that they intend to fill in with custom unsafe initialization patterns.  The
  `String(ptr=x, length=y)` constructor has been removed.

* `String` supports working with legacy C APIs that assume null termination,
  but the details have changed: `String` is now no longer implicitly
  null-terminated, which means that it is incorrect to assume that
  `str.unsafe_ptr()` will return a null-terminated string.  For that, use the
  `str.unsafe_cstr_ptr()` method. It now requires the string to be mutable in
  order to make null-termination lazy on demand. This improves performance for
  strings that are not passed to legacy APIs.

* The [`List`](/mojo/stdlib/collections/list/List) type has been improved
  similarly to `String` to reduce inconsistency and enable power-user features,
  including removing adding `List(unsafe_uninit_length=x)` and
  `list.resize(unsafe_uninit_size=n)` methods avoid initialized memory that the
  caller plans to overwrite.

* [`Set`](/mojo/stdlib/collections/set/Set/) now conforms to the
  [`Copyable`](/mojo/stdlib/builtin/value/Copyable/) trait so you can store sets
  in other types of collections (for example, as values in a `Dict`).

* The following traits have been removed in favor of trait composition:
  `EqualityComparableCollectionElement`, `RepresentableCollectionElement`,
  `TestableCollectionElement`, `Testable`, `StringableIdentifiable`,
  `StringableCollectionElement`, `IntervalPayload`, `WritableCollectionElement`,
  `ComparableCollectionElement`, `BoolableCollectionElement`,
  `EqualityComparableWritableCollectionElement`,
  `EqualityComparableWritableCollectionElementNew`, `CollectionElementNew`,
  `WritableCollectionElementNew`.

  For example, you can replace `EqualityComparableCollectionElement` with
  `EqualityComparable & CollectionElement`. `StringableCollectionElement` was
  already deprecated and scheduled to be removed; it can be replaced with
  `Writable & CollectionElement`.

* The [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) type is
  being reworked in preparation for some improvements to Mojo-Python
  interoperability:

  * Since virtually any operation on a `PythonObject` can raise, the
    `PythonObject` struct no longer implements the following traits:
    `ImplicitlyBoolable`, `ImplicitlyIntable`.

  * `PythonObject` is no longer implicitly constructible from tuple or list
    literals. For example, `var x : PythonObject = [1, 2, "foo"]` is no longer
    accepted. Instead, please use the new `Python.list()` and `Python.tuple()`
    factory methods. For example:

    ```mojo
    var x = Python.list(1, 2, "foo")
    ```

    (The `list()` and `tuple()` factory methods were originally added on
    `PythonObject`, but have been moved to the `Python` struct.)

    We hope to re-enable literal syntax in the future as the standard library
    matures.

  * `PythonObject.from_borrowed_ptr()` has been removed in favor of a
    constructor with a keyword-only `from_borrowed_ptr` argument.

  * The deprecated `PythonObject.to_float64()` method has been removed. Use the
    `Float64()` constructor, instead.

* [`Span`](/mojo/stdlib/memory/span/Span) now has a `swap_elements()` method
  which takes two indices and swaps them within the span.

* [`Pointer`](/mojo/stdlib/memory/pointer/Pointer/) now has a `get_immutable()`
  method to return a new `Pointer` with the same underlying data but with an
  `ImmutableOrigin`.

* You can now forward a
  [`VariadicPack`](/mojo/stdlib/builtin/list_literal/VariadicPack/) where all
  values are `Writable` to a writer using
  [`WritableVariadicPack`](/mojo/stdlib/utils/write/WritableVariadicPack/):

  ```mojo
  from utils.write import WritableVariadicPack

  fn print_message[*Ts: Writable](*messages: *Ts):
      print("message:", WritableVariadicPack(messages), "[end]")

  x = 42
  print_message("'x = ", x, "'")
  ```

  ```text
  message: 'x = 42' [end]
  ```

  In this example the variadic pack is buffered to the stack in the `print` call
  along with the extra arguments, before doing a single syscall to write to
  stdout.

* [`debug_assert()`](/mojo/stdlib/builtin/debug_assert/debug_assert/) in AMD GPU
  kernels now behaves the same as on NVIDIA, printing the thread information and
  variadic args passed after the condition:

  ```mojo
  from gpu.host import DeviceContext

  fn kernel():
      var x = 1
      debug_assert(x == 2, "x should be 2 but is: ", x)

  def main():
      with DeviceContext() as ctx:
          ctx.enqueue_function[kernel](grid_dim=2, block_dim=2)
  ```

  Running `mojo run -D ASSERT=all [filename]` will output:

  ```text
  At /tmp/test.mojo:5:17: block: [0,0,0] thread: [0,0,0] Assert Error: x should be 2 but is: 1
  At /tmp/test.mojo:5:17: block: [0,0,0] thread: [1,0,0] Assert Error: x should be 2 but is: 1
  At /tmp/test.mojo:5:17: block: [1,0,0] thread: [0,0,0] Assert Error: x should be 2 but is: 1
  At /tmp/test.mojo:5:17: block: [1,0,0] thread: [1,0,0] Assert Error: x should be 2 but is: 1
  ```

* The
  [`constrained[cond, string]()`](/mojo/stdlib/builtin/constrained/constrained/)
  function now accepts multiple strings that are printed concatenated on
  failure, so you can use:

  ```mojo
  constrained[cond, "hello: ", String(n), ": world"]()
  ```

  This is more compile-time efficient and somewhat more ergonomic than using
  string concatenation.

* [`pathlib.Path.write_text()`](/mojo/stdlib/pathlib/path/Path/#write_text) now
  accepts a `Writable` argument instead of a `Stringable` argument. This makes
  the function more efficient by removing a String allocation.

* Added
  [`pathlib.Path.write_bytes()`](/mojo/stdlib/pathlib/path/Path/#write_bytes)
  which enables writing raw bytes to a file.

* Added
  [`os.path.split_extension()`](/mojo/stdlib/os/path/path/split_extension) to
  split a path into its root and extension.

* Added [`os.path.is_absolute()`](/mojo/stdlib/os/path/path/is_absolute) to
  check if a given path is absolute or not.

* One can now specify the consistency model used in atomic operations with the
  default being sequential consistency. The consistency models are defined in
  the [`Consistency`](/mojo/stdlib/os/atomic/Consistency/) struct.

* Added
  [`Variant.is_type_supported()`](/mojo/stdlib/utils/variant/Variant/#is_type_supported)
  method. ([PR #4057](https://github.com/modular/modular/pull/4057)) Example:

  ```mojo
    def takes_variant(mut arg: Variant):
        if arg.is_type_supported[Float64]():
            arg = Float64(1.5)
    def main():
        var x = Variant[Int, Float64](1)
        takes_variant(x)
        if x.isa[Float64]():
            print(x[Float64]) # 1.5
  ```

* The `type` parameter of `SIMD` has been renamed to `dtype`.

* The `is_power_of_two(x)` function in the `bit` package is now a method on
  `Int`, `UInt` and `SIMD`.

* The `Pointer.address_of(...)` and `UnsafePointer.address_of(...)` functions
  have been deprecated. Please use the
  [`Pointer(to=...)`](/mojo/stdlib/memory/pointer/Pointer#__init__) and
  [`UnsafePointer(to=...)`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#__init__)
  constructors instead. Conceptually, this is saying "please initialize a
  `Pointer` (a reference, if you will) to *some other address in memory*. In the
  future, these `address_of()` functions will be removed.

### Tooling changes {#25-3-tooling-changes}

* Fixed SIMD boolean display in debugger: SIMD boolean values now display
  correctly with proper bit extraction.

* Improved language server performance: The language server now avoids
  parsing more than it needs to, improving performance across the board.

* The Mojo compiler is now able to interpret all arithmetic operations from
  the `index` dialect that are used in methods of `Int` and `UInt` types.
  That allows users to finally compute constants at compile time:

  ```mojo
  alias a: Int = 1000000000
  alias b: Int = (5 * a) // 2
  ```

  Previously, the compiler would throw the error "cannot fold operation".

* Added a new `--emit-llvm` option to the `mojo build` command, which allows
  users to emit LLVM IR. When `--emit-llvm` is specified, the build process
  will: compile mojo code to LLVM IR, save the IR to a .ll file (using the same
  name as the input file), and print the IR to stdout for immediate inspection.

### Other changes

* The syntax for adding attributes to an `__mlir_op` is now limited to inherent
  attributes (those defined by the op definition). Most users will not need to
  attach other kinds of attributes, and this helps guard against typos and mojo
  code getting outdated when the dialect changes.

### ❌ Removed {#25-3-removed}

* The `SIMD.roundeven()` method has been removed from the standard library.
  This functionality is now handled by the
  [`round()`](/mojo/stdlib/builtin/math/round) function.

* Error messages about the obsolete `borrowed` and `inout` keywords, as well as
  the obsolete `-> Int as name` syntax have been removed.

* The `object` type has been removed.

* `utils.numerics.ulp` has been removed. Use the
  [`ulp()`](/mojo/stdlib/math/math/ulp) function from the `math` package
  instead.

* Several free functions that were deprecated in the 25.2 release have now been
  removed.  This includes:

  * The `str` free function. Use the `String` constructor instead.
  * The `int` free function. Use the `Int` constructor instead.
  * The `bool` free function. Use the `Bool` constructor instead.
  * The `float` free function. Use the `Float64` constructor instead.

* Removed deprecated
  [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext/) methods
  `copy_sync()` and `memset_sync()`.

* The `unroll()` utility has been removed. Use the
  [`@parameter for` construct](/mojo/manual/decorators/parameter#parametric-for-statement)
  instead.

  ```mojo
  from utils.loop import unroll

  # Before
  @always_inline
  @parameter
  fn foo[i: Int]():
      body_logic[i]()
  unroll[foo, iteration_range]()

  # After
  @parameter
  for i in range(iteration_range):
      body_logic[i]()
  ```

* The `InlinedString` type has been removed.  Use `String` instead which now
  supports the Small String Optimization (SSO).

### 🛠️ Fixed {#25-3-fixed}

* [#3510](https://github.com/modular/modular/issues/3510) - `PythonObject`
  doesn't handle large `UInt64` correctly.

* [#3847](https://github.com/modular/modular/issues/3847) - Count leading zeros
  can't be used on `SIMD` at compile time.

* [#4198](https://github.com/modular/modular/issues/4198) - Apple M4
  is not properly detected with `sys.is_apple_silicon()`.

* [#3662](https://github.com/modular/modular/issues/3662) - Code using
  `llvm.assume` cannot run at compile time.

* [#4273](https://github.com/modular/modular/issues/4273) - `count_leading_zeros`
  doesn't work for vectors with size > 1 at compile time.

* [#4320](https://github.com/modular/modular/issues/4320) - Intermittent
  miscompilation with bytecode imported traits.

* [#4281](https://github.com/modular/modular/issues/4281) - MAX does not support
  RTX 5000-series GPUs.

* [#4163](https://github.com/modular/modular/issues/4163) - Corner case in
  initializers.

* [#4360](https://github.com/modular/modular/issues/4360) - Fix constructor emission
  for parameterized types conforming to a trait composition.

* [#4362](https://github.com/modular/modular/issues/4362) - Function call with
  `IntLiteral` incorrectly eliminated despite side-effects.

* [#4431](https://github.com/modular/modular/issues/4431) - \[BUG]
  Python.evaluate doesn't handle null termination correctly.

### Special thanks

Special thanks to our community contributors:

[@auris](https://github.com/auris),
[@bgreni](https://github.com/bgreni),
[@christianbator](https://github.com/christianbator),
[@KamilGucik](https://github.com/KamilGucik),
[@kasmith11](https://github.com/kasmith11),
[@martinvuyk](https://github.com/martinvuyk),
[@ratulb](https://github.com/ratulb),
[@rd4com](https://github.com/rd4com),
[@sora](https://github.com/sora),
[@thatstoasty](https://github.com/thatstoasty), and
[@winding-lines](https://github.com/winding-lines).

* [#4492](https://github.com/modular/modular/issues/4488) - Fix `StringSlice.replace`
  seg fault.

## v25.2 (2025-03-25)

### ✨ Highlights

* Check out the new [GPU basics](/mojo/manual/gpu/basics) section of the [Mojo
  Manual](/mojo/manual) and the [Get started with GPU programming with Mojo and
  the MAX Driver](/mojo/manual/gpu/intro-tutorial) tutorial for a guide to
  getting started with GPU programming in Mojo!

* Some APIs in the [`gpu`](/mojo/stdlib/gpu/) package were enhanced to simplify
  working with GPUs.

  * If you're executing a GPU kernel only once, you can now skip compiling it
    first before enqueueing it, and pass it directly to
    [`DeviceContext.enqueue_function()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_function).

  * The three separate methods on `DeviceContext` for asynchronously copying
    buffers between host and GPU memory have been combined to single overloaded
    [`enqueue_copy()`](/mojo/stdlib/gpu/host/device_context/DeviceContext/#enqueue_copy)
    method, and the three separate methods for synchronous copies have been
    combined into an overloaded
    [`copy_sync()`](/mojo/stdlib/gpu/host/device_context/DeviceContext/#copy_sync)
    method.

  * The `gpu.shuffle` module has been renamed to
    [`gpu.warp`](/mojo/stdlib/gpu/warp/) to better reflect its purpose.

  * The [`gpu`](/mojo/stdlib/gpu) package API documentation has been expanded,
    and API documentation for the [`layout`](/mojo/kernels/layout) package is
    underway, beginning with core types, functions, and traits.

  See the [Standard library changes](#25-2-standard-library-changes) section of
  the changelog for more information.

* The legacy `borrowed`/`inout` keywords and `-> T as foo` syntax are no longer
  supported and now generate a compiler error. Please move to `read`/`mut`/`out`
  argument syntax instead. See [Argument
  conventions](/mojo/manual/values/ownership#argument-conventions) in the Mojo
  Manual for more information.

* The standard library has many changes related to strings. Notably, the `Char`
  type has been renamed to
  [`Codepoint`](/mojo/stdlib/collections/string/codepoint/Codepoint), to better
  capture its intended purpose of storing a single Unicode codepoint.
  Additionally, related method and type names have been updated as well. See
  [Standard library changes](#25-2-standard-library-changes) for more details.

* Support has been added for 128- and 256-bit signed and unsigned integers. This
  includes the [`DType`](/mojo/stdlib/builtin/dtype/DType) aliases
  `DType.int128`, `DType.uint128`, `DType.int256`, and `DType.uint256`, as well
  as [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) support for 128- and 256-bit
  signed and unsigned element types. Note that this exposes capabilities (and
  limitations) of LLVM, which may not always provide high performance for these
  types and may have missing operations like divide, remainder, etc. See
  [Standard library changes](#25-2-standard-library-changes) for more details.

### Language changes {#25-2-language-changes}

* References to aliases in struct types with unbound (or partially) bound
  parameters sets are now allowed as long as the referenced alias doesn't
  depend on any unbound parameters:

  ```mojo
  struct StructWithParam[a: Int, b: Int]:
    alias a1 = 42
    alias a2 = a+1

  fn test():
    _ = StructWithParams.a1 # ok
    _ = StructWithParams[1].a2 # ok
    _ = StructWithParams.a2 # error, 'a' is unbound.
  ```

* The Mojo compiler now warns about `@parameter for` with large loop unrolling
  factor (>1024 by default), which can lead to long compilation time and large
  generated code size. Set `--loop-unrolling-warn-threshold` to change default
  value to a different threshold or to `0` to disable the warning.

* The Mojo compile-time interpreter can now handle many more LLVM intrinsics,
  including ones that return floating point values. This allows functions like
  [`round()`](/mojo/stdlib/builtin/math/round) to be constant folded when used
  in a compile-time context.

* The Mojo compiler now has only one compile-time interpreter. It had two
  previously: one to handle a few cases that were important for dependent types
  in the parser (but which also had many limitations), and the primary one that
  ran at "instantiation" time which is fully general. This was confusing and
  caused a wide range of bugs. We've now removed the special case parse-time
  interpreter, replacing it with a more general solution for dependent types.
  This change should be invisible to most users, but should resolve a number of
  long-standing bugs and significantly simplifies the compiler implementation,
  allowing us to move faster.

### Standard library changes {#25-2-standard-library-changes}

* [`Optional`](/mojo/stdlib/collections/optional/Optional),
  [`Span`](/mojo/stdlib/memory/span/Span), and
  [`InlineArray`](/mojo/stdlib/collections/inline_array/InlineArray) have been
  added to the prelude. You now no longer need to explicitly import these types
  to use them in your program.

* GPU programming changes:

  * You can now skip compiling a GPU kernel first before enqueueing it, and pass
    it directly to
    [`DeviceContext.enqueue_function()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#enqueue_function):

    ```mojo
    from gpu.host import DeviceContext

    fn func():
        print("Hello from GPU")

    with DeviceContext() as ctx:
        ctx.enqueue_function[func](grid_dim=1, block_dim=1)
    ```

    However, if you're reusing the same function and parameters multiple times,
    this incurs some overhead of around 50-500 nanoseconds per enqueue. So you
    can still compile the function first with
    [`DeviceContext.compile_function()`](/mojo/stdlib/gpu/host/device_context/DeviceContext#compile_function)
    and pass it to `DeviceContext.enqueue_function()` like this:

    ```mojo
    with DeviceContext() as ctx:
      var compiled_func = ctx.compile_function[func]()
      # Multiple kernel launches with the same function/parameters
      ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
      ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    ```

  * The following methods on
    [`DeviceContext`](/mojo/stdlib/gpu/host/device_context/DeviceContext):

    * `enqueue_copy_to_device()`
    * `enqueue_copy_from_device()`
    * `enqueue_copy_device_to_device()`

    have been combined to a single overloaded
    [`enqueue_copy()`](/mojo/stdlib/gpu/host/device_context/DeviceContext/#enqueue_copy)
    method. Additionally, the methods:

    * `copy_to_device_sync()`
    * `copy_from_device_sync()`
    * `copy_device_to_device_sync()`

    have been combined into an overloaded
    [`copy_sync()`](/mojo/stdlib/gpu/host/device_context/DeviceContext/#copy_sync)
    method.

  * The `gpu.shuffle` module has been renamed to
    [`gpu.warp`](/mojo/stdlib/gpu/warp/) to better reflect its purpose. For
    example:

    ```mojo
    import gpu.warp as warp

    var val0 = warp.shuffle_down(x, offset)
    var val1 = warp.broadcast(x)
    ```

* Support has been added for 128- and 256-bit signed and unsigned integers.

  * The following aliases have been added to the
    [`DType`](/mojo/stdlib/builtin/dtype/DType) struct: `DType.int128`,
    `DType.uint128`, `DType.int256`, and `DType.uint256`.

  * The [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type now supports 128- and
    256-bit signed and unsigned element types. Note that this exposes
    capabilities (and limitations) of LLVM, which may not always provide high
    performance for these types and may have missing operations like divide,
    remainder, etc.

  * The following [`Scalar`](/mojo/stdlib/builtin/simd/#aliases) aliases for
    1-element `SIMD` values have been added: `Int128`, `UInt128`, `Int256`, and
    `UInt256`.

* [`String`](/mojo/stdlib/collections/string) and friends:

  * The `Char` type has been renamed to
    [`Codepoint`](/mojo/stdlib/collections/string/codepoint/Codepoint), to
    better capture its intended purpose of storing a single Unicode codepoint.
    Additionally, related method and type names have been updated as well,
    including:

    * `StringSlice.chars()` and `String.chars()` to
      [`StringSlice.codepoints()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#codepoints)
      and
      [`String.codepoints()`](/mojo/stdlib/collections/string/string/String/#codepoints),
      respectively

    * `StringSlice.char_slices()` and `String.char_slices()` to
      [`StringSlice.codepoint_slices()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#codepoint_slices)
      and
      [`String.codepoint_slices()`](/mojo/stdlib/collections/string/string/String/#codepoint_slices),
      respectively

    * `CharsIter` to
      [`CodepointsIter`](/mojo/stdlib/collections/string/string_slice/CodepointsIter)

    * `Char.unsafe_decode_utf8_char()` to
      [`Codepoint.unsafe_decode_utf8_codepoint()`](/mojo/stdlib/collections/string/codepoint/Codepoint/#unsafe_decode_utf8_codepoint)

    * Made the iterator type returned by the string `codepoint_slices()` methods
      public as
      [`CodepointSliceIter`](/mojo/stdlib/collections/string/string_slice/CodepointSliceIter/).

  * [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice)
    now supports several additional methods moved from
    [`String`](/mojo/stdlib/collections/string/string/String). The existing
    `String` methods have been updated to instead call the corresponding new
    `StringSlice` methods:

    * [`center()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#center)
    * [`is_ascii_digit()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#is_ascii_digit)
    * [`is_ascii_printable()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#is_ascii_printable)
    * [`islower()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#islower)
    * [`isupper()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#isupper)
    * [`ljust()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#ljust)
    * [`lower()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#lower)
    * [`rjust()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#rjust)
    * [`split()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#split)
    * [`upper()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#upper)

  * Added a
    [`StringSlice.is_codepoint_boundary()`](/mojo/stdlib/collections/string/string_slice/StringSlice/#is_codepoint_boundary)
    method for querying if a given byte index is a boundary between encoded
    UTF-8 codepoints.

  * [`StringSlice.__getitem__(Slice)`](/mojo/stdlib/collections/string/string_slice/StringSlice/#__getitem__)
    now raises an error if the provided slice start and end positions do not
    fall on a valid codepoint boundary. This prevents construction of malformed
    `StringSlice` values, which could lead to memory unsafety or undefined
    behavior. For example, given a string containing multi-byte encoded data,
    like:

    ```mojo
    str_slice = "Hi👋!"
    ```

    and whose in-memory and decoded data looks like:

    
              String
              Hi👋!
          
          
              Codepoint Characters
              H
              i
              👋
              !
          
          
              Codepoints
              72
              105
              128075
              33
          
          
              Bytes
              72
              105
              240
              159
              145
              139
              33
          
          
              Index
              0
              1
              2
              3
              4
              5
              6
          
      
    attempting to slice bytes `[3-5)` with `str_slice[3:5]` would previously
    erroneously produce a malformed `StringSlice` as output that did not
    correctly decode to anything:

    
              String
              invalid
          
          
              Codepoint Characters
              invalid
          
          
              Codepoints
              invalid
          
          
              Bytes
              159
              145
          
          
              Index
              0
              1
          
      
    The same statement will now raise an error informing the user that their
    indices are invalid.

  * The `StringLiteral.get[value]()` method, which converts a compile-time value
    of [`Stringable`](/mojo/stdlib/builtin/str/Stringable) type has been changed
    to a function named
    [`get_string_literal[value]()`](/mojo/stdlib/builtin/string_literal/get_string_literal).

* Collections:

  * A new [`IntervalTree`](/mojo/stdlib/collections/interval/IntervalTree) data
    structure has been added to the standard library. This is a tree data
    structure that allows for efficient range queries.

  * Added an iterator to
    [`LinkedList`](/mojo/stdlib/collections/linked_list/LinkedList) ([PR
    \#4005](https://github.com/modular/modular/pull/4005))

    * [`LinkedList.__iter__()`](/mojo/stdlib/collections/linked_list/LinkedList/#__iter__)
      to create a forward iterator.

    * [`LinkedList.__reversed__()`](/mojo/stdlib/collections/linked_list/LinkedList/#__reversed__)
      for a backward iterator.

    ```mojo
    var ll = LinkedList[Int](1, 2, 3)
    for element in ll:
      print(element[])
    ```

  * `List.bytecount()` has been renamed to
    [`List.byte_length()`](/mojo/stdlib/collections/list/List/#byte_length) for
    consistency with the string-like APIs.

  * The
    [`InlineArray(unsafe_uninitialized=True)`](/mojo/stdlib/collections/inline_array/InlineArray/#__init__)
    constructor is now spelled `InlineArray(uninitialized=True)`.

* The design of the [`IntLiteral`](/mojo/stdlib/builtin/int_literal/IntLiteral)
  and [`FloatLiteral`](/mojo/stdlib/builtin/float_literal/FloatLiteral) types
  has been changed to maintain their compile-time-only value as a parameter
  instead of a stored field. This correctly models that infinite precision
  literals are not representable at runtime, and eliminates a number of bugs hit
  in corner cases. This is made possible by enhanced dependent type support in
  the compiler.

* The `Buffer` struct has been removed in favor of
  [`Span`](/mojo/stdlib/memory/span/Span) and
  [`NDBuffer`](/mojo/stdlib/buffer/buffer/NDBuffer).

* The [`round()`](/mojo/stdlib/builtin/math/round) function is now fixed to
  perform "round half to even" (also known as "bankers' rounding") instead of
  "round half away from zero".

* The
  [`UnsafePointer.alloc()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer/#alloc)
  method has changed to produce pointers with an empty `Origin` parameter,
  instead of with `MutableAnyOrigin`. This mitigates an issue with the any
  origin parameter extending the lifetime of unrelated local variables for this
  common method.

* Several more packages are now documented:

  * [`compile`](/mojo/stdlib/compile) package
  * [`gpu`](/mojo/stdlib/gpu) package
  * [`layout`](/mojo/kernels/layout) package is underway, beginning with core
    types, functions, and traits

* Added a new
  [`sys.is_compile_time()`](/mojo/stdlib/sys/compile/is_compile_time) function.
  This enables you to query whether code is being executed at compile time or
  not. For example:

  ```mojo
  from sys import is_compile_time

  fn check_compile_time() -> String:
    if is_compile_time():
        return "compile time"
    else:
        return "runtime"

  def main():
      alias var0 = check_compile_time()
      var var1 = check_compile_time()
      print("var0 is evaluated at ", var0, " , while var1 is evaluated at ", var1)
  ```

  will print `var0 is evaluated at compile time, while var1 is evaluated at
  runtime`.

### Tooling changes {#25-2-tooling-changes}

* Mojo API documentation generation is now able to display function and struct
  parameter references inside nested parametric types using names instead of
  indices. For example, instead of

  ```mojo

  sort[type: CollectionElement, //, cmp_fn: fn($1|0, $1|0) capturing -> Bool](span: Span[type, origin])

  ```

  it now displays

  ```mojo

  sort[type: CollectionElement, //, cmp_fn: fn(type, type) capturing -> Bool](span: Span[type, origin])

  ```

### ❌ Removed

* Use of legacy argument conventions like `inout` and the use of `as` in named
  results now produces an error message instead of a warning.

* Direct access to `List.size` has been removed. Use the public API instead.

  Examples:

  Extending a List:

  ```mojo
  base_data = List[Byte](1, 2, 3)

  data_list = List[Byte](4, 5, 6)
  ext_data_list = base_data.copy()
  ext_data_list.extend(data_list) # [1, 2, 3, 4, 5, 6]

  data_span = Span(List[Byte](4, 5, 6))
  ext_data_span = base_data.copy()
  ext_data_span.extend(data_span) # [1, 2, 3, 4, 5, 6]

  data_vec = SIMD[DType.uint8, 4](4, 5, 6, 7)
  ext_data_vec_full = base_data.copy()
  ext_data_vec_full.extend(data_vec) # [1, 2, 3, 4, 5, 6, 7]

  ext_data_vec_partial = base_data.copy()
  ext_data_vec_partial.extend(data_vec, count=3) # [1, 2, 3, 4, 5, 6]
  ```

  Slicing and extending a list efficiently:

  ```mojo
  base_data = List[Byte](1, 2, 3, 4, 5, 6)
  n4_n5 = Span(base_data)[3:5]
  extra_data = Span(List[Byte](8, 10))
  end_result = List[Byte](capacity=len(n4_n5) + len(extra_data))
  end_result.extend(n4_n5)
  end_result.extend(extra_data) # [4, 5, 8, 10]
  ```

* `InlinedFixedVector` and `InlineList` have been removed. Instead, use
  [`InlineArray`](/mojo/stdlib/collections/inline_array/InlineArray) when the
  upper bound is known at compile time. If the upper bound is not known until
  runtime, use [`List`](/mojo/stdlib/collections/list/List) with the `capacity`
  constructor to minimize allocations.

### 🛠️ Fixed

* [#3976](https://github.com/modular/modular/issues/3976) The `variance`
  argument in [`random.randn_float64()`](/mojo/stdlib/random/random/randn_float64)
  and [`random.randn()`](/mojo/stdlib/random/random/randn) has been renamed to
  `standard_deviation` so that values are drawn from the correct distribution.

### Special thanks

Special thanks to our community contributors:
[@bgreni](https://github.com/bgreni),
[@fnands](https://github.com/fnands),
[@illiasheshyn](https://github.com/illiasheshyn),
[@izo0x90](https://github.com/izo0x90),
[@lydiandy](https://github.com/lydiandy),
[@martinvuyk](https://github.com/martinvuyk),
[@msaelices](https://github.com/msaelices),
[@owenhilyard](https://github.com/owenhilyard),
[@rd4com](https://github.com/rd4com),
[@yinonburgansky](https://github.com/yinonburgansky)

## v25.1 (2025-02-13)

### ✨ Highlights

* The legacy `borrowed`/`inout` keywords and `-> T as foo` syntax are deprecated
  and now generate a compiler warning. Please move to `read`/`mut`/`out`
  argument syntax instead. See
  [Argument conventions](/mojo/manual/values/ownership#argument-conventions)
  in the Mojo Manual for more information.

* The `bool()`, `float()`, `int()`, and `str()` functions are deprecated and
  generate compiler warnings. Please use the `Bool()`, `Float64()`, `Int()`, and
  `String()` constructors instead. See [Standard library
  changes](#25-1-standard-library-changes) for more details.

* The standard library has many changes related to strings. The new
  [`Char`](/mojo/stdlib/collections/string/codepoint/Codepoint) struct
  represents a single Unicode character, and includes several methods for
  categorizing character types. When iterating over the characters of a `String`
  with a `for` loop, you now should use the
  [`String.chars()`](/mojo/stdlib/collections/string/string/String#chars) method
  to provide an iterator of `Char` values or the
  [`String.char_slices()`](/mojo/stdlib/collections/string/string/String#char_slices)
  method to provide an iterator of
  [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice/)
  instances for each character. `StringRef` has been removed in favor of
  [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice/).
  And various functionality has moved from `String` and `StringLiteral` to the
  more general `StringSlice` type. See [Standard library
  changes](#25-1-standard-library-changes) for more details.

* You can now use [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) constructors to cast
  existing `SIMD` values (including `Scalar` values) to a different type, though
  you can still use the [`SIMD.cast()`](/mojo/stdlib/builtin/simd/SIMD#cast)
  method to infer the size of the new vector. See [Standard library
  changes](#25-1-standard-library-changes) for more details.

### Language changes {#25-1-language-changes}

* The legacy `borrowed`/`inout` keywords and `-> T as foo` syntax now generate
  a warning. Please move to `read`/`mut`/`out` argument syntax instead. See
  [Argument conventions](/mojo/manual/values/ownership#argument-conventions)
  in the Mojo Manual for more information.

* Initializers are now treated as static methods that return an instance of
  `Self`.  This means the `out` argument of an initializer is now treated the
  same as any other function result or `out` argument. This is generally
  invisible, except that patterns like `instance.__init__()` and
  `x.__copyinit__(y)` no longer work.  Simply replace them with `instance = T()`
  and `x = y` respectively.

* The [`@value`](/mojo/manual/decorators/value) decorator now additionally
  derives an implementation of the
  [`ExplicitlyCopyable`](/mojo/stdlib/builtin/value/ExplicitlyCopyable/) trait.
  This will ease the transition to explicit copyability requirements by default
  in the Mojo collection types.

* Indexing into a homogenous tuple now produces the consistent element type
  without needing a rebind:

  ```mojo
    var x = (1, 2, 3, 3, 4)
    var y : Int = x[idx]     # Just works!
  ```

* You can now overload positional arguments with a keyword-only argument, and
  keyword-only arguments with different names:

  ```mojo
    struct OverloadedKwArgs:
        var val: Int

        fn __init__(out self, single: Int):
            self.val = single

        fn __init__(out self, *, double: Int):
            self.val = double * 2

        fn __init__(out self, *, triple: Int):
            self.val = triple * 3

        fn main():
            OverloadedKwArgs(1)        # val=1
            OverloadedKwArgs(double=1) # val=2
            OverloadedKwArgs(triple=2) # val=6
  ```

  This also works with indexing operations:

  ```mojo
  struct OverloadedKwArgs:
    var vals: List[Int]

    fn __init__(out self):
        self.vals = List[Int](0, 1, 2)

    fn __getitem__(self, idx: Int) -> Int:
        return self.vals[idx]

    fn __getitem__(self, *, idx2: Int) -> Int:
        return self.vals[idx2 * 2]

    fn __setitem__(mut self, idx: Int, val: Int):
        self.vals[idx] = val

    fn __setitem__(mut self, val: Int, *, idx2: Int):
          self.vals[idx2 * 2] = val

  fn main():
      var x = OverloadedKwArgs()
      print(x[1])       # 1
      print(x[idx2=1])  # 2

      x[1] = 42
      x[idx2=1] = 84

      print(x[1])       # 42
      print(x[idx2=1])  # 84
  ```

* The `__disable_del x` operation has been tightened up to treat all fields of
  `x` as consumed by the point of the deletion, so it should be used after all
  the subfields are transferred or otherwise consumed (for example, at the end
  of the function), not before uses of the fields.

### GPU programming {#25-1-gpu-programming}

* The new [`gpu` package](/mojo/stdlib/gpu/) provides low-level programming
  constructs for working with GPUs. The Mojo `gpu` APIs allow you to manually
  manage interaction between the CPU host and GPU device, manage memory between
  devices, synchronize threads, and more. Currently the best way to use these
  APIs is from inside a [MAX custom operation](/max/custom-ops/).

  The following code example shows a GPU kernel written in Mojo:

  ```mojo
  from max.tensor import ManagedTensorSlice
  from gpu import thread_idx, block_dim, block_idx

  fn gpu_add_kernel(out: ManagedTensorSlice, x: ManagedTensorSlice[out.type, out.rank]):
      tid_x = thread_idx.x + block_dim.x * block_idx.x
      tid_y = thread_idx.y + block_dim.y * block_dim.y
      if tid_x  Self` method. Previously,
    an initializer with the signature `fn __init__(out self, *, other: Self)`
    had been required by `ExplicitlyCopyable`.

    This improves the "greppability" and at-a-glance readability when a
    programmer is looking for places in their code that may be performing
    copies.

  * The `IntLike` trait has been removed and its functionality incorporated into
    the [`Indexer`](/mojo/stdlib/builtin/int/Indexer/) trait. This enables
    `SIMD` scalar integer types and `UInt` to be used for indexing into all of
    the collection types, as well as optimizing away normalization checks for
    `UInt` indexing.

  * The [`ImplicitlyIntable`](/mojo/stdlib/builtin/int/ImplicitlyIntable/) trait
    has been added, allowing types to be implicitly converted to an `Int` by
    implementing the `__as_int__()` method:

    ```mojo
    @value
    struct Foo(ImplicitlyIntable):
        var i: Int

        fn __as_int__(self) -> Int:
            return self.i
    ```

* You can now cast `SIMD` types using constructors:

  ```mojo
  var val = Int8(42)
  var cast = Int32(val)
  ```

  It also works when passing a scalar type to larger vector size:

  ```mojo
  var vector = SIMD[DType.int64, 4](cast) # [42, 42, 42, 42]
  ```

  For values other than scalars the size of the `SIMD` vector needs to be equal:

  ```mojo
  var float_vector = SIMD[DType.float64, 4](vector)
  ```

  [`SIMD.cast()`](/mojo/stdlib/builtin/simd/SIMD#cast) still exists to infer the
  size of new vector:

  ```mojo
  var inferred_size = float_vector.cast[DType.uint64]() # [42, 42, 42, 42]
  ```

* Added [`SIMD.from_bytes()`](/mojo/stdlib/builtin/simd/SIMD/#from_bytes) and
  [`SIMD.as_bytes()`](/mojo/stdlib/builtin/simd/SIMD/#as_bytes) to convert a
  list of bytes to a list of scalars and vice versa, accepting the endianess as
  an argument. Similar to Python `int.from_bytes()` and `int.to_bytes()`
  functions.

* You can now use [`max()`](/mojo/stdlib/builtin/math/max) and
  [`min()`](/mojo/stdlib/builtin/math/min) with variadic number of arguments.

* `bit_ceil()` has been renamed to
  [`next_power_of_two()`](/mojo/stdlib/bit/bit/next_power_of_two), and
  `bit_floor()` to
  [`prev_power_of_two()`](/mojo/stdlib/bit/bit/prev_power_of_two). This is to
  improve readability and clarity in their use.

* Added a new boolean `validate` parameter to
  [`b64decode()`](/mojo/stdlib/base64/base64/b64decode).

* The [`b64encode()`](/mojo/stdlib/base64/base64/b64encode) overload that
  previously took a `List` has been changed to take a
  [`Span`](/mojo/stdlib/memory/span/Span/).

* Removed the `@implicit` decorator from some standard library initializer
  methods that perform allocation. This reduces places where Mojo code could
  implicitly allocate where the user may not be aware.

  Removed `@implicit` from:

  * `String.__init__(out self, StringSlice)`
  * `List.__init__(out self, owned *values: T)`
  * `List.__init__(out self, span: Span[T])`

* Added more aliases in [`sys.ffi`](/mojo/stdlib/sys/ffi/) to round out the
  usual needs for FFI bindings.

### Tooling changes {#25-1-tooling-changes}

* `mblack` (aka [`mojo format`](/mojo/cli/format)) no longer formats non-Mojo
  files. This prevents unexpected formatting of Python files.

* Full struct signature information is now exposed in the documentation
  generator, and in the symbol outline and hover markdown via the Mojo Language
  Server.

* The [`env_get_dtype()`](/mojo/stdlib/sys/param_env/env_get_dtype) function has
  been added to the [`sys.param_env`](/mojo/stdlib/sys/param_env/) module. This
  allows you to get the value of a `DType` from the param environment.

### ❌ Removed

* `StringRef` has been removed. Use
  [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice/)
  instead.

  * Changed [`sys.argv()`](/mojo/stdlib/sys/arg/argv) to return list of
    `StringSlice`.

  * Added explicit [`Path()`](/mojo/stdlib/pathlib/path/Path/#__init__)
    constructor from `StringSlice`.

* The `Tuple.get[i, T]()` method has been removed. Please use `tup[i]` or
  `rebind[T](tup[i])` as needed instead.

* `StringableCollectionElement` is deprecated. Use
  [`WritableCollectionElement`](/mojo/stdlib/builtin/value/WritableCollectionElement/)
  instead, which still allows you to construct a `String`, but can avoid
  intermediate allocations.

* The `IntLike` trait has been removed and its functionality incorporated into
  the [`Indexer`](/mojo/stdlib/builtin/int/Indexer/) trait.

* The `Type{field1: 42, field2: 17}` syntax for direct initializing register
  passable types has been removed. This was legacy syntax - to upgrade your
  code, add the [`@value`](/mojo/manual/decorators/value) decorator to your
  struct to get a fieldwise initializer and use `Type(field1=42, field2 = 17)`
  instead.

### 🛠️ Fixed

* The Mojo Kernel for Jupyter Notebooks is working again on nightly releases.

* The command `mojo debug --vscode` now sets the current working directory
  properly.

* [Issue #3796](https://github.com/modular/modular/issues/3796) - Compiler crash
  handling `for`-`else` statement.

* [Issue #3540](https://github.com/modular/modular/issues/3540) - Using named
  output slot breaks trait conformance

* [Issue #3617](https://github.com/modular/modular/issues/3617) - Can't generate
  the constructors for a type wrapping `!lit.ref`

* The Mojo Language Server doesn't crash anymore on empty `__init__.mojo` files.
  [Issue #3826](https://github.com/modular/modular/issues/3826).

* [Issue #3935](https://github.com/modular/modular/issues/3935) - Confusing OOM
  error when using `Tuple.get()` incorrectly.

* [Issue #3955](https://github.com/modular/modular/issues/3955) - Unexpected
  copy behavior with `def` arguments in loops

* [Issue #3960](https://github.com/modular/modular/issues/3960) - Infinite `for`
  loop

## v24.6 (2024-12-17)

### ✨ Highlights

Here's a brief summary of some of the major changes in this release, with more
detailed information in the following sections:

* The `inout` and `borrowed` argument conventions have been renamed to `mut`
  and `read`, respectively. A new `out` convention has been added for the `self`
  argument in constructors and for named results. See
  [Language changes](#24-6-language-changes) for details.

* `Lifetime` and related types in the standard library have been renamed to
  [`Origin`](/mojo/stdlib/builtin/type_aliases/Origin) to better clarify that
  parameters of this type indicate where a reference is derived from, not the
  more complicated notion of where a variable is initialized and destroyed. As a
  consequence the `__lifetime_of()` operator is now named `__origin_of()`.

  There are also a number of other origin-related improvements in this release,
  including being able to specify a union of origins by listing multiple values
  in the `__origin_of()` operator or inside the `ref` origin specifier
  (`ref [a, b]`). For details, see [Language changes](#24-6-language-changes).

  For background information and rationale on the name change see
  [the proposal](https://github.com/modular/modular/issues/3623). For more
  information on origins, see
  [Lifetimes, origins and references](/mojo/manual/values/lifetimes) in the Mojo
  Manual.

* Implicit conversions are now opt-in using the
  [`@implicit`](/mojo/manual/decorators/implicit) decorator. See
  [Language changes](#24-6-language-changes) for details.

* The standard library has added several new types, including
  [`Deque`](/mojo/stdlib/collections/deque/Deque) (a double-ended queue) and
  [`OwnedPointer`](/mojo/stdlib/memory/owned_pointer/OwnedPointer) (safe,
  single-owner, non-nullable smart pointer). See
  [Standard library changes](#24-6-standard-library-changes)
  for details.

* The VS Code extension now supports setting data breakpoints and function
  breakpoints, and the Mojo LLDB debugger supports symbol breakpoints, such
  as `b main` or `b my_module::main`.

* We've made a number of improvement to how information is displayed in error
  messages, LSP, and generated API documentation. For details, see
  [Tooling changes](#24-6-tooling-changes).

* And we've added a number of new docs, including a brand new
  [Mojo tutorial](/mojo/manual/get-started), new pages on
  [operators and expressions](/mojo/manual/operators),
  [error handling](/mojo/manual/errors), and
  [pointers](/mojo/manual/pointers/), and many smaller additions and
  improvements.

### Language changes {#24-6-language-changes}

* Argument convention changes:

  * The `inout` and `borrowed` argument conventions have been renamed to `mut`
    (for "mutate") and `read`, respectively. These verbs reflect what the callee
    can do to the argument value passed in by the caller, without requiring the
    programmer to know about advanced features like references.

    For information on Mojo's argument conventions, see
    [Argument conventions](/mojo/manual/values/ownership/#argument-conventions)
    in the Mojo Manual.

  * The argument convention for the `self` argument in the `__init__()`,
    `__copyinit__()`, and `__moveinit__()` methods has been changed from `inout`
    to `out`, reflecting that a constructor method initializes its `self` value
    without reading from it. This also enables spelling the type of an
    initializer correctly, which was not supported before:

    ```mojo
    struct Foo:
        fn __init__(out self): pass

    fn test():
        # This works now
        var fnPtr : fn(out x: Foo)->None = Foo.__init__

        var someFoo : Foo
        fnPtr(someFoo)  # initializes someFoo.
    ```

    The previous `fn __init__(inout self)` syntax is still supported in this
    release of Mojo, but will be removed in the future.  Please migrate to the
    new syntax.

  * Similarly, the spelling of named results has switched to use
    `out` syntax instead of `-> T as name`. Functions may have at most one named
    result or return type specified with the usual `->` syntax. `out` arguments
    may occur anywhere in the argument list, but are typically last (except for
    `__init__` methods, where they are typically first).

    ```mojo
    # This function has type "fn() -> String"
    fn example(out result: String):
      result = "foo"
    ```

    The parser still accepts the old syntax as a synonym for this, but that will
    eventually be deprecated and removed.

    This was [discussed extensively in a public
    proposal](https://github.com/modular/modular/issues/3623). For more
    information, see
    [Named results](/nightly/mojo/manual/functions#named-results) in the Mojo
    Manual.

* Single argument constructors now require the
  [`@implicit`](/mojo/manual/decorators/implicit) decorator to allow for
  implicit conversions. Previously you could define an `__init__` that takes a
  single argument:

  ```mojo
  struct Foo:
      var value: Int

      fn __init__(out self, value: Int):
          self.value = value
  ```

  And this would allow you to pass an `Int` in the position of a `Foo`:

  ```mojo
  fn func(foo: Foo):
      print("implicitly converted Int to Foo:", foo.value)

  fn main():
      func(Int(42))
  ```

  This can result in complicated errors that are difficult to debug. By default
  this implicit behavior is now turned off, so you have to explicitly construct
  `Foo`:

  ```mojo
  fn main():
      func(Foo(42))
  ```

  You can still opt into implicit conversions by adding the `@implicit`
  decorator. For example, to enable implicit conversions from `Int` to `Foo`:

  ```mojo
  struct Foo:
      var value: Int

      @implicit
      fn __init__(out self, value: Int):
          self.value = value
  ```

  For more information see [Constructors and implicit
  conversion](/mojo/manual/lifecycle/life#constructors-and-implicit-conversion)
  in the Mojo Manual.

* Origin-related changes:

  * The `AnyLifetime` type (useful for declaring origin types as parameters) has
    has been renamed to [`Origin`](/mojo/stdlib/builtin/type_aliases/Origin) and
    the `__lifetime_of()` operator renamed to `__origin_of()`.

  * `Origin` is now a complete wrapper around the MLIR origin type.

    * The `Origin.type` alias has been renamed to `_mlir_origin`. In parameter
      lists, you can now write just `Origin[..]`, instead of `Origin[..].type`.

    * `ImmutableOrigin` and `MutableOrigin` are now, respectively, just aliases
      for `Origin[False]` and `Origin[True]`.

    * `Origin` struct values are now supported in the origin specifier of a
      `ref [..]` argument.

    * Added `Origin.cast_from` for casting the mutability of an origin value.

  * `ref` arguments and results now allow for providing a memory value
    directly in the origin specifier, rather than requiring the use of
    `__origin_of()`.  It is still fine to use `__origin_of()` explicitly though,
    and this is required when specifying origins for parameters (e.g. to the
    `Pointer` type). For example, this is now valid without `__origin_of()`:

    ```mojo
    fn return_ref(a: String) -> ref [a] String:
        return a
    ```

  * Various improvements to origin handling and syntax have landed, including
    support for the ternary operator and allowing multiple arguments in a `ref`
    specifier (which are implicitly unions).  This enables expression of simple
    algorithms cleanly:

    ```mojo
    fn my_min[T: Comparable](ref a: T, ref b: T) -> ref [a, b] T:
      return a if a  ref [a] String:
        return a
    ```

  * The `__type_of(x)` and `__origin_of(x)` operators are much more general now:
    they allow arbitrary expressions inside of them, allow referring to dynamic
    values in parameter contexts, and even allow referring to raising functions
    in non-raising contexts. These operations never evaluate their expression,
    so any side effects that occur in the expression are never evaluated at
    runtime, eliminating concerns about `__type_of(expensive())` being a
    problem.

  * The destructor insertion logic in Mojo is now aware that types that take an
    `MutableAnyOrigin` or `ImmutableAnyOrigin` as part of their signature could
    potentially access any live value that destructor insertion is tracking,
    eliminating a significant usability issue with unsafe APIs like
    [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer).
    Consider a typical example working with strings before this change:

    ```mojo
    var str = String(...)
    var ptr = str.unsafe_ptr()
    some_low_level_api(ptr)
    _ = str^  # OLD HACK: Explicitly keep string alive until here!
    ```

    The `_ = str^` pattern was formerly required because the Mojo compiler has
    no idea what "ptr" might reference. As a consequence, it had no idea that
    `some_low_level_api()` might access `str` and therefore thought it was ok to
    destroy the `String` before the call - this is why the explicit lifetime
    extension was required.

    Mojo now knows that
    [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) may
    access the `MutableAnyOrigin` origin, and now assumes that any API that uses
    that origin could use live values. In this case, it assumes that
    `some_low_level_api()` might access `str` and because it might be using it,
    it cannot destroy `str` until after the call. The consequence of this is
    that the old hack is no longer needed for these cases!

  * Function types now accept an origin set parameter. This parameter represents
    the origins of values captured by a parameter closure. The compiler
    automatically tags parameter closures with the right set of origins. This
    enables lifetimes and parameter closures to correctly compose.

    ```mojo
    fn call_it[f: fn() capturing [_] -> None]():
        f()

    fn test():
        var msg = String("hello world")

        @parameter
        fn say_hi():
            print(msg)

        call_it[say_hi]()
        # no longer need to write `_ = msg^`!!
    ```

    Note that this only works for higher-order functions which have explicitly
    added `[_]` as the capture origins. By default, the compiler still assumes
    a `capturing` closure does not reference any origins. This will soon change.

* Infer-only parameters may now be explicitly bound with keywords, enabling
  some important patterns in the standard library:

  ```mojo
  struct StringSlice[is_mutable: Bool, //, origin: Origin[is_mutable]]: ...
  alias ImmStringSlice = StringSlice[is_mutable=False]
  # This auto-parameterizes on the origin, but constrains it to being an
  # immutable slice instead of a potentially mutable one.
  fn take_imm_slice(a: ImmStringSlice): ...
  ```

* The flag for turning on asserts has changed, e.g. to enable all checks:

  ```bash
  mojo -D ASSERT=all main.mojo
  ```

  The levels are:

  * `none`: all assertions off
  * `warn`: print assertion errors e.g. for multithreaded tests (previously `-D
    ASSERT_WARNING`)
  * `safe`: the default mode for standard CPU safety assertions
  * `all`: turn on all assertions (previously `-D MOJO_ENABLE_ASSERTIONS`)

  You can now also pass `Stringable` args to format a message, which will have
  no runtime penalty or IR bloat cost when assertions are off. Previously you
  had to:

  ```mojo
  x = -1
  debug_assert(
    x > 0, String.format_sequence(“expected x to be more than 0 but got: ”, x)
  )
  ```

  Which can't be optimized away by the compiler in release builds, you can now
  pass multiple args for a formatted message at no runtime cost:

  ```mojo
  debug_assert(x > 0, “expected x to be more than 0 but got: ”, x)
  ```

* Automatic parameterization of parameters is now supported. Specifying a
  parameterized type with unbound parameters causes them to be implicitly added
  to the function signature as infer-only parameters.

  ```mojo
  fn foo[value: SIMD[DType.int32, _]]():
    pass

  # Equivalent to
  fn foo[size: Int, //, value: SIMD[DType.int32, size]]():
    pass
  ```

* Mojo can now interpret simple LLVM intrinsics in parameter expressions,
  enabling things like `count_leading_zeros` to work at compile time:
  [Issue #933](https://github.com/modular/modular/issues/933).

* Introduced the `@explicit_destroy` annotation, the `__disable_del` keyword,
  the `UnknownDestructibility` trait, and the `ImplicitlyDestructible` keyword,
  for the experimental explicitly destroyed types feature.

* Added associated types; we can now have aliases like `alias T: AnyType`,
  `alias N: Int`, etc. in a trait, and then specify them in structs that conform
  to that trait. For more information, see [Associated aliases for
  generics](/mojo/manual/traits#associated-aliases-for-generics).

### Standard library changes {#24-6-standard-library-changes}

* Introduced a new [`Deque`](/mojo/stdlib/collections/deque/Deque) (double-ended
  queue) collection type, based on a dynamically resizing circular buffer for
  efficient O(1) additions and removals at both ends as well as O(1) direct
  access to all elements.

  The `Deque` supports the full Python `collections.deque` API, ensuring that
  all expected deque operations perform as in Python.

  Enhancements to the standard Python API include `peek()` and `peekleft()`
  methods for non-destructive access to the last and first elements, and
  advanced constructor options (`capacity`, `min_capacity`, and `shrink`) for
  customizing memory allocation and performance. These options allow for
  optimized memory usage and reduced buffer reallocations, providing flexibility
  based on application requirements.

* The `Formatter` struct has been replaced with a
  [`Writer`](/mojo/stdlib/utils/write/Writer) trait to enable buffered IO,
  increasing print and file writing perf to the same speed as C. It's now more
  general purpose and can write any `Span[Byte]`. To align with this the
  `Formattable` trait is now named
  [`Writable`](/mojo/stdlib/utils/write/Writable), and the
  `String.format_sequence()` static method to initialize a new `String` has been
  renamed to
  [`String.write()`](/mojo/stdlib/collections/string/string/String/#write).
  Here's an example of using all of the changes:

  ```mojo
  from memory import Span

  @value
  struct NewString(Writer, Writable):
      var s: String

      # Writer requirement to write a Span of Bytes
      fn write_bytes(inout self, bytes: Span[Byte, _]):
          self.s._iadd[False](bytes)

      # Writer requirement to take multiple args
      fn write[*Ts: Writable](inout self, *args: *Ts):
          @parameter
          fn write_arg[T: Writable](arg: T):
              arg.write_to(self)

          args.each[write_arg]()

      # Also make it Writable to allow `print` to write the inner String
      fn write_to[W: Writer](self, inout writer: W):
          writer.write(self.s)

  @value
  struct Point(Writable):
      var x: Int
      var y: Int

      # Pass multiple args to the Writer. The Int and StringLiteral types call
      # `writer.write_bytes` in their own `write_to` implementations.
      fn write_to[W: Writer](self, inout writer: W):
          writer.write("Point(", self.x, ", ", self.y, ")")

      # Enable conversion to a String using `str(point)`
      fn __str__(self) -> String:
          return String.write(self)

  fn main():
      var point = Point(1, 2)
      var new_string = NewString(str(point))
      new_string.write("\n", Point(3, 4))
      print(new_string)
  ```

  ```output
  Point(1, 2)
  Point(3, 4)
  ```

* Python interop changes:

  * Introduced
    [`TypedPythonObject`](/mojo/stdlib/python/python_object/TypedPythonObject)
    as a light-weight way to annotate
    [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) values with
    static type information. This design will likely evolve and change
    significantly.

    * Added `TypedPythonObject["Tuple].__getitem__()` for accessing the elements
      of a Python tuple.

  * Added
    [`Python.add_object()`](/mojo/stdlib/python/python/Python#add_object), to
    add a named `PythonObject` value to a Python 'module' object instance.

  * Added
    [`Python.unsafe_get_python_exception()`](/mojo/stdlib/python/python/Python#unsafe_get_python_exception),
    as an efficient low-level utility to get the Mojo `Error` equivalent of the
    current CPython error state.

  * Add
    [`PythonObject.from_borrowed_ptr()`](/mojo/stdlib/python/python_object/PythonObject#from_borrowed_ptr),
    to simplify the construction of `PythonObject` values from CPython 'borrowed
    reference' pointers.

    The existing `PythonObject.__init__(PyObjectPtr)` should continue to be used
    for the more common case of constructing a `PythonObject` from a
    'strong reference' pointer.

  * Support for multi-dimensional indexing and slicing for `PythonObject`
    (PR [#3549](https://github.com/modular/modular/pull/3549),
    PR [#3583](https://github.com/modular/modular/pull/3583)).

    ```mojo
    var np = Python.import_module("numpy")
    var a = np.array(PythonObject([1,2,3,4,5,6])).reshape(2,3)
    print((a[0, 1])) # 2
    print((a[1][::-1])) # [6 5 4]
    ```

    Note that the syntax, `a[1, ::-1]`, is currently not supported.

  * Added
    [`PythonObject.__contains__()`](/mojo/stdlib/python/python_object/PythonObject#__contains__).
    ([PR #3101](https://github.com/modular/modular/pull/3101))

    Example usage:

    ```mojo
    x = PythonObject([1,2,3])
    if 1 in x:
        print("1 in x")
    ```

* Pointer related changes:

  * The [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) type
    now has an `origin` parameter that can be used when the `UnsafePointer`
    points to a value with a known origin. This origin is propagated through the
    `ptr[]` indirection operation. This parameter and other `UnsafePointer`
    parameters (other than the type) are now keyword-only.

  * You can now index into `UnsafePointer` using `SIMD` scalar integral types:

    ```mojo
    p = UnsafePointer[Int].alloc(1)
    i = UInt8(1)
    p[i] = 42
    print(p[i])
    ```

  * Added a new [`OwnedPointer`](/mojo/stdlib/memory/owned_pointer/OwnedPointer)
    type as a safe, single-owner, non-nullable smart pointer with similar
    semantics to Rust's
    [`Box`](https://doc.rust-lang.org/std/boxed/struct.Box.html) and C++'s
    [`std::unique_ptr`](https://en.cppreference.com/w/cpp/memory/unique_ptr).
    ([PR #3524](https://github.com/modular/modular/pull/3524))

  * `Arc` has been renamed to
    [`ArcPointer`](/mojo/stdlib/memory/arc/ArcPointer), for consistency with
    `OwnedPointer`.

  * [`ArcPointer`](/mojo/stdlib/memory/arc/ArcPointer) now implements
    [`Identifiable`](/mojo/stdlib/builtin/identifiable/Identifiable), and can be
    compared for pointer equivalence using `a is b`.

  * The `Reference` type has been renamed to
    [`Pointer`](/mojo/stdlib/memory/pointer/Pointer): a memory safe complement
    to `UnsafePointer`. This change is motivated by the fact that `Pointer` is
    assignable and requires an explicit dereference with `ptr[]`. Renaming to
    `Pointer` clarifies that "references" means `ref` arguments and results, and
    gives us a model that is more similar to what the C++ community would
    expect.

    For an overview of Mojo's pointer types, see the new
    [Intro to pointers](/mojo/manual/pointers/) page in the Mojo Manual.

  * A new
    [`as_noalias_ptr()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#as_noalias_ptr)
    method as been added to `UnsafePointer`. This method specifies to the
    compiler that the resultant pointer is a distinct identifiable object that
    does not alias any other memory in the local scope.

* Added the [`Floatable`](/mojo/stdlib/builtin/floatable/Floatable) and
  [`FloatableRaising`](/mojo/stdlib/builtin/floatable/FloatableRaising) traits
  to denote types that can be converted to a `Float64` value using the builtin
  `float` function. Made `SIMD` and `FloatLiteral` conform to the `Floatable`
  trait. ([PR #3163](https://github.com/modular/modular/pull/3163))

  ```mojo
  fn foo[F: Floatable](v: F):
    ...

  var f = float(Int32(45))
  ```

* The [`rebind()`](/mojo/stdlib/builtin/rebind/rebind) standard library function
  now works with memory-only types in addition to
  `@register_passable("trivial")` ones, without requiring a copy. For more
  information, see
  [The `rebind()` builtin](/mojo/manual/parameters/#the-rebind-builtin) in the
  Mojo Manual.

* Introduced the [`random.shuffle()`](/mojo/stdlib/random/random/shuffle)
  function for randomizing the elements of a `List`.
  ([PR #3327](https://github.com/modular/modular/pull/3327))

  Example:

  ```mojo
  from random import shuffle

  var l = List[Int](1, 2, 3, 4, 5)
  shuffle(l)
  ```

* The [`Dict.__getitem__()`](/mojo/stdlib/collections/dict/Dict#__getitem__)
  method now returns a reference instead of a copy of the value (or raises).
  This improves the performance of common code that uses `Dict` by allowing
  borrows from the `Dict` elements.

* [`Slice.step`](/mojo/stdlib/builtin/builtin_slice/Slice#fields) is now an
  `Optional[Int]`, matching the optionality of `slice.step` in Python.
  ([PR #3160](https://github.com/modular/modular/pull/3160))

* There is now a [`Byte`](/mojo/stdlib/builtin/simd/#aliases) alias to better
  express intent when working with a pack of bits.
  ([PR #3670](https://github.com/modular/modular/pull/3670)).

* Expanded [`os.path`](/mojo/stdlib/os/path/path/) with new functions:
  * `os.path.expandvars()`: Expands environment variables in a path ([PR #3735](https://github.com/modular/modular/pull/3735)).
  * `os.path.splitroot()`: Split a path into drive, root and tail.
    ([PR #3780](https://github.com/modular/modular/pull/3780)).

* Added a [`reserve()`](/mojo/stdlib/collections/string/string/String#reserve)
  method and new constructor to the `String` struct to allocate additional
  capacity. ([PR #3755](https://github.com/modular/modular/pull/3755)).

* A new
  [`StringLiteral.get[some_stringable]()`](/mojo/stdlib/builtin/string_literal/StringLiteral#get)
  method is available. It allows forming a runtime-constant `StringLiteral` from
  a compile-time-dynamic `Stringable` value.

* [`Span`](/mojo/stdlib/memory/span/Span) has moved from the `utils` module to
  the `memory` module.

* [`Span`](/mojo/stdlib/memory/span/Span) now implements `__reversed__()`. This
  means that one can get a reverse iterator over a `Span` using
  `reversed(my_span)`. Users should currently prefer this method over
  `my_span[::-1]`.

* A new [`AsBytes`](/mojo/stdlib/memory/span/AsBytes) trait has been added to
  enable taking a `Span[Byte]` from any type that implements `as_bytes()`.
  `String.as_bytes()` and `String.as_bytes_slice()` have been consolidated under
  `String.as_bytes()` to return a `Span[Byte]`. If you require a copy, you can
  convert the `Span` to a `List` with `List(my_string.as_bytes())`.

* [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice) now
  implements `strip()`, `rstrip()`, and `lstrip()`.

* [`StringRef`](/mojo/stdlib/collections/string/string_slice/StringSlice) now
  implements `split()` which can be used to split a `StringRef` into a
  `List[StringRef]` by a delimiter. ([PR
  \#2705](https://github.com/modular/modular/pull/2705))

* [`StringRef`](/mojo/stdlib/collections/string/string_slice/StringSlice) is now
  representable so `repr(StringRef("hello"))` will return `StringRef('hello')`.

* More things have been removed from the auto-exported set of entities in the
  `prelude` module from the Mojo standard library:
  * `UnsafePointer` has been removed. Please explicitly import it via
    `from memory import UnsafePointer`.
  * `StringRef` has been removed. Please explicitly import it via
    `from utils import StringRef`.

* Restored implicit copyability of [`Tuple`](/mojo/stdlib/builtin/tuple/Tuple)
  and [`ListLiteral`](/mojo/stdlib/builtin/list_literal/ListLiteral).

* The
  [aliases for C foreign function interface (FFI)](/mojo/stdlib/sys/ffi/#aliases)
  have been renamed: `C_int` -> `c_int`, `C_long` -> `c_long` and so on.

* `Float32` and `Float64` are now printed and converted to strings with
  roundtrip guarantee and shortest representation:

  ```plaintext
  Value                       Old                       New
  Float64(0.3)                0.29999999999999999       0.3
  Float32(0.3)                0.30000001192092896       0.3
  Float64(0.0001)             0.0001                    0.0001
  Float32(0.0001)             9.9999997473787516e-05    0.0001
  Float64(-0.00001)           -1.0000000000000001e-05   -1e-05
  Float32(-0.00001)           -9.9999997473787516e-06   -1e-05
  Float32(0.00001234)         1.2339999557298142e-05    1.234e-05
  Float32(-0.00000123456)     -1.2345600453045336e-06   -1.23456e-06
  Float64(1.1234567e-320)     1.1235052786429946e-320   1.1235e-320
  Float64(1.234 * 10**16)     12340000000000000.0       1.234e+16
  ```

* The `StaticIntTuple` data structure in the `utils` package has been renamed to
  [`IndexList`](/mojo/stdlib/utils/index_/IndexList). The data structure now
  allows one to specify the index bitwidth of the elements along with whether
  the underlying indices are signed or unsigned.

* Added [`DLHandle.get_symbol()`](/mojo/stdlib/sys/ffi/DLHandle#get_symbol), for
  getting a pointer to a symbol in a dynamic library. This is more general
  purpose than the existing methods for getting function pointers.

### Tooling changes {#24-6-tooling-changes}

* The VS Code Mojo Debugger now has a `buildArgs` JSON debug configuration
  setting that can be used in conjunction with `mojoFile` to define the build
  arguments when compiling the Mojo file.

* The VS Code extension now supports a `Configure Build and Run Args` command
  that helps set the build and run args for actions file `Run Mojo File` and
  `Debug Mojo File`. A corresponding button appears in `Run and Debug` selector
  in the top right corner of a Mojo File.

* The VS Code extension now has the `mojo.run.focusOnTerminalAfterLaunch`
  setting, which controls whether to focus on the terminal used by the
  `Mojo: Run Mojo File` command or on the editor after launch.
  [Issue #3532](https://github.com/modular/modular/issues/3532).

* The VS Code extension now has the `mojo.SDK.additionalSDKs` setting, which
  allows the user to provide a list of MAX SDKs that the extension can use when
  determining a default SDK to use. The user can select the default SDK to use
  with the `Mojo: Select the default MAX SDK` command.

* The VS Code extension now supports setting
  [data breakpoints](https://code.visualstudio.com/docs/editor/debugging#_data-breakpoints)
  as well as
  [function breakpoints](https://code.visualstudio.com/docs/editor/debugging#_function-breakpoints).

* The Mojo LLDB debugger now supports symbol breakpoints, for example, `b main`
  or `b my_module::main`.

* Error messages that include type names no longer include inferred or defaulted
  parameters when they aren't needed.  For example, previously Mojo complained
  about things like:

  ```plaintext
  ... cannot be converted from 'UnsafePointer[UInt, 0, _default_alignment::AnyType](), MutableAnyOrigin]' to 'UnsafePointer[Int, 0, _default_alignment[::AnyType](), MutableAnyOrigin]'
  ```

  it now complains more helpfully that:

  ```plaintext
  ... cannot be converted from 'UnsafePointer[UInt]' to 'UnsafePointer[Int]'
  ```

* Tooling now prints the origins of `ref` arguments and results correctly, and
  prints `self` instead of `self: Self` in methods.

* The Mojo Language Server and generated documentation now print parametric
  result types correctly, e.g. showing `SIMD[type, simd_width]` instead of
  `SIMD[$0, $1]`.

* Generated API documentation now shows the signatures for structs, and
  identifies `@register_passable` and `@register_passable("trivial")` types.

* The VS Code extension now allows cancelling the installation of its private
  MAX SDK.

* The VS Code extension now opens the Run and Debug tab automatically whenever
  a debug session starts.

* The `mojo debug --vscode` command now support the `--init-command` and
  `--stop-on-entry` flags. Execute `mojo debug --help` for more information.

* The Mojo LLDB debugger on VS Code now supports inspecting the raw attributes
  of variables that are handled as synthetic types, e.g. `List` from Mojo or
  `std::vector` from C++.

* The VS Code extension now allows selecting a default SDK when multiple are
  available.

### ❌ Removed

* The `UnsafePointer.bitcast()` overload for `DType` has been removed. Wrap your
  `DType` in a `Scalar[my_dtype]` to call the only overload of `bitcast()` now.

### 🛠️ Fixed

* Lifetime tracking is now fully field sensitive, which makes the uninitialized
  variable checker more precise.

* [Issue #1310](https://github.com/modular/modular/issues/1310) - Mojo permits
  the use of any constructor for implicit conversions

* [Issue #1632](https://github.com/modular/modular/issues/1632) - Mojo produces
  weird error when inout function is used in non mutating function

* [Issue #3444](https://github.com/modular/modular/issues/3444) - Raising init
  causing use of uninitialized variable

* [Issue #3544](https://github.com/modular/modular/issues/3544) - Known
  mutable `ref` argument are not optimized as `noalias` by LLVM.

* [Issue #3559](https://github.com/modular/modular/issues/3559) - VariadicPack
  doesn't extend the lifetimes of the values it references.

* [Issue #3627](https://github.com/modular/modular/issues/3627) - Compiler
  overlooked exclusivity violation caused by `ref [MutableAnyOrigin] T`

* [Issue #3710](https://github.com/modular/modular/issues/3710) - Mojo frees
  memory while reference to it is still in use.

* [Issue #3805](https://github.com/modular/modular/issues/3805) - Crash When
  Initializing !llvm.ptr.

* [Issue #3816](https://github.com/modular/modular/issues/3816) - Ternary
  if-operator doesn't propagate origin information.

* [Issue #3815](https://github.com/modular/modular/issues/3815) -
  \[BUG] Mutability not preserved when taking the union of two origins.

* [Issue #3829](https://github.com/modular/modular/issues/3829) - Poor error
  message when invoking a function pointer upon an argument of the wrong origin

* [Issue #3830](https://github.com/modular/modular/issues/3830) - Failures
  emitting register RValues to ref arguments.

* The VS Code extension now auto-updates its private copy of the MAX SDK.

* The variadic initializer for `SIMD` now works in parameter expressions.

* The VS Code extension now downloads its private copy of the MAX SDK in a way
  that prevents `ETXTBSY` errors on Linux.

* The VS Code extension now allows invoking a mojo formatter from SDK
  installations that contain white spaces in their path.

### Special thanks

Special thanks to our community contributors:
[@soraos](https://github.com/soraros), [@jjvraw](https://github.com/jjvraw),
[@bgreni](https://github.com/bgreni),
[@thatstoasty](https://github.com/thatstoasty),
[@szbergeron](https://github.com/szbergeron),
[@rd4com](https://github.com/rd4com),
[@fknfilewalker](https://github.com/fknfilewalker),
[@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse),
[@avitkauskas](https://github.com/avitkauskas), and
[@martinvuyk](https://github.com/martinvuyk).

## v24.5 (2024-09-13)

### ✨ Highlights

Here's a brief summary of some of the major changes in this release, with more
detailed information in the following sections:

* Mojo now supports Python 3.12 interoperability.

* The set of automatically imported entities (types, aliases, functions) into
  users' Mojo programs has been dramatically reduced. This can break existing
  user code as users will need to explicitly import what they're using for cases
  previously automatically included before.

* [`print()`](/mojo/stdlib/builtin/io/print) now requires that its arguments
  conform to the [`Formattable`](/mojo/stdlib/utils/write/Writable) trait.
  This enables efficient stream-based writing by default, avoiding unnecessary
  intermediate String heap allocations.

* The new builtin [`input()`](/mojo/stdlib/builtin/io/input) function prints an
  optional prompt and reads a line from standard input, in the same way as
  Python.

* Mojo now allows implicit definitions of variables within a `fn` in the same
  way that has been allowed in a `def`. The `var` keyword is still allowed, but
  is now optional.

* Mojo now diagnoses "argument exclusivity" violations due to aliasing
  references. Mojo requires references (including implicit references due to
  `borrowed`/`inout` arguments) to be uniquely referenced (non-aliased) if
  mutable. This is a warning in the 24.5 release, but will be upgraded to an
  error in subsequent releases.

* Mojo now supports "conditional conformances" where some methods on a struct
  have additional trait requirements that the struct itself doesn't.

* `DTypePointer`, `LegacyPointer`, and `Pointer` have been removed. Use
  [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) instead.
  Functions that previously took a `DTypePointer` now take an equivalent
  `UnsafePointer`. For more information on using pointers, see [Unsafe
  pointers](/mojo/manual/pointers/unsafe-pointers) in the Mojo Manual.

* There are many new standard library APIs, with new features for strings,
  collections, and interacting with the filesystem and environment. Changes are
  listed in the standard library section.

* The VS Code extension now supports a vendored MAX SDK for VS Code, which is
  automatically downloaded by the extension and it's used for all Mojo features,
  including the Mojo Language Server, the Mojo debugger, the Mojo formatter, and
  more.

* [`mojo test`](/mojo/cli/test) now uses the Mojo compiler for running unit
  tests. This will resolve compilation issues that sometimes appeared, and will
  also improve overall test execution times.

### Language changes

* Mojo now allows implicit definitions of variables within a `fn` in the same
  way that has been allowed in a `def`.  The `var` keyword is still allowed and
  still denotes the declaration of a new variable with a scope (in both `def`
  and `fn`).  Relaxing this makes `fn` and `def` more similar, but they still
  differ in other important ways.

* Mojo now diagnoses "argument exclusivity" violations due to aliasing
  references. Mojo requires references (including implicit references due to
  `borrowed`/`inout` arguments) to be uniquely referenced (non-aliased) if
  mutable. This is important for code safety, because it allows the compiler
  (and readers of code) to understand where and when a value is mutated. It is
  also useful for performance optimization because it allows the compiler to
  know that accesses through immutable references cannot change behind the
  scenes. Here is an invalid example:

  ```mojo
  fn take_two_strings(a: String, inout b: String):
     # Mojo knows 'a' and 'b' cannot be the same string.
     b += a

  fn invalid_access():
    var my_string = String()

    # warning: passing `my_string` inout is invalid since it is also passed
    # borrowed.
    take_two_strings(my_string, my_string)
  ```

  This is similar to [Swift exclusivity
  checking](https://swift.org/blog/swift-5-exclusivity/) and the [Rust
  language](https://doc.rust-lang.org/beta/book/ch04-02-references-and-borrowing.html)
  sometimes known as "aliasing xor mutability". That said, the Mojo
  implementation details are somewhat different because lifetimes are embedded
  in types.

  This is a warning in the 24.5 release, but will be upgraded to an error in
  subsequent releases.

  :::note

  Argument exclusivity is not enforced for register-passable types. They are
  passed by copy, so they don't form aliases.

  :::

* Mojo now supports "conditional conformances" where some methods on a struct
  have additional trait requirements that the struct itself doesn't. This is
  expressed through an explicitly declared `self` type:

  ```mojo
  struct GenericThing[Type: AnyType]:  # Works with anything
    # Sugar for 'fn normal_method[Type: AnyType](self: GenericThing[Type]):'
    fn normal_method(self): ...

    # Just redeclare the requirements with more specific types:
    fn needs_move[Type: Movable](self: GenericThing[Type], owned val: Type):
      var tmp = val^  # Ok to move 'val' since it is Movable
      ...
  fn usage_example():
    var a = GenericThing[Int]()
    a.normal_method() # Ok, Int conforms to AnyType
    a.needs_move(42)  # Ok, Int is movable

    var b = GenericThing[NonMovable]()
    b.normal_method() # Ok, NonMovable conforms to AnyType

      # error: argument type 'NonMovable' does not conform to trait 'Movable'
    b.needs_move(NonMovable())
  ```

  Conditional conformance works with dunder methods and other things as well.

* As a specific form of "conditional conformances", initializers in a struct
  may indicate specific parameter bindings to use in the type of their `self`
  argument. For example:

  ```mojo
  @value
  struct MyStruct[size: Int]:
      fn __init__(inout self: MyStruct[0]): pass
      fn __init__(inout self: MyStruct[1], a: Int): pass
      fn __init__(inout self: MyStruct[2], a: Int, b: Int): pass

  def test(x: Int):
      a = MyStruct()      # Infers size=0 from 'self' type.
      b = MyStruct(x)     # Infers size=1 from 'self' type.
      c = MyStruct(x, x)  # Infers size=2 from 'self' type.
  ```

* Mojo now supports named result bindings. Named result bindings are useful for
  directly emplacing function results into the output slot of a function. This
  feature provides more flexibility and guarantees around emplacing the result
  of a function compared to "guaranteed" named return value optimization (NRVO).
  If a `@register_passable` result is bound to a name, the result value is made
  accessible as a mutable reference.

  ```mojo
  fn efficiently_return_string(b: Bool) -> String as output:
      if b:
          output = "emplaced!"
          mutate(output)
          return
      return "regular return"
  ```

  If we used a temporary for `output` instead, we would need to move into the
  result slot, which wouldn't work if the result type was non-movable.

  In a function with a named result, `return` may be used with no operand to
  signal an exit from the function, or it can be used normally to specify the
  return value of the function. The compiler will error if the result is not
  initialized on all normal exit paths from the function.

* `__setitem__()` now works with variadic argument lists such as:

  ```mojo
  struct YourType:
      fn __setitem__(inout self, *indices: Int, val: Int): ...
  ```

  The Mojo compiler now always passes the "new value" being set using the last
  keyword argument of the `__setitem__()`, e.g. turning `yourType[1, 2] = 3`
  into `yourType.__setitem__(1, 2, val=3)`. This fixes [Issue
  \#248](https://github.com/modular/modular/issues/248).

* Mojo context managers used in regions of code that may raise no longer need to
  define a "conditional" exit function in the form of
  `fn __exit__(self, e: Error) -> Bool`. This function allows the context
  manager to conditionally intercept and handle the error and allow the function
  to continue executing. This is useful for some applications, but in many cases
  the conditional exit would delegate to the unconditional exit function
  `fn __exit__(self)`.

  Concretely, this enables defining `with` regions that unconditionally
  propagate inner errors, allowing code like:

  ```mojo
  def might_raise() -> Int:
      ...

  def foo() -> Int:
      with ContextMgr():
          return might_raise()
      # no longer complains about missing return

  def bar():
      var x: Int
      with ContextMgr():
          x = might_raise()
      print(x) # no longer complains about 'x' being uninitialized
  ```

* `async` functions now support memory-only results (like `String`, `List`,
  etc.) and `raises`. Accordingly, both
  [`Coroutine`](/mojo/stdlib/builtin/coroutine/Coroutine) and
  [`RaisingCoroutine`](/mojo/stdlib/builtin/coroutine/RaisingCoroutine) have
  been changed to accept `AnyType` instead of `AnyTrivialRegType`. This means
  the result types of `async` functions do not need to be `Movable`.

  ```mojo
  async fn raise_or_string(c: Bool) raises -> String:
      if c:
          raise "whoops!"
      return "hello world!"
  ```

  Note that `async` functions do not yet support indirect calls, `ref` results,
  and constructors.

* The [`Reference`](/mojo/stdlib/memory/pointer/Pointer) type (and many
  iterators) now use [infer-only
  parameters](/mojo/manual/parameters/#infer-only-parameters) to represent the
  mutability of their lifetime, simplifying the interface.

* The environment variable `MOJO_PYTHON` can be pointed to an executable to pin
  Mojo to a specific version:

  ```sh
  export MOJO_PYTHON="/usr/bin/python3.11"
  ```

  Or a virtual environment to always have access to those Python modules:

  ```sh
  export MOJO_PYTHON="~/venv/bin/python"
  ```

  `MOJO_PYTHON_LIBRARY` still exists for environments with a dynamic `libpython`
  but no Python executable.

* The pointer aliasing semantics of Mojo have changed. Initially, Mojo adopted a
  C-like set of semantics around pointer aliasing and derivation. However, the C
  semantics bring a lot of history and baggage that are not needed in Mojo and
  which complicate compiler optimizations. The language overall provides a
  stronger set of invariants around pointer aliasing with lifetimes and
  exclusive mutable references to values, etc.

  It is now forbidden to convert a non-pointer-typed value derived from a
  Mojo-allocated pointer, such as an integer address, to a pointer-typed value.
  "Derived" means there is overlap in the bits of the non-pointer-typed value
  with the original pointer value. Accordingly, the
  [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer)
  constructor that took an `address` keyword argument has been removed.

  It is still possible to make this conversion in certain cases where it is
  absolutely necessary, such as interoperating with other languages like Python.
  In this case, the compiler makes two assumptions: any pointer derived from a
  non-pointer-typed value does not alias any Mojo-derived pointer and that any
  external function calls have arbitrary memory effects.

* `await` on a coroutine now consumes it. This strengthens the invariant that
  coroutines can be awaited only once.

### Standard library changes

* [`builtin`](/mojo/stdlib/builtin/) package:

  * The set of automatically imported entities (types, aliases, functions) into
    users' Mojo programs has been dramatically reduced. Before, with the way the
    `builtin` module was handled, all of the entities in the following modules
    would be automatically included:

    `memory`, `sys`, `os`, `utils`, `python`, `bit`, `random`, `math`,
    `builtin`, `collections`

    Now, only the explicitly enumerated entities in `prelude/__init__.mojo` are
    the ones automatically imported into users' Mojo programs. This will break a
    lot of user code as users will need to explicitly import what they're using
    for cases previously commonly included before (such as
    [`Optional`](/mojo/stdlib/collections/optional/Optional),
    [`Variant`](/mojo/stdlib/utils/variant/Variant), and functions such as
    [`abort()`](/mojo/stdlib/os/os/abort),
    [`alignof()`](/mojo/stdlib/sys/info/alignof),
    [`bitcast()`](/mojo/stdlib/memory/unsafe/bitcast),
    [`bitwidthof()`](/mojo/stdlib/sys/info/bitwidthof),
    [`external_call()`](/mojo/stdlib/sys/ffi/external_call),
    [`simdwidthof()`](/mojo/stdlib/sys/info/simdwidthof), and
    [`sizeof()`](/mojo/stdlib/sys/info/sizeof)).

  * Some types from the `builtin` module have been moved to different modules
    for clarity which is made possible now that we have a `prelude` module that
    can re-export symbols from modules other than `builtin`.

    In particular, the `builtin.string` module has been moved to
    [`collections.string`](/mojo/stdlib/collections/string/).

* Input and output:

  * Added the builtin [`input()`](/mojo/stdlib/builtin/io/input) function, which
    behaves the same as Python.
    ([PR #3392](https://github.com/modular/modular/pull/3392))

    ```mojo
    name = input("Enter your name: ")
    print("Hello, " + name + "!")
    ```

    If the user enters "Mojo" it returns "Hello, Mojo!"

    There is a known issue when running the `input()` function with JIT
    compilation (see issue
    [#3479](https://github.com/modular/modular/issues/3479)).

  * [`print()`](/mojo/stdlib/builtin/io/print) now requires that its arguments
    conform to the [`Formattable`](/mojo/stdlib/utils/write/Writable) trait.
    This enables efficient stream-based writing by default, avoiding unnecessary
    intermediate String heap allocations.

    Previously, `print()` required types conform to
    [`Stringable`](/mojo/stdlib/builtin/str/Stringable). This meant that to
    execute a call like `print(a, b, c)`, at least three separate String heap
    allocations were down, to hold the formatted values of `a`, `b`, and `c`
    respectively. The total number of allocations could be much higher if, for
    example, `a.__str__()` was implemented to concatenate together the fields of
    `a`, like in the following example:

    ```mojo
    struct Point(Stringable):
        var x: Float64
        var y: Float64

        fn __str__(self) -> String:
            # Performs 3 allocations: 1 each for str(..) of each of the fields,
            # and then the final returned `String` allocation.
            return "(" + str(self.x) + ", " + str(self.y) + ")"
    ```

    A type like the one above can transition to additionally implementing
    `Formattable` with the following changes:

    ```mojo
    struct Point(Stringable, Formattable):
        var x: Float64
        var y: Float64

        fn __str__(self) -> String:
            return String.format_sequence(self)

        fn format_to(self, inout writer: Formatter):
            writer.write("(", self.x, ", ", self.y, ")")
    ```

    In the example above,
    [`String.format_sequence()`](/mojo/stdlib/collections/string/string/String#format_sequence)
    is used to construct a `String` from a type that implements `Formattable`.
    This pattern of implementing a type's `Stringable` implementation in terms
    of its `Formattable` implementation minimizes boilerplate and duplicated
    code, while retaining backwards compatibility with the requirements of the
    commonly used `str()` function.

    :::note

    The error shown when passing a type that does not implement `Formattable` to
    `print()` is currently not entirely descriptive of the underlying cause:

    ```shell
    error: invalid call to 'print': callee with non-empty variadic pack argument expects 0 positional operands, but 1 was specified
       print(point)
       ~~~~~^~~~~~~
    ```

    If you see the above error, ensure that all argument types implement
    `Formattable`.

    :::

  * [`debug_assert()`](/mojo/stdlib/builtin/debug_assert/debug_assert) now also
    requires that its `message` argument conform to `Formattable`.

  * Added
    [`TemporaryDirectory`](/mojo/stdlib/tempfile/tempfile/TemporaryDirectory) in
    module `tempfile`.
    ([PR 2743](https://github.com/modular/modular/pull/2743))

  * Added
    [`NamedTemporaryFile`](/mojo/stdlib/tempfile/tempfile/NamedTemporaryFile) in
    module `tempfile`.
    ([PR 2762](https://github.com/modular/modular/pull/2762))

* [`String`](/mojo/stdlib/collections/string/string) and friends:

  * The `builtin.string` module has been moved to
    [`collections.string`](/mojo/stdlib/collections/string/).

  * Added the
    [`String.format()`](/mojo/stdlib/collections/string/string/String#format)
    method. ([PR #2771](https://github.com/modular/modular/pull/2771))

    Supports automatic and manual indexing of `*args`.

    Examples:

    ```mojo
    print(
      String("{1} Welcome to {0} {1}").format("mojo", "🔥")
    )
    # 🔥 Wecome to mojo 🔥
    ```

    ```mojo
    print(String("{} {} {}").format(True, 1.125, 2))
    #True 1.125 2
    ```

  * [`String.format()`](/mojo/stdlib/collections/string/string/String#format)
    now supports conversion flags `!s` and `!r`, allowing for `str()` and
    `repr()` conversions within format strings. ([PR
    \#3279](https://github.com/modular/modular/pull/3279))

    Example:

    ```mojo
    String("{} {!r}").format("Mojo", "Mojo")
    # "Mojo 'Mojo'"

    String("{0!s} {0!r}").format("Mojo")
    # "Mojo 'Mojo'"
    ```

  * The `String` class now has
    [`rjust()`](/mojo/stdlib/collections/string/string/String#rjust),
    [`ljust()`](/mojo/stdlib/collections/string/string/String#ljust), and
    [`center()`](/mojo/stdlib/collections/string/string/String#center) methods
    to return a justified string based on width and fillchar. ([PR
    \#3278](https://github.com/modular/modular/pull/3278))

  * The [`atol()`](/mojo/stdlib/collections/string/string/atol) function now
    correctly supports leading underscores, (e.g.`atol("0x_ff", 0)`), when the
    appropriate base is specified or inferred (base 0). non-base-10 integer
    literals as per Python's [Integer
    Literals](https://docs.python.org/3/reference/lexical_analysis.html#integers).
    ([PR #3180](https://github.com/modular/modular/pull/3180))

  * Added the
    [`unsafe_cstr_ptr()`](/mojo/stdlib/collections/string/string/String#unsafe_cstr_ptr)
    method to `String` and `StringLiteral`, which returns an
    `UnsafePointer[c_char]` for convenient interoperability with C APIs.

  * Added the `byte_length()` method to
    [`String`](/mojo/stdlib/collections/string/string/String#byte_length),
    [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice#byte_length),
    and
    [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral#byte_length)
    and deprecated their private `_byte_length()` methods. Added a warning to
    the
    [`String.__len__()`](/mojo/stdlib/collections/string/string/String#__len__)
    method that it will return the length in Unicode codepoints in the future
    and
    [`StringSlice.__len__()`](/mojo/stdlib/collections/string/string_slice/StringSlice#__len__)
    now does return the Unicode codepoints length. ([PR
    \#2960](https://github.com/modular/modular/pull/2960))

  * Added a new
    [`StaticString`](/mojo/stdlib/collections/string/string_slice/#aliases) type
    alias. This can be used in place of
    [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral) for
    runtime string arguments.

  * Added a
    [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice#__init__)
    initializer that accepts a `StringLiteral`.

  * The `StringRef` constructors from `DTypePointer.int8` have been changed to
    take a `UnsafePointer[c_char]`, reflecting their use for compatibility with
    C APIs.

  * Continued the transition to `UnsafePointer` and unsigned byte type for
    strings:

    * [`String.unsafe_ptr()`](/mojo/stdlib/collections/string/string/String#unsafe_ptr)
      now returns an `UnsafePointer[UInt8]` (was `UnsafePointer[Int8]`)

    * [`StringLiteral.unsafe_ptr()`](/mojo/stdlib/builtin/string_literal/StringLiteral#unsafe_ptr)
      now returns an `UnsafePointer[UInt8]` (was `UnsafePointer[Int8]`)

* [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) and other
  reference type changes:

  * `DTypePointer`, `LegacyPointer`, and `Pointer` have been removed. Use
    [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) instead.
    For more information on using pointers, see [Unsafe
    pointers](/mojo/manual/pointers/unsafe-pointers) in the Mojo Manual.

    Functions that previously took a `DTypePointer` now take an equivalent
    `UnsafePointer`. A quick rule for conversion from `DTypePointer` to
    `UnsafePointer` is:

    ```mojo
    DTypePointer[type] -> UnsafePointer[Scalar[type]]
    ```

    There could be places that you have code of the form:

    ```mojo
    fn f(ptr: DTypePointer):
    ```

    which is equivalent to `DTypePointer[*_]`. In this case you would have to
    add an infer-only `type` parameter to the function:

    ```mojo
    fn f[type: DType, //](ptr: UnsafePointer[Scalar[type]]):
    ```

    because we can’t have an unbound parameter inside the struct.

    There could also be places where you use
    `DTypePointer[Scalar[DType.invalid/index]]`, and it would be natural to
    change these to `UnsafePointer[NoneType/Int]`. But since these are not an
    `UnsafePointer` that stores a `Scalar`, you might have to `rebind/bitcast`
    to appropriate types.

  * The `DTypePointer`
    [`load()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#load) and
    [`store()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#store) methods
    have been moved to `UnsafePointer`.

  * `UnsafePointer` now supports
    [`strided_load()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#strided_load),
    [`strided_store()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#strided_store),
    [`gather()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#gather), and
    [`scatter()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#scatter) when
    the underlying type is `Scalar[DType]`.

  * The global functions for working with `UnsafePointer` have transitioned to
    being methods through the use of conditional conformances:

    * `destroy_pointee(p)` => [`p.destroy_pointee()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#destroy_pointee)
    * `move_from_pointee(p)` => [`p.take_pointee()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#take_pointee)
    * `initialize_pointee_move(p, value)` => [`p.init_pointee_move(value)`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#init_pointee_move)
    * `initialize_pointee_copy(p, value)` => [`p.init_pointee_copy(value)`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#init_pointee_copy)
    * `move_pointee(src=p1, dst=p2)` => [`p.move_pointee_into(p2)`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#move_pointee_into)

  * The `UnsafePointer.offset()` method is deprecated and will be removed in a
    future release. Use [pointer
    arithmetic](/mojo/manual/pointers#storing-multiple-values) instead.

    ```mojo
    new_ptr = ptr.offset(1)
    ```

    Becomes:

    ```mojo
    new_ptr = ptr + 1
    ```

  * `UnsafePointer` now has an
    [`alignment`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#parameters)
    parameter to specify the static alignment of the pointer. Consequently,
    [`UnsafePointer.alloc()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#alloc)
    no longer takes in an alignment parameter, and the alignment should be
    specified in the type.

    ```mojo
    UnsafePointer[type].alloc[alignment](x) # now becomes
    UnsafePointer[type, alignment].alloc(x)
    ```

  * `UnsafePointer` has a new [`exclusive: Bool =
    False`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#parameters)
    parameter. Setting this parameter to true tells the compiler that the user
    knows this pointer and all those derived from it have exclusive access to
    the underlying memory allocation. The compiler is not guaranteed to do
    anything with this information.

  * It is no longer possible to cast (implicitly or explicitly) from `Reference`
    to `UnsafePointer`. Instead of `UnsafePointer(someRef)` please use the
    [`UnsafePointer.address_of(someRef[])`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#address_of)
    which makes the code explicit that the `UnsafePointer` gets the address of
    what the reference points to.

* Python interoperability changes:

  * Mojo now supports Python 3.12 interoperability.

  * Creating a nested
    [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) from a list
    or tuple of Python objects is possible now:

    ```mojo
    var np = Python.import_module("numpy")
    var a = np.array([1, 2, 3])
    var b = np.array([4, 5, 6])
    var arrays = PythonObject([a, b])
    assert_equal(len(arrays), 2)
    ```

    Also allowing more convenient call syntax:

    ```mojo
    var stacked = np.hstack((a, b))
    assert_equal(str(stacked), "[1 2 3 4 5 6]")
    ```

    ([PR #3264](https://github.com/modular/modular/pull/3264))

  * Accessing local Python modules with
    [`Python.add_to_path(".")`](/mojo/stdlib/python/python/Python#add_to_path)
    is no longer required. It now behaves the same as Python. You can access
    modules in the same folder as the target file:

    * `mojo run /tmp/main.mojo` can access `/tmp/mymodule.py`

    * `mojo build main.mojo -o ~/myexe && ~/myexe` can access `~/mymodule.py`

* Collections:

  * [`List`](/mojo/stdlib/collections/list/List) values are now equality
    comparable with `==` and `!=` when their element type is equality
    comparable. ([PR #3195](https://github.com/modular/modular/pull/3195))

  * [`Optional`](/mojo/stdlib/collections/optional/Optional) values are now
    equality comparable with `==` and `!=` when their element type is equality
    comparable.

  * Added a new [`Counter`](/mojo/stdlib/collections/counter/Counter)
    dictionary-like type, matching most of the features of the Python one.
    ([PR #2910](https://github.com/modular/modular/pull/2910))

  * [`Dict`](/mojo/stdlib/collections/dict/Dict) now implements
    [`setdefault()`](/mojo/stdlib/collections/dict/Dict#setdefault), which gets
    a value from the dictionary by key, or sets it to a default if it doesn't
    exist.
    ([PR #2803](https://github.com/modular/modular/pull/2803))

  * `Dict` now supports
    [`popitem()`](/mojo/stdlib/collections/dict/Dict#popitem), which removes and
    returns the last item in the `Dict`.
    ([PR #2701](https://github.com/modular/modular/pull/2701))

  * Added a [`Dict.__init__()`](/mojo/stdlib/collections/dict/Dict#__init__)
    overload to specify initial capacity.
    ([PR #3171](https://github.com/modular/modular/pull/3171))

    The capacity has to be a power of two and greater than or equal to 8.

    It allows for faster initialization by skipping incremental growth steps.

    Example:

    ```mojo
    var dictionary = Dict[Int,Int](power_of_two_initial_capacity = 1024)
    # Insert (2/3 of 1024) entries
    ```

  * `ListLiteral` now supports
    [`__contains__()`](/mojo/stdlib/builtin/list_literal/ListLiteral#__contains__).
    ([PR #3251](https://github.com/modular/modular/pull/3251))

* Filesystem and environment utilities:

  * [`Path.home()`](/mojo/stdlib/pathlib/path/Path#home) has been added to
    return a path of the user's home directory.

  * [`os.path.expanduser()`](/mojo/stdlib/os/path/path/expanduser) and
    [`pathlib.Path.exapanduser()`](/mojo/stdlib/pathlib/path/Path#expanduser)
    have been added to allow expanding a prefixed `~` in a `String` or `Path`
    with the user's home path:

    ```mojo
    import os
    print(os.path.expanduser("~/.modular"))
    # /Users/username/.modular
    print(os.path.expanduser("~root/folder"))
    # /var/root/folder (on macos)
    # /root/folder     (on linux)
    ```

  * [`os.path.split()`](/mojo/stdlib/os/path/path/split) has been added for
    splitting a path into `head, tail`:

    ```mojo
    import os
    head, tail = os.path.split("/this/is/head/tail")
    print("head:", head)
    print("tail:", tail)
    # head: /this/is/head
    # tail: tail
    ```

  * [`os.makedirs()`](/mojo/stdlib/os/os/makedirs) and
    [`os.removedirs()`](/mojo/stdlib/os/os/removedirs) have been added for
    creating and removing nested directories:

    ```mojo
    import os
    path = os.path.join("dir1", "dir2", "dir3")
    os.path.makedirs(path, exist_ok=True)
    os.path.removedirs(path)
    ```

  * The [`pwd`](/mojo/stdlib/pwd/pwd/) module has been added for accessing user
    information in `/etc/passwd` on POSIX systems. This follows the same logic
    as Python:

    ```mojo
    import pwd
    import os
    current_user = pwd.getpwuid(os.getuid())
    print(current_user)

    # pwd.struct_passwd(pw_name='jack', pw_passwd='********', pw_uid=501,
    # pw_gid=20, pw_gecos='Jack Clayton', pw_dir='/Users/jack',
    # pw_shell='/bin/zsh')

    print(current_user.pw_uid)

    # 501

    root = pwd.getpwnam("root")
    print(root)

    # pwd.struct_passwd(pw_name='root', pw_passwd='*', pw_uid=0, pw_gid=0,
    # pw_gecos='System Administrator', pw_dir='/var/root', pw_shell='/bin/zsh')
    ```

* Other new traits and related features:

  * Added the
    [`ExplicitlyCopyable`](/mojo/stdlib/builtin/value/ExplicitlyCopyable) trait
    to mark types that can be copied explicitly, but which might not be
    implicitly copyable.

    This supports work to transition the standard library collection types away
    from implicit copyability, which can lead to unintended expensive copies.

  * Added the [`Identifiable`](/mojo/stdlib/builtin/identifiable/Identifiable)
    trait, used to describe types that implement the `__is__()` and
    `__isnot__()` trait methods.
    ([PR #2807](https://github.com/modular/modular/pull/2807))

  * Types conforming to [`Boolable`](/mojo/stdlib/builtin/bool/Boolable) (that
    is, those implementing `__bool__()`) no longer implicitly convert to `Bool`.
    A new [`ImplicitlyBoolable`](/mojo/stdlib/builtin/bool/ImplicitlyBoolable)
    trait is introduced for types where this behavior is desired.

* Miscellaneous:

  * [`NoneType`](/mojo/stdlib/builtin/none/NoneType) is now a normal standard
    library type, and not an alias for a raw MLIR type.

    Function signatures written as `fn() -> NoneType` should transition to
    being written as `fn() -> None`.

  * Mojo now has a [`UInt`](/mojo/stdlib/builtin/uint/UInt) type for modeling
    unsigned (scalar) integers with a platform-dependent width. `UInt`
    implements most arithmetic operations that make sense for integers, with the
    notable exception of `__neg__()`. Builtin functions such as `min()`/`max()`,
    as well as `math` functions like `ceildiv()`, `align_down()`, and
    `align_up()` are also implemented for `UInt`.

  * Now that we have a `UInt` type, use this to represent the return type of a
    hash. In general, hashes should be an unsigned integer, and can also lead to
    improved performance in certain cases.

  * Added the [`c_char`](/mojo/stdlib/sys/ffi/#aliases) type alias in `sys.ffi`.

  * [`sort()`](/mojo/stdlib/builtin/sort/sort) now supports a `stable`
    parameter. It can be called by

    ```mojo
    sort[cmp_fn, stable=True](list)
    ```

    The algorithm requires $O(N)$ auxiliary memory. If extra memory allocation
    fails, the program crashs.

  * `sort()` no longer takes `LegacyPointer` since that type is now removed.

  * Added the [`oct()`](/mojo/stdlib/builtin/format_int/oct) builtin function
    for formatting an integer in octal.
    ([PR #2914](https://github.com/modular/modular/pull/2914))

  * Added the [`assert_is()`](/mojo/stdlib/testing/testing/assert_is) and
    [`assert_is_not()`](/mojo/stdlib/testing/testing/assert_is_not) test
    functions to the `testing` module.

  * The [`math`](/mojo/stdlib/math/constants/) package now includes the `pi`,
    `e`, and `tau` constants (Closes Issue
    [#2135](https://github.com/modular/modular/issues/2135)).

  * The [`ulp`](/mojo/stdlib/math/math/ulp) function from `numerics` has been
    moved to the `math` module.

  * `bit` module now supports
    [`bit_reverse()`](/mojo/stdlib/bit/bit/bit_reverse),
    [`byte_swap()`](/mojo/stdlib/bit/bit/byte_swap), and
    [`pop_count()`](/mojo/stdlib/bit/bit/pop_count) for the `Int` type.
    ([PR #3150](https://github.com/modular/modular/pull/3150))

  * A few `bit` functions have been renamed for clarity:

    * `countl_zero()` ->
      [`count_leading_zeros()`](/mojo/stdlib/bit/bit/count_leading_zeros)

    * `countr_zero()` ->
      [`count_trailing_zeros()`](/mojo/stdlib/bit/bit/count_trailing_zeros)

  * [`Slice`](/mojo/stdlib/builtin/builtin_slice/Slice) now uses
    `OptionalReg[Int]` for `start` and `end` and implements a constructor which
    accepts optional values. `Slice._has_end()` has also been removed since a
    Slice with no end is now represented by an empty `Slice.end` option.
    ([PR #2495](https://github.com/modular/modular/pull/2495))

    ```mojo
      var s = Slice(1, None, 2)
      print(s.start.value()) # must retrieve the value from the optional
    ```

  * The `rank` argument for
    [`algorithm.elementwise()`](/mojo/stdlib/algorithm/functional/elementwise)
    is no longer required and is only inferred.

  * The `time.now()` function has been deprecated. Please use
    [`time.perf_counter()`](/mojo/stdlib/time/time/perf_counter) or
    [`time.perf_counter_ns`](/mojo/stdlib/time/time/perf_counter_ns) instead.

  * [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) construction from `Bool` has been
    restricted to `DType.bool` data type.

### Tooling changes

* [`mojo test`](/mojo/cli/test) new features and changes:

  * `mojo test` now uses the Mojo compiler for running unit tests. This will
    resolve compilation issues that sometimes appeared, and will also improve
    overall test times, since we will only compile unit tests once before
    executing all of them.

    These changes do not apply to doctests, due to their different semantics.

  * The `mojo test` command now accepts a `--filter` option that will narrow the
    set of tests collected and executed. The filter string is a POSIX extended
    regular expression.

  * The `mojo test` command now supports using the same compilation options as
    `mojo build`.

  * You can now debug unit tests using `mojo test` by passing the `--debug`
    flag. Most debug flags are supported; run `mojo test --help` for a full
    listing.

    Debugging doctests is not currently supported.

* Mojo debugger new features and changes:

  * The `mojo debug --rpc` command has been renamed to [`mojo debug   --vscode`](/mojo/cli/debug#debug-server-options), which is now able to
    manage multiple VS Code windows.

  * The Mojo debugger now supports a `break-on-raise` command that indicated the
    debugger to stop at any `raise` statements. A similar features has been
    added to the debugger on VS Code.

  * The Mojo debugger now hides the artificial function arguments `__result__`
    and `__error__` created by the compiler for Mojo code.

* VS Code support changes:

  * The VS Code extension now supports a vendored MAX SDK for VS Code, which is
    automatically downloaded by the extension and it's used for all Mojo
    features, including the Mojo Language Server, the Mojo debugger, the Mojo
    formatter, and more.

  * A proxy has been added to the Mojo Language Server on VS Code that handles
    crashes more gracefully.

* The Mojo Language Server no longer sets `.` as a commit character for
  auto-completion.

### ❌ Removed

* Support for the legacy `fn __init__(...) -> Self:` form has been removed from
  the compiler, please switch to using `fn __init__(inout self, ...):` instead.

* The builtin `tensor` module has been removed. Identical functionality is
  available in [`max.tensor`](/max/api/mojo/tensor/tensor), but it is generally
  recommended to use structs from the [`buffer`](/mojo/stdlib/buffer/buffer)
  module when possible instead.

* Removed `String.unsafe_uint8_ptr()`. `String.unsafe_ptr()` now returns the
  same thing.

* Removed `StringLiteral.unsafe_uint8_ptr()` and `StringLiteral.as_uint8_ptr()`.

* Removed `SIMD.splat(value: Scalar[type])`. Use the constructor for `SIMD`
  instead.

* Removed the `SIMD.{add,mul,sub}_with_overflow()` methods.

* Removed the `SIMD.min()` and `SIMD.max()` methods. Identical functionality is
  available using the builtin [`min()`](/mojo/stdlib/builtin/math/min) and
  [`max()`](/mojo/stdlib/builtin/math/max) functions.

* Removed the Mojo Language Server warnings for unused function arguments.

* `Run Mojo File in Dedicated Terminal` action has been removed, and the
  action `Run Mojo File` will always open a dedicated terminal for each mojo
  file to guarantee a correct environment.

### 🛠️ Fixed

* Fixed a crash in the Mojo Language Server when importing the current file.

* Fixed crash when specifying variadic keyword arguments without a type
  expression in `def` functions, e.g.:

  ```mojo
  def foo(**kwargs): ...  # now works
  ```

* Mojo now prints `ref` arguments and results in generated documentation
  correctly.

* [#1734](https://github.com/modular/modular/issues/1734) - Calling
  `__copyinit__` on self causes crash.

* [#3142](https://github.com/modular/modular/issues/3142) - \[QoI] Confusing
  `__setitem__` method is failing with a "must be mutable" error.

* [#248](https://github.com/modular/modular/issues/248) - \[Feature] Enable
  `__setitem__` to take variadic arguments

* [#3065](https://github.com/modular/modular/issues/3065) - Fix incorrect behavior
  of `SIMD.__int__` on unsigned types

* [#3045](https://github.com/modular/modular/issues/3045) - Disable implicit SIMD
  conversion routes through `Bool`

* [#3126](https://github.com/modular/modular/issues/3126) - \[BUG] List doesn't
  work at compile time.

* [#3237](https://github.com/modular/modular/issues/3237) - \[BUG] Difference
  between `__getitem__` and `[.]` operator.

* [#3336](https://github.com/modular/modular/issues/3336) - Fix outdated
  references to `let` in REPL documentation.

* The VS Code extension no longer caches the information of the selected
  MAX SDK, which was causing issues upon changes in the SDK.

* The Mojo debugger now stops showing spurious warnings when parsing closures.

### Special thanks

Special thanks to our community contributors:
[@jjvraw](https://github.com/jjvraw),
[@artemiogr97](https://github.com/artemiogr97),
[@martinvuyk](https://github.com/martinvuyk),
[@jayzhan211](https://github.com/jayzhan211),
[@bgreni](https://github.com/bgreni), [@mzaks](https://github.com/mzaks),
[@msaelices](https://github.com/msaelices),
[@rd4com](https://github.com/rd4com), [@jiex-liu](https://github.com/jiex-liu),
[@kszucs](https://github.com/kszucs),
[@thatstoasty](https://github.com/thatstoasty)

## v24.4 (2024-06-07)

### ✨ Highlights

Big themes for this release:

* Improvements to the performance and ease-of-use for `def` functions.

* Continued unification of standard library APIs around the `UnsafePointer`
  type.

* Many quality-of-life improvements for the standard library collection types.

* Significant performance improvements when inserting into a `Dict`. Performance
  on this metric is still not where we'd like it to be, but it is much improved.

* A new `@parameter for` mechanism for expressing compile-time loops, which
  replaces the earlier (and less reliable) `@unroll` decorator.

* New Mojo Manual pages on [Control flow](/mojo/manual/control-flow),
  [Testing](/mojo/tools/testing) and using
  [unsafe pointers](/mojo/manual/pointers/unsafe-pointers).

### Language changes

* Mojo has changed how `def` function arguments are processed.  Previously, by
  default, arguments to a `def` were treated according to the `owned`
  convention, which makes a copy of the value, enabling that value to be mutable
  in the callee.

  This could lead to major performance issues because of the proliferation of
  unnecessary copies. It also required you to declare non-copyable types as
  `borrowed` explicitly.  Now Mojo takes a different approach: `def` functions
  take arguments as `borrowed` by default (consistent with `fn` functions) but
  will make a local copy of the value **only if the argument is mutated** in the
  body of the function.

  This improves consistency, performance, and ease of use.

* Implicit variable definitions in a `def` function are more flexible: you can
  now implicitly declare variables as the result of a tuple return, using
  `a,b,c = foo()`. For example:

  ```mojo
  def return_two(i: Int) -> (Int, Int):
    return i, i+1

  a, b = return_two(5)
  ```

  Implicit variable declarations can also now shadow global immutable symbols
  (such as module names and built-ins) without getting a compiler error.
  For example:

  ```mojo
  slice = foo()
  ```

* Mojo functions can return an auto-dereferenced reference to storage with a
  new `ref` keyword in the result type specifier.  For example:

  ```mojo
  @value
  struct Pair:
      var first: Int
      var second: Int

      fn get_first_ref(inout self) -> ref [self] Int:
          return self.first

  fn show_mutation():
      var somePair = Pair(5, 6)
      somePair.get_first_ref() = 1
  ```

  This approach provides a general way to return an "automatically dereferenced"
  reference of a given type. Notably, this eliminates the need for
  `__refitem__()` to exist.  `__refitem__()` has thus been removed and replaced
  with `__getitem__()` that returns a reference.

* Mojo added support for *infer-only parameters*. Infer-only parameters must
  appear at the beginning of the parameter list and cannot be explicitly
  specified by the user. They are declared to the left of a `//` marker, much
  like positional-only parameters. This allows programmers to define functions
  with dependent parameters to be called without the caller specifying all the
  necessary parameters. For example:

  ```mojo
  fn parameter_simd[dt: DType, //, value: Scalar[dt]]():
      print(value)

  fn call_it():
      parameter_simd[Int32(42)]()
  ```

  In the above example, `Int32(42)` is passed directly into `value`, the first
  parameter that isn't infer-only. `dt` is inferred from the parameter itself
  to be `DType.int32`.

  This also works with structs. For example:

  ```mojo
  struct ScalarContainer[dt: DType, //, value: Scalar[dt]]:
      pass

  fn foo(x: ScalarContainer[Int32(0)]): # 'dt' is inferred as `DType.int32`
      pass
  ```

  This should make working with dependent parameters more ergonomic. See
  [Infer-only parameters](/mojo/manual/parameters/#infer-only-parameters) in the
  Mojo Manual.

* Mojo now allows functions overloaded on parameters to be resolved when forming
  references to, but not calling, those functions. For example, the following
  now works:

  ```mojo
  fn overloaded_parameters[value: Int32]():
      pass

  fn overloaded_parameters[value: Float32]():
      pass

  fn form_reference():
      alias ref = overloaded_parameters[Int32()] # works!
  ```

* Mojo now supports adding a `@deprecated` decorator on structs, functions,
  traits, aliases, and global variables. The decorator marks the attached
  declaration as deprecated and causes a warning to be emitted when the
  deprecated declaration is referenced in user code. The decorator requires a
  deprecation message, specified as a string literal.

  ```mojo
  @deprecated("Foo is deprecated, use Bar instead")
  struct Foo:
      pass

  fn outdated_api(x: Foo): # warning: Foo is deprecated, use Bar instead
      pass

  @deprecated("use another function!")
  fn bar():
      pass

  fn techdebt():
      bar() # warning: use another function!
  ```

* Mojo has introduced
  [`@parameter for`](/mojo/manual/decorators/parameter#parametric-for-statement),
  a new feature for compile-time programming. `@parameter for` defines a for
  loop where the sequence and the induction values in the sequence must be
  parameter values. For example:

  ```mojo
  fn parameter_for[max: Int]():
      @parameter
      for i in range(max)
          @parameter
          if i == 10:
              print("found 10!")
  ```

  Currently, `@parameter for` requires the sequence's `__iter__()` method to
  return a `_StridedRangeIterator`, meaning the induction variables must be
  `Int`. The intention is to lift these restrictions in the future.

* The `is_mutable` parameter of `Reference` and `AnyLifetime` is now a `Bool`,
  not a low-level `__mlir_type.i1` value.

  This improves the ergonomics of spelling out a
  `Reference` type explicitly.

* Mojo will now link to a Python dynamic library based on the Python on top of
  your search path: `PATH`. This enables you to activate a virtual environment
  like `conda` and have access to Python modules installed in that environment
  without setting `MOJO_PYTHON_LIBRARY`. Previously Mojo would find a
  `libpython` dynamic library on installation and put the path in
  `.modular/modular.cfg`, which could result in version conflicts if you
  activated a virtual environment of a different Python version.

* `AnyRegType` has been renamed to `AnyTrivialRegType` and Mojo now forbids
  binding non-trivial register-passable types to `AnyTrivialRegType`. This
  closes a major safety hole in the language. Please use `AnyType` for generic
  code going forward.

* The `let` keyword has been completely removed from the language. We previously
  removed `let` declarations but still provided an error message to users. Now,
  it is completely gone from the grammar.

### Standard library changes

* New traits and related features:

  * Added built-in [`repr()`](/mojo/stdlib/builtin/repr/repr) function and
    [`Representable`](/mojo/stdlib/builtin/repr/Representable) trait.
    ([PR #2361](https://github.com/modular/modular/pull/2361))

  * Added the [`Indexer`](/mojo/stdlib/builtin/int/Indexer) trait to denote
    types that implement the `__index__()` method which allows these types to be
    accepted in common `__getitem__()` and `__setitem__()` implementations, as
    well as allow a new built-in
    [`index()`](/mojo/stdlib/builtin/int/index-function)
    function to be called on them. Most standard library containers can now be
    indexed by any type that implements `Indexer`. For example:

    ```mojo
    @value
    struct AlwaysZero(Indexer):
        fn __index__(self) -> Int:
            return 0

    struct MyList:
        var data: List[Int]

        fn __init__(inout self):
            self.data = List[Int](1, 2, 3, 4)

        fn __getitem__[T: Indexer](self, idx: T) -> Int:
            return self.data[index(idx)]

    print(MyList()[AlwaysZero()])  # prints `1`
    ```

    Types conforming to the `Indexer` trait are implicitly convertible to Int.
    This means you can write generic APIs that take `Int` instead of making them
    take a generic type that conforms to `Indexer`. For example:

    ```mojo
    @value
    struct AlwaysZero(Indexer):
        fn __index__(self) -> Int:
            return 0

    @value
    struct Incrementer:
        fn __getitem__(self, idx: Int) -> Int:
            return idx + 1

    var a = Incrementer()
    print(a[AlwaysZero()])  # works and prints 1
    ```

    ([PR #2685](https://github.com/modular/modular/pull/2685))

  * Added traits allowing user-defined types to be supported by various
    built-in and math functions.

    | Function                                         | Trait                                              | Required method |
    | ------------------------------------------------ | -------------------------------------------------- | --------------- |
    | [`abs()`](/mojo/stdlib/builtin/math/abs)         | [`Absable`](/mojo/stdlib/builtin/math/Absable)     | `__abs__()`     |
    | [`pow()`](/mojo/stdlib/builtin/math/pow)         | [`Powable`](/mojo/stdlib/builtin/math/Powable)     | `__pow__()`     |
    | [`round()`](/mojo/stdlib/builtin/math/round)     | [`Roundable`](/mojo/stdlib/builtin/math/Roundable) | `__round__()`   |
    | [`math.ceil`](/mojo/stdlib/math/math/ceil)       | `math.Ceilable`                                    | `__ceil__()`    |
    | [`math.ceildiv`](/mojo/stdlib/math/math/ceildiv) | `math.CeilDivable`  `math.CeilDivableRaising` | `__ceildiv__()` |
    | [`math.floor`](/mojo/stdlib/math/math/floor)     | `math.Floorable`                                   | `__floor__()`   |
    | [`math.trunc`](/mojo/stdlib/math/math/trunc)     | `Truncable`                                        | `__trunc__()`   |

    Notes:

    * Conforming to the `Powable` trait also means that the type can be used
      with the power operator (`**`).

    * For `ceildiv()`, structs can conform to either the `CeilDivable` trait
      or `CeilDivableRaising` trait.

    * Due to ongoing refactoring, the traits `Ceilable`, `CeilDivable`,
      `Floorable`, and `Truncable` do not appear in the API reference. They
      should be imported from the `math` module, except for `Truncable` which
      is (temporarily) available as a built-in trait and does not need to be
      imported.

    Example:

    ```mojo
    from math import sqrt

    @value
    struct Complex2(Absable, Roundable):
        var re: Float64
        var im: Float64

        fn __abs__(self) -> Self:
            return Self(sqrt(self.re * self.re + self.im * self.im), 0.0)

        fn __round__(self) -> Self:
            return Self(round(self.re, 0), round(self.im, 0))

        fn __round__(self, ndigits: Int) -> Self:
            return Self(round(self.re, ndigits), round(self.im, ndigits))

    ```

* Benchmarking:

  * The [`bencher`](/mojo/stdlib/benchmark/bencher/) module as part of the
    `benchmark` package is now public and documented. This module provides
    types such as `Bencher` which provides the ability to execute a `Benchmark`
    and allows for benchmarking configuration via the `BenchmarkConfig` struct.

* [`String`](/mojo/stdlib/collections/string/string) and friends:

  * **Breaking.** Implicit conversion to `String` is now removed for builtin
    classes/types. Use [`str()`](/mojo/stdlib/builtin/str/str) explicitly to
    convert to `String`.

  * Added
    [`String.isspace()`](/mojo/stdlib/collections/string/string/String#isspace)
    method conformant with Python's universal separators. This replaces the
    `isspace()` free function from the `string` module. (If you need the old
    function, it is temporarily available as `_isspace()`. It now takes a
    `UInt8` but is otherwise unchanged.)

  * [`String.split()`](/mojo/stdlib/collections/string/string/String#split) now
    defaults to whitespace and has Pythonic behavior in that it removes all
    adjacent whitespace by default.

  * [`String.strip()`](/mojo/stdlib/collections/string/string/String#strip),
    [`lstrip()`](/mojo/stdlib/collections/string/string/String#lstrip) and
    [`rstrip()`](/mojo/stdlib/collections/string/string/String#rstrip) can now
    remove custom characters other than whitespace. In addition, there are now
    several useful aliases for whitespace, ASCII lower/uppercase, and so on.
    ([PR #2555](https://github.com/modular/modular/pull/2555))

  * `String` now has a
    [`splitlines()`](/mojo/stdlib/collections/string/string/String#splitlines)
    method, which allows splitting strings at line boundaries. This method
    supports [universal
    newlines](https://docs.python.org/3/glossary.html#term-universal-newlines)
    and provides an option to retain or remove the line break characters. ([PR
    \#2810](https://github.com/modular/modular/pull/2810))

  * `InlinedString` has been renamed to
    [`InlineString`](/mojo/stdlib/collections/string/inline_string/InlineString)
    to be consistent with other types.

  * [`StringRef`](/mojo/stdlib/collections/string/string_slice/StringSlice) now
    implements
    [`strip()`](/mojo/stdlib/collections/string/string_slice/StringSlice#strip),
    which can be used to remove leading and trailing whitespace. ([PR
    \#2683](https://github.com/modular/modular/pull/2683))

  * `StringRef` now implements
    [`startswith()`](/mojo/stdlib/collections/string/string_slice/StringSlice#startswith)
    and
    [`endswith()`](/mojo/stdlib/collections/string/string_slice/StringSlice#endswith).
    ([PR #2710](https://github.com/modular/modular/pull/2710))

  * Added a new
    [`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice)
    type, to replace uses of the unsafe `StringRef` type in standard library
    code.

    `StringSlice` is a non-owning reference to encoded string data. Unlike
    `StringRef`, a `StringSlice` is safely tied to the lifetime of the data it
    points to.

    * Added new
      [`as_string_slice()`](/mojo/stdlib/collections/string/string/String#as_string_slice)
      methods to `String` and `StringLiteral`.
    * Added `StringSlice` initializer from an `UnsafePointer` and a length in
      bytes.

  * Added a new
    [`as_bytes_slice()`](/mojo/stdlib/collections/string/string/String#as_bytes_slice)
    method to `String` and `StringLiteral`, which returns a `Span` of the bytes
    owned by the string.

  * Continued transition to
    [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) and
    unsigned byte type for strings:
    * Renamed `String._as_ptr()` to
      [`String.unsafe_ptr()`](/mojo/stdlib/collections/string/string/String#unsafe_ptr),
      and changed return type to `UnsafePointer` (was `DTypePointer`).
    * Renamed `StringLiteral.data()` to
      [`StringLiteral.unsafe_ptr()`](/mojo/stdlib/builtin/string_literal/StringLiteral#unsafe_ptr),
      and changed return type to `UnsafePointer` (was `DTypePointer`).
    * `InlineString.as_ptr()` has been renamed to
      [`unsafe_ptr()`](/mojo/stdlib/collections/string/inline_string/InlineString#unsafe_ptr)
      and now returns an `UnsafePointer[UInt8]` (was
      `DTypePointer[DType.int8]`).
    * `StringRef.data` is now an `UnsafePointer` (was `DTypePointer`) and
      [`StringRef.unsafe_ptr()`](/mojo/stdlib/collections/string/string_slice/StringSlice#unsafe_ptr)
      now returns an `UnsafePointer[UInt8]` (was `DTypePointer[DType.int8]`).

* Other built-ins:

  * The `Slice.__len__()` function has been removed and
    [`Slice`](/mojo/stdlib/builtin/builtin_slice/Slice) no longer conforms
    to the `Sized` trait. This clarifies the ambiguity of the semantics: the
    length of a slice always depends on the length of the object being sliced.
    Users that need the existing functionality can use the
    [`Slice.unsafe_indices()`](/mojo/stdlib/builtin/builtin_slice/Slice#indices)
    method. This makes it explicit that this implementation does not check if
    the slice bounds are concrete or within any given object's length.

  * Added a built-in [`sort()`](/mojo/stdlib/builtin/sort/sort) function for
    lists of elements that conform to the
    [`ComparableCollectionElement`](/mojo/stdlib/builtin/value/ComparableCollectionElement)
    trait.([PR #2609](https://github.com/modular/modular/pull/2609))

  * [`int()`](/mojo/stdlib/builtin/int/int-function) can now take a string and a
    specified base to parse an integer from a
    string: `int("ff", 16)` returns `255`. Additionally, if a base of zero is
    specified, the string will be parsed as if it was an integer literal, with
    the base determined by whether the string contains the prefix `"0x"`,
    `"0o"`, or `"0b"`.
    ([PR #2273](https://github.com/modular/modular/pull/2273),
    fixes [#2274](https://github.com/modular/modular/issues/2274))

  * Added the [`bin()`](/mojo/stdlib/builtin/format_int/bin) built-in function
    to convert integral types into their binary
    string representation.
    ([PR #2603](https://github.com/modular/modular/pull/2603))

  * Added the [`atof()`](/mojo/stdlib/collections/string/string/atof) built-in
    function, which can convert a `String` to a `float64`. ([PR
    \#2649](https://github.com/modular/modular/pull/2649))

  * You can now use the built-in [`any()`](/mojo/stdlib/builtin/bool/any) and
    [`all()`](/mojo/stdlib/builtin/bool/all) functions to check for truthy
    elements in a collection. Because `SIMD.__bool__()` is now constrained to
    `size=1`, You must explicitly use these to get the truthy value of a SIMD
    vector with more than one element. This avoids common bugs around implicit
    conversion of `SIMD` to `Bool`.
    ([PR #2600](https://github.com/modular/modular/pull/2600))

    For example:

    ```mojo
      fn truthy_simd():
          var vec = SIMD[DType.int32, 4](0, 1, 2, 3)
          if any(vec):
              print("any elements are truthy")
          if all(vec):
              print("all elements are truthy")
    ```

  * `object` now implements all the bitwise
    operators.
    ([PR #2324](https://github.com/modular/modular/pull/2324))

  * [`Tuple`](/mojo/stdlib/builtin/tuple/Tuple) now supports `__contains__()`.
    ([PR #2709](https://github.com/modular/modular/pull/2709)) For example:

    ```mojo
    var x = Tuple(1, 2, True)
    if 1 in x:
        print("x contains 1")
    ```

  * [`ListLiteral`](/mojo/stdlib/builtin/list_literal/ListLiteral) and `Tuple`
    now only require that element types be `Movable`. Consequently,
    `ListLiteral` and `Tuple` are themselves no longer `Copyable`.

  * Added new `ImmutableStaticLifetime` and `MutableStaticLifetime` helpers.

* [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) and
  others:

  * Added new [`memcpy()`](/mojo/stdlib/memory/memory/memcpy) overload for
    `UnsafePointer[Scalar[_]]` pointers.

  * Removed the `get_null()` method from `UnsafePointer` and other pointer
    types. Please use the default constructor instead: `UnsafePointer[T]()`.

  * Many functions returning a pointer type have been unified to have a public
    API function of `unsafe_ptr()`.

  * The `Tensor.data()` method has been renamed to `unsafe_ptr()`. The return
    type is still a `DTypePointer[T]`.

* Collections:

  * [`List`](/mojo/stdlib/collections/list/List) now has an
    [`index()`](/mojo/stdlib/collections/list/List#index) method that allows you
    to find the (first) location of an element in a `List` of
    `EqualityComparable` types. For example:

    ```mojo
    var my_list = List[Int](2, 3, 5, 7, 3)
    print(my_list.index(3))  # prints 1
    ```

  * `List` can now be converted to a `String` with a simplified syntax:

    ```mojo
    var my_list = List[Int](2, 3)
    print(my_list.__str__())  # prints [2, 3]
    ```

    Note that `List` doesn't conform to the `Stringable` trait yet so you cannot
    use `str(my_list)` yet.
    ([PR #2673](https://github.com/modular/modular/pull/2673))

  * `List` has a simplified syntax to call the
    [`count()`](/mojo/stdlib/collections/list/List#count) method:
    `my_list.count(x)`.
    ([PR #2675](https://github.com/modular/modular/pull/2675))

  * `List()` now supports `__contains__()`, so you can now use lists with the
    `in` operator:

    ```mojo
    if x in my_list:
    ```

    ([PR #2667](https://github.com/modular/modular/pull/2667))

  * `List` now has an
    [`unsafe_get()`](/mojo/stdlib/collections/list/List#unsafe_get) to get the
    reference to an element without bounds check or wraparound for negative
    indices. Note that this method is unsafe. Use with caution.
    [PR #2800](https://github.com/modular/modular/pull/2800)

  * Added a [`fromkeys()`](/mojo/stdlib/collections/dict/Dict#fromkeys) method
    to `Dict` to return a `Dict` with the specified keys and values.
    ([PR 2622](https://github.com/modular/modular/pull/2622))

  * Added a [`clear()`](/mojo/stdlib/collections/dict/Dict#clear) method  to
    `Dict`. ([PR 2627](https://github.com/modular/modular/pull/2627))

  * `Dict` now supports [`reversed()`](/mojo/stdlib/builtin/reversed/reversed)
    for its `items()` and `values()` iterators.
    ([PR #2340](https://github.com/modular/modular/pull/2340))

  * `Dict` now has a simplified conversion to `String` with `my_dict.__str__()`.
    Note that `Dict` does not conform to the `Stringable` trait so
    `str(my_dict)` is not possible yet.
    ([PR #2674](https://github.com/modular/modular/pull/2674))

  * `Dict` now implements [`get(key)`](/mojo/stdlib/collections/dict/Dict#get)
    and `get(key, default)` functions.
    ([PR #2519](https://github.com/modular/modular/pull/2519))

  * Added a temporary `__get_ref(key)` method to `Dict`, allowing you to get a
    `Reference` to a dictionary value.

  * Added a new
    [`InlineList`](/mojo/stdlib/collections/inline_array/InlineArray) type, a
    stack-allocated list with a static maximum size. ([PR
    2587#](https://github.com/modular/modular/pull/2587))
    ([PR #2703](https://github.com/modular/modular/pull/2703))

  * Added a new [`Span`](/mojo/stdlib/memory/span/Span) type for taking slices
    of contiguous collections. ([PR
    \#2595](https://github.com/modular/modular/pull/2595))

* [`os`](/mojo/stdlib/os/os/) module:

  * The `os` module now provides functionality for adding and removing
    directories using [`mkdir()`](/mojo/stdlib/os/os/mkdir) and
    [`rmdir()`](/mojo/stdlib/os/os/rmdir).
    ([PR #2430](https://github.com/modular/modular/pull/2430))

  * Added the [`os.path.getsize()`](/mojo/stdlib/os/path/path/getsize) function,
    which gives the size in bytes of the file identified by the path.
    ([PR 2626](https://github.com/modular/modular/pull/2626))

  * Added [`os.path.join()`](/mojo/stdlib/os/path/path/join) function.
    ([PR 2792](https://github.com/modular/modular/pull/2792))

  * Added a new [`tempfile`](/mojo/stdlib/tempfile/tempfile/) module, with
    `gettempdir()` and `mkdtemp()` functions.
    ([PR 2742](https://github.com/modular/modular/pull/2742))

* [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type:

  * Added [`SIMD.shuffle()`](/mojo/stdlib/builtin/simd/SIMD#shuffle) with
    `IndexList` mask.
    ([PR #2315](https://github.com/modular/modular/pull/2315))

  * [`SIMD.__bool__()`](/mojo/stdlib/builtin/simd/SIMD#__bool__) is constrained
    such that it only works when `size` is `1`. For SIMD vectors with more than
    one element, use [`any()`](/mojo/stdlib/builtin/bool/any) or
    [`all()`](/mojo/stdlib/builtin/bool/all).
    ([PR #2502](https://github.com/modular/modular/pull/2502))

  * The [`SIMD.reduce_or()`](/mojo/stdlib/builtin/simd/SIMD#reduce_or) and
    [`SIMD.reduce_and()`](/mojo/stdlib/builtin/simd/SIMD#reduce_and) methods are
    now bitwise operations, and support integer types.
    ([PR #2671](https://github.com/modular/modular/pull/2671))

  * Added [`SIMD.__repr__()`](/mojo/stdlib/builtin/simd/SIMD#__repr__) to get
    the verbose string representation of `SIMD` types.
    ([PR #2728](https://github.com/modular/modular/pull/2728))

* [`math`](/mojo/stdlib/math/math/) package:

  * The `math.bit` module has been moved to a new top-level
    [`bit`](/mojo/stdlib/bit/bit/) module. The following functions in this
    module have been renamed:
    * `ctlz` -> `countl_zero`
    * `cttz` -> `countr_zero`
    * `bit_length` -> `bit_width`
    * `ctpop` -> `pop_count`
    * `bswap` -> `byte_swap`
    * `bitreverse` -> `bit_reverse`

  * The `math.rotate_bits_left()` and `math.rotate_bits_right()` functions have
    been moved to the `bit` module.

  * The `is_power_of_2()` function in the `math` module is now called
    `is_power_of_two()` and located in the `bit` module.

  * The `abs()`, `round()`, `min()`, `max()`, `pow()`, and `divmod()` functions
    have moved from `math` to `builtin`, so you no longer need to import these
    functions.

  * The `math.tgamma()` function has been renamed to
    [`math.gamma()`](/mojo/stdlib/math/math/gamma) to conform with Python's
    naming.

  * The implementation of the following functions have been moved from the
    `math` module to the new [`utils.numerics`](/mojo/stdlib/utils/numerics/)
    module: `isfinite()`, `isinf()`, `isnan()`, `nan()`, `nextafter()`, and
    `ulp()`. The functions continue to be exposed in the `math` module.

  * [`math.gcd()`](/mojo/stdlib/math/math/gcd) now works on negative inputs, and
    like Python's implementation, accepts a variadic list of integers. New
    overloads for a `List` or `Span`of integers are also added.
    ([PR #2777](https://github.com/modular/modular/pull/2777))

* Async and coroutines:

  * [`Coroutine`](/mojo/stdlib/builtin/coroutine/Coroutine) now requires a
    lifetime parameter. This parameter is set automatically by the parser when
    calling an async function. It contains the lifetimes of all the arguments
    and any lifetime accesses by the arguments. This ensures that argument
    captures by async functions keep the arguments alive as long as the
    coroutine is alive.

  * Async function calls are no longer allowed to borrow non-trivial
    register-passable types. Because async functions capture their arguments but
    register-passable types don't have lifetimes (yet), Mojo is not able to
    correctly track the reference, making this unsafe. To cover this safety gap,
    Mojo has temporarily disallowed binding non-trivial register-passable types
    to borrowed arguments in async functions.

* Miscellaneous:

  * Added an [`InlineArray`](/mojo/stdlib/collections/inline_array/InlineArray)
    type that works on memory-only types. Compare with the existing
    [`StaticTuple`](/mojo/stdlib/utils/static_tuple/StaticTuple) type, which is
    conceptually an array type, but only works on `AnyTrivialRegType`. ([PR
    \#2294](https://github.com/modular/modular/pull/2294))

  * The [`base64`](/mojo/stdlib/base64/) package now includes encoding and
    decoding support for both the Base64 and Base16 encoding schemes.
    ([PR #2364](https://github.com/modular/modular/pull/2364))
    ([PR #2584](https://github.com/modular/modular/pull/2584))

  * The `take()` function in [`Variant`](/mojo/stdlib/utils/variant/Variant) and
    [`Optional`](/mojo/stdlib/collections/optional/Optional) has been renamed to
    `unsafe_take()`.

  * The `get()` function in `Variant` has been replaced by `__getitem__()`. That
    is, `v.get[T]()` should be replaced with `v[T]`.

  * Various functions in the `algorithm` module are now built-in functions. This
    includes `sort()`, `swap()`, and `partition()`. `swap()` and `partition()`
    will likely shuffle around as we're reworking our built-in `sort()` function
    and optimizing it.

* `infinity` and `NaN` are now correctly handled in
  [`testing.assert_almost_equal()`](/mojo/stdlib/testing/testing/assert_almost_equal)
  and  an `inf` function  has been added to `utils/numerics.mojo`.
  ([PR #2375](https://github.com/modular/modular/pull/2375))

### Tooling changes

* Invoking `mojo package my-package -o my-dir` on the command line, where
  `my-package` is a Mojo package source directory, and `my-dir` is an existing
  directory, now outputs a Mojo package to `my-dir/my-package.mojopkg`.
  Previously, this had to be spelled out, as in `-o my-dir/my-package.mojopkg`.

* The Mojo Language Server now reports a warning when a local variable is
  unused.

* Several `mojo` subcommands now support a `--diagnostic-format` option that
  changes the format with which errors, warnings, and other diagnostics are
  printed. By specifying `--diagnostic-format json` on the command line, errors
  and other diagnostics will be output in a structured
  [JSON Lines](https://jsonlines.org) format that is easier for machines to
  parse.

  The full list of subcommands that support `--diagnostic-format` is as follows:
  `mojo build`, `mojo doc`, `mojo run`, `mojo package`, and `mojo test`.
  Further, the `mojo test --json` option has been subsumed into this new option;
  for the same behavior, run `mojo test --diagnostic-format json`.

  Note that the format of the JSON output may change; we don't currently
  guarantee its stability across releases of Mojo.

* A new `--validate-doc-strings` option has been added to `mojo` to emit errors
  on invalid doc strings instead of warnings.

* The `--warn-missing-doc-strings` flag for `mojo` has been renamed to
  `--diagnose-missing-doc-strings`.

* A new decorator, `@doc_private`, was added that can be used to hide a
  declaration from being generated in the output of `mojo doc`. It also removes
  the requirement that the declaration has documentation (for example, when used
  with `--diagnose-missing-doc-strings`).

* Debugger users can now set breakpoints on function calls in O0 builds even if
  the call has been inlined by the compiler.

* The Mojo Language Server now supports renaming local variables.

### Other changes

#### ❌ Removed

* The `@unroll` decorator has been deprecated and removed. The decorator was
  supposed to guarantee that a decorated loop would be unrolled, or else the
  compiler would error. In practice, this guarantee was eroded over time, as
  a compiler-based approach cannot be as robust as the Mojo parameter system.
  In addition, the `@unroll` decorator did not make the loop induction variables
  parameter values, limiting its usefulness. Please see `@parameter for` for a
  replacement!

* The method `object.print()` has been removed. Since `object` now conforms to
  the `Stringable` trait, you can use `print(my_object)` instead.

* The following functions have been removed from the math module:
  * `clamp()`; use the new `SIMD.clamp()` method instead.
  * `round_half_down()` and `round_half_up()`; these can be trivially
    implemented using the `ceil()` and `floor()` functions.
  * `add()`, `sub()`, `mul()`, `div()`, `mod()`, `greater()`, `greater_equal()`,
    `less()`, `less_equal()`, `equal()`, `not_equal()`, `logical_and()`,
    `logical_xor()`, and `logical_not()`; Instead, users should rely directly on
    the corresponding operators (`+`, `-`, `*`, `/`, `%`, `>`, `>=`, ` Int:
         @parameter
         if name == "r":   return ...
         elif name == "g": return ...
         else:
             constrained[name == "b", "can only access with r, g, or b members"]()
             return ...

  var rgb = RGB()
  print(rgb.b) # Works
  print(rgb.q) # Compile error
  ```

* Mojo now allows users to capture the source location of code and call location
  of functions dynamically using the `__source_location()` and
  `__call_location()` functions. For example:

  ```mojo
  from builtin._location import __call_location

  @always_inline
  fn my_assert(cond: Bool, msg: String):
      if not cond:
        var call_loc = __call_location()
        print("In", call_loc.file_name, "on line", str(call_loc.line) + ":", msg)

  fn main():
      my_assert(False, "always fails")  # some_file.mojo, line 193
  ```

  This prints "`In /path/to/some_file.mojo on line 193: always fails`".
  Note that `__call_location()` only works in `@always_inline` or
  `@always_inline("nodebug")` functions. It gives incorrect results if placed in
  an `@always_inline` function that's called *from* an
  `@always_inline("nodebug")` function.

  This feature is still evolving and for the time being you need to explicitly
  import these APIs, as shown above. In the future, these will probably be
  built-in functions and not require an import statement.

  Neither `__source_location()` nor `__call_location()` work when called in a
  parameter context. For example:

  ```mojo
  from builtin._location import __call_location

  @always_inline
  fn mystery_location() -> String:
      var loc = __call_location()
      return str(loc.file_name)

  def main():
      alias doesnt_work = mystery_location() # 
  ```

### Standard library changes

#### ⭐️ New

* [`List`](/mojo/stdlib/collections/list/List) has several new methods:

  * `pop(index)` for removing an element at a particular index.
    By default, `List.pop()` removes the last element in the list.
    (@LJ-9801, fixes
    [#2017](https://github.com/modular/modular/issues/2017))

  * `resize(new_size)` for resizing the list without the need to
    specify an additional value.
    ([@mikowals](https://github.com/mikowals), fixes
    [#2133](https://github.com/modular/modular/issues/2133))

  * `insert(index, value)` for inserting a value at a specified index
    into the `List`. ([@whym1here](https://github.com/whym1here), fixes
    [#2134](https://github.com/modular/modular/issues/2134))

  * A new constructor `List(ptr, size, capacity)` to to avoid needing to
    do a deep copy of an existing contiguous memory allocation when constructing
    a new `List`. ([@StandinKP](https://github.com/StandinKP), fixes
    [#2170](https://github.com/modular/modular/issues/2170))

* [`Dict`](/mojo/stdlib/collections/dict/Dict) now has a `update()` method to
  update keys/values from another `Dict`.
  ([@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse))

* [`Set`](/mojo/stdlib/collections/set/Set) now has named methods for set
  operations:

  * `difference()` mapping to `-`
  * `difference_update()` mapping to `-=`
  * `intersection_update()` mapping to `&=`
  * `update()` mapping to `|=`

  ([@arvindavoudi](https://github.com/arvindavoudi))

* `Dict`, `List`, and `Set` all conform to the `Boolable` trait. The collections
  evaluate to `True` if they contain any elements, `False` otherwise:

  ```mojo
  def list_names(names: List[String]):
      if names:
          for name in names:
              print(name[])
      else:
          print("No names to list.")
  ```

  ([@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse))

* Added [`reversed()`](/mojo/stdlib/builtin/reversed/reversed) function for
  creating reversed iterators. Several range types, `List`, and `Dict` now
  support iterating in reverse.

  ```mojo
  var numbers = List(1, 2, 3, 4, 5)
  for number in reversed(numbers):
      print(number)
  ```

  ([@helehex](https://github.com/helehex) and
  [@jayzhan211](https://github.com/jayzhan211), contributes towards
  [#2325](https://github.com/modular/modular/issues/2325))

* [`Optional`](/mojo/stdlib/collections/optional/Optional) now implements
  `__is__` and `__isnot__` methods so that you can compare an `Optional` with
  `None`. For example:

  ```mojo
  var opt = Optional(1)
  if opt is not None:
      print(opt.value()[])
  ```

  ([@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse))

* [`Tuple`](/mojo/stdlib/builtin/tuple/Tuple) now works with memory-only element
  types like `String` and allows you to directly index into it with a parameter
  expression.  This means you can now simply use `x = tup[1]` like Python
  instead of `x = tup.get[1, Int]()`. You can also assign into tuple elements
  now as well with `tup[1] = x`.

  ```mojo
  var tuple = ("Green", 9.3)
  var name = tuple[0]
  var value = tuple[1]
  ```

  Note that because the subscript must be a parameter expression, you can't
  iterate through a `Tuple` using an ordinary `for` loop.

* The `Reference` type has several
  changes, including:

  * It has moved to the `memory.reference` module instead of `memory.unsafe`.

  * `Reference` now has an
    [`unsafe_bitcast()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#bitcast)
    method, similar to the pointer types.

  * Several unsafe methods were removed, including `offset()`,
    `destroy_element_unsafe()` and `emplace_ref_unsafe()`. This is because
    `Reference` is a safe type—use `UnsafePointer` to do unsafe operations.

* [`Bool`](/mojo/stdlib/builtin/bool/Bool) can now be implicitly converted from
  any type conforming to the [`Boolable`](/mojo/stdlib/builtin/bool/Boolable)
  trait. This means that you no longer need to write code like this:

  ```mojo
  @value
  struct MyBoolable:
    fn __bool__(self) -> Bool: ...

  fn takes_boolable[T: Boolable](cond: T): ...

  takes_boolable(MyBoolable())
  ```

  Instead, you can simply write:

  ```mojo
  fn takes_bool(cond: Bool): ...

  takes_bool(MyBoolable())
  ```

  Note that calls to `takes_bool()` will perform the implicit conversion, so in
  some cases is it still better to explicitly declare a type parameter, e.g.:

  ```mojo
  fn takes_two_boolables[T: Boolable](a: T, b: T):
    # Short circuit means `b.__bool__()` might not be evaluated.
    if a.__bool__() and b.__bool__():
      ...
  ```

* [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) now conforms
  to the [`KeyElement`](/mojo/stdlib/collections/dict/KeyElement) trait, meaning
  that it can be used as key type for
  [`Dict`](/mojo/stdlib/collections/dict/Dict). This allows you to easily build
  and interact with Python dictionaries in Mojo:

  ```mojo
  def main():
      d = PythonObject(Dict[PythonObject, PythonObject]())
      d["foo"] = 12
      d[7] = "bar"
      d["foo"] = [1, 2, "something else"]
      print(d)  # prints `{'foo': [1, 2, 'something else'], 7: 'bar'}`
  ```

* [`FileHandle.seek()`](/mojo/stdlib/builtin/file/FileHandle#seek) now has a
  `whence` argument that defaults to `os.SEEK_SET` to seek from the beginning of
  the file. You can now set to `os.SEEK_CUR` to offset by the current
  `FileHandle` seek position:

  ```mojo
  var f = open("/tmp/example.txt")
  # Skip 32 bytes
  f.seek(os.SEEK_CUR, 32)
  ```

  Or `os.SEEK_END` to offset from the end of file:

  ```mojo
  # Start from 32 bytes before the end of the file
  f.seek(os.SEEK_END, -32)
  ```

* [`FileHandle.read()`](/mojo/stdlib/builtin/file/FileHandle#read) can now
  read straight into a
  [`DTypePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer):

  ```mojo
  var file = open("/tmp/example.txt", "r")

  # Allocate and load 8 elements
  var ptr = DTypePointer[DType.float32].alloc(8)
  var bytes = file.read(ptr, 8)
  print("bytes read", bytes)
  print(ptr.load[width=8]())
  ```

* The `sys` module now contains an `exit()` function that would exit a Mojo
  program with the specified error code.

  ```mojo
  from sys import exit

  exit(0)
  ```

* The constructors for [`Tensor`](/max/api/mojo/tensor/tensor/Tensor) have been
  changed to be more consistent. As a result, constructors take the shape as the
  first argument (instead of the second) when constructing a tensor with pointer
  data.

  If you pass a single scalar value to the `Tensor` constructor, it now
  broadcasts the value to all elements in the tensor. For example,
  `Tensor[DType.float32](TensorShape(2,2), 0)` constructs a `2x2` tensor
  initialized with all zeros. This provides an easy way to fill in the data of a
  tensor.

* [`String`](/mojo/stdlib/collections/string/string/String) now has
  `removeprefix()` and `removesuffix()` methods.
  ([@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse))

* The [`ord`](/mojo/stdlib/collections/string/string/ord) and
  [`chr`](/mojo/stdlib/collections/string/string/chr) functions have been
  improved to accept any Unicode character. ([@mzaks](https://github.com/mzaks),
  contributes towards [#1616](https://github.com/modular/modular/issues/1616))

* [`atol()`](/mojo/stdlib/collections/string/string/atol) now handles
  whitespace. The `atol()`function is used internally by `String.__int__()`, so
  `int(String( " 10 "))` now returns `10` instead of raising an error.
  ([@artemiogr97](https://github.com/artemiogr97))

* [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) now implements the `__rmod__()`
  method. ([@bgreni](https://github.com/bgreni), fixes
  [#1482](https://github.com/modular/modular/issues/1482))

* [`bool(None)`](/mojo/stdlib/builtin/bool/bool-function) is now implemented.
  ([@zhoujingya](https://github.com/zhoujingya))

* The [`DTypePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) type
  now implements `gather()` for gathering a `SIMD` vector from offsets of a
  current pointer. Similarly, support for `scatter()` was added to scatter a
  `SIMD` vector into offsets of the current pointer.
  ([@leandrolcampos](https://github.com/leandrolcampos))

* The [`len()`](/mojo/stdlib/builtin/len/len) function now handles a
  [`range()`](/mojo/stdlib/builtin/range/range) specified with a negative end
  value, so that things like `len(range(-1))` work correctly.
  ([@soraros](https://github.com/soraros))

* [`debug_assert()`](/mojo/stdlib/builtin/debug_assert/debug_assert) now prints
  its location (filename, line, and column where it was called) in its error
  message. Similarly, the `assert` helpers in the
  [`testing`](/mojo/stdlib/testing/testing/) module now include location
  information in their messages.

* The
  [`testing.assert_equal[SIMD]()`](/mojo/stdlib/testing/testing/assert_equal)
  function now raises if any of the elements mismatch in the two `SIMD`
  arguments being compared.
  ([@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse))

* The
  [`testing.assert_almost_equal()`](/mojo/stdlib/testing/testing/assert_almost_equal)
  and [`math.isclose()`](/mojo/stdlib/math/math/isclose) functions now have an
  `equal_nan` flag. When set to `True`, then NaNs are considered equal.

* The `object` type now supports the
  division, modulo, and left and right shift operators, including the in-place
  and reverse variants.
  (@LJ-9801, fixes
  [#2224](https://github.com/modular/modular/issues/2224))

* Added checked arithmetic operations for `SIMD` integers.

  `SIMD` integer types (including the sized integer scalars like `Int64`) can
  now perform checked additions, subtractions, and multiplications using the
  following new methods:

  * `add_with_overflow()`
  * `sub_with_overflow()`
  * `mul_with_overflow()`

  Checked arithmetic allows the caller to determine if an operation exceeded
  the numeric limits of the type. For example:

  ```mojo
  var simd = SIMD[DType.int8, 4](7, 11, 13, 17)
  var product: SIMD[DType.int8, 4]
  var overflow: SIMD[DType.bool, 4]
  (product, overflow) = simd.mul_with_overflow(simd)
  for i in range(len(product)):
    if overflow[i]:
            print("")
        else:
            print(product[i])
  ```

  ([@lsh](https://github.com/lsh))

* Added [`os.remove()`](/mojo/stdlib/os/os/remove) and
  [`os.unlink()`](/mojo/stdlib/os/os/unlink) for deleting files.
  ([@artemiogr97](https://github.com/artemiogr97), fixes
  [#2306](https://github.com/modular/modular/issues/2306))

#### 🦋 Changed

* The [`parallel_memcpy()`](/mojo/stdlib/algorithm/memory/parallel_memcpy)
  function has moved from the `buffer` package to the `algorithm` package.
  Please update your imports accordingly.

* [`Optional.value()`](/mojo/stdlib/collections/optional/Optional#value) now
  returns a reference instead of a copy of the contained value.

  To perform a copy manually, dereference the result:

  ```mojo
  var result = Optional(123)

  var value = result.value()[]
  ```

  ([@lsh](https://github.com/lsh), fixes
  [#2179](https://github.com/modular/modular/issues/2179))

* Per the accepted community proposal, [Standardize the representation of byte
  sequence as a sequence of unsigned 8-bit
  integers](https://github.com/modular/modular/blob/main/mojo/proposals/byte-as-uint8.md),
  began transition to using `UInt8` by changing the data pointer of `Error` to
  `DTypePointer[DType.uint8]`.
  ([@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse), contributes
  towards [#2317](https://github.com/modular/modular/issues/2317))

* Continued transition to `UnsafePointer` from the legacy `Pointer` type
  in various standard library APIs and internals.
  ([@gabrieldemarmiesse](https://github.com/gabrieldemarmiesse))

### Tooling changes

* The behavior of `mojo build` when invoked without an output `-o` argument has
  changed slightly: `mojo build ./test-dir/program.mojo` now outputs an
  executable to the path `./program`, whereas before it would output to the path
  `./test-dir/program`.

* The `mojo package` command no longer supports the `-D` flag. All compilation
  environment flags should be provided at the point of package use (e.g.
  `mojo run` or `mojo build`).

* The REPL no longer allows type level variable declarations to be
  uninitialized, e.g. it will reject `var s: String`.  This is because it does
  not do proper lifetime tracking (yet!) across cells, and so such code would
  lead to a crash.  You can work around this by initializing to a dummy value
  and overwriting later.  This limitation only applies to top level variables,
  variables in functions work as they always have.

### Other changes

#### Low-level language changes

* A low-level `__get_mvalue_as_litref(x)` builtin was added to give access to
  the underlying memory representation as a `!lit.ref` value without checking
  initialization status of the underlying value. This is useful in very
  low-level logic but isn't designed for general usability and will likely
  change in the future.

* Properties can now be specified on inline MLIR ops:

  ```mojo
  _ = __mlir_op.`kgen.source_loc`[
      _type = (
          __mlir_type.index, __mlir_type.index, __mlir_type.`!kgen.string`
      ),
      _properties = __mlir_attr.`{inlineCount = 1 : i64}`,
  ]()
  ```

  As the example shows above, the protected `_properties` attribute can be
  passed during op construction, with an MLIR `DictionaryAttr` value.

#### ❌ Removed

* Support for "register only" variadic packs has been removed. Instead of
  `AnyRegType`, please upgrade your code to `AnyType` in examples like this:

  ```mojo
  fn your_function[*Types: AnyRegType](*args: *Ts): ...
  ```

  This move gives you access to a nicer API and has the benefit of being memory
  safe and correct for non-trivial types.  If you need specific APIs on the
  types, please use the correct trait instead of `AnyType`.

* `List.pop_back()` has been removed.  Use `List.pop()` instead which defaults
  to popping the last element in the list.

* `SIMD.to_int(value)` has been removed.  Use `int(value)` instead.

* The `__get_lvalue_as_address(x)` magic function has been removed.  To get a
  reference to a value use `Reference(x)` and if you need an unsafe pointer, you
  can use `UnsafePointer.address_of(x)`.

#### 🛠️ Fixed

* [#516](https://github.com/modular/modular/issues/516) and
  [#1817](https://github.com/modular/modular/issues/1817) and many others, e.g.
  "Can't create a function that returns two strings."

* [#1178](https://github.com/modular/modular/issues/1178) (os/kern) failure (5).

* [#1609](https://github.com/modular/modular/issues/1609) alias with
  `DynamicVector[Tuple[Int]]` fails.

* [#1987](https://github.com/modular/modular/issues/1987) Defining `main`
  in a Mojo package is an error, for now. This is not intended to work yet,
  erroring for now will help to prevent accidental undefined behavior.

* [#1215](https://github.com/modular/modular/issues/1215) and
  [#1949](https://github.com/modular/modular/issues/1949) The Mojo LSP server no
  longer cuts off hover previews for functions with functional arguments,
  parameters, or results.

* [#1901](https://github.com/modular/modular/issues/1901) Fixed Mojo LSP and
  documentation generation handling of inout arguments.

* [#1913](https://github.com/modular/modular/issues/1913) - `0__` no longer
  crashes the Mojo parser.

* [#1924](https://github.com/modular/modular/issues/1924) JIT debugging on Mac
  has been fixed.

* [#1941](https://github.com/modular/modular/issues/1941) Mojo variadic arguments
  don't work with non-trivial register-only types.

* [#1963](https://github.com/modular/modular/issues/1963) `a!=0` is now parsed
  and formatted correctly by `mojo format`.

* [#1676](https://github.com/modular/modular/issues/1676) Fix a crash related to
  `@value` decorator and structs with empty body.

* [#1917](https://github.com/modular/modular/issues/1917) Fix a crash after
  syntax error during tuple creation.

* [#2006](https://github.com/modular/modular/issues/2006) The Mojo LSP now
  properly supports signature types with named arguments and parameters.

* [#2007](https://github.com/modular/modular/issues/2007) and
  [#1997](https://github.com/modular/modular/issues/1997) The Mojo LSP no longer
  crashes on certain types of closures.

* [#1675](https://github.com/modular/modular/issues/1675) Ensure `@value`
  decorator fails gracefully after duplicate field error.

* [#2068](https://github.com/modular/modular/issues/2068)
  Fix `SIMD.reduce()` for size\_out == 2.
  ([@soraros](https://github.com/soraros))

## v24.2.1 (2024-04-11)

This release doesn't include any changes to Mojo.

## v24.2 (2024-03-28)

### 🔥 Legendary

* The Mojo standard library is now open source! Check out the
  [README](https://github.com/modular/modular/blob/main/mojo/stdlib/README.md)
  for everything you need to get started.

* Structs and other nominal types are now allowed to implicitly conform to
  traits. A struct implicitly conforms to a trait if it implements all the
  requirements for the trait. For example, any struct that implements the
  `__str__()` method implicitly conforms to `Stringable`, and is usable with
  the `str()` built-in function.

  ```mojo
  @value
  struct Foo:
      fn __str__(self) -> String:
          return "foo!"

  fn main():
      print(str(Foo())) # prints 'foo!'
  ```

  We still strongly encourage you to explicitly list the traits a struct
  conforms to when possible:

  ```mojo
  @value
  struct Foo(Stringable): ...
  ```

  Not only is this useful for documentation and for communicating intentions,
  but in the future, explicit conformance will be useful for features like
  default methods and extensions.

* Mojo's Python interoperability now supports passing keyword arguments to
  Python functions:

  ```mojo
  from python import Python

  def main():
      plt = Python.import_module("matplotlib.pyplot")
      plt.plot((5, 10), (10, 15), color="red")
      plt.show()
  ```

### Language changes

#### ⭐️ New

* Mojo now has support for variadic keyword arguments, often referred to as
  `**kwargs`. This means you can now declare and call functions like this:

  ```mojo
  fn print_nicely(**kwargs: Int) raises:
    for key in kwargs.keys():
        print(key[], "=", kwargs[key[]])

   # prints:
   # `a = 7`
   # `y = 8`
  print_nicely(a=7, y=8)
  ```

  For more details (and a list of current limitations), see [Variadic keyword
  arguments](/mojo/manual/functions#variadic-keyword-arguments) in the Mojo
  manual.

#### 🦋 Changed or removed

* `let` declarations now produce a compile time error instead of a warning,
  our next step in [removing let
  declarations](https://github.com/modular/modular/blob/main/mojo/proposals/remove-let-decls.md).
  The compiler still recognizes the `let` keyword for now in order to produce
  a good error message, but that will be removed in subsequent releases.

* Mojo now warns about unused values in both `def` and `fn` declarations,
  instead of completely disabling the warning in `def`s.  It never warns about
  unused `object` or `PythonObject` values, tying the warning to these types
  instead of the kind of function they are unused in.  This will help catch API
  usage bugs in `def`s and make imported Python APIs more ergonomic in `fn`s.

* For the time being, dynamic type values will be disabled in the language. For
  example, the following will now fail with an error:

  ```mojo
  var t = Int  # dynamic type values not allowed

  struct SomeType: ...

  takes_type(SomeType)  # dynamic type values not allowed
  ```

  We want to take a step back and (re)design type valued variables,
  existentials, and other dynamic features. This does not affect type valued
  **parameters**, so the following works as before:

  ```mojo
  alias t = Int  # still 🔥

  struct SomeType: ...

  takes_type[SomeType]()  # already 🔥

  >fn uses_trait[T: SomeTrait](value: T): ... # still 🔥
  ```

* The `*_` expression in parameter expressions is now required to occur at the
  end of a positional parameter list, instead of being allowed in the middle.

  ```mojo
  # No longer supported
  alias FirstUnbound = SomeStruct[*_, 42]
  alias MidUnbound   = SomeStruct[7, *_, 6]
  # Still supported
  alias LastUnbound  = SomeStruct[42, *_]
  ```

  We narrowed this because we want to encourage type designers
  to get the order of parameters right, and want to extend `*_` to support
  keyword parameters as well in the future.

### Standard library changes

#### ⭐️ New

* `DynamicVector` has been renamed to
  [`List`](/mojo/stdlib/collections/list/List), and has moved from the
  `collections.vector` module to the `collections.list` module. In addition:

  * You can now construct a `List` from a variadic number of values. For
    example:

    ```mojo
    var numbers = List[Int](1, 2, 3)
    ```

  * `List` and
    [`InlinedFixedVector`](/mojo/stdlib/collections/inline_array/InlineArray)
    types now support negative indexing. This means that you can write `vec[-1]`
    which is equivalent to `vec[len(vec)-1]`.

  * `List.push_back()` has been removed.  Please use the `append()` function
    instead.

* The [`print()`](/mojo/stdlib/builtin/io/print) function now takes `sep` and
  `end` keyword arguments. This means that you can write:

  ```mojo
  print("Hello", "Mojo", sep=", ", end="!!!\n") # prints Hello, Mojo!!!
  ```

  `sep` defaults to the empty string and `end` defaults to "\n".

  Also, the `print_no_newline()` function has been removed.  Please use
  `print(end="")` instead.

* The [`FloatLiteral`](/mojo/stdlib/builtin/float_literal/FloatLiteral) type is
  now an infinite-precision nonmaterializable type. This means you can do
  compile-time calculations using `FloatLiteral` without rounding errors. When
  materialized at runtime, a `FloatLiteral` value is converted to a
  [`Float64`](/mojo/stdlib/builtin/simd).

  ```mojo
  # third is an infinite-precision FloatLiteral value
  alias third = 1.0 / 3.0
  # t is a Float64
  var t = third
  ```

* String types all conform to the
  [`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising) trait. This means
  that you can now call `int("123")` to get the integer `123`. If the integer
  cannot be parsed from the string, then an error is raised.

* The `Tensor` type now has `argmax()` and `argmin()` functions to compute the
  position of the max or min value. Note: this should return a `Tensor[Int]` but
  currently the output tensor is the same type as the input tensor. This will be
  fixed in a future release.

* Added a new
  [`collections.OptionalReg`](/mojo/stdlib/collections/optional/OptionalReg)
  type, a register-passable alternative to
  [`Optional`](/mojo/stdlib/collections/optional/Optional).

* The [`ulp()`](/mojo/stdlib/utils/numerics/ulp) function has been added to the
  `math` module. This allows you to get the units of least precision (or units
  of last place) of a floating point value.

#### 🦋 Changed

* The `simd_load()`, `simd_store()`, `aligned_simd_load()`, and
  `aligned_simd_store()` methods on
  [`DTypePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer),
  [`Buffer`](/mojo/stdlib/buffer/buffer/NDBuffer), and
  [`NDBuffer`](/mojo/stdlib/buffer/buffer/NDBuffer) have been merged into
  a more expressive set of `load()` and `store()` methods with keyword-only
  `width` and `alignment` parameters:

  ```mojo
  # Doesn't work
  my_simd = my_buffer.simd_load[simd_width](index)
  # Works
  my_simd = my_buffer.load[width=simd_width](index)
  # Doesn't work
  my_buffer.aligned_simd_store[width, alignment](my_simd)
  # Works
  my_buffer.store[width=width, alignment=alignment](my_simd)
  ```

* The
  [`EqualityComparable`](/mojo/stdlib/builtin/equality_comparable/EqualityComparable)
  trait now requires the `__ne__()` method for conformance in addition to the
  previously required `__eq__()` method.

* Many types now declare conformance to `EqualityComparable` trait.

* [`StaticTuple`](/mojo/stdlib/utils/static_tuple/StaticTuple) parameter order
  has changed to `StaticTuple[type, size]` for consistency with `SIMD` and
  similar collection types.

* The signature of the
  [`elementwise()`](/mojo/stdlib/algorithm/functional/elementwise) function has
  been changed. The new order is is `function`, `simd_width`, and then `rank`.
  As a result, the rank parameter can now be inferred and one can call
  `elementwise()` without it:

  ```mojo
  elementwise[func, simd_width](shape)
  ```

* `PythonObject` is now register-passable.

* `PythonObject.__iter__()` now works correctly on more types of iterable Python
  objects. Attempting to iterate over non-iterable objects will now raise an
  exception instead of behaving as if iterating over an empty sequence.
  `__iter__()` also now borrows `self` rather than requiring `inout`, allowing
  code like:

  ```mojo
  for value in my_dict.values():
    ...
  ```

#### 🚚 Moved

* We took the opportunity to rehome some modules into their correct package
  as we were going through the process of open-sourcing the Mojo standard
  library.  Specifically, the following are some breaking changes worth
  calling out.  Please update your import statements accordingly.

  * [`Buffer`](/mojo/stdlib/buffer/buffer/NDBuffer),
    [`NDBuffer`](/mojo/stdlib/buffer/buffer/NDBuffer), and friends have moved
    from the `memory` package into a new `buffer` package.

    ```mojo
    from buffer import Buffer, NDBuffer
    ```

  * `utils.list`, including the [`Dim`](/mojo/stdlib/buffer/dimlist/Dim) and
    [`DimList`](/mojo/stdlib/buffer/dimlist/DimList) types, has moved to
    the `buffer` package.

    ```mojo
    from buffer import Dim, DimList
    ```

  * The [`parallel_memcpy()`](/mojo/stdlib/algorithm/memory/parallel_memcpy)
    function has moved from the `memory` package into the `buffer` package.

    ```mojo
    from buffer import parallel_memcpy
    ```

  * The [`rand()`](/max/api/mojo/tensor/tensor/Tensor/#rand) and
    [`randn()`](/max/api/mojo/tensor/tensor/Tensor/#randn) functions from the
    `random` package that return a `Tensor` have moved to the `tensor` package.
    Note that the overloads that write to a `DTypePointer` remain in the
    `random` package.

    If you happen to be using both versions in the same source file, you can
    import them both using the `import as` syntax:

    ```mojo
    from tensor import rand
    from random import rand as rand_dt
    ```

  * The `trap()` function has been renamed to
    [`abort()`](/mojo/stdlib/os/os/abort).  It also has moved from the `debug`
    module to the `os` module.

    ```mojo
    from os import abort
    ```

  * The [`isinf()`](/mojo/stdlib/utils/numerics/isfinite) and
    [`isfinite()`](/mojo/stdlib/utils/numerics/isfinite) methods have been moved
    from `math.limits` to the `math` module.

    ```mojo
    from math import ininf, isfinite
    ```

### Tooling changes

#### ⭐️ New

* Docstring code blocks can now use `%#` to hide lines of code from
  documentation generation.

  For example:

  ```mojo
  var value = 5
  %# print(value)
  ```

  Will generate documentation of the form:

  ```mojo
  var value = 5
  ```

  Hidden lines are processed as if they were normal code lines during test
  execution. This allows for writing additional code within a docstring
  example that is only used to ensure the example is runnable/testable.

* The Mojo LSP server now allow you to specify additional search paths to use
  when resolving imported modules in a document. You can specify search paths
  on the command line, using the `-I` option, or you can add them to the
  `mojo.lsp.includeDirs` setting in the VS Code extension.

### Other changes

#### ❌ Removed

* The `__get_address_as_lvalue` magic function has been removed.  You can now
  get an LValue from a `Pointer` or `Reference` by using the dereference
  operator (`[]`):

  ```mojo
  var ptr: Pointer[MyRecord]
  ...
  # Doesn't work
  __get_address_as_lvalue(ptr.value) = MyRecord(3, 5)
  # Works
  ptr[] = MyRecord(3, 5)
  ```

* The type parameter for the `memcpy` function is now automatically inferred.
  This means that calls to `memcpy` of the form `memcpy[Dtype.xyz](...)` will
  no longer work and the user would have to change the code to `memcpy(...)`.

* The [`memcpy()`](/mojo/stdlib/memory/memory/memcpy) overload that worked on
  [`Buffer`](/mojo/stdlib/buffer/buffer/NDBuffer) types has been removed in
  favor of just overloads for
  [`Pointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) and
  [`DTypePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer):

  ```mojo
  # Doesn't work
  memcpy(destBuffer, srcBuffer, count)
  # Works
  memcpy(destBuffer.data, srcBuffer.data, count)
  ```

* The functions `max_or_inf()`, `min_or_neginf()` have been removed from
  `math.limit`. These functions were only used by the SIMD type.

* As mentioned previously, the `print_no_newline()` function has been removed.
  Please use `print(end="")` instead.

#### 🛠️ Fixed

* [#1362](https://github.com/modular/modular/issues/1362) - Parameter inference
  now recursively matches function types.
* [#951](https://github.com/modular/modular/issues/951) - Functions that were
  both `async` and `@always_inline` incorrectly errored.
* [#1858](https://github.com/modular/modular/issues/1858) - Trait with parametric
  methods regression.
* [#1892](https://github.com/modular/modular/issues/1892) - Forbid unsupported
  decorators on traits.
* [#1735](https://github.com/modular/modular/issues/1735) - Trait-typed values
  are incorrectly considered equal.
* [#1909](https://github.com/modular/modular/issues/1909) - Crash due to nested
  import in unreachable block.
* [#1921](https://github.com/modular/modular/issues/1921) - Parser crashes
  binding `Reference` to lvalue with subtype lifetime.
* [#1945](https://github.com/modular/modular/issues/1945) - `Optional[T].or_else()`
  should return `T` instead of `Optional[T]`.
* [#1940](https://github.com/modular/modular/issues/1940) - Constrain
  `math.copysign` to floating point or integral types.
* [#1838](https://github.com/modular/modular/issues/1838) - Variadic `print`
  does not work when specifying `end=""`
* [#1826](https://github.com/modular/modular/issues/1826) - The `SIMD.reduce`
  methods correctly handle edge cases where `size_out >= size`.

## v24.1.1 (2024-03-18)

This release includes installer improvements and enhanced error reporting for
installation issues. Otherwise it is functionally identical to Mojo 24.1.

## v24.1 (2024-02-29)

### 🔥 Legendary

* Mojo is now bundled with [the MAX platform](/max)!

  As such, the Mojo package version now matches the MAX version, which follows
  a `YY.MAJOR.MINOR` version scheme. Because this is our first release in 2024,
  that makes this version `24.1`.

* Mojo debugging support is here! The Mojo VS Code extension includes debugger
  support. For details, see [Debugging](/mojo/tools/debugging) in the Mojo
  Manual.

### ⭐️ New

* We now have a [`Set`](/mojo/stdlib/collections/set/Set) type in our
  collections! `Set` is backed by a `Dict`, so it has fast add, remove, and `in`
  checks, and requires member elements to conform to the `KeyElement` trait.

  ```mojo
  from collections import Set

  var set = Set[Int](1, 2, 3)
  print(len(set))  # 3
  set.add(4)

  for element in set:
      print(element[])

  set -= Set[Int](3, 4, 5)
  print(set == Set[Int](1, 2))  # True
  print(set | Set[Int](0, 1) == Set[Int](0, 1, 2))  # True
  let element = set.pop()
  print(len(set))  # 1
  ```

* Mojo now supports the `x in y` expression as syntax sugar for
  `y.__contains__(x)` as well as `x not in y`.

* Mojo now has support for keyword-only arguments and parameters. For example:

  ```mojo
  fn my_product(a: Int, b: Int = 1, *, c: Int, d: Int = 2):
      print(a * b * c * d)

  my_product(3, c=5)     # prints '30'
  my_product(3, 5, d=7)  # error: missing 1 required keyword-only argument: 'c'
  ```

  This includes support for declaring signatures that use both variadic and
  keyword-only arguments/parameters. For example, the following is now possible:

  ```mojo
  fn prod_with_offset(*args: Int, offset: Int = 0) -> Int:
      var res = 1
      for i in range(len(args)):
          res *= args[i]
      return res + offset

  print(prod_with_offset(2, 3, 4, 10))         # prints 240
  print(prod_with_offset(2, 3, 4, offset=10))  # prints 34
  ```

  Note that variadic keyword-only arguments/parameters (for example, `**kwargs`)
  are not supported yet. That is, the following is not allowed:

  ```mojo
  fn variadic_kw_only(a: Int, **kwargs): ...
  ```

  For more information, see
  [Positional-only and keyword-only arguments](/mojo/manual/functions#positional-only-and-keyword-only-arguments)
  in the Mojo Manual.

* The `print()` function now accepts a keyword-only argument for the `end`
  which is useful for controlling whether a newline is printed or not
  after printing the elements.  By default, `end` defaults to "\n" as before.

* The Mojo SDK can now be installed on AWS Graviton instances.

* A new version of the [Mojo
  Playground](https://developer.modular.com/playground) is available. The new
  playground is a simple interactive editor for Mojo code, similar to the Rust
  Playground or Go Playground. The old JupyterLab based playground will remain
  online until March 20th.

* The Mojo LSP server will now generate fixits for populating empty
  documentation strings:

  ```mojo
  fn foo(arg: Int):
      """""" # Unexpected empty documentation string
  ```

  Applying the fixit from above will generate:

  ```mojo
  fn foo(arg: Int):
      """[summary].

      Args:
          arg: [description].
      """
  ```

* Added new `*_` syntax that allows users to explicitly unbind any number of
  positional parameters. For example:

  ```mojo
  struct StructWithDefault[a: Int, b: Int, c: Int = 8, d: Int = 9]: pass

  alias all_unbound = StructWithDefault[*_]
  # equivalent to
  alias all_unbound = StructWithDefault[_, _, _, _]

  alias first_bound = StructWithDefault[5, *_]
  # equivalent to
  alias first_bound = StructWithDefault[5, _, _, _]

  alias last_bound = StructWithDefault[*_, 6]
  # equivalent to
  alias last_bound = StructWithDefault[_, _, _, 6]

  alias mid_unbound = StructWithDefault[3, *_, 4]
  # equivalent to
  alias mid_unbound = StructWithDefault[3, _, _, 4]
  ```

  As demonstrated above, this syntax can be used to explicitly unbind an
  arbitrary number of parameters, at the beginning, at the end, or in the
  middle of the operand list. Since these unbound parameters must be explicitly
  specified at some point, default values for these parameters are not applied.
  For example:

  ```mojo
  alias last_bound = StructWithDefault[*_, 6]
  # When using last_bound, you must specify a, b, and c. last_bound
  # doesn't have a default value for `c`.
  var s = last_bound[1, 2, 3]()
  ```

  For more information see the Mojo Manual sections on
  [partially-bound types](/mojo/manual/parameters/#fully-bound-partially-bound-and-unbound-types)
  and
  [automatic parameterization of functions](/mojo/manual/parameters/#automatic-parameterization-of-functions).

* [`DynamicVector`](/mojo/stdlib/collections/list/List) now supports iteration.
  Iteration values are instances of `Reference` and require dereferencing:

  ```mojo
  var v: DynamicVector[String]()
  v.append("Alice")
  v.append("Bob")
  v.append("Charlie")
  for x in v:
      x[] = str("Hello, ") + x[]
  for x in v:
      print(x[])
  ```

* `DynamicVector` now has
  [`reverse()`](/mojo/stdlib/collections/list/List#reverse) and
  [`extend()`](/mojo/stdlib/collections/list/List#extend) methods.

* The `mojo package` command now produces compilation agnostic packages.
  Compilation options such as O0, or --debug-level, are no longer needed or
  accepted. As a result, packages are now smaller, and extremely portable.

* Initializers for `@register_passable` values can (and should!) now be
  specified with `inout self` arguments just like memory-only types:

  ```mojo
  @register_passable
  struct YourPair:
      var a: Int
      var b: Int
      fn __init__(inout self):
          self.a = 42
          self.b = 17
      fn __copyinit__(inout self, existing: Self):
          self.a = existing.a
          self.b = existing.b
  ```

  This form makes the language more consistent, more similar to Python, and
  easier to implement advanced features for.  There is also no performance
  impact of using this new form: the compiler arranges to automatically return
  the value in a register without requiring you to worry about it.

  The older `-> Self` syntax is still supported in this release, but will be
  removed in a subsequent one, so please migrate your code.  One thing to watch
  out for: a given struct should use one style or the other, mixing some of
  each won't work well.

* The `inout self` initializer form is **required** for initializers of
  `@register_passable` types that may raise errors:

  ```mojo
  @register_passable
  struct RaisingCtor:
      fn __init__(inout self) raises:
          raise
  ```

* `async` functions that may raise errors have been temporarily disabled in this
  build. The implementation of Mojo async is undergoing a rework 🚧.

* The standard library `slice` type has been renamed to
  [`Slice`](/mojo/stdlib/builtin/builtin_slice/Slice), and a `slice`
  function has been introduced.  This makes Mojo closer to Python and makes the
  `Slice` type follow the naming conventions of other types like `Int`.

* "Slice" syntax in subscripts is no longer hard coded to the builtin `slice`
  type: it now works with any type accepted by a container's `__getitem__()`
  method. For example:

  ```mojo
  @value
  struct UnusualSlice:
      var a: Int
      var b: Float64
      var c: String

  struct YourContainer:
      fn __getitem__(self, slice: UnusualSlice) -> T: ...
  ```

  Given this implementation, you can subscript into an instance of
  `YourContainer` like `yc[42:3.14:"🔥"]` and the three values are passed to the
  `UnusualSlice` constructor.

* The `__refitem__()` accessor method may now return a `Reference` instead of
  having to return an MLIR internal reference type.

* Added [`AnyPointer.move_into()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#move_pointee_into)
  method, for moving a value from one pointer memory location to another.

* Added built-in [`hex()`](/mojo/stdlib/builtin/format_int/hex) function, which
  can be used to format any value whose type implements the
  [`Intable`](/mojo/stdlib/builtin/int/Intable) trait as a hexadecimal string.

* [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) now
  implements `__is__` and `__isnot__` so that you can use expressions of the
  form `x is y` and `x is not y` with `PythonObject`.

* [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) now conforms
  to the `SizedRaising` trait. This means the built-in
  [`len()`](/mojo/stdlib/builtin/len/len) function now works on `PythonObject`.

* The `os` package now contains the [`stat()`](/mojo/stdlib/os/fstat/stat)
  and [`lstat()`](/mojo/stdlib/os/fstat/lstat) functions.

* A new [`os.path`](/mojo/stdlib/os/path/path) package now allows you to query
  properties on paths.

* The `os` package now has a
  [`PathLike`](/mojo/stdlib/os/pathlike/PathLike) trait. A struct conforms
  to the `PathLike` trait by implementing the `__fspath__()` function.

* The [`pathlib.Path`](/mojo/stdlib/pathlib/path/Path) now has functions to
  query properties of the path.

* The [`listdir()`](/mojo/stdlib/pathlib/path/Path#listdir) method now exists on
  [`pathlib.Path`](/mojo/stdlib/pathlib/path) and also exists in the `os`
  module to work on `PathLike` structs. For example, the following sample
  lists all the directories in the `/tmp` directory:

  ```mojo
  from pathlib import Path

  fn walktree(top: Path, inout files: DynamicVector[Path]):
      try:
          var ls = top.listdir()
          for i in range(len(ls)):
              var child = top / ls[i]
              if child.is_dir():
                  walktree(child, files)
              elif child.is_file():
                  files.append(child)
              else:
                  print("Skipping '" + str(child) + "'")
      except:
          return

  fn main():
      var files = DynamicVector[Path]()

      walktree(Path("/tmp"), files)

      for i in range(len(files)):
          print(files[i])
  ```

* The [`find()`](/mojo/stdlib/builtin/string_literal/StringLiteral#find),
  [`rfind()`](/mojo/stdlib/builtin/string_literal/StringLiteral#rfind),
  [`count()`](/mojo/stdlib/collections/string/string_slice/StringSlice#count), and
  [`__contains__()`](/mojo/stdlib/builtin/string_literal/StringLiteral#__contains__)
  methods now work on string literals. This means that you can write:

  ```mojo
  if "Mojo" in "Hello Mojo":
      ...
  ```

* Breakpoints can now be inserted programmatically within the code using the
  builtin [`breakpoint()`](/mojo/stdlib/builtin/breakpoint/breakpoint) function.

  Note: on Graviton instances, the debugger might not be able to resume after
  hitting this kind of breakpoint.

* Added a builtin [`Boolable`](/mojo/stdlib/builtin/bool/Boolable) trait that
  describes a type that can be represented as a boolean value. To conform to the
  trait, a type must implement the `__bool__()` method.

* Modules within packages can now use purely relative `from` imports:

  ```mojo
  from . import another_module
  ```

* Trivial types, like MLIR types and function types, can now be bound implicitly
  to traits that require copy constructors or move constructors, such as
  [`Movable`](/mojo/stdlib/builtin/value/Movable),
  [`Copyable`](/mojo/stdlib/builtin/value/Copyable), and
  [`CollectionElement`](/mojo/stdlib/builtin/value/CollectionElement).

* A new magic `__origin_of(expr)` call will yield the lifetime of a memory
  value.  We hope and expect that this will eventually be replaced by
  `Reference(expr).lifetime` as the parameter system evolves, but this is
  important in the meantime for use in function signatures.

* A new magic `__type_of(expr)` call will yield the type of a value. This allows
  one to refer to types of other variables. For example:

  ```mojo
  fn my_function(x: Int, y: __type_of(x)) -> Int:
      let z: __type_of(x) = y
      return z
  ```

### 🦋 Changed

* As another step towards [removing let
  declarations](https://github.com/modular/modular/blob/main/mojo/proposals/remove-let-decls.md)
  we have removed support for let declarations inside the compiler.  To ease
  migration, we parse `let` declarations as a `var` declaration so your code
  won't break.  We emit a warning about this, but please switch your code to
  using `var` explicitly, because this migration support will be removed in a
  subsequent update.

  ```mojo
  fn test():
      # treated as a var, but please update your code!
      let x = 42  # warning: 'let' is being removed, please use 'var' instead
      x = 9
  ```

* It is no longer possible to explicitly specify implicit argument parameters in
  [automatically parameterized
  functions](/mojo/manual/parameters/#automatic-parameterization-of-functions).
  This ability was an oversight and this is now an error:

  ```mojo
  fn autoparameterized(x: SIMD):
      pass

  autoparameterized[DType.int32, 1](3) # error: too many parameters
  ```

* `vectorize_unroll` has been removed, and
  [`vectorize`](/mojo/stdlib/algorithm/functional/vectorize) now has a parameter
  named `unroll_factor` with a default value of 1. Increasing `unroll_factor`
  may improve performance at the cost of binary size. See the
  [loop unrolling blog here](https://www.modular.com/blog/what-is-loop-unrolling-how-you-can-speed-up-mojo)
  for more details.

* The `vectorize` signatures have changed with the closure `func` moved to the
  first parameter:

  ```mojo
  vectorize[func, width, unroll_factor = 1](size)
  vectorize[func, width, size, unroll_factor = 1]()
  ```

  The doc string has been updated with examples demonstrating the difference
  between the two signatures.

* The `unroll` signatures have changed with the closure `func` moved to the
  first parameter:

  ```mojo
  unroll[func, unroll_count]()
  ```

* The signature of the [`NDBuffer`](/mojo/stdlib/buffer/buffer/NDBuffer) and
  [`Buffer`](/mojo/stdlib/buffer/buffer/NDBuffer) types have changed. Now, both
  take the type as the first parameter and no longer require the shape
  parameter. This allows you to use these types and have sensible defaults.
  For example:

  ```mojo
  NDBuffer[DType.float32, 3]
  ```

  is equivalent to

  ```mojo
  NDBuffer[DType.float32, 3, DimList.create_unknown[3]()]
  ```

  Users can still specify the static shape (if known) to the type:

  ```mojo
  NDBuffer[DType.float32, 3, DimList(128, 128, 3)]
  ```

* The error message for missing function arguments is improved: instead of
  describing the number of arguments (e.g. `callee expects at least 3 arguments,
  but 1 was specified`) the missing arguments are now described by
  name (e.g. `missing 2 required positional arguments: 'b', 'c'`).

* The [`CollectionElement`](/mojo/stdlib/builtin/value/CollectionElement) trait
  is now a built-in trait and has been removed from `collections.vector`.

* The `DynamicVector(capacity: Int)` constructor has been changed to take
  `capacity` as a keyword-only argument to prevent implicit conversion from
  `Int`.

* [`Variant.get[T]()`](/mojo/stdlib/utils/variant/Variant#__getitem__) now
  returns a `Reference` to the value rather than a copy.

* The [`String`](/mojo/stdlib/collections/string/string/String) methods `tolower()`
  and `toupper()` have been renamed to `str.lower()` and `str.upper()`.

* The `ref` and `mutref` identifiers are no longer reserved as Mojo keywords.
  We originally thought about using those as language sugar for references, but
  we believe that generic language features combined with the
  [`Reference`](/mojo/stdlib/memory/pointer/Pointer) type will provide a good
  experience without dedicated sugar.

### 🛠️ Fixed

* [#435](https://github.com/modular/modular/issues/435)
  Structs with Self type don't always work.
* [#1540](https://github.com/modular/modular/issues/1540)
  Crash in register\_passable self referencing struct.
* [#1664](https://github.com/modular/modular/issues/1664) - Improve error
  message when `StaticTuple` is constructed with a negative size for
  the number of elements.
* [#1679](https://github.com/modular/modular/issues/1679) - crash on SIMD of zero
  elements.
* Various crashes on invalid code:
  [#1230](https://github.com/modular/modular/issues/1230),
  [#1699](https://github.com/modular/modular/issues/1699),
  [#1708](https://github.com/modular/modular/issues/1708)
* [#1223](https://github.com/modular/modular/issues/1223) - Crash when parametric
  function is passed as (runtime) argument. The parser now errors out instead.
* [#1530](https://github.com/modular/modular/issues/1530) - Crash during
  diagnostic emission for parameter deduction failure.
* [#1538](https://github.com/modular/modular/issues/1538) and [#1607](https://github.com/modular/modular/issues/1607) - Crash when returning type
  value instead of instance of expected type. This is a common mistake and the
  error now includes a hint to point users to the problem.
* [#1613](https://github.com/modular/modular/issues/1613) - Wrong type name in
  error for incorrect `self` argument type in trait method declaration.
* [#1670](https://github.com/modular/modular/issues/1670) - Crash on implicit
  conversion in a global variable declaration.
* [#1741](https://github.com/modular/modular/issues/1741) - Mojo documentation
  generation doesn't show `inout`/`owned` on variadic arguments.
* [#1621](https://github.com/modular/modular/issues/1621) - VS Code does not
  highlight `raises` and `capturing` in functional type expressions.
* [#1617](https://github.com/modular/modular/issues/1617) - VS Code does not
  highlight `fn` in specific contexts.
* [#1740](https://github.com/modular/modular/issues/1740) - LSP shows unrelated
  info when hovering over a struct.
* [#1238](https://github.com/modular/modular/issues/1238) - File shadows Mojo
  package path.
* [#1429](https://github.com/modular/modular/issues/1429) - Crash when using
  nested import statement.
* [#1322](https://github.com/modular/modular/issues/1322) - Crash when missing
  types in variadic argument.
* [#1314](https://github.com/modular/modular/issues/1314) - Typecheck error when
  binding alias to parametric function with default argument.
* [#1248](https://github.com/modular/modular/issues/1248) - Crash when importing
  from file the same name as another file in the search path.
* [#1354](https://github.com/modular/modular/issues/1354) - Crash when importing
  from local package.
* [#1488](https://github.com/modular/modular/issues/1488) - Crash when setting
  generic element field.
* [#1476](https://github.com/modular/modular/issues/1476) - Crash in interpreter
  when calling functions in parameter context.
* [#1537](https://github.com/modular/modular/issues/1537) - Crash when copying
  parameter value.
* [#1546](https://github.com/modular/modular/issues/1546) - Modify nested vector
  element crashes parser.
* [#1558](https://github.com/modular/modular/issues/1558) - Invalid import causes
  parser to crash.
* [#1562](https://github.com/modular/modular/issues/1562) - Crash when calling
  parametric type member function.
* [#1577](https://github.com/modular/modular/issues/1577) - Crash when using
  unresolved package as a variable.
* [#1579](https://github.com/modular/modular/issues/1579) - Member access into
  type instances causes a crash.
* [#1602](https://github.com/modular/modular/issues/1602) - Interpreter failure
  when constructing strings at compile time.
* [#1696](https://github.com/modular/modular/issues/1696) - Fixed an issue that
  caused syntax highlighting to occasionally fail.
* [#1549](https://github.com/modular/modular/issues/1549) - Fixed an issue when
  the shift amount is out of range in `SIMD.shift_left` and `SIMD.shift_right`.

## v0.7.0 (2024-01-25)

### ⭐️ New

* A new Mojo-native dictionary type,
  [`Dict`](/mojo/stdlib/collections/dict) for storing key-value pairs.
  `Dict` stores values that conform to the
  [`CollectionElement`](/mojo/stdlib/builtin/value/CollectionElement)
  trait. Keys need to conform to the new
  [`KeyElement`](/mojo/stdlib/collections/dict/KeyElement) trait, which is
  not yet implemented by other standard library types. In the short term, you
  can create your own wrapper types to use as keys. For example, the following
  sample defines a `StringKey` type and uses it to create a dictionary that maps
  strings to `Int` values:

  ```mojo
  from collections.dict import Dict, KeyElement

  @value
  struct StringKey(KeyElement):
      var s: String

      fn __init__(inout self, owned s: String):
          self.s = s ^

      fn __init__(inout self, s: StringLiteral):
          self.s = String(s)

      fn __hash__(self) -> Int:
          return hash(self.s)

      fn __eq__(self, other: Self) -> Bool:
          return self.s == other.s

  fn main() raises:
      var d = Dict[StringKey, Int]()
      d["cats"] = 1
      d["dogs"] = 2
      print(len(d))         # prints 2
      print(d["cats"])      # prints 1
      print(d.pop("dogs"))  # prints 2
      print(len(d))         # prints 1
  ```

  We plan to add `KeyElement` conformance to standard library types in
  subsequent releases.

* Users can opt-in to assertions used in the standard library code by
  specifying `-D MOJO_ENABLE_ASSERTIONS` when invoking `mojo` to
  compile your source file(s).  In the case that an assertion is fired,
  the assertion message will be printed along with the stack trace
  before the program exits.  By default, assertions are *not enabled*
  in the standard library right now for performance reasons.

* The Mojo Language Server now implements the References request. IDEs use
  this to provide support for **Go to References** and **Find All References**.
  A current limitation is that references outside of the current document are
  not supported, which will be addressed in the future.

* The [`sys.info`](/mojo/stdlib/sys/info) module now includes
  `num_physical_cores()`, `num_logical_cores()`, and `num_performance_cores()`
  functions.

* Homogeneous variadic arguments consisting of memory-only types, such as
  `String` are more powerful and easier to use. These arguments are projected
  into a
  [`VariadicListMem`](/mojo/stdlib/builtin/list_literal/VariadicListMem).

  (Previous releases made it easier to use variadic lists of register-passable
  types, like `Int`.)

  Subscripting into a `VariadicListMem` now returns the element instead of an
  obscure internal type. In addition, we now support `inout` and `owned`
  variadic arguments:

  ```mojo
  fn make_worldly(inout *strs: String):
      # This "just works" as you'd expect!
      for i in range(len(strs)):
          strs[i] += " world"
  fn main():
      var s1: String = "hello"
      var s2: String = "konnichiwa"
      var s3: String = "bonjour"
      make_worldly(s1, s2, s3)
      print(s1)  # hello world
      print(s2)  # konnichiwa world
      print(s3)  # bonjour world
  ```

  (Previous releases made it easier to use variadic lists, but subscripting into
  a `VariadicListMem` returned a low-level pointer, which required the user to
  call `__get_address_as_lvalue()` to access the element.)

  Note that subscripting the variadic list works nicely as above, but iterating
  over the variadic list directly with a `for` loop produces a `Reference`
  (described below) instead of the desired value, so an extra subscript is
  required; We intend to fix this in the future.

  ```mojo
  fn make_worldly(inout *strs: String):
      # Requires extra [] to dereference the reference for now.
      for i in strs:
          i[] += " world"
  ```

  Heterogeneous variadic arguments have not yet been moved to the new model, but
  will in future updates.

  Note that for variadic arguments of register-passable types like `Int`, the
  variadic list contains values, not references, so the dereference operator
  (`[]`) is not required. This code continues to work as it did previously:

  ```mojo
  fn print_ints(*nums: Int):
      for num in nums:
          print(num)
      print(len(nums))
  ```

* Mojo now has a prototype version of a safe
  [`Reference`](/mojo/stdlib/memory/pointer/Pointer) type. The compiler's
  lifetime tracking pass can reason about references to safely extend local
  variable lifetime, and check indirect access safety.  The `Reference` type
  is brand new (and currently has no syntactic sugar) so it must be explicitly
  dereferenced with an empty subscript: `ref[]` provides access to the
  underlying value.

  ```mojo
  fn main():
      var a: String = "hello"
      var b: String = " references"

      var aref = Reference(a)
      aref[] += b
      print(a)  # prints "hello references"

      aref[] += b
      # ^last use of b, it is destroyed here.

      print(aref[]) # prints "hello references references"
      # ^last use of a, it is destroyed here.
  ```

  While the `Reference` type has the same in-memory representation as a C
  pointer or the Mojo `Pointer` type, it also tracks a symbolic "lifetime" value
  so the compiler can reason about the potentially accessed set of values.  This
  lifetime is part of the static type of the reference, so it propagates through
  generic algorithms and abstractions built around it.

  The `Reference` type can form references to both mutable and immutable memory
  objects, e.g. those on the stack or borrowed/inout/owned function arguments.
  It is fully parametric over mutability, eliminating the [problems with code
  duplication due to mutability
  specifiers](https://duckki.github.io/2024/01/01/inferred-mutability.html) and
  provides the base for unified user-level types. For example, it could be
  used to implement an array slice object that handles both mutable and
  immutable array slices.

  While this is a major step forward for the lifetimes system in Mojo, it is
  still *very* early and awkward to use.  Notably, there is no syntactic sugar
  for using references, such as automatic dereferencing. Several aspects of it
  need to be more baked. It is getting exercised by variadic memory arguments,
  which is why they are starting to behave better now.

  Note: the safe `Reference` type and the unsafe pointer types are defined in
  the same module, currently named `memory.unsafe`. We expect to restructure
  this module in a future release.

* Mojo now allows types to implement `__refattr__()` and `__refitem__()` to
  enable attribute and subscript syntax with computed accessors that return
  references. For common situations where these address a value in memory this
  provides a more convenient and significantly more performant alternative to
  implementing the traditional get/set pairs.  Note: this may be changed in the
  future when references auto-dereference—at that point we may switch to just
  returning a reference from `__getattr__()`.

* Parametric closures can now capture register passable typed values by copy
  using the `__copy_capture` decorator. For example, the following code will
  print `5`, not `2`.

  ```mojo
  fn foo(x: Int):
      var z = x

      @__copy_capture(z)
      @parameter
      fn formatter() -> Int:
          return z
      z = 2
      print(formatter())

  fn main():
      foo(5)
  ```

* String now implements KeyElement and may be used as a key in Dict.

* More robust support for structs with fields of self referencing types.
  For example, the following code will work and print `0`:

  ```mojo
  struct Foo(CollectionElement):
      var vec: DynamicVector[Self]

      fn __init__(inout self: Self):
          self.vec = DynamicVector[Self]()

      fn __moveinit__(inout self: Self, owned existing: Self):
          self.vec = existing.vec ^

      fn __copyinit__(inout self: Self, existing: Self):
          self.vec = existing.vec

  fn main():
      var foo = Foo()
      print(len(foo.vec))
  ```

### ❌ Removed

* The `__takeinit__` special constructor form has been removed from the
  language.  This "non-destructive move" operation was previously wired into the
  `x^` transfer operator, but had unpredictable behavior that wasn't consistent.
  Now that Mojo has traits, it is better to model this as an explicit `.take()`
  operation on a type, which would transfer out the contents of the type without
  ending its lifetime. For example, for a type that holds a pointer, `take()`
  might return a new instance pointing to the same data, and null out its own
  internal pointer.

  This change makes it clear when a lifetime is ended versus when the
  contents of an LValue are explicitly taken.

* The current implementation of autotuning has been deprecated, as Mojo's
  autotuning implementation is undergoing a redesign. Tutorials around the
  current implementation have also been removed as they are being rewritten.

  Consequently, the `autotune()`, `autotune_fork()`, and `search()` functions
  have been removed from the standard library.

* The `_OldDynamicVector` type that worked only on register passable element
  types has been removed.  Please migrate uses to
  [`DynamicVector`](/mojo/stdlib/collections/list/List) which
  works on both register passable and memory types.

* The `UnsafeFixedVector` in `utils.vector` has been removed. We recommend using
  either [`DynamicVector`](/mojo/stdlib/collections/list/List)
  or [`InlinedFixedVector`](/mojo/stdlib/collections/inline_array/InlineArray)
  instead.

* The `@adaptive` decorator has been removed from the language. Any uses of the
  decorator in a non-search context can be replaced with `@parameter if`. For
  example:

  ```mojo
  @adaptive
  fn foo[a: Bool]():
      constrained[a]()
      body1()

  @adaptive
  fn foo[a: Bool]():
      constrained[not a]()
      body2()
  ```

  Can be rewritten as:

  ```mojo
  fn foo[a: Bool]():
      @parameter
      if a:
          body1()
      else:
          body2()
  ```

  Consequently, the special `__adaptive_set` attribute has been removed as well.

* Result parameters have been removed from Mojo. Result parameter declarations
  in function parameter lists are no longer allowed, nor are forward alias
  declarations. This includes removing the `param_return` statement.

* The `@noncapturing` and `@closure` decorators have been removed due to
  refinements and improvements to the closure model. See below for more details!

### 🦋 Changed

* The Mojo closure model has been refined to be more straightforward and safe.
  Mojo has two closure types: parameter closures and runtime closures. Parameter
  closures can be used in higher-order functions and are the backbone of
  functions like `vectorize` and `parallelize`. They are always denoted by
  `@parameter` and have type `fn() capturing -> T` (where `T` is the return
  type).

  On the other hand, runtime closures are always dynamic values, capture values
  by invoking their copy constructor, and retain ownership of their capture
  state. You can define a runtime closure by writing a nested function that
  captures values:

  ```mojo
  fn outer(b: Bool, x: String) -> fn() escaping -> None:
      fn closure():
          print(x) # 'x' is captured by calling String.__copyinit__

      fn bare_function():
          print("hello") # nothing is captured

      if b:
          # closure can be safely returned because it owns its state
          return closure^

      # function pointers can be converted to runtime closures
      return bare_function
  ```

  The type of runtime closures are of the form `fn() escaping -> T`. You
  can pass equivalent function pointers as runtime closures.

  Stay tuned for capture list syntax for move capture and capture by reference,
  and a more unified closure model!

* The `@unroll(n)` decorator can now take a parameter expression for
  the unroll factor, i.e. `n` can be a parameter expression that is
  of integer type.

* The `cpython` module in the `python` package has been moved to be an internal
  module, i.e, `_cpython`.

* `AnyType` and `Destructable` have been unified into a single trait, `AnyType`.
  Every nominal type (i.e. all structs) now automatically conform to `AnyType`.

* Previously, the `mojo package` command would output a Mojo package that
  included both partly-compiled Mojo code, as well as fully-compiled machine
  code for a specific computer architecture -- the architecture of the machine
  being used to invoke the `mojo package` command.

  Now, `mojo package` only includes partly-compiled Mojo code. It is only fully
  compiled for the specific computer architecture being used at the point that
  the package is first `import`-ed. As a result, Mojo packages are smaller and
  more portable.

* The `simd_width` and `dtype` parameters of `polynomial_evaluate` have been
  switched. Based on the request in
  [#1587](https://github.com/modular/modular/issues/1587), the
  `polynomial_evaluate` function has also been extended so that the
  `coefficients` parameter can take either a either a
  [`StaticTuple`](/mojo/stdlib/utils/static_tuple/StaticTuple) or a
  [`VariadicList`](/mojo/stdlib/builtin/list_literal/VariadicList).

* As a tiny step towards removing `let` declarations, this release removes the
  warning: `'var' was never mutated, consider switching to a 'let'`.

### 🛠️ Fixed

* [#1595](https://github.com/modular/modular/issues/1595) - Improve error message
  when trying to materialize `IntLiteral` in runtime code.
* Raising an error from the initializer of a memory-only type now works
  correctly in the presence of complex control flow.  Previously Mojo could run
  the destructor on `self` before it was initialized when exiting with an
  error.
* [#1096](https://github.com/modular/modular/issues/1096) - Improve warning
  messages for dead code in conditionals like `or` expressions.
* [#1419](https://github.com/modular/modular/issues/1419) - Fix assertion failure
  with uninitialized lattice values.
* [#1402](https://github.com/modular/modular/issues/1402) - Fix movable trait not
  detected on recursive struct implemented with `AnyPointer`.
* [#1399](https://github.com/modular/modular/issues/1399) - Fix parser crash when
  a parameter type in a struct that implements a trait is misspelled.
* [#1152](https://github.com/modular/modular/issues/1152) - Allow mutable `self`
  argument when overloading operators using dunder methods.
* [#1493](https://github.com/modular/modular/issues/1493) - Fix crash in
  `DynamicVector` copy constructor in certain situations.
* [#1316](https://github.com/modular/modular/issues/1316) - The `benchmark.keep`
  function now properly handles vector types.
* [#1505](https://github.com/modular/modular/issues/1505) - The `simd.shuffle`
  operation now works on 64 element permutations.
* [#1355](https://github.com/modular/modular/issues/1355) - Fix `String.find()`
  returning wrong value when starting index is non-zero.
* [#1367](https://github.com/modular/modular/issues/1367) - Fix `String.replace()`
  returning incorrect results for multi-character search strings.
* [#1535](https://github.com/modular/modular/issues/1535) - Invalid error `field
  'w.x.y' destroyed out of the middle of a value, preventing the overall value
  from being destroyed`.
* [#1475](https://github.com/modular/modular/issues/1475) - Assertion failure in
  nested loop.
* [#1591](https://github.com/modular/modular/issues/1591) - Assertion failure
  when using `AnyType` struct member.
* [#1503](https://github.com/modular/modular/issues/1503) - Rename the mojo build
  of LLDB to `mojo-lldb`, to prevent name collisions with the system's LLDB.
* [#1542](https://github.com/modular/modular/issues/1542) - `@unroll` does not
  accept alias as unroll factor.
* [#1443](https://github.com/modular/modular/issues/1443) - Compiler crash on
  variadic list of traits.
* [#1604](https://github.com/modular/modular/issues/1604) - Variable of trivial
  type not destroyed by transferring ownership.
* [#1341](https://github.com/modular/modular/issues/1341) - Segmentation fault
  when passing closures around.
* [#217](https://github.com/modular/modular/issues/217) - Closure state is
  stack allocated.

## v0.6.1 (2023-12-18)

### ⭐️ New

* The Mojo REPL now provides limited support for the `%cd` magic command.

  This command automatically maintains an internal stack of directories you
  visit during the REPL session. Usage:

  * `%cd 'dir'`: change to directory `dir` and push it on the directory stack.
  * `%cd -`: pop the directory stack and change to the last visited directory.

* Structs decorated with `@value` now automatically conform to the
  [`Movable`](/mojo/stdlib/builtin/value/Movable)
  and [`Copyable`](/mojo/stdlib/builtin/value/Copyable) built-in traits.

* [`String`](/mojo/stdlib/collections/string/string/String) now has new
  [`toupper()`](/mojo/stdlib/collections/string/string/String#upper) and
  [`tolower()`](/mojo/stdlib/collections/string/string/String#lower) methods analogous,
  respectively, to Python's `str.toupper()` and `str.tolower()`.

* Added a [`hash()`](/mojo/stdlib/hashlib/hash/hash) built-in function and
  [`Hashable`](/mojo/stdlib/hashlib/hash/Hashable) trait for types
  implementing the `__hash__()` method. Future releases will add `Hashable`
  support to Standard Library types. In the meantime, the `hash` module includes
  a version of the `hash()` function that works on arbitrary byte strings. To
  generate hashes for [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) types, you
  use the internal `_hash_simd()` function:

  ```mojo
  from builtin.hash import _hash_simd

  fn gen_simd_hash():
      let vector = SIMD[DType.int64, 4](1, 2, 3, 4)
      let hash = _hash_simd(vector)
  ```

* Several standard library types now conform to the
  [`CollectionElement`](/mojo/stdlib/builtin/value/CollectionElement)
  trait.  These types include [`Bool`](/mojo/stdlib/builtin/bool/Bool),
  [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral),
  [`DynamicVector`](/mojo/stdlib/collections/list/List),
  [`Tensor`](/max/api/mojo/tensor/tensor/Tensor),
  [`TensorShape`](/max/api/mojo/tensor/tensor_shape/TensorShape),
  and [`TensorSpec`](/max/api/mojo/tensor/tensor_spec/TensorSpec).

### 🦋 Changed

* `utils.vector` has been moved to a new `collections` package to make
  space for new collections. This means that if you had previous code
  that did `from utils.vector import DynamicVector`, it now needs to
  be `from collections.vector import DynamicVector` due to the move.

* The special destructor method `__del__()` has been changed to enforce
  that it cannot raise an error. Raising destructors are not supported properly
  at the moment.

### 🛠️ Fixed

* [#1421](https://github.com/modular/modular/issues/1421) - Fixed a crash when
  using Tuples in the REPL.

* [#222](https://github.com/modular/modular/issues/222) - Generate an error
  for obviously self recursive functions.

* [#1408](https://github.com/modular/modular/issues/1408) - Fix overload
  resolution when candidates can return generic types.

* [#1413](https://github.com/modular/modular/issues/1413) and
  [#1395](https://github.com/modular/modular/issues/1395) - Do not crash when
  re-declaring a builtin declaration.

* [#1307](https://github.com/modular/modular/issues/1307) - Fix compatibility of
  function signatures that only differ in default argument values.

* [#1380](https://github.com/modular/modular/issues/1380) - Fix printing
  of empty `String`.

## v0.6.0 (2023-12-04)

### 🔥 Legendary

* Traits have arrived!

  You can now define a *trait*, which consists of a required set of method
  prototypes. A struct can *conform to* the trait by implementing these methods.
  This lets you write generic functions that work on any structs that conform to
  a given trait.

  The following section gives a brief overview of traits—see the
  [Mojo Manual](/mojo/manual/traits) and this
  [traits blog post](https://modul.ar/traits-blog) for more details!

  Traits are declared with the `trait` keyword. The bodies of traits should
  contain method signatures declared with `...` as their bodies. Default
  method implementations are not supported yet.

  ```mojo
  trait SomeTrait:
      fn required_method(self, x: Int): ...
  ```

  The trait can be implemented on a struct by inheriting from it.

  ```mojo
  struct SomeStruct(SomeTrait):
      fn required_method(self, x: Int):
          print("hello traits", x)
  ```

  You can then write a generic functions that accepts any type that conforms to
  the trait. You do this by creating a parameterized function with a
  trait-typed parameter:

  ```mojo
  fn fun_with_traits[T: SomeTrait](x: T):
      x.required_method(42)
  ```

  Which can be invoked with instances of types that conform to the trait:

  ```mojo
  var thing = SomeStruct()
  # Infer the parameter `T`!
  fun_with_traits(thing)
  ```

  Traits can also inherit from other traits, which simply requires that
  implementers of the child trait also conform to all parent traits.

  ```mojo
  trait Parent:
      fn parent_func(self): ...

  trait Child(Parent):
      fn child_func(self): ...
  ```

  Then, both child and parent trait methods can be invoked on instances of
  the trait `Child`. As well, an instance of the child trait can be converted to
  an instance of the parent trait.

  ```mojo
  fn the_parents[T: Parent](x: T):
      x.parent_func()

  fn the_children[T: Child](x: T):
      x.child_func()
      x.parent_func()
      # Upcast `x` from instance of `Child` to `Parent`.
      the_parents(x)
  ```

  For more information, see the [Traits page](/mojo/manual/traits)
  in the Mojo Manual.

* A fundamental `Destructable` trait has been added to the language. This is a
  core trait that every trait automatically conforms to. This enables
  destruction of generic types and generic collections.

  **Note:** We're aware that this trait might be better spelled `Destructible`.
  We're planning on removing it in the future and moving its functionality to
  `AnyType` so that any type that doesn't provide its own destructor will have
  a default, no-op destructor.

* We've added some traits to the standard library, you can implement these on
  your own types:

  * [`Destructable`](/mojo/stdlib/builtin/anytype/AnyType)
  * [`Copyable`](/mojo/stdlib/builtin/value/Copyable)
  * [`Movable`](/mojo/stdlib/builtin/value/Movable)
  * [`Stringable`](/mojo/stdlib/builtin/str/Stringable)
  * [`Intable`](/mojo/stdlib/builtin/int/Intable)
  * [`Sized`](/mojo/stdlib/builtin/len/Sized)
  * [`CollectionElement`](/mojo/stdlib/builtin/value/CollectionElement)

* We added built-in [`len()`](/mojo/stdlib/builtin/len/len),
  [`str()`](/mojo/stdlib/builtin/str/str), and
  [`int()`](/mojo/stdlib/builtin/int/int-function) functions, which work with
  types that implement the `Sized`, `Stringable`, and `Intable` traits,
  respectively.

* [`DynamicVector`](/mojo/stdlib/collections/list/List) is now a
  proper generic collection that can use any type that implements the `Movable`
  and `Copyable` traits. This means you can now write, for example,
  `DynamicVector[String]`. Also, `DynamicVector` now invokes its element
  destructors upon destruction, so `_del_old` has been deleted.

* `print` now works on any types that implement `Stringable` by invoking their
  `__str__` method:

  ```mojo
  @value
  struct BoxedInt(Stringable):
      var value: Int

      fn __str__(self) -> String:
          return self.value

  print(BoxedInt(11), "hello traits!", BoxedInt(42))
  ```

### ⭐️ New

* The [Mojo Manual](/mojo/manual/) is an all-new, complete Mojo user guide.
  It doesn't include *everything* about Mojo yet, but it includes a lot,
  and more than the original programming
  manual (now deprecated).

  Plus, the entire Mojo Manual and other Mojo docs are now [open-sourced on
  GitHub](https://github.com/modular/modular/tree/main/mojo/docs), and we'd love
  to accept contributions to help us improve them!

* Mojo now supports partial automatic parameterization: when a function is
  declared with an argument of a partially bound type, the unbound parameters
  of that type are implicitly added to the function's input parameters. For
  example:

  ```mojo
  @value
  struct Fudge[a: Int, b: Int, c: Int = 7]: ...

  # These function declarations are roughly equivalent:
  fn eat(f: Fudge[5]): ...               # implicitly parameterized
  fn eat[_b: Int](f: Fudge[5, _b]): ...  # explicitly parameterized
  ```

  In the first signature for `eat()`, the `b` parameter isn't bound, so it's
  *implicitly* added as an input parameter on the function.

  In the second signature for `eat()`, the author has explicitly defined an
  input parameter (`_b`), which is bound to the second parameter on the argument
  type (which happens to be `b`).

  Both functions can be called like this:

  ```mojo
  eat(Fudge[5, 8]())
  ```

  Mojo infers the value of the `b` parameter from the argument (in this case,
  8\).

  With the second signature, you can also pass the `_b` parameter value
  explicitly:

  ```mojo
  eat[3](Fudge[5, 3]())
  ```

  Moreover, Mojo now allows you to explicitly mark parameters as unbound using
  the `_` as syntax meaning "placeholder for an unbound parameter." For example:

  ```mojo
  # These function declarations are roughly equivalent:
  fn eat(f: Fudge[5, _, c=_]): ...                    # implicitly parameterized
  fn eat(f: Fudge[c=_, a=5, b=_]): ...                # implicitly parameterized
  fn eat[_b: Int, _c: Int](f: Fudge[5, _b, _c]): ...  # explicitly parameterized
  ```

  The first two signatures explicitly unbind the `b` and `c` parameters.

  In the last signature, the `_b` and `_c` parameters are explicitly declared by
  the author, and bound to the `b` and `c` parameters in the argument type.

  Any of these signatures can be called like this:

  ```mojo
  eat(Fudge[5, 8]())
  eat(Fudge[5, 8, 9]())
  ```

  Note that the default parameter values of struct parameters are bound, unless
  explicitly unbound by the user.

  For more information, see the
  [Mojo Manual](/mojo/manual/parameters/#fully-bound-partially-bound-and-unbound-types).

* Parametric types can now be partially bound in certain contexts. For example,
  a new `Scalar` type alias has been added defined as:

  ```mojo
  alias Scalar = SIMD[size=1]
  ```

  Which creates a parametric type alias `Scalar` with a single parameter of type
  `DType`. Types can also be partially or fully bound in other contexts. For
  instance, `alias` declarations of type values inside functions now work
  properly:

  ```mojo
  fn type_aliases():
      alias T = SIMD
      print(T[DType.float32, 1]())
      alias Partial = T[type=DType.int32]
      print(Partial[2]())
  ```

* The `__mlir_op` feature now supports operations that return multiple results.
  To use them, you write the `_type` field as a `Tuple` of types.  For example:

  ```mojo
  # The `ret` variable has type `Tuple[Int, Int]`.
  let ret = __mlir_op.`multi_result_op`[_type=(Int, Int)]()
  ```

* Mojo now has the ability to read raw bytes from a file using the
  [`read_bytes()`](/mojo/stdlib/builtin/file/FileHandle#read_bytes) method.
  For example:

  ```mojo
  with open("file.binary", "r") as f:
      data = f.read_bytes()
  ```

* A size argument was added to the
  [`read()`](/mojo/stdlib/builtin/file/FileHandle#read) and
  [`read_bytes()`](/mojo/stdlib/builtin/file/FileHandle#read_bytes) methods on
  the builtin `file.FileHandle`. The size argument defaults to -1 and maintains
  the previous "read to EOF" behavior when size is negative.

  ```mojo
  with open("file.binary", "r") as f:
      data1 = f.read_bytes(1024)
      data2 = f.read_bytes(256)
  ```

* [`Path`](/mojo/stdlib/pathlib/path/Path) now has `read_bytes()` and
  `read_text()` methods to read file contents from a path:

  ```mojo
  let text_path = Path("file.txt")
  let text = text_path.read_text()

  let binary_path = Path("file.binary")
  let data = binary_path.read_bytes()
  ```

* `Tensor` has new `save()` and `load()` methods to save and load to file. These
  methods preserve shape and datatype information. For example:

  ```mojo
  let tensor = Tensor[DType.float32]()
  tensor.save(path)

  let tensor_from_file = Tensor[DType.float32].load(path)
  ```

* Subscripting added to
  [`DTypePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) and
  [`Pointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer):

  ```mojo
  let p = DTypePointer[DType.float16].alloc(4)
  for i in range(4):
      p[i] = i
      print(p[i])
  ```

* `file.FileHandle` now has a `seek()` method.

* [`String`](/mojo/stdlib/collections/string/string/String) now has an
  [`rfind()`](/mojo/stdlib/collections/string/string/String#rfind) method
  analogous to Python's `str.rfind()`.

* `String` now has an
  [`split()`](/mojo/stdlib/collections/string/string/String#split) method
  analogous to Python's `str.split()`.

* [`Path`](/mojo/stdlib/pathlib/path/Path) now has a
  [`suffix()`](/mojo/stdlib/pathlib/path/Path#suffix) method analogous to
  Python's `pathlib.Path.suffix`.

* The Mojo REPL now supports indented expressions, making it a bit easier to
  execute expressions copied from an indented block (such as a doc string).

* The Mojo Language Server now implements the Document Symbols request. IDEs use
  this to provide support for **Outline View** and **Go to Symbol**. This
  addresses [Issue #960](https://github.com/modular/modular/issues/960).

* The Mojo Language Server now shows documentation when code completing modules
  or packages in `import` statements.

* The Mojo Language Server now supports processing code examples, defined as
  markdown Mojo code blocks, inside of doc strings. This enables IDE features
  while writing examples in API documentation.

* The Mojo Language Server now provides semantic token information, providing
  better highlighting for symbols whose semantics are not statically analyzable.

* The Mojo Language Server now classifies doc strings as folding ranges,
  making them easier to collapse, reducing vertical space while editing.

* Command line options for the `mojo` driver that take arguments can now be
  written in either of two ways: both `--foo FOO` and `--foo=FOO`. Previously,
  only the former was valid.

### 🦋 Changed

* Variadic list types
  [`VariadicList`](/mojo/stdlib/builtin/list_literal/VariadicList) and
  [`VariadicListMem`](/mojo/stdlib/builtin/list_literal/VariadicListMem)
  are now iterable. Variadic arguments are automatically projected into one of
  these types inside the function body, so var args can be iterated:

  ```mojo
  fn print_ints(*nums: Int):
      for num in nums:
          print(num)
      print(len(nums))
  ```

* The assert functions in the [`testing`](/mojo/stdlib/testing/testing)
  package now raise an `Error` when the assertion fails instead of returning a
  `Bool` for whether the assertion succeeded or not.

* Parameters of [`AnyType`](/mojo/stdlib/builtin/type_aliases) type are no
  longer (implicitly) assumed to be register-passable. A new `AnyRegType` type
  is used to represent generic types that are register passable.

* Changing the units in a [`benchmark`](/mojo/stdlib/benchmark/benchmark)
  report is now an argument instead of a parameter:

  ```mojo
  let report = benchmark.run[timer]()
  report.print(Unit.ms)
  ```

* Default values on `inout` arguments are no longer permitted, i.e. the
  following will now raise an error:

  ```mojo
  fn inout_default(inout x: Int = 2): ...
  ```

* The `to_string()` function has been removed from
  [`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) in favor of
  the new `__str__()` function.  This composes better with traits so it can be
  used with the generic `str()` function.

### 🛠️ Fixed

* [#734](https://github.com/modular/modular/issues/734) - Consumption of struct
  works only for types with a `__del__` method.

* [#910](https://github.com/modular/modular/issues/910) - Parser crash when
  using memory-only generic type as return of function that `raise`s.

* [#1060](https://github.com/modular/modular/issues/1060) - Mojo happily parses
  code that has messed up indentation

* [#1159](https://github.com/modular/modular/issues/1159) - The language server
  doesn't warn about bad return type.

* [#1166](https://github.com/modular/modular/issues/1166) - warning: unreachable
  code after return statement with context manager

* [#1098](https://github.com/modular/modular/issues/1098) - The language server
  doesn't highlight properties of PythonObjects correctly.

* [#1153](https://github.com/modular/modular/issues/1153) - The language server
  crashes when parsing an invalid multi-nested module import.

* [#1236](https://github.com/modular/modular/issues/1236) - The language server
  doesn't show autocomplete in if statements.

* [#1246](https://github.com/modular/modular/issues/1246) - Warning diagnostics
  are transient in the presence of caching.

### Known Issue

* There is an issue affecting Jupyter notebooks that use autotuning and traits.
  This issue only manifests on macOS, and the same code runs without issue
  outside of the notebooks. This issue affects the *Matrix multiplication in
  Mojo* notebook.

## v0.5.0 (2023-11-2)

### ⭐️ New

* The [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type now defaults to the
  architectural SIMD width of the type. This means you can write
  `SIMD[DType.float32]` which is equivalent to
  `SIMD[DType.float32, simdwidthof[DType.float32]()]`.

* The [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type now contains a `join()`
  function that allows you to concatenate two `SIMD` values together and produce
  a new `SIMD` value.

* Mojo now supports compile-time *keyword parameters*, in addition to existing
  support for [keyword
  arguments](/mojo/manual/parameters/#optional-parameters-and-keyword-parameters).
  For example:

  ```mojo
  fn foo[a: Int, b: Int = 42]():
      print(a, "+", b)

  foo[a=5]()        # prints '5 + 42'
  foo[a=7, b=13]()  # prints '7 + 13'
  foo[b=20, a=6]()  # prints '6 + 20'
  ```

  Keyword parameters are also supported in structs:

  ```mojo
  struct KwParamStruct[a: Int, msg: String = "🔥mojo🔥"]:
      fn __init__(inout self):
          print(msg, a)

  fn use_kw_params():
      KwParamStruct[a=42]()               # prints '🔥mojo🔥 42'
      KwParamStruct[5, msg="hello"]()     # prints 'hello 5'
      KwParamStruct[msg="hello", a=42]()  # prints 'hello 42'
  ```

  For more detail, see the [Mojo
  Manual](/mojo/manual/parameters/#optional-parameters-and-keyword-parameters).

  For the time being, the following notable limitations apply:

  * Keyword-only parameters are **not supported** yet:

    ```mojo
    fn baz[*args: Int, b: Int](): pass  # fails
    fn baz[a: Int, *, b: Int](): pass  # fails
    ```

    (The analogous keyword-only arguments in Python are described in
    [PEP 3102](https://peps.python.org/pep-3102/).)

  * Variadic keyword parameters are **not supported** yet:

    ```mojo
    fn baz[a: Int, **kwargs: Int](): pass  # fails
    ```

* Mojo now supports "automatic" parameterization of functions. What this means
  is that if a function argument type is parametric but has no bound parameters,
  they are automatically added as input parameters on the function. This works
  with existing features to allow you to write parametric functions with less
  boilerplate.

  ```mojo
  @value
  struct Thing[x: Int, y: Int]:
      pass

  fn foo(v: Thing):
      print(v.x)
      print(v.y)

  fn main():
      let v = Thing[2, 3]()
      foo(v)
  ```

  However, partial autoparameterization is **not supported** yet:

  ```mojo
  fn foo(v: Thing[y=7]):  # Partially bound type not allowed yet.
      ...
  ```

* Keyword argument passing is supported when invoking `__getitem__` using
  the bracket syntax:

  ```mojo
  @value
  struct MyStruct:
      fn __getitem__(self, x: Int, y: Int, z: Int) -> Int:
          return x * y + z

  MyStruct()[z=7, x=3, y=5]  # returns 22
  ```

  However, keyword argument passing to `__setitem__` using the bracket syntax is
  **not supported** yet:

  ```mojo
  @value
  struct OtherStruct:
      fn __setitem__(self, x: Int, y: Int): pass

  OtherStruct()[x=1] = 4  # fails
  ```

* Function argument input parameters can now be referenced within the signature
  of the function:

  ```mojo
  fn foo(x: SIMD, y: SIMD[x.type, x.size]):
      pass
  ```

* The [`benchmark`](/mojo/stdlib/benchmark/benchmark) module has been
  simplified and improved so you can now run:

  ```mojo
  import benchmark
  from time import sleep

  fn sleeper():
      sleep(.01)

  fn main():
      let report = benchmark.run[sleeper]()
      print(report.mean())
  ```

  It no longer requires a capturing `fn` so can benchmark functions outside the
  same scope.

  You can print a report with:

  ```mojo
  report.print()
  ```

  ```plaintext
  ---------------------
  Benchmark Report (s)
  ---------------------
  Mean: 0.012314264957264957
  Total: 1.440769
  Iters: 117
  Warmup Mean: 0.0119335
  Warmup Total: 0.023866999999999999
  Warmup Iters: 2
  Fastest Mean: 0.012227958333333334
  Slowest Mean: 0.012442699999999999
  ```

  Units for all functions default to seconds, but can be changed with:

  ```mojo
  from benchmark import Unit

  report.print[Unit.ms]()
  ```

* Mojo now supports struct parameter deduction (a.k.a. class template argument
  deduction, or CTAD) for partially bound types. Struct parameter deduction is
  also possible from static methods. For example:

  ```mojo
  @value
  struct Thing[v: Int]: pass

  struct CtadStructWithDefault[a: Int, b: Int, c: Int = 8]:
      fn __init__(inout self, x: Thing[a]):
          print("hello", a, b, c)

      @staticmethod
      fn foo(x: Thing[a]):
          print("🔥", a, b, c)

  fn main():
      _ = CtadStructWithDefault[b=7](Thing[6]())  # prints 'hello 6 7 8'
      CtadStructWithDefault[b=7].foo(Thing[6]())  # prints '🔥 6 7 8'
  ```

* `Tensor` has new `fromfile()` and `tofile()` methods to save and load as bytes
  from a file.

* The built-in `print()` function now works on the
  [`Tensor`](/max/api/mojo/tensor/tensor/Tensor) type.

* [`TensorShape`](/max/api/mojo/tensor/tensor_shape/TensorShape) and
  [`TensorSpec`](/max/api/mojo/tensor/tensor_spec/TensorSpec) now have
  constructors that take
  [`DynamicVector[Int]`](/mojo/stdlib/collections/list/List) and
  [`IndexList`](/mojo/stdlib/utils/index_/IndexList) to initialize shapes.

* The [`String`](/mojo/stdlib/collections/string/string/String) type now has the
  `count()` and `find()` methods to enable counting the number of occurrences or
  finding the offset index of a substring in a string.

* The `String` type now has a `replace()` method which allows you to replace a
  substring with another string.

### 🦋 Changed

* [`VariadicList`](/mojo/stdlib/builtin/list_literal/VariadicList) and
  [`VariadicListMem`](/mojo/stdlib/builtin/list_literal/VariadicListMem)
  moved under builtins, and no longer need to be imported.

* Variadic arguments are now automatically projected into a `VariadicList` or
  `VariadicListMem` inside the function body. This allows for more flexibility
  in using var args. For example:

  ```mojo
    fn print_ints(*nums: Int):
        let len = len(nums)
        for i in range(len):
            print(nums[i])
        print(len)
  ```

* The parameters for
  [`InlinedFixedVector`](/mojo/stdlib/collections/inline_array/InlineArray)
  have been switched. The parameters are now `[type, size]` instead of
  `[size, type]`. The `InlinedFixedVector` now has a default size which means
  that one can just use `InlinedFixedVector` as `InlinedFixedVector[Float32]`
  and the default size is used.

* `write_file()` method in [`Buffer`](/mojo/stdlib/buffer/buffer/NDBuffer)
  and [`NDBuffer`](/mojo/stdlib/buffer/buffer/NDBuffer) is renamed to
  `tofile()` to match the Python naming.

* Mojo will now utilize all available cores across all NUMA sockets on the host
  machine by default. The prior default behavior was to use all the cores on
  the first socket.

### ❌ Removed

* The `math.numerics` module is now private, because its types (`FPUtils` and
  `FlushDenormals`) should not be used externally.

### 🛠️ Fixed

* [#532](https://github.com/modular/modular/issues/532) - Compiler optimizing
  while True loop away
* [#760](https://github.com/modular/modular/issues/760) - Compilation error:
  'hlcf.for.yield' op specifies 0 branch inputs but target expected 1 along
  control-flow edge from here
* [#849](https://github.com/modular/modular/issues/849) - The `Tensor` type is
  now initialized with zeros at construction time.
* [#912](https://github.com/modular/modular/issues/912) - Invalid load for
  `__get_address_as_lvalue`.
* [#916](https://github.com/modular/modular/issues/916) - Parser crash when
  specifying default values for `inout` arguments.
* [#943](https://github.com/modular/modular/issues/943) - Mojo hangs if you
  use continue in the nested loop
* [#957](https://github.com/modular/modular/issues/957) - Parser crash when a
  function call with variadic arguments of a memory-only type is evaluated at
  compile time.
* [#990](https://github.com/modular/modular/issues/990) - Fixes rounding
  issue with floor division with negative numerator.
* [#1018](https://github.com/modular/modular/issues/1018) - In some cases the
  sort function was returning invalid results. This release fixes some of these
  corner cases.
* [#1010](https://github.com/modular/modular/issues/1010) - Initializing tensor
  in alias declaration results in crash.
* [#1110](https://github.com/modular/modular/issues/1110) - The `time.now()`
  function now returns nanoseconds across all operating systems.
* [#1115](https://github.com/modular/modular/issues/1115) - cannot load
  non-register passable type into SSA register.

## v0.4.0 for Mac (2023-10-19)

### 🔥 Legendary

* Mojo for Mac!

  The Mojo SDK now works on macOS (Apple silicon). This is the same version
  previously released for Linux. Get the latest version of the SDK for your Mac
  system:

  [Download Now!](https://developer.modular.com/download)

## v0.4.0 (2023-10-05)

### ⭐️ New

* Mojo now supports default parameter values. For example:

  ```mojo
  fn foo[a: Int = 3, msg: StringLiteral = "woof"]():
      print(msg, a)

  fn main():
      foo()  # prints 'woof 3'
      foo[5]()  # prints 'woof 5'
      foo[7, "meow"]()  # prints 'meow 7'
  ```

  Inferred parameter values take precedence over defaults:

  ```mojo
  @value
  struct Bar[v: Int]:
      pass

  fn foo[a: Int = 42, msg: StringLiteral = "quack"](bar: Bar[a]):
      print(msg, a)

  fn main():
      foo(Bar[9]())  # prints 'quack 9'
  ```

  Structs also support default parameters:

  ```mojo
  @value
  struct DefaultParams[msg: StringLiteral = "woof"]:
      alias message = msg

  fn main():
      print(DefaultParams[]().message)  # prints 'woof'
      print(DefaultParams["meow"]().message)  # prints 'meow'
  ```

* The new [`file`](/mojo/stdlib/builtin/file) module adds basic file I/O
  support. You can now write:

  ```mojo
  var f = open("my_file.txt", "r")
  print(f.read())
  f.close()
  ```

  or

  ```mojo
  with open("my_file.txt", "r") as f:
      print(f.read())
  ```

* Mojo now allows context managers to support an `__enter__` method without
  implementing support for an `__exit__` method, enabling idioms like this:

  ```mojo
  # This context manager consumes itself and returns it as the value.
  fn __enter__(owned self) -> Self:
      return self^
  ```

  Here Mojo *cannot* invoke a noop `__exit__` method because the context
  manager is consumed by the `__enter__` method.  This can be used for types
  (like file descriptors) that are traditionally used with `with` statements,
  even though Mojo's guaranteed early destruction doesn't require that.

* A very basic version of `pathlib` has been implemented in Mojo. The
  module will be improved to achieve functional parity with Python in
  the next few releases.

* The `memory.unsafe` module now contains a `bitcast` function. This is a
  low-level operation that enables bitcasting between pointers and scalars.

* The input parameters of a parametric type can now be directly accessed as
  attribute references on the type or variables of the type. For example:

  ```mojo
  @value
  struct Thing[param: Int]:
      pass

  fn main():
      print(Thing[2].param) # prints '2'
      let x = Thing[9]()
      print(x.param) # prints '9'
  ```

  Input parameters on values can even be accessed in parameter contexts. For
  example:

  ```mojo
  fn foo[value: Int]():
      print(value)

  let y = Thing[12]()
  alias constant = y.param + 4
  foo[constant]() # prints '16'
  ```

* The Mojo REPL now supports code completion. Press Tab while typing
  to query potential completion results.

* Error messages from Python are now exposed in Mojo. For example the following
  should print `No module named 'my_uninstalled_module'`:

  ```mojo
  fn main():
      try:
          let my_module = Python.import_module("my_uninstalled_module")
      except e:
          print(e)
  ```

* Error messages can now store dynamic messages. For example, the following
  should print "Failed on: Hello"

  ```mojo
  fn foo(x: String) raises:
      raise Error("Failed on: " + x)

  fn main():
      try:
          foo("Hello")
      except e:
          print(e)
  ```

### 🦋 Changed

* We have improved and simplified the `parallelize` function. The function
  now elides some overhead by caching the Mojo parallel runtime.

* The Mojo REPL and Jupyter environments no longer implicitly expose `Python`,
  `PythonObject`, or `Pointer`. These symbols must now be imported explicitly,
  for example:

  ```mojo
  from python import Python
  from python.object import PythonObject
  from memory.unsafe import Pointer
  ```

* The syntax for specifying attributes with the `__mlir_op` prefix have changed
  to mimic Python's keyword argument passing syntax. That is, `=` should be used
  instead of `:`, e.g.:

  ```mojo
  # Old syntax, now fails.
  __mlir_op.`index.bool.constant`[value : __mlir_attr.false]()
  # New syntax.
  __mlir_op.`index.bool.constant`[value=__mlir_attr.false]()
  ```

* You can now print the `Error` object directly. The `message()` method
  has been removed.

### 🛠️ Fixed

* [#794](https://github.com/modular/modular/issues/794) - Parser crash when
  using the `in` operator.
* [#936](https://github.com/modular/modular/issues/936) - The `Int` constructor
  now accepts other `Int` instances.
* [#921](https://github.com/modular/modular/issues/921) - Better error message
  when running `mojo` on a module with no  `main` function.
* [#556](https://github.com/modular/modular/issues/556) - UInt64s are now
  printed correctly.
* [#804](https://github.com/modular/modular/issues/804) - Emit error instead of
  crashing when passing variadic arguments of unsupported types.
* [#833](https://github.com/modular/modular/issues/833) - Parser crash when
  assigning module value.
* [#752](https://github.com/modular/modular/issues/752) - Parser crash when
  calling async def.
* [#711](https://github.com/modular/modular/issues/711) - The overload resolution
  logic now correctly prioritizes instance methods over static methods (if
  candidates are an equally good match otherwise), and no longer crashed if a
  static method has a `Self` type as its first argument.
* [#859](https://github.com/modular/modular/issues/859) - Fix confusing error and
  documentation of the `rebind` builtin.
* [#753](https://github.com/modular/modular/issues/753) - Direct use of LLVM
  dialect produces strange errors in the compiler.
* [#926](https://github.com/modular/modular/issues/926) - Fixes an issue that
  occurred when a function with a return type of `StringRef` raised an error.
  When the function raised an error, it incorrectly returned the string value of
  that error.
* [#536](https://github.com/modular/modular/issues/536) - Report More information
  on python exception.

## v0.3.1 (2023-09-28)

Our first-ever patch release of the Mojo SDK is here! Release v0.3.1
includes primarily installation-related fixes. If you’ve had trouble
installing the previous versions of the SDK, this release may be for you.

### 🛠️ Fixed

* [#538](https://github.com/modular/modular/issues/538) - Installation hangs
  during the testing phase. This issue occurs on machines with a low number
  of CPU cores, such as free AWS EC2 instances and GitHub Codespaces.
* [#590](https://github.com/modular/modular/issues/590) - Installation fails
  with a “failed to run python” message.
* [#672](https://github.com/modular/modular/issues/672) - Language server hangs
  on code completion. Related to #538, this occurs on machines with a low
  number of CPU cores.
* [#913](https://github.com/modular/modular/issues/913) - In the REPL and Jupyter
  notebooks, inline comments were being parsed incorrectly.

## v0.3.0 (2023-09-21)

There's more Mojo to love in this, the second release of the Mojo SDK! This
release includes new features, an API change, and bug fixes.

There's also an updated version of the [Mojo extension for VS
Code](https://marketplace.visualstudio.com/items?itemName=modular-mojotools.vscode-mojo).

### ⭐️ New

* Mojo now has partial support for passing keyword arguments to functions and
  methods. For example the following should work:

  ```mojo
  fn foo(a: Int, b: Int = 3) -> Int:
      return a * b

  fn main():
      print(foo(6, b=7))  # prints '42'
      print(foo(a=6, b=7))  # prints '42'
      print(foo(b=7, a=6))  # prints '42'
  ```

  Parameters can also be inferred from keyword arguments, for example:

  ```mojo
  fn bar[A: AnyType, B: AnyType](a: A, b: B):
      print("Hello 🔥")

  fn bar[B: AnyType](a: StringLiteral, b: B):
      print(a)

  fn main():
      bar(1, 2)  # prints `Hello 🔥`
      bar(b=2, a="Yay!")  # prints `Yay!`
  ```

  For the time being, the following notable limitations apply:

  * Keyword-only arguments are not supported:

    ```mojo
    fn baz(*args: Int, b: Int): pass  # fails
    fn baz(a: Int, *, b: Int): pass  # fails
    ```

    (Keyword-only arguments are described in
    [PEP 3102](https://peps.python.org/pep-3102/).)

  * Variadic keyword arguments are not supported:

    ```mojo
    fn baz(a: Int, **kwargs: Int): pass  # fails
    ```

* Mojo now supports the `@nonmaterializable` decorator.  The purpose is to mark
  data types that should only exist in the parameter domain.  To use it, a
  struct is decorated with `@nonmaterializable(TargetType)`.  Any time the
  nonmaterializable type is converted from the parameter domain, it is
  automatically converted to `TargetType`.  A nonmaterializable struct should
  have all of its methods annotated as `@always_inline`, and must be computable
  in the parameter domain.  In the following example, the `NmStruct` type can
  be added in the parameter domain, but are converted to `HasBool` when
  materialized.

  ```mojo
  @value
  @register_passable("trivial")
  struct HasBool:
      var x: Bool
      fn __init__(x: Bool) -> Self:
          return Self {x: x}
      @always_inline("nodebug")
      fn __init__(nms: NmStruct) -> Self:
          return Self {x: True if (nms.x == 77) else False}

  @value
  @nonmaterializable(HasBool)
  @register_passable("trivial")
  struct NmStruct:
      var x: Int
      @always_inline("nodebug")
      fn __add__(self: Self, rhs: Self) -> Self:
          return NmStruct(self.x + rhs.x)

  alias stillNmStruct = NmStruct(1) + NmStruct(2)
  # When materializing to a run-time variable, it is automatically converted,
  # even without a type annotation.
  let convertedToHasBool = stillNmStruct
  ```

* Mojo integer literals now produce the `IntLiteral` infinite precision integer
  type when used in the parameter domain.  `IntLiteral` is materialized to the
  `Int` type for runtime computation, but intermediate computations at compile
  time, using supported operators, can now exceed the bit width of the `Int`
  type.

* The Mojo Language Server now supports top-level code completions, enabling
  completion when typing a reference to a variable, type, etc. This resolves
  [#679](https://github.com/modular/modular/issues/679).

* The Mojo REPL now colorizes the resultant variables to help distinguish input
  expressions from the output variables.

### 🦋 Changed

* Mojo allows types to implement two forms of move constructors, one that is
  invoked when the lifetime of one value ends, and one that is invoked if the
  compiler cannot prove that.  These were previously both named `__moveinit__`,
  with the following two signatures:

  ```mojo
  fn __moveinit__(inout self, owned existing: Self): ...
  fn __moveinit__(inout self, inout existing: Self): ...
  ```

  We've changed the second form to get its own name to make it more clear that
  these are two separate operations: the second has been renamed to
  `__takeinit__`:

  ```mojo
  fn __moveinit__(inout self, owned existing: Self): ...
  fn __takeinit__(inout self, inout existing: Self): ...
  ```

  The name is intended to connote that the operation takes the conceptual value
  from the source (without destroying it) unlike the first one which "moves" a
  value from one location to another.

  For more information, see the Mojo Manual section on
  [move constructors](/mojo/manual/lifecycle/life#move-constructor).

* The Error type in Mojo has changed. Instead of extracting the error message
  using `error.value` you will now extract the error message using
  `error.message()`.

### 🛠️ Fixed

* [#503](https://github.com/modular/modular/issues/503) - Improve error message
  for failure lowering `kgen.param.constant`.
* [#554](https://github.com/modular/modular/issues/554) - Alias of static tuple
  fails to expand.
* [#500](https://github.com/modular/modular/issues/500) - Call expansion failed
  due to verifier error.
* [#422](https://github.com/modular/modular/issues/422) - Incorrect comment
  detection in multiline strings.
* [#729](https://github.com/modular/modular/issues/740) - Improve messaging on
  how to exit the REPL.
* [#756](https://github.com/modular/modular/issues/756) - Fix initialization
  errors of the VS Code extension.
* [#575](https://github.com/modular/modular/issues/575) - Build LLDB/REPL with
  libedit for a nicer editing experience in the terminal.

## v0.2.1 (2023-09-07)

The first versioned release of Mojo! 🔥

All earlier releases were considered version 0.1.

### 🔥 Legendary

* First release of the Mojo SDK!

  You can now develop with Mojo locally. The Mojo SDK is currently available
  for Ubuntu Linux systems, and support for Windows and macOS is coming soon.
  You can still develop from a Windows or Mac computer using a container or
  remote Linux system.

  The Mojo SDK includes the Mojo standard library and the [Mojo command-line
  interface](/mojo/cli/) (CLI), which allows you to run, compile, and package
  Mojo code. It also provides a REPL programming environment.

  [Get the Mojo SDK!](https://developer.modular.com/download)

* First release of the [Mojo extension for VS
  Code](https://marketplace.visualstudio.com/items?itemName=modular-mojotools.vscode-mojo).

  This provides essential Mojo language features in Visual Studio Code, such as
  code completion, code quick fixes, docs tooltips, and more. Even when
  developing on a remote system, using VS Code with this extension provides
  a native-like IDE experience.

### ⭐️ New

* A new `clobber_memory` function has been added to the
  [`benchmark`](/mojo/stdlib/benchmark/benchmark) module.
  The clobber memory function tells the system to flush all memory operations
  at the specified program point. This allows you to benchmark operations
  without the compiler reordering memory operations.

* A new `keep` function has been added to the
  [`benchmark`](/mojo/stdlib/benchmark/benchmark) module. The `keep`
  function tries to tell the compiler not to optimize the variable away
  if not used. This allows you to avoid compiler's dead code elimination
  mechanism, with a low footprint side effect.

* New `shift_right` and `shift_left` functions have been added to the
  [`simd`](/mojo/stdlib/builtin/simd) module. They shift the elements in
  a SIMD vector right/left, filling elements with zeros as needed.

* A new `cumsum` function has been added to the
  [`reduction`](/mojo/stdlib/algorithm/reduction) module that computes
  the cumulative sum (also known as scan) of input elements.

* Mojo Jupyter kernel now supports code completion.

### 🦋 Changed

* Extends `rotate_bits_left`, `rotate_left`, `rotate_bits_right`, and
  `rotate_right` to operate on Int values. The ordering of parameters has also
  been changed to enable type inference. Now it's possible to write
  `rotate_right[shift_val](simd_val)` and have the `dtype` and `simd_width`
  inferred from the argument. This addresses
  [Issue #528](https://github.com/modular/modular/issues/528).

### 🛠️ Fixed

* Fixed a bug causing the parser to crash when the `with` statement was written
  without a colon.
  This addresses [Issue #529](https://github.com/modular/modular/issues/529).

* Incorrect imports no longer crash when there are other errors at the top
  level of a module. This fixes [Issue
  \#531](https://github.com/modular/modular/issues/531).

## August 2023

### 2023-08-24

* Fixed issue where the `with expr as x` statement within `fn` behaved
  as if it were in a `def`, binding `x` with function scope instead of using
  lexical scope.

#### ⭐️ New

* Major refactoring of the standard library to enable packaging and better
  import ergonomics:
  * The packages are built as binaries to improve startup speed.
  * Package and module names are now lowercase to align with the Python style.
  * Modules have been moved to better reflect the purpose of the underlying
    functions (e.g. `Pointer` is now within the `unsafe` module in the `memory`
    package).
  * The following modules are now included as built-ins:
    `SIMD`, `DType`, `IO`, `Object`, and `String`.
    This means it's no longer necessary to explicitly import these modules.
    Instead, these modules will be implicitly imported for the user. Private
    methods within the module are still accessible using the
    `builtin.module_name._private_method` import syntax.
  * New `math` package has been added to contain the `bit`, `math`, `numerics`,
    and `polynomial` modules. The contents of the `math.math` module are
    re-exported into the `math` package.

* Mojo now supports using memory-only types in parameter expressions and as
  function or type parameters:

  ```mojo
  @value
  struct IntPair:
      var first: Int
      var second: Int

  fn add_them[value: IntPair]() -> Int:
      return value.first + value.second

  fn main():
      print(add_them[IntPair(1, 2)]()) # prints '3'
  ```

* In addition, Mojo supports evaluating code that uses heap-allocated memory
  at compile-time and materializing compile-time values with heap-allocated
  memory into dynamic values:

  ```mojo
  fn fillVector(lowerBound: Int, upperBound: Int, step: Int) -> DynamicVector[Int]:
      var result = DynamicVector[Int]()
      for i in range(lowerBound, upperBound, step):
          result.push_back(i)
      return result

  fn main():
      alias values = fillVector(5, 23, 7)
      for i in range(0, values.__len__()):
          print(values[i]) # prints '5', '12', and then '19'
  ```

#### 🦋 Changed

* `def main():`, without the explicit `None` type, can now be used to define
  the entry point to a Mojo program.

* The `assert_param` function has been renamed to `constrained` and is now
  a built-in function.

* The `print` function now works on `Complex` values.

#### 🛠️ Fixed

* Fixed issues with print formatting for `DType.uint16` and `DType.int16`.
* [Issue #499](https://github.com/modular/modular/issues/499) - Two new
  `rotate_right` and `rotate_left` functions have been added to the SIMD module.
* [Issue #429](https://github.com/modular/modular/issues/429) - You can now
  construct a `Bool` from a `SIMD` type whose element-type is `DType.bool`.
* [Issue #350](https://github.com/modular/modular/issues/350) - Confusing Matrix
  implementation
* [Issue #349](https://github.com/modular/modular/issues/349) - Missing load\_tr
  in struct Matrix
* [Issue #501](https://github.com/modular/modular/issues/501) - Missing syntax
  error messages in Python expressions.

### 2023-08-09

#### 🦋 Changed

* The `ref` and `mutref` identifiers are now treated as keywords, which means
  they cannot be used as variable, attribute, or function names.  These keywords
  are used by the "lifetimes" features, which is still in development.  We can
  consider renaming these (as well as other related keywords) when the
  development work gels, support is enabled in public Mojo builds, and when we
  have experience using them.

* The argument handling in `def` functions has changed: previously, they had
  special behavior that involved mutable copies in the callee. Now, we have a
  simple rule, which is that `def` argument default to the `owned` convention
  (`fn` arguments still default to the `borrowed` convention).

  This change is mostly an internal cleanup and simplification of the compiler
  and argument model, but does enable one niche use-case: you can now pass
  non-copyable types to `def` arguments by transferring ownership of a value
  into the `def` call. Before, that would not be possible because the copy was
  made on the callee side, not the caller's side. This also allows the explicit
  use of the `borrowed` keyword with a `def` that wants to opt-in to that
  behavior.

### 2023-08-03

#### ⭐️ New

* A new [`Tensor`](/max/api/mojo/tensor/tensor/Tensor) type has been introduced.
  This tensor type manages its own data (unlike `NDBuffer` and `Buffer` which
  are just views). Therefore, the tensor type performs its own allocation and
  free. Here is a simple example of using the tensor type to represent an RGB
  image and convert it to grayscale:

  ```mojo
  from tensor import Tensor, TensorShape
  from utils.index import Index
  from random import rand

  let height = 256
  let width = 256
  let channels = 3

  # Create the tensor of dimensions height, width, channels and fill with
  # random value.
  let image = rand[DType.float32](height, width, channels)

  # Declare the grayscale image.
  var gray_scale_image = Tensor[DType.float32](height, width)

  # Perform the RGB to grayscale transform.
  for y in range(height):
      for x in range(width):
          let r = image[y, x, 0]
          let g = image[y, x, 1]
          let b = image[y, x, 2]
          gray_scale_image[Index(y, x)] = 0.299 * r + 0.587 * g + 0.114 * b
  ```

#### 🛠️ Fixed

* [Issue #53](https://github.com/modular/modular/issues/53) - `Int` now
  implements true division with the `/` operator. Similar to Python, this
  returns a 64-bit floating point number. The corresponding in-place operator,
  `/=`, has the same semantics as `//=`.

## July 2023

### 2023-07-26

#### ⭐️ New

* Types that define both `__getitem__` and `__setitem__` (i.e. where
  sub-scripting instances creates computed LValues) can now be indexed
  in parameter expressions.

* Unroll decorator for loops with constant bounds and steps:

  * `@unroll`: Fully unroll a loop.
  * `@unroll(n)`: Unroll a loop by factor of n, where `n` is a positive integer.
  * Unroll decorator requires loop bounds and iteration step to be
    compiler time constant value, otherwise unrolling will fail with
    compilation error. This also doesn't make loop induction variable a parameter.

  ```mojo
  # Fully unroll the loop.
  @unroll
  for i in range(5):
      print(i)

  # Unroll the loop by a factor of 4 (with remainder iterations of 2).
  @unroll(4)
  for i in range(10):
      print(i)
  ```

* The Mojo REPL now prints the values of variables defined in the REPL. There is
  full support for scalars and structs. Non-scalar SIMD vectors are not
  supported at this time.

#### 🛠️ Fixed

* [Issue #437](https://github.com/modular/modular/issues/437) - Range can now
  be instantiated with a PythonObject.

* [Issue #288](https://github.com/modular/modular/issues/288) - Python strings
  can now be safely copied.

### 2023-07-20

#### ⭐️ New

* Mojo now includes a `Limits` module, which contains functions to get the max
  and min values representable by a type, as requested in [Issue
  \#51](https://github.com/modular/modular/issues/51). The following functions
  moved from `Math` to `Limits`: `inf()`, `neginf()`, `isinf()`, `isfinite()`.

* Mojo decorators are now distinguished between "signature" and "body"
  decorators and are ordered. Signature decorators, like `@register_passable`
  and `@parameter`, modify the type of declaration before the body is parsed.
  Body decorators, like `@value`, modify the body of declaration after it is
  fully parsed. Due to ordering, a signature decorator cannot be applied after
  a body decorator. That means the following is now invalid:

  ```mojo
  @register_passable # error: cannot apply signature decorator after a body one!
  @value
  struct Foo:
      pass
  ```

* Global variables can now be exported in Mojo compiled archives, using the
  `@export` decorator. Exported global variables are public symbols in compiled
  archives and use the variable name as its linkage name, by default. A custom
  linkage name can be specified with `@export("new_name")`. This does not affect
  variable names in Mojo code.

* Mojo now supports packages! A Mojo package is defined by placing an
  `__init__.mojo` or `__init__.🔥` within a directory. Other files in the same
  directory form modules within the package (this works exactly like it
  does [in Python](https://docs.python.org/3/tutorial/modules.html#packages)).
  Example:

  ```bash
  main.🔥
  my_package/
    __init__.🔥
    module.🔥
    my_other_package/
      __init__.🔥
      stuff.🔥
  ```

  ```mojo
  # main.🔥
  from my_package.module import some_function
  from my_package.my_other_package.stuff import SomeType

  fn main():
      var x: SomeType = some_function()
  ```

* Mojo now supports direct module and package imports! Modules and packages can
  be imported and bound to names. Module and package elements, like functions,
  types, global variables, and other modules, can be accessed using attribute
  references, like `my_module.foo`. Note that modules lack runtime
  representations, meaning module references cannot be instantiated.

  ```mojo
  import builtin.io as io
  import SIMD

  io.print("hello world")
  var x: SIMD.Float32 = 1.2
  ```

#### 🦋 Changed

* Reverted the feature from 2023-02-13 that allowed unqualified struct members.
  Use the `Self` keyword to conveniently access struct members with bound
  parameters instead.  This was required to fix
  [Issue #260](https://github.com/modular/modular/issues/260).

* Updated the RayTracing notebook: added step 5 to create specular lighting for
  more realistic images and step 6 to add a background image.

#### 🛠️ Fixed

* [Issue #260](https://github.com/modular/modular/issues/260) - Definitions
  inside structs no longer shadow definitions outside of struct definitions.

### 2023-07-12

#### ⭐️ New

* Mojo now has support for global variables! This enables `var` and `let`
  declaration at the top-level scope in Mojo files. Global variable initializers
  are run when code modules are loaded by the platform according to the order of
  dependencies between global variables, and their destructors are called in the
  reverse order.

* The Mojo programming manual is now written
  as a Jupyter notebook, and available in its entirety in the Mojo Playground
  (`programming-manual.ipynb`). (Previously, `HelloMojo.ipynb` included most of
  the same material, but it was not up-to-date.)

* As a result, we've also re-written `HelloMojo.ipynb` to be much shorter and
  provide a more gentle first-user experience.

* [`Coroutine` module documentation](/mojo/stdlib/builtin/coroutine) is now
  available. Coroutines form the basis of Mojo's support for asynchronous
  execution. Calls to `async fn`s can be stored into a `Coroutine`, from which
  they can be resumed, awaited upon, and have their results retrieved upon
  completion.

#### 🦋 Changed

* `simd_bit_width` in the `TargetInfo` module has been renamed to `simdbitwidth`
  to better align with `simdwidthof`, `bitwidthof`, etc.

#### 🛠️ Fixed

* The walrus operator now works in if/while statements without parentheses,
  e.g. `if x := function():`.

* [Issue #428](https://github.com/modular/modular/issues/428) - The
  `FloatLiteral` and `SIMD` types now support conversion to `Int` via the
  `to_int` or `__int__` method calls. The behavior matches that of Python, which
  rounds towards zero.

### 2023-07-05

#### ⭐️ New

* Tuple expressions now work without parentheses. For example, `a, b = b, a`
  works as you'd expect in Python.
* Chained assignments (e.g. `a = b = 42`) and the walrus operator (e.g.
  `some_function(b := 17)`) are now supported.

#### 🦋 Changed

* The `simd_width` and `dtype_simd_width` functions in the
  [`TargetInfo`](/mojo/stdlib/sys/info) module
  have been renamed to `simdwidthof`.

* The `dtype_` prefix has been dropped from `alignof`, `sizeof`, and
  `bitwidthof`. You can now use these functions (e.g. `alignof`) with any
  argument type, including `DType`.

* The `inf`, `neginf`, `nan`, `isinf`, `isfinite`, and `isnan` functions were
  moved from the `Numerics` module to the [`Math`](/mojo/stdlib/math/math/)
  module, to better align with Python's library structure.

#### 🛠️ Fixed

* [Issue #253](https://github.com/modular/modular/issues/253) - Issue
  when accessing a struct member alias without providing parameters.

* [Issue #404](https://github.com/modular/modular/issues/404) - The docs now use
  `snake_case` for variable names, which more closely conforms to Python's
  style.

* [Issue #379](https://github.com/modular/modular/issues/379) - Tuple
  limitations have been addressed and multiple return values are now supported,
  even without parentheses.

* [Issue #347](https://github.com/modular/modular/issues/347) - Tuples no longer
  require parentheses.

* [Issue #320](https://github.com/modular/modular/issues/320) - Python objects
  are now traversable via `for` loops.

## June 2023

### 2023-06-29

#### ⭐️ New

* You can now share `.ipynb` notebook files in Mojo Playground. Just save a
  file in the `shared` directory, and then right-click the file and select
  **Copy Sharable link**. To open a shared notebook, you must already have
  access to Mojo Playground;
  when you open a shared notebook, click **Import** at the top of the notebook
  to save your own copy. For more details about this feature, see the
  instructions inside the `help` directory, in the Mojo Playground file browser.

#### 🦋 Changed

* The `unroll2()` and `unroll3()` functions in the
  [`Functional`](/mojo/stdlib/algorithm/functional) module have been renamed to
  overload the `unroll()` function. These functions unroll 2D and 3D loops and
  `unroll()` can determine the intent based on the number of input parameters.

#### 🛠️ Fixed

* [Issue #229](https://github.com/modular/modular/issues/229) - Issue when
  throwing an exception from `__init__` before all fields are initialized.

* [Issue #74](https://github.com/modular/modular/issues/74) - Struct
  definition with recursive reference crashes.

* [Issue #285](https://github.com/modular/modular/issues/285) - The
  [`TargetInfo`](/mojo/stdlib/sys/info) module now includes
  `is_little_endian()` and `is_big_endian()` to check if the target host uses
  either little or big endian.

* [Issue #254](https://github.com/modular/modular/issues/254) - Parameter name
  shadowing in nested scopes is now handled correctly.

### 2023-06-21

#### ⭐️ New

* Added support for overloading on parameter signature. For example, it is now
  possible to write the following:

  ```mojo
  fn foo[a: Int](x: Int):
      pass

  fn foo[a: Int, b: Int](x: Int):
      pass
  ```

  For details on the overload resolution logic, see the Mojo Manual section on
  [parameters](/mojo/manual/parameters/#overloading-on-parameters).

* A new `cost_of()` function has been added to `Autotune`. This meta-function
  must be invoked at compile time, and it returns the number of MLIR operations
  in a function (at a certain stage in compilation), which can be used to
  build basic heuristics in higher-order generators.

  ```mojo
  from autotune import cost_of

  fn generator[f: fn(Int) -> Int]() -> Int:
      @parameter
      if cost_of[fn(Int) -> Int, f]()  SIMD[T, w]: ...

  @adaptive
  fn foobar[w: Int, S: DType]() -> SIMD[S, w]: ...
  ```

* [Issue #219](https://github.com/modular/modular/issues/219) - Issue when
  redefining a function and a struct defined in the same cell.

* [Issue #355](https://github.com/modular/modular/issues/355) - The loop order
  in the Matmul notebook for Python and naive mojo have been reordered for
  consistency. The loop order now follows (M, K, N) ordering.

* [Issue #309](https://github.com/modular/modular/issues/309) - Use snake case
  naming within the testing package and move the asserts out of the TestSuite
  struct.

### 2023-06-14

#### ⭐️ New

* Tuple type syntax is now supported, e.g. the following works:

  ```mojo
  fn return_tuple() -> (Int, Int):
      return (1, 2)
  ```

#### 🦋 Changed

* The `TupleLiteral` type was renamed to just `Tuple`, e.g.
  `Tuple[Int, Float]`.

#### 🛠️ Fixed

* [Issue #354](https://github.com/modular/modular/issues/354) - Returning a tuple
  doesn't work even with parens.
* [Issue #365](https://github.com/modular/modular/issues/365) - Copy-paste error
  in `FloatLiteral` docs.
* [Issue #357](https://github.com/modular/modular/issues/357) - Crash when
  missing input parameter to variadic parameter struct member function.

### 2023-06-07

#### ⭐️ New

* Tuple syntax now works on the left-hand side of assignments (in "lvalue"
  positions), enabling things like `(a, b) = (b, a)`.  There are several
  caveats: the element types must exactly match (no implicit conversions),
  this only works with values of `TupleLiteral` type (notably, it will not work
  with `PythonObject` yet) and parentheses are required for tuple syntax.

#### ❌ Removed

* Mojo Playground no longer includes the following Python packages (due to size,
  compute costs, and
  [environment complications](https://github.com/modular/modular/issues/300)):
  `torch`, `tensorflow`, `keras`, `transformers`.

#### 🦋 Changed

* The data types and scalar names now conform to the naming convention used
  by numpy. So we use `Int32` instead of `SI32`, similarly using `Float32`
  instead of `F32`. Closes
  [Issue #152](https://github.com/modular/modular/issues/152).

#### 🛠️ Fixed

* [Issue #287](https://github.com/modular/modular/issues/287) - computed
  lvalues don't handle raising functions correctly
* [Issue #318](https://github.com/modular/modular/issues/318) - Large integers
  are not being printed correctly
* [Issue #326](https://github.com/modular/modular/issues/326) - Float modulo
  operator is not working as expected
* [Issue #282](https://github.com/modular/modular/issues/282) - Default arguments
  are not working as expected
* [Issue #271](https://github.com/modular/modular/issues/271) - Confusing error
  message when converting between function types with different result semantics

## May 2023

### 2023-05-31

#### ⭐️ New

* Mojo Playground now includes the following Python packages (in response to
  [popular demand](https://github.com/modular/modular/discussions/173)):
  `torch`, `tensorflow`, `polars`, `opencv-python`, `keras`, `Pillow`, `plotly`,
  `seaborn`, `sympy`, `transformers`.

* A new optimization is applied to non-trivial copyable values that are passed
  as an owned value without using the transfer (`^`) operator.  Consider code
  like this:

  ```mojo
  var someValue: T = ...
  ...
  takeValueAsOwned(someValue)
  ...
  ```

  When `takeValueAsOwned()` takes its argument as an
  [`owned`](/mojo/manual/values/ownership#transfer-arguments-owned-and-)
  value (this is
  common in initializers for example), it is allowed to do whatever it wants
  with the value and destroy it when it is finished. In order to support this,
  the Mojo compiler is forced to make a temporary copy of the `someValue`
  value, and pass that value instead of `someValue`, because there may be other
  uses of `someValue` after the call.

  The Mojo compiler is now smart enough to detect when there are no uses of
  `someValue` later, and it will elide the copy just as if you had manually
  specified the transfer operator like `takeValueAsOwned(someValue^)`.  This
  provides a nice "it just works" behavior for non-trivial types without
  requiring manual management of transfers.

  If you'd like to take full control and expose full ownership for your type,
  just don't make it copyable.  Move-only types require the explicit transfer
  operator so you can see in your code where all ownership transfer happen.

* Similarly, the Mojo compiler now transforms calls to `__copyinit__` methods
  into calls to `__moveinit__` when that is the last use of the source value
  along a control flow path. This allows types which are both copyable and
  movable to get transparent move optimization. For example, the following code
  is compiled into moves instead of copies even without the use of the transfer
  operator:

  ```mojo
    var someValue = somethingCopyableAndMovable()
    use(someValue)
    ...
    let otherValue = someValue      # Last use of someValue
    use(otherValue)
    ...
    var yetAnother = otherValue     # Last use of otherValue
    mutate(yetAnother)
  ```

  This is a significant performance optimization for things like `PythonObject`
  (and more complex value semantic types) that are commonly used in a fluid
  programming style.  These don't want extraneous reference counting operations
  performed by its copy constructor.

  If you want explicit control over copying, it is recommended to use a
  non-dunder `.copy()` method instead of `__copyinit__`, and recall that
  non-copyable types must always use of the transfer operator for those that
  want fully explicit behavior.

#### 🛠️ Fixed

* [Issue #231](https://github.com/modular/modular/issues/231) - Unexpected error
  when a Python expression raises an exception
* [Issue #119](https://github.com/modular/modular/issues/119) - The REPL fails
  when a python variable is redefined

### 2023-05-24

#### ⭐️ New

* `finally` clauses are now supported on `try` statements. In addition, `try`
  statements no longer require `except` clauses, allowing `try-finally` blocks.
  `finally` clauses contain code that is always executed from control-flow
  leaves any of the other clauses of a `try` statement by any means.

#### 🦋 Changed

* `with` statement emission changed to use the new `finally` logic so that

  ```mojo
  with ContextMgr():
      return
  ```

  Will correctly execute `ContextMgr.__exit__` before returning.

#### 🛠️ Fixed

* [Issue #204](https://github.com/modular/modular/issues/204) - Mojo REPL
  crash when returning a String at compile-time
* [Issue #143](https://github.com/modular/modular/issues/143) - synthesized
  init in `@register_passable` type doesn't get correct convention.
* [Issue #201](https://github.com/modular/modular/issues/201) - String literal
  concatenation is too eager.
* [Issue #209](https://github.com/modular/modular/issues/209) - \[QoI] Terrible
  error message trying to convert a type to itself.
* [Issue #32](https://github.com/modular/modular/issues/32) - Include struct
  fields in docgen
* [Issue #50](https://github.com/modular/modular/issues/50) - Int to string
  conversion crashes due to buffer overflow
* [Issue #132](https://github.com/modular/modular/issues/132) - PythonObject
  `to_int` method has a misleading name
* [Issue #189](https://github.com/modular/modular/issues/189) - PythonObject bool
  conversion is incorrect
* [Issue #65](https://github.com/modular/modular/issues/65) - Add SIMD
  constructor from Bool
* [Issue #153](https://github.com/modular/modular/issues/153) - Meaning of
  `Time.now` function result is unclear
* [Issue #165](https://github.com/modular/modular/issues/165) - Type in
  `Pointer.free` documentation
* [Issue #210](https://github.com/modular/modular/issues/210) - Parameter results
  cannot be declared outside top-level in function
* [Issue #214](https://github.com/modular/modular/issues/214) - Pointer offset
  calculations at compile-time are incorrect
* [Issue #115](https://github.com/modular/modular/issues/115) - Float printing
  does not include the right number of digits
* [Issue #202](https://github.com/modular/modular/issues/202) -
  `kgen.unreachable` inside nested functions is illegal
* [Issue #235](https://github.com/modular/modular/issues/235) - Crash when
  register passable struct field is not register passable
* [Issue #237](https://github.com/modular/modular/issues/237) - Parameter
  closure sharp edges are not documented

### 2023-05-16

#### ⭐️ New

* Added missing dunder methods to `PythonObject`, enabling the use of common
  arithmetic and logical operators on imported Python values.

* `PythonObject` is now printable from Mojo, instead of requiring you to import
  Python's print function.

#### 🛠️ Fixed

* [Issue #98](https://github.com/modular/modular/issues/98):
  Incorrect error with lifetime tracking in loop.

* [Issue #49](https://github.com/modular/modular/issues/49): Type inference
  issue (?) in 'ternary assignment' operation (FloatLiteral vs. 'SIMD\[f32, 1]').

* [Issue #48](https://github.com/modular/modular/issues/48):
  and/or don't work with memory-only types.

* [Issue #11](https://github.com/modular/modular/issues/11): `setitem` Support
  for `PythonObject`.

### 2023-05-11

#### ⭐️ New

* `NDBuffer` and `Buffer` are now constructable via `Pointer` and
  `DTypePointer`.

* `String` now supports indexing with either integers or slices.

* Added factorial function to the `Math` module.

#### 🦋 Changed

* The "byref" syntax with the `&` sigil has changed to use an `inout`
  keyword to be more similar to the `borrowed` and `owned` syntax in arguments.
  Please see [Issue #7](https://github.com/modular/modular/issues/7) for more
  information.

* Optimized the Matrix multiplication implementation in the notebook.
  Initially we were optimizing for expandability rather than performance. We
  have found a way to get the best of both worlds and now the performance of the
  optimized Matmul implementation is 3x faster.

* Renamed the [`^` postfix
  operator](/mojo/manual/values/ownership#transfer-arguments-owned-and-)
  from "consume" to "transfer."

#### 🛠️ Fixed

* Fixed missing overloads for `Testing.assertEqual` so that they work on
  `Integer` and `String` values.

* [Issue #6](https://github.com/modular/modular/issues/6):
  Playground stops evaluating cells when a simple generic is defined.

* [Issue #18](https://github.com/modular/modular/issues/18):
  Memory leak in Python interoperability was removed.

### 2023-05-02

#### 📢 Released

* Mojo publicly launched! This was epic, with lots of great coverage online
  including a [wonderful post by Jeremy
  Howard](https://www.fast.ai/posts/2023-05-03-mojo-launch.html). The team is
  busy this week.

#### ⭐️ New

* Added a Base64 encoding function to perform base64 encoding on strings.

#### 🦋 Changed

* Decreased memory usage of serialization of integers to strings.

* Speedup the sort function.

#### 🛠️ Fixed

* Fixed time unit in the `sleep` function.

## April 2023

### Week of 2023-04-24

* 📢 The default behavior of nested functions has been changed. Mojo nested
  functions that capture are by default are non-parametric, runtime closures,
  meaning that:

  ```mojo
  def foo(x):
      # This:
      def bar(y): return x * y
      # Is the same as:
      let bar = lambda y: x * y
  ```

  These closures cannot have input or result parameters, because they are always
  materialized as runtime values. Values captured in the closure (`x` in the
  above example), are captured by copy: values with copy constructors cannot be
  copied and captures are immutable in the closure.

  Nested functions that don't capture anything are by default "parametric"
  closures: they can have parameters and they can be used as parameter values.
  To restore the previous behavior for capturing closures, "parametric,
  capture-by-unsafe-reference closures", tag the nested function with the
  `@parameter` decorator.

* 📢 Mojo now has full support for "runtime" closures: nested functions that
  capture state materialized as runtime values. This includes taking the address
  of functions, indirect calls, and passing closures around through function
  arguments. Note that capture-by-reference is still unsafe!

  You can also take references to member functions with instances of that class
  using `foo.member_function`, which creates a closure with `foo` bound to the
  `self` argument.

* 📢 Mojo now supports Python style `with` statements and context managers.

  These things are very helpful for implementing things like our
  trace region support and things like Runtime support.

  A context manager in Mojo implements three methods:

  ```mojo
  fn __enter__(self) -> T:
  fn __exit__(self):
  fn __exit__(self, err: Error) -> Bool:
  ```

  The first is invoked when the context is entered, and returns a
  value that may optionally be bound to a target for use in the with
  body. If the with block exits normally, the second method is
  invoked to clean it up. If an error is raised, the third method
  is invoked with the Error value. If that method returns true, the
  error is considered handled, if it returns false, the error is
  re-thrown so propagation continues out of the 'with' block.

* 📢 Mojo functions now support variable scopes! Explicit `var` and `let`
  declarations inside functions can shadow declarations from higher "scopes",
  where a scope is defined as any new indentation block. In addition, the
  `for` loop iteration variable is now scoped to the loop body, so it is
  finally possible to write

  ```mojo
  for i in range(1): pass
  for i in range(2): pass
  ```

* 📢 Mojo now supports an `@value` decorator on structs to reduce boilerplate
  and encourage best practices in value semantics.  The `@value` decorator looks
  to see the struct has a fieldwise initializer (which has arguments for each
  field of the struct), a `__copyinit__` method, and a `__moveinit__` method,
  and synthesizes the missing ones if possible.  For example, if you write:

  ```mojo
  @value
  struct MyPet:
    var name: String
    var age: Int
  ```

  The `@value` decorator will synthesize the following members for you:

  ```mojo
  fn __init__(inout self, owned name: String, age: Int):
      self.name = name^
      self.age = age
  fn __copyinit__(inout self, existing: Self):
      self.name = existing.name
      self.age = existing.age
  fn __moveinit__(inout self, owned existing: Self):
      self.name = existing.name^
      self.age = existing.age
  ```

  This decorator can greatly reduce the boilerplate needed to define common
  aggregates, and gives you best practices in ownership management
  automatically.  The `@value` decorator can be used with types that need custom
  copy constructors (your definition wins).  We can explore having the decorator
  take arguments to further customize its behavior in the future.

* 📚 Memcpy and memcmp now consistently use count as the byte count.

* 📚 Add a variadic string join on strings.

* 📚 Introduce a `reduce_bit_count` method to count the number of 1 across all
  elements in a SIMD vector.

* 📚 Optimize the `pow` function if the exponent is integral.

* 📚 Add a `len` function which dispatches to `__len__` across the different
  structs that support it.

### Week of 2023-04-17

* 📢 Error messages have been significantly improved, thanks to prettier
  printing for Mojo types in diagnostics.

* 📢 Variadic values can now be indexed directly without wrapping them in a
  `VariadicList`!

* 📢 `let` declarations in a function can now be lazily initialized, and `var`
  declarations that are never mutated get a warning suggesting they be converted
  to a `let` declaration.  Lazy initialization allows more flexible patterns of
  initialization than requiring the initializer be inline, e.g.:

  ```mojo
  let x: Int
  if cond:
      x = foo()
  else:
      x = bar()
  use(x)
  ```

* 📢 Functions defined with `def` now return `object` by default, instead of
  `None`. This means you can return values (convertible to `object`) inside
  `def` functions without specifying a return type.

* 📢 The `@raises` decorator has been removed. Raising `fn` should be declared
  by specifying `raises` after the function argument list. The rationale is that
  `raises` is part of the type system, instead of a function modifier.

* 📢 The `BoolLiteral` type has been removed. Mojo now emits `True` and `False`
  directly as `Bool`.

* 📢 Syntax for function types has been added. You can now write function types
  with `fn(Int) -> String` or `async def(&String, *Int) -> None`. No more
  writing `!kgen.signature` types by hand!

* 📢 Float literals are not emitted as `FloatLiteral` instead of an MLIR `f64`
  type!

* 📢 Automatic destructors are now supported by Mojo types, currently spelled
  `fn __del___(owned self):` (the extra underscore will be dropped shortly).
  These destructors work like Python object destructors and similar to C++
  destructors, with the major difference being that they run "as soon as
  possible" after the last use of a value.  This means they are not suitable
  for use in C++-style RAII patterns (use the `with` statement for that, which
  is currently unsupported).

  These should be generally reliable for both memory-only and register-passable
  types, with the caveat that closures are known to *not* capture values
  correctly.  Be very careful with interesting types in the vicinity of a
  closure!

* A new (extremely dangerous!) builtin function is available for low-level
  ownership muckery.  The `__get_address_as_owned_value(x)` builtin takes a
  low-level address value (of `!kgen.pointer` type) and returns an `owned` value
  for the memory that is pointed to.  This value is assumed live at the
  invocation of the builtin, but is "owned" so it needs to be consumed by the
  caller, otherwise it will be automatically destroyed.  This is an effective
  way to do a "placement delete" on a pointer.

  ```mojo
  # "Placement delete": destroy the initialized object begin pointed to.
  _ = __get_address_as_owned_value(somePointer.value)

  # Result value can be consumed by anything that takes it as an 'owned'
  # argument as well.
  consume(__get_address_as_owned_value(somePointer.value))
  ```

* Another magic operator, named `__get_address_as_uninit_lvalue(x)` joins
  the magic LValue operator family.  This operator projects a pointer to
  an LValue like `__get_address_as_lvalue(x)`.  The difference is that
  `__get_address_as_uninit_lvalue(x)` tells the compiler that the pointee is
  uninitialized on entry and initialized on exit, which means that you can use
  it as a "placement new" in C++ sense.  `__get_address_as_lvalue(x)` tells the
  compiler that the pointee is initialized already, so reassigning over it will
  run the destructor.

  ```mojo
  # "*Re*placement new": destroy the existing SomeHeavy value in the memory,
  # then initialize a new value into the slot.
  __get_address_as_lvalue(somePointer.value) = SomeHeavy(4, 5)

  # Ok to use an lvalue, convert to borrow etc.
  use(__get_address_as_lvalue(somePointer.value))

  # "Placement new": Initialize a new value into uninitialied memory.
  __get_address_as_uninit_lvalue(somePointer.value) = SomeHeavy(4, 5)

  # Error, cannot read from uninitialized memory.
  use(__get_address_as_uninit_lvalue(somePointer.value))
  ```

  Note that `__get_address_as_lvalue` assumes that there is already a value at
  the specified address, so the assignment above will run the `SomeHeavy`
  destructor (if any) before reassigning over the value.

* 📢 Implement full support for `__moveinit__` (aka move constructors)

  This implements the ability for memory-only types to define two different
  types of move ctors if they'd like:

  1. `fn __moveinit__(inout self, owned existing: Self)`: Traditional Rust
     style moving constructors that shuffles data around while taking
     ownership of the source binding.
  2. `fn __moveinit__(inout self, inout existing: Self):`: C++ style "stealing"
     move constructors that can be used to take from an arbitrary LValue.

  This gives us great expressive capability (better than Rust/C++/Swift)
  and composes naturally into our lifetime tracking and value
  categorization system.

* The `__call__` method of a callable type has been relaxed to take `self` by
  borrow, allow non-copyable callees to be called.

* Implicit conversions are now invoked in `raise` statements properly, allowing
  converting strings to `Error` type.

* Automatic destructors are turned on for `__del__` instead of `__del___`.

* 📚 Add the builtin FloatLiteral type.

* 📚 Add integral `floordiv` and `mod` for the SIMD type that handle negative
  values.

* 📚 Add an F64 to String converter.

* 📚 Make the `print` function take variadic inputs.

### Week of 2023-04-10

* 📢 Introduce consume operator `x^`

  This introduces the postfix consume operator, which produces an RValue given
  a lifetime tracked object (and, someday, a movable LValue).

* Mojo now automatically synthesizes empty destructor methods for certain types
  when needed.

* The `object` type has been built out into a fully-dynamic type, with dynamic
  function dispatch, with full error handling support.

  ```mojo
  def foo(a) -> object:
      return (a + 3.45)  Self:`. The `T{}` initializer
  syntax has been removed for memory-primary types.

* Mojo String literals now emit a builtin `StringLiteral` type! One less MLIR
  type to worry about.

* New `__getattr__` and `__setattr__` dunder methods were added. Mojo calls
  these methods on a type when attempting member lookup of a non-static member.
  This allows writing dynamic objects like `x.foo()` where `foo` is not a member
  of `x`.

* Early destructor support has been added. Types can now define a special
  destructor method `__del___` (note three underscores). This is an early
  feature and it is still being built out. There are many caveats, bugs,
  and missing pieces. Stay tuned!

* 📚 Integer division and mod have been corrected for rounding in the presence
  of negative numbers.

* 📚 Add scalar types (UI8, SI32, F32, F64, etc.) which are aliases to
  `SIMD[1, type]`.

## March 2023

### Week of 2023-03-27

* 📢 Parameter names are no longer load-bearing in function signatures. This
  gives more flexibility in defining higher-order functions, because the
  functions passed as parameters do not need their parameter names to match.

  ```mojo
  # Define a higher-order function...
  fn generator[
     func: __mlir_type[`!kgen.signature() -> !kgen.none`]
  ]():
     pass

  # Int parameter is named "foo".
  fn f0[foo: Int]():
     pass

  # Int parameter is named "bar".
  fn f1[bar: Int]():
     pass

  fn main():
     # Both can be used as `func`!
     generator[f0]()
     generator[f1]()
  ```

  Stay tuned for improved function type syntax...

* 📢 Two magic operators, named `__get_lvalue_as_address(x)` and
  `__get_address_as_lvalue` convert stored LValues to and from `!kgen.pointer`
  types (respectively).  This is most useful when using the `Pointer[T]`
  library type.  The `Pointer(to=lvalue)` method uses the first one
  internally.  The second one must currently be used explicitly, and can be
  used to project a pointer to a reference that you can pass around and use
  as a self value, for example:

  ```mojo
  # "Replacement new" SomeHeavy value into the memory pointed to by a
  # Pointer[SomeHeavy].
  __get_address_as_lvalue(somePointer.value) = SomeHeavy(4, 5)
  ```

  Note that `__get_address_as_lvalue` assumes that there is already a value at
  the specified address, so the assignment above will run the `SomeHeavy`
  destructor (if any) before reassigning over the value.

* The `(((x)))` syntax is \_\_mlir\_op has been removed in favor of
  `__get_lvalue_as_address` which solves the same problem and is more general.

* 📢 When using a mutable `self` argument to a struct `__init__` method, it
  now must be declared with `&`, like any other mutable method.  This clarifies
  the mutation model by making `__init__` consistent with other mutating
  methods.

* 📚 Add variadic string join function.

* 📚 Default initialize values with 0 or null if possible.

* 📚 Add compressed, aligned, and mask store intrinsics.

### Week of 2023-03-20

* Initial `String` type is added to the standard library with some very basic
  methods.

* Add `DimList` to remove the need to use an MLIR list type throughout the
  standard library.

* 📢 The `__clone__` method for copying a value is now named `__copy__` to
  better follow Python term of art.

* 📢 The `__copy__` method now takes its self argument as a "read" value,
  instead of taking it by reference.  This makes it easier to write, works for
  `@register_passable` types, and exposes more optimization opportunities to
  the early optimizer and dataflow analysis passes.

  ```mojo
  # Before:
  fn __clone__(inout self) -> Self: ...

  # After:
  fn __copy__(self) -> Self: ...
  ```

* 📢 A new `@register_passable("trivial")` may be applied to structs that
  have no need for a custom `__copy__` or `__del__` method, and whose state is
  only made up of `@register_passable("trivial")` types.  This eliminates the
  need to define `__copy__` boilerplate and reduces the amount of IR generated
  by the compiler for trivial types like `Int`.

* You can now write back to attributes of structs that are produced by a
  computed lvalue expression.  For example `a[i].x = ..` works when `a[i]`
  is produced with a `__getitem__`/`__setitem__` call.  This is implemented by
  performing a read of `a[i]`, updating the temporary, then doing a writeback.

* The remaining hurdles to using non-parametric, `@register_passable` types as
  parameter values have been cleared. Types like `Int` should enjoy full use as
  parameter values.

* Parameter pack inference has been added to function calls. Calls to functions
  with parameter packs can now elide the pack types:

  ```mojo
  fn foo[*Ts: AnyType](*args: *Ts): pass

  foo(1, 1.2, True, "hello")
  ```

  Note that the syntax for parameter packs has been changed as well.

* 📚 Add the runtime string type.

* 📚 Introduce the DimList struct to remove the need to use low-level MLIR
  operations.

### Week of 2023-03-13

* 📢 Initializers for structs now use `__init__` instead of `__new__`,
  following standard practice in Python.  You can write them in one of two
  styles, either traditional where you mutate self:

  ```mojo
  fn __init__(self, x: Int):
      self.x = x
  ```

  or as a function that returns an instance:

  ```mojo
  fn __init__(x: Int) -> Self:
      return Self {x: x}
  ```

  Note that `@register_passable` types must use the later style.

* 📢 The default argument convention is now the `borrowed` convention.  A
  "read" argument is passed like a C++ `const&` so it doesn't need to
  invoke the copy constructor (aka the `__clone__` method) when passing a value
  to the function.  There are two differences from C++ `const&`:

  1. A future borrow checker will make sure there are no mutable
     aliases with an immutable borrow.
  2. `@register_passable` values are passed directly in an SSA register (and
     thus, usually in a machine register) instead of using an extra reference
     wrapper.  This is more efficient and is the 'right default' for
     `@register_passable` values like integers and pointers.

  This also paves the way to remove the reference requirement from `__clone__`
  method arguments, which will allow us to fill in more support for them.

* Support for variadic pack arguments has been added to Mojo. You can now
  write heterogeneous variadic packs like:

  ```mojo
  fn foo[*Ts: AnyType](args*: Ts): pass

  foo[Int, F32, String, Bool](1, 1.5, "hello", True)
  ```

* The `owned` argument convention has been added. This argument convention
  indicates that the function takes ownership of the argument and is responsible
  for managing its lifetime.

* The `borrowed` argument convention has been added. This convention signifies
  the callee gets an immutable shared reference to a value in the caller's
  context.

* 📚 Add the `getenv` function to the `OS` module to enable getting environment
  variables.

* 📚 Enable the use of dynamic strides in `NDBuffer`.

### Week of 2023-03-06

* 📢 Support added for using capturing async functions as parameters.

* 📢 Returning result parameters has been moved from `return` statements to a
  new `param_return` statement. This allows returning result parameters from
  throwing functions:

  ```mojo
  @raises
  fn foo[() -> out: Int]():
      param_return[42]
      raise Error()
  ```

  And returning different parameters along `@parameter if` branches:

  ```mojo
  fn bar[in: Bool -> out: Int]():
      @parameter
      if in:
          param_return[1]
      else:
          param_return[2]
  ```

* 📢 Mojo now supports omitting returns at the end of functions when they would
  not reachable. For instance,

  ```mojo
  fn foo(cond: Bool) -> Int:
      if cond:
          return 0
      else:
          return 1

  fn bar() -> Int:
      while True:
          pass
  ```

* String literals now support concatenation, so `"hello " "world"` is treated
  the same as `"hello world"`.

* Empty bodies on functions, structs, and control flow statements are no longer
  allowed.  Please use `pass` in them to explicitly mark that they are empty,
  just like in Python.

* 📢 Structs in Mojo now default to living in memory instead of being passed
  around in registers.  This is the right default for generality (large
  structures, structures whose pointer identity matters, etc) and is a key
  technology that enables the borrow model.  For simple types like `Int` and
  `SIMD`, they can be marked as `@register_passable`.

  Note that memory-only types currently have some limitations: they cannot be
  used in generic algorithms that take and return a `!mlirtype` argument, and
  they cannot be used in parameter expressions.  Because of this, a lot of
  types have to be marked `@register_passable` just to work around the
  limitations.  We expect to enable these use-cases over time.

* 📢 Mojo now supports computed lvalues, which means you can finally assign to
  subscript expressions instead of having to call `__setitem__` explicitly.

  Some details on this: Mojo allows you to define multiple `__setitem__`
  overloads, but will pick the one that matches your `__getitem__` type if
  present.  It allows you to pass computed lvalues into inout arguments by
  introducing a temporary copy of the value in question.

* Mojo now has much better support for using register-primary struct types in
  parameter expressions and as the types of parameter values. This will allow
  migration of many standard library types away from using bare MLIR types like
  `__mlir_type.index` and towards using `Int`. This moves us towards getting rid
  of MLIR types everywhere and makes struct types first-class citizens in the
  parameter system.

* 📚 Add a `sort` function.

* 📚 Add non-temporal store to enable cache bypass.

## February 2023

### Week of 2023-02-27

* 📢 The `@interface`, `@implements`, and `@evaluator` trio of decorators have
  been removed, replaced by the `@parameter if` and `@adaptive` features.

* 📢 Parameter inference can now infer the type of variadic lists.

* 📢 Memory primary types are now supported in function results. A result slot
  is allocated in the caller, and the callee writes the result of the function
  into that slow. This is more efficient for large types that don't fit into
  registers neatly! And initializers for memory-primary types now initialize
  the value in-place, instead of emitting a copy!

* Support for `let` decls of memory primary types has been implemented. These
  are constant, ready-only values of memory primary types but which are
  allocated on the function stack.

* Overload conversion resolution and parameter inference has been improved:

  1. Inference now works with `let` decls in some scenarios that weren't
     working before.
  2. Parameter bindings can now infer types into parameter expressions. This
     helps resolve higher-order functions in parameter expressions.

* 📚 Optimize floor, ceil, and ldexp on X86 hardware.

* 📚 Implement the log math function.

### Week of 2023-02-20

* 📢 A new `@__memory_primary` struct decorator has been introduced. Memory
  primary types must always have an address. For instance, they are always
  stack-allocated when declared in a function and their values are passed into
  function calls by address instead of copy. This is in contract with register
  primary types that may not have an address, and which are passed by value
  in function calls. Memory-primary fields are not allowed inside
  register-primary structs, because struct elements are stored in-line.

* 📢 A new `_CompilerBuiltin` module was added. This module defines core types
  and functions of the language that are referenced by the parser, and hence, is
  auto-imported by all other modules. For example new types for literal values
  like the boolean True/False will be included in `_CompilerBuiltin`.

* 📢 A special `__adaptive_set` property can be accessed on a function reference
  marked as `@adaptive`. The property returns the adaptive overload set of that
  function. The return type is a `!kgen.variadic`. This feature is useful to
  implement a generic `evaluate` function in the standard library.

* 📢 A new built-in literal type `BoolLiteral` was added in `_CompilerBuiltin`.
  It represents the literal boolean values `True` and `False`. This is the first
  Mojo literal to be emitted as a standard library type!

* 📚 Add the prefetch intrinsic to enable HW prefetching a cache line.

* 📚 Add the `InlinedFixedVector`, which is optimized for small vectors and stores
  values on both the stack and the heap.

### Week of 2023-02-13

* Unqualified lookups of struct members apply contextual parameters. This means
  for instance that you can refer to static methods without binding the
  struct parameters.

  ```mojo
  struct Foo[x: Int]:
      @staticmethod
      bar(): pass

      foo(self):
          bar()         # implicitly binds to Foo[x].bar()
          Foo[2].bar()  # explicitly bind to another parameter
  ```

* 📢 A new `Self` type refers to the enclosing type with all parameters bound
  to their current values.  This is useful when working with complex parametric
  types, e.g.:

  ```mojo
  struct MyArray[size: Int, element_type: type]:
     fn __new__() -> Self:
         return Self {...}
  ```

  which is a lot nicer than having to say `MyArray[size, element_type]` over
  and over again.

* 📢 Mojo now supports an `@adaptive` decorator. This decorator will supersede
  interfaces, and it represents an overloaded function that is allowed to
  resolve to multiple valid candidates. In that case, the call is emitted as a
  fork, resulting in multiple function candidates to search over.

  ```mojo
  @adaptive
  fn sort(arr: ArraySlice[Int]):
      bubble_sort(arr)

  @adaptive
  fn sort(arr: ArraySlice[Int]):
      merge_sort(arr)

  fn concat_and_sort(lhs: ArraySlice[Int], rhs: ArraySlice[Int]):
      let arr = lhs + rhs
      sort(arr) # this forks compilation, creating two instances
                # of the surrounding function
  ```

* 📢 Mojo now requires that types implement the `__clone__` special member in
  order to copy them.  This allows the safe definition of non-copyable types
  like Atomic.  Note that Mojo still doesn't implement destructors, and (due to
  the absence of non-mutable references) it doesn't actually invoke the
  `__clone__` member when copying a let value. As such, this forces to you as
  a Mojo user to write maximal boilerplate without getting much value out of it.

  In the future, we will reduce the boilerplate with decorators, and we will
  actually start using it. This will take some time to build out though.

* 📢 A special `__mlir_region` statement was added to provide stronger
  invariants around defining MLIR operation regions in Mojo. It similar syntax
  to function declarations, except it there are no results and no input
  conventions.

* 📚 Implement the log math function.

* 📚 Improve the DType struct to enable compile-time equality checks.

* 📚 Add the Complex struct class.

### Week of 2023-02-06

* 📢 The `if` statement now supports a `@parameter` decorator, which requires
  its condition to be a parameter expression, but which only emits the 'True'
  side of the condition to the binary, providing a "static if" functionality.
  This should eliminate many uses of `@interface` that are just used to provide
  different constraint on the implementations.

* 📢 `fn main():` is now automatically exported and directly runnable by the
  command-line `mojo` tool. This is a stop-gap solution to enable script-like
  use cases until we have more of the language built out.

* 🪦 The `@nodebug_inline` feature has been removed, please use
  `@alwaysinline("nodebug")` for methods that must be inlined and that we don't
  want to step into.

* 📢 Python chained comparisons, ex. `a  Int:
      return a + b + c

  async fn call_it():
      let task: Coroutine[Int] = add_three(1, 2, 3)
      print(await task)
  ```

* ⭐️ We now diagnose unused expression values at statement context in `fn`
  declarations (but not in `def`s). This catches bugs with unused values, e.g.
  when you forget the parens to call a function.

* 📢 An `@always_inline("nodebug")` function decorator can be used on functions
  that need to be force inlined, but when they should not have debug info in
  the result.  This should be used on methods like `Int.__add__` which should
  be treated as builtin.

* 📢 The `@export` decorator now supports an explicit symbol name to export to,
  for example:

  ```mojo
  @export("baz") # exported as 'baz'
  fn some_mojo_fn_name():
  ```

* 📢 🚧 Subscript syntax is now wired up to the `__getitem__` dunder method.

  This allows type authors to implement the `__getitem__` method to enable
  values to be subscripted.  This is an extended version of the Python semantics
  (given we support overloading) that allows you to define N indices instead of
  a single version that takes a tuple (also convenient because we don't have
  tuples yet).

  Note that this has a very, very important limitation: subscripts are NOT
  wired up to `__setitem__` yet. This means that you can read values with
  `.. = v[i]` but you cannot store to them with `v[i] = ..`.  For this, please
  continue to call `__setitem__` directly.

* 📢 Function calls support parameter inference.

  For calls to functions that have an insufficient number of parameters
  specified at the callsite, we can now infer them from the argument list. We
  do this by matching up the parallel type structure to infer what the
  parameters must be.

  Note that this works left to right in the parameter list, applying explicitly
  specified parameters before trying to infer new ones. This is similar to how
  C++ does things, which means that you may want to reorder the list of
  parameters with this in mind. For example, a `dyn_cast`-like function will be
  more elegant when implemented as:

  `fn dyn_cast[DstType: type, SrcType: type](src: SrcType) -> DstType:`

  Than with the `SrcType`/`DstType` parameters flipped around.

* 📚 Add the growable Dynamic vector struct.

### Week of 2023-01-23

* Inplace operations like `+=`/`__iadd__` may now take `self` by-val if they
  want to, instead of requiring it to be by-ref.

* ⭐️ Inplace operations are no longer allowed to return a non-None value.  The
  corresponding syntax is a statement, not an expression.

* A new `TaskGroup` type was added to the standard library. This type can be
  used to schedule multiple tasks on a multi-threaded workqueue to be executed
  in parallel. An async function can `await` all the tasks at once with the
  taskgroup.

* 📢 We now support for loops! A type that defines an `__iter__` method that
  returns a type that defines `__next__` and `__len__` methods is eligible to
  be used in the statement `for el in X()`. Control flow exits the loop when
  the length is zero.

  This means things like this now work:

  ```mojo
  for item in range(start, end, step):
      print(item)
  ```

* Result parameters now have names. This is useful for referring to result
  parameters in the return types of a function:

  ```mojo
  fn return_simd[() -> nelts: Int]() -> SIMD[f32, nelts]:
  ```

* 📢 We now support homogeneous variadics in value argument lists, using the
  standard Python `fn thing(*args: Int):` syntax! Variadics also have support
  in parameter lists:

  ```mojo
  fn variadic_params_and_args[*a: Int](*b: Int):
      print(a[0])
      print(b[1])
  ```

* 📚 Add the range struct to enable `for ... range(...)` loops.

* 📚 Introduce the unroll generator to allow one to unroll loops via a library
  function.

### Week of 2023-01-16

* 📢 Struct field references are now supported in parameter context, so you
  can use `someInt.value` to get the underlying MLIR thing out of it. This
  should allow using first-class types in parameters more widely.

* 📢 We now support "pretty" initialization syntax for structs, e.g.:

  ```mojo
  struct Int:
      var value: __mlir_type.index
      fn __new__(value: __mlir_type.index) -> Int:
          return Int {value: value}
  ```

  This eliminates the need to directly use the MLIR `lit.struct.create` op in
  struct initializers.  This syntax may change in the future when ownership
  comes in, because we will be able to support the standard `__init__` model
  then.

* 📢 It is now possible to attach regions to `__mlir_op` operations.  This is
  done with a hack that allows an optional `_region` attribute that lists
  references to the region bodies (max 1 region right now due to lack of list
  `[]` literal).

* Nested functions now parse, e.g.:

  ```mojo
  fn foo():
      fn bar():
          pass
      bar()
  ```

* Python-style `async` functions should now work and the `await` expression
  prefix is now supported.  This provides the joy of async/await syntactic
  sugar when working with asynchronous functions.  This is still somewhat
  dangerous to use because we don't have proper memory ownership support yet.

* String literals are now supported.

* Return processing is now handled by a dataflow pass inside the compiler, so
  it is possible to return early out of if statements.

* The parser now supports generating 'fixit' hints on diagnostics, and uses
  them when a dictionary literal uses a colon instead of equal, e.g.:

  ```log
  x.mojo:8:48: error: expected ':' in subscript slice, not '='
      return __mlir_op.`lit.struct.create`[value = 42]()
                                                 ^
                                                 :
  ```

* 📚 Add reduction methods which operate on buffers.

* 📚 Add more math functions like sigmoid, sqrt, rsqrt, etc.

* 📚 Add partial load / store which enable loads and stores that are predicated
  on a condition.

### Week of 2023-01-09

* The `/` and `*` markers in function signatures are now parsed and their
  invariants are checked.  We do not yet support keyword arguments yet though,
  so they aren't very useful.

* Functions now support a new `@nodebug_inline` decorator.
  (Historical note: this was later replaced with `@alwaysinline("nodebug")`).

  Many of the things at the bottom level of the Mojo stack are trivial
  zero-abstraction wrappers around MLIR things, for example, the `+`
  operator on Int or the `__bool__` method on Bool itself.  These operators
  need to be force inlined even at -O0, but they have some additional things
  that we need to wrestle with:

  1. In no case would a user actually want to step into the `__bool__` method on
     Bool or the + method on Int.  This would be terrible debugger QoI for
     unless you're debugging Int itself. We need something like
     `__always_inline__, __nodebug__` attributes that clang uses in headers
     like xmmintrin.h.

  2. Similarly, these "operators" should be treated by users as primitives:
     they don't want to know about MLIR or internal implementation details of
     Int.

  3. These trivial zero abstraction things should be eliminated early in the
     compiler pipeline so they don't slow down the compiler, bloating out the
     call graph with trivial leaves.  Such thing slows down the elaborator,
     interferes with basic MLIR things like fold(), bloats out the IR, or
     bloats out generated debug info.

  4. In a parameter context, we want some of these things to get inlined so
     they can be simplified by the attribute logic and play more nicely with
     canonical types.  This is just a nice to have thing those of us who have
     to stare at generated IR.

  The solution to this is a new `@nodebug_inline` decorator. This decorator
  causes the parser to force-inline the callee instead of generating a call to
  it. While doing so, it gives the operations the location of the call itself
  (that's the "nodebug" part) and strips out let decls that were part of the
  internal implementation details.

  This is a super-power-user-feature intended for those building the standard
  library itself, so it is intentionally limited in power and scope: It can
  only be used on small functions, it doesn't support regions, by-ref, throws,
  async, etc.

* Separately, we now support an `@alwaysInline` decorator on functions. This
  is a general decorator that works on any function, and indicates that the
  function must be inlined. Unlike `@nodebug_inline`, this kind of inlining is
  performed later in the compilation pipeline.

* The `__include` hack has been removed now that we have proper import support.

* `__mlir_op` can now get address of l-value:

  You can use magic `(((x)))` syntax in \_\_mlir\_op that forces the `x`
  expression to be an lvalue, and yields its address.  This provides an escape
  hatch (isolated off in `__mlir_op` land) that allows unsafe access to lvalue
  addresses.

* We now support `__rlshift__` and `__rtruediv__`.

* 📢 The parser now resolves scoped alias references.  This allows us to support
  things like `SomeType.someAlias`, forward substituting the value.  This
  unblocks use of aliases in types like `DType`.  We'd like to eventually
  preserve the reference in the AST, but this unblocks library development.

* 📚 Add a `now` function and `Benchmark` struct to enable timing and
  benchmarking.

* 📚 Move more of the computation in NDBuffer from runtime to compile time if
  possible (e.g. when the dimensions are known at compile time).

### Week of 2023-01-02

* 📚 Added the `print` function which works on Integers and SIMD values.

* The frontend now has a new diagnostic subsystem used by the `kgen` tool (but
  not by `kgen-translate` for tests) that supports source ranges on
  diagnostics. Before we'd emit an error like:

  ```log
  x.mojo:13:3: error: invalid call to 'callee': in argument #0, value of type '$F32::F32' cannot be converted to expected type '$int::Int'
    callee(1.0+F32(2.0))
    ^
  x.lit:4:1: note: function declared here
  fn callee(a: Int):
  ^
  ```

  now we produce:

  ```log
  x.mojo:13:3: error: invalid call to 'callee': in argument #0, value of type '$F32::F32' cannot be converted to expected type '$int::Int'
    callee(1.0+F32(2.0))
    ^      ~~~~~~~~~~~~
  x.lit:4:1: note: function declared here
  fn callee(a: Int):
  ^
  ```

* 📢 Parameter results are now supported in a proper way. They are now forward
  declared with an alias declaration and then bound in a call with an arrow,
  e.g.:

  ```mojo
  alias a: __mlir_type.index
  alias b: __mlir_type.index
  idx_result_params[xyz * 2 -> a, b]()
  ```

* Various minor issues with implicit conversions are fixed. For instances,
  implicit conversions are now supported in parameter binding contexts and
  `alias` declarations with explicit types.

* Doc strings are allowed on functions and structs, but they are currently
  discarded by the parser.

* 📚 Add a `print` method!!!

* 📚 Demonstrate a naive matmul in Mojo.

* 📚 Initial work on functions that depend on types (e.g. FPUtils, nan, inf,
  etc.)

* 📚 Allow one to query hardware properties such as simd\_width, os, etc. via
  TargetInfo at compile time.

## December 2022

### Week of 2022-12-26

* 📢 You can now call functions in a parameter context! Calling a function in
  a parameter context will evaluate the function at compile time. The result
  can then be used as parameter values. For example,

  ```mojo
  fn fma(x: Int, y: Int, z: Int) -> Int:
      return a + b * c

  fn parameter_call():
      alias nelts = fma(32, 2, 16)
      var x: SIMD[f32, nelts]
  ```

* You can now disable printing of types in an `__mlir_attr` substitution by
  using unary `+` expression.

* 📢 `let` declarations are now supported in functions.  `let` declarations are
  local run-time constant values, which are always rvalues. They complement
  'var' decls (which are mutable lvalues) and are the normal thing to use in
  most cases.  They also generate less IR and are always in SSA form when
  initialized.

  We will want to extend this to support 'let' decls in structs at some point
  and support lazy initialized 'let' declarations (using dataflow analysis) but
  that isn't supported yet.

* 📚 Add the NDBuffer struct.

* Happy new year.

### Week of 2022-12-19

* 📚 Start of the Standard library:
  1. Added Integer and SIMD structs to bootstrap the standard library.
  2. Added very basic buffer data structure.

* We have basic support for parsing parameter results in function calls! Result
  parameters are an important Mojo metaprogramming feature. They allow functions
  to return compile-time constants.

  ```mojo
  fn get_preferred_simdwidthof[() -> nelts: Int]():
      return[2]

  fn vectorized_function():
      get_preferred_simdwidthof[() -> nelts]()
      var x: SIMD[f32, nelts]
  ```

* Types can now be used as parameters of `!kgen.mlirtype` in many more cases.

* MLIR operations with zero results don't need to specify `_type: []` anymore.

* We support parsing triple quoted strings, for writing docstrings for your
  functions and structs!

* A new `__mlir_type[a,b,c]` syntax is available for substituting into MLIR
  types and attributes is available, and the old placeholder approach is
  removed.  This approach has a few advantages beyond what placeholders do:

  1. It's simpler.
  2. It doesn't form the intermediate result with placeholders, which
     gets rejected by MLIR's semantic analysis, e.g. the complex case
     couldn't be expressed before.
  3. It provides a simple way to break long attrs/types across multiple
     lines.

* We now support an `@evaluator` decorator on functions for KGEN evaluators.
  This enables specifying user-defined interface evaluators when performing
  search during compilation.

* 📢 `import` syntax is now supported!

  This handles packaging imported modules into file ops, enables effective
  isolation from the other decls. "import" into the desired context is just
  aliasing decls, with the proper symbols references handle automatically during
  IR generation. As a starting point, this doesn't handle any notion of packages
  (as those haven't been sketched out enough).

* 📢 Reversed binary operators (like `__radd__`) are now looked up and used if
  the forward version (like `__add__`) doesn't work for some reason.

* 📢 Implicit conversions are now generally available, e.g. in assign
  statements, variable initializers etc. There are probably a few more places
  they should work, but we can start eliminating all the extraneous explicit
  casts from literals now.

* Happy Holidays

### Week of 2022-12-12

* 📢 Function overloading now works. Call resolution filters candidate list
  according to the actual parameter and value argument specified at the site of
  the call, diagnosing an error if none of the candidates are viable or if
  multiple are viable and ambiguous. We also consider implicit conversions in
  overload look:

  ```mojo
  fn foo(x: Int): pass
  fn foo(x: F64): pass

  foo(Int(1)) # resolves to the first overload
  foo(1.0)    # resolves to the second overload
  foo(1)      # error: both candidates viable with 1 implicit conversion!
  ```

* The short circuiting binary `and` and `or` expressions are now supported.

* Unary operator processing is a lot more robust, now handling the `not`
  expression and `~x` on Bool.

* 📢 The compiler now generates debug information for use with GDB/LLDB that
  describes variables and functions.

* The first version of the Mojo Visual Studio Code extension has been released!
  It supports syntax highlighting for Mojo files.

* The first version of the `Bool` type has landed in the new Mojo standard
  library!

* 📢 Implicit conversions are now supported in return statements.

### Week of 2022-12-05

* "Discard" patterns are now supported, e.g. `_ = foo()`

* We now support implicit conversions in function call arguments, e.g.
  converting an `index` value to `Int` automatically.  This eliminates a bunch
  of casts, e.g. the need to say F32(1.0) everywhere.

  This is limited for a few reasons that will be improved later:

  1. We don't support overloading, so lots of types aren't convertible
     from all the things they should be, e.g. you can't pass "1" to
     something that expects F32, because F32 can't be created from index.
  2. This doesn't "check to see if we can invoke `__new__`" it force applies
     it on a mismatch, which leads to poor QoI.
  3. This doesn't fix things that need radd.

## November 2022

### Week of 2022-11-28

* 📢 We support the `True` and `False` keywords as expressions.

* 📢 A new `alias` declaration is supported which allows defining local
  parameter values.  This will eventually subsume type aliases and other
  things as it gets built out.

* 📢 We now have end-to-end execution of Mojo files using the `kgen` tool!
  Functions exported with `@export` can be executed.

* 📢 We have try-except-else and `raise` statements and implicit error
  propagation! The error semantics are that `def` can raise by default, but `fn`
  must explicitly declare raising with a `@raises` decorator. Stub out basic
  `Error` type.

* The `&` sigil for by-ref arguments is now specified after the identifier.
  Postfix works better for ref and move operators on the expression
  side because it chains an mentally associates correctly:
  `thing.method().result^`. We don't do that yet, but align param
  decl syntax to it so that things won't be odd looking when we do.
  In practice this looks like:

  ```mojo
  def mutate_argument(a&: index):
      a = 25
  ```

### Week of 2022-11-21

* 📢 The magic `index` type is gone. Long live `__mlir_type.index`.

* Implement parameter substitution into parametric `__mlir_type` decls. This
  allows us to define parametric opaque MLIR types with exposed parameters using
  a new "placeholder" attribute.  This allows us to expose the power of the KGEN
  type parametric system directly into Mojo.

* 📢 Fully-parametric custom types can now be defined and work in Mojo, bringing
  together a lot of the recent work. We can write the SIMD type directly as a
  wrapper around the KGEN type, for example:

  ```mojo
  struct SIMD[dt: __mlir_type.`!kgen.dtype`, nelts: __mlir_type.index]:
      var value:
        __mlir_type.`!pop.simd,
                               #lit>`[nelts, dt]

      fn __add__(self, rhs: SIMD[dt, nelts]) -> SIMD[dt, nelts]:
          return __mlir_op.`pop.add`(self.value, rhs.value)
  ```

### Week of 2022-11-14

* 📢 Implement a magic `__mlir_type` declaration that can be used to access any
  MLIR type. E.g. `__mlir_type.f64`.

* 📢 Add an `fn` declaration. These are like `def` declarations, but are more
  strict in a few ways: they require type annotations on arguments, don't allow
  implicit variable declarations in their body, and make their arguments rvalues
  instead of lvalues.

* Implemented Swift-style backtick identifiers, which are useful for code
  migration where names may collide with new keywords.

* 📢 A new `__include` directive has been added that performs source-level
  textual includes. This is temporary until we have an `import` model.

* Implement IR generation for arithmetic operators like `+` and `*` in terms
  of the `__add__` and `__mul__` methods.

* 📢 Added support for `break` and `continue` statements, as well as early
  returns inside loops and conditionals!

* 📢 Implemented augmented assignment operators, like `+=` and `@=`.

* 📢 Mojo now has access to generating any MLIR operations (without regions)
  with a new `__mlir_op` magic declaration. We can start to build out the
  language's builtin types with this:

  ```mojo
  struct Int:
      var value: __mlir_type.index

      fn __add__(self, rhs: Int) -> Int:
          return __mlir_op.`index.add`(self.value, rhs.value)
  ```

  Attributes can be attached to the declaration with subscript `[]` syntax,
  and an explicit result type can be specified with a special `_type` attribute
  if it cannot be inferred. Attributes can be accessed via the `__mlir_attr`
  magic decl:

  ```mojo
  __mlir_op.`index.cmp`[
      _type: __mlir_type.i1,
      pred: __mlir_attr.`#index`
  ](lhs, rhs)
  ```

* Improved diagnostics emissions with ranges! Now errors highlight the whole
  section of code and not just the first character.

### Week of 2022-11-07

* Implemented the `@interface` and `@implements` decorators, which provide
  access to KGEN generator interfaces. A function marked as an `@interface`
  has no body, but it can be implemented by multiple other functions.

  ```mojo
  @interface
  def add(lhs: index, rhs: index):

  @implements(add)
  def normal_add(lhs: index, rhs: index) -> index:
      return lhs + rhs

  @implements(add)
  def slow_add(lhs: index, rhs: index) -> index:
      wait(1000)
      return normal_add(lhs, rhs)
  ```

* 📢 Support for static struct methods and initializer syntax has been added.
  Initializing a struct with `Foo()` calls an implicitly static `__new__`
  method. This method should be used instead of `__init__` inside structs.

  ```mojo
  struct Foo:
      var value: index

      def __new__() -> Foo:
          var result: Foo
          result.value = Foo.return_a_number() # static method!
          return result

      @staticmethod
      def return_a_number() -> index:
          return 42
  ```

* 📢 Full by-ref argument support. It's now possible to define in-place
  operators like `__iadd__` and functions like `swap(x, y)` correctly.

* 📢 Implemented support for field extract from rvalues, like `x.value` where
  `x` is not an lvalue (`var` declaration or by-ref function argument).

## October 2022

### Week of 2022-10-31

* Revised `return` handling so that a return statement with no expression is
  syntax sugar for `return None`. This enables early exits in functions that
  implicitly return `None` to be cleaner:

  ```mojo
  def just_return():
      return
  ```

* Added support for parsing more expressions: if-else, bitwise operators,
  shift operators, comparisons, floor division, remainder, and matmul.

* 📢 The type of the `self` argument can now be omitted on member methods.

### Week of 2022-10-24

* Added parser support for right-associativity and unary ops, like the power
  operator `a ** b ** c` and negation operator `-a`.

* Add support for `&expr` in Mojo, which allows denoting a by-ref argument in
  functions. This is required because the `self` type of a struct method is
  implicitly a pointer.

* Implemented support for parametric function declarations, such as:

  ```mojo
  struct SIMD[dt: DType, width: index]:
      fn struct_method(self: &SIMD[dt, width]):
          pass

  def fancy_add[dt: DType, width: index](
      lhs: SIMD[dt, width], rhs: SIMD[dt, width]) -> index:
    return width
  ```

### Week of 2022-10-17

* Added explicit variable declarations with `var`, for declaring variables both
  inside functions and structs, with support for type references. Added `index`
  as a temporary built-in type.

  ```mojo
  def foo(lhs: index, rhs: index) -> index:
      var result: index = lhs + rhs
      return result
  ```

* Implemented support for parsing struct declarations and references to type
  declarations in functions! In `def`, the type can be omitted to signal an
  object type.

  ```mojo
  struct Foo:
      var member: index

  def bar(x: Foo, obj) -> index:
      return x.member
  ```

* Implemented parser support for `if` statements and `while` loops!

  ```mojo
  def if_stmt(c: index, a: index, b: index) -> index:
      var result: index = 0
      if c:
          result = a
      else:
          result = b
      return result

  def while_stmt(init: index):
      while init > 1:
          init = init - 1
  ```

* Significantly improved error emission and handling, allowing the parser to
  emit multiple errors while parsing a file.

### Week of 2022-10-10

* Added support for parsing integer, float, and string literals.

* Implemented parser support for function input parameters and results. You can
  now write parametric functions like,

  ```mojo
  def foo[param: Int](arg: Int) -> Int:
      result = param + arg
      return result
  ```

### Week of 2022-10-03

* Added some basic parser scaffolding and initial parser productions, including
  trivial expressions and assignment parser productions.

* Implemented basic scope handling and function IR generation, with support for
  forward declarations. Simple functions like,

  ```mojo
  def foo(x: Int):
  ```

  Now parse! But all argument types are hard-coded to the MLIR `index` type.

* Added IR emission for simple arithmetic expressions on builtin types, like
  `x + y`.

## September 2022

### Week of 2022-09-26

* Mojo's first patch to add a lexer was Sep 27, 2022.

* Settled on `[]` for Mojo generics instead of ``. Square brackets are
  consistent with Python generics and don't have the less than ambiguity
  other languages have.

---

## Mojo🔥 FAQ

We tried to anticipate your questions about Mojo on this page. If this page
doesn't answer all your questions, also check out our [community
channels](https://www.modular.com/community).

## Motivation

### Why did you build Mojo?

We built Mojo to solve an internal challenge at Modular, and we are using it
extensively in our systems such as our [MAX Platform](https://www.modular.com/max).
As a result, we are extremely committed to
its long term success and are investing heavily in it. Our overall mission is
to unify AI software and we can’t do that without a unified language that can
scale across the AI infrastructure stack. Our current focus is to unify
CPU+GPU programming with blazing-fast execution on the
[MAX Platform](https://www.modular.com/max). That said, the north star
is for Mojo to support the whole gamut of general-purpose
programming over time. For a longer answer, read [Why
Mojo](/mojo/why-mojo).

### Why is it called Mojo?

Mojo means “a magical charm” or “magical powers.” We thought this was a fitting
name for a language that brings magical powers to Python, including unlocking
an innovative programming model for accelerators and other heterogeneous
systems pervasive in AI today.

### Why does Mojo have the 🔥 file extension?

We paired Mojo with fire emoji 🔥 as a fun visual way to impart onto users that
Mojo empowers them to get their Mojo on—to develop faster and more efficiently
than ever before. We also believe that the world can handle a unicode extension
at this point, but you can also just use the `.mojo` extension. :)

### What problems does Mojo solve that no other language can?

Mojo combines the usability of Python with the systems programming features
it’s missing. We are guided more by pragmatism than novelty, but Mojo’s use of
[MLIR](https://mlir.llvm.org/) allows it to scale to new exotic hardware types
and domains in a way that other languages haven’t demonstrated. It also
has caching and distributed compilation built into its
core. We also believe Mojo has a good chance of unifying hybrid packages in the
broader Python community.

### What kind of developers will benefit the most from Mojo?

Mojo’s initial focus is to bring programmability back to AI, enabling AI
developers to customize and get the most out of their hardware. As such, Mojo
will primarily benefit researchers and other engineers looking to write
high-performance AI operations. Over time, Mojo will become much more
interesting to the general Python community as it grows to be a superset of
Python. We hope this will help lift the vast Python library ecosystem and
empower more traditional systems developers that use C, C++, Rust, etc.

### Why build upon Python?

Effectively, all AI research and model development happens in Python today, and
there’s a good reason for this! Python is a powerful high-level language with
clean, simple syntax and a massive ecosystem of libraries. It’s also one of the
world's [most popular programming
languages](https://www.tiobe.com/tiobe-index/), and we want to help it become
even better. At Modular, one of our core principles is meeting customers where
they are—our goal is not to further fragment the AI landscape but to unify and
simplify AI development workflows.

### Why not enhance CPython (the major Python implementation) instead?

We’re thrilled to see a big push to improve
[CPython](https://en.wikipedia.org/wiki/CPython) by the existing community, but
our goals for Mojo (such as to deploy onto GPUs and other accelerators) need a
fundamentally different architecture and compiler approach underlying it.
CPython is a significant part of our compatibility approach and powers our
Python interoperability.

### Why not enhance another Python implementation (like Codon, PyPy, etc)?

Codon and PyPy aim to improve performance compared to CPython, but Mojo’s goals
are much deeper than this. Our objective isn’t just to create “a faster
Python,” but to enable a whole new layer of systems programming that includes
direct access to accelerated hardware, as outlined in [Why
Mojo](/mojo/why-mojo). Our technical implementation
approach is also very different, for example, we are not relying on heroic
compiler and JIT technologies to “devirtualize” Python.

Furthermore, solving big challenges for the computing industry is hard and
requires a fundamental rethinking of the compiler and runtime infrastructure.
This drove us to build an entirely new approach and we’re willing to put in the
time required to do it properly (see our blog post about [building a
next-generation AI
platform](https://www.modular.com/blog/the-case-for-a-next-generation-ai-developer-platform)),
rather than tweaking an existing system that would only solve a small part of
the problem.

### Why not make Julia better?

We think [Julia](https://julialang.org/) is a great language and it has a
wonderful community, but Mojo is completely different. While Julia and Mojo
might share some goals and look similar as an easy-to-use and high-performance
alternative to Python, we’re taking a completely different approach to building
Mojo. Notably, Mojo is Python-first and doesn't require existing Python
developers to learn a new syntax.

Mojo also has a bunch of technical advancements compared to Julia, simply
because Mojo is newer and we’ve been able to learn from Julia (and from Swift,
Rust, C++ and many others that came before us). For example, Mojo takes a
different approach to memory ownership and memory management, it scales down to
smaller envelopes, and is designed with AI and MLIR-first principles (though
Mojo is not only for AI).

That said, we also believe there’s plenty of room for many languages and this
isn’t an OR proposition. If you use and love Julia, that's great! We’d love for
you to try Mojo and if you find it useful, then that's great too.

## Functionality

### Where can I learn more about Mojo’s features?

The best place to start is the [Mojo Manual](/mojo/manual). And if you want to
see what features are coming in the future, take a look at [the
roadmap](/mojo/roadmap).

### What does it mean that Mojo is designed for MLIR?

[MLIR](https://mlir.llvm.org/) provides a flexible infrastructure for building
compilers. It’s based upon layers of intermediate representations (IRs) that
allow for progressive lowering of any code for any hardware, and it has been
widely adopted by the hardware accelerator industry since [its first
release](https://blog.google/technology/ai/mlir-accelerating-ai-open-source-infrastructure/).
Although you can use MLIR to create a flexible and powerful compiler for any
programming language, Mojo is the world’s first language to be built from the
ground up with MLIR design principles. This means that Mojo not only offers
high-performance compilation for heterogeneous hardware, but it also provides
direct programming support for the MLIR intermediate representations.

### Is Mojo only for AI or can it be used for other stuff?

Mojo's initial focus is to solve AI programmability challenges. See
[here](https://github.com/modular/modular/tree/main/examples/custom_ops) for examples
of how to write custom GPU operations. That being said,
the goal is to grow Mojo into a general purpose programming language. We use Mojo
at Modular to develop AI algorithms, but
you can use it for other things like HPC, data transformations, writing pre/post
processing operations, and much more. For examples of how Mojo can be used for
other general programming tasks, see our [Mojo
examples](https://github.com/modular/modular/tree/main/examples/mojo).

### Is Mojo interpreted or compiled?

Mojo is a compiled language. [`mojo build`](/mojo/cli/build) performs
ahead-of-time (AOT) compilation to save an executable program. [`mojo
run`](/mojo/cli/run) performs just-in-time (JIT) compilation to execute a Mojo
source file without saving the compiled result.

### How does Mojo compare to Triton Lang?

[Triton Lang](https://triton-lang.org/main/index.html) is a specialized
programming model for one type of accelerator, whereas Mojo is a more general
language that will support more architectures over time and includes a
debugger, a full tool suite, etc. For more about embedded domain-specific
languages (EDSLs) like Triton, read the “Embedded DSLs in Python” section of
[Why
Mojo](/mojo/why-mojo#embedded-dsls-in-python).

### How does Mojo help with PyTorch acceleration?

We use Mojo as part of the overall Modular AI stack,
[MAX](https://www.modular.com/max) which accelerates PyTorch models.
Mojo is the language we use to write the MAX’s high-performance CPU and GPU graph
operations.

### Does Mojo support distributed execution?

Not alone. You will need to leverage the
[MAX Platform](https://www.modular.com/max)
for that. Mojo is one component of the Modular stack that makes it easier
for you to author highly performant, portable CPU and GPU graph operations,
but you’ll also need a runtime (or “OS”) that supports graph level
transformations and heterogeneous compute, which is provided by MAX.

### Will Mojo support web deployment (such as Wasm or WebGPU)?

We haven’t prioritized this functionality yet, but there’s no reason Mojo can’t
support it.

### How do I convert Python programs or libraries to Mojo?

Mojo is still early and not yet a Python superset, so only simple programs can
be brought over as-is with no code changes. We will continue investing in this
and build migration tools as the language matures.

### What about interoperability with other languages like C/C++?

Yes, we want to enable developers to port code from languages other than Python
to Mojo as well. We expect that due to Mojo’s similarity to the C/C++ type
systems, migrating code from C/C++ should work well and it’s in [our
roadmap](/mojo/roadmap#cc-interop).

### How does Mojo support hardware lowering?

Mojo leverages LLVM-level dialects for the hardware targets it supports, and it
uses other MLIR-based code-generation backends where applicable. This also
means that Mojo is easily extensible to any hardware backend. For more
information, read about our vision for [pluggable
hardware](https://www.modular.com/hardware).

### Who writes the software to add more hardware support for Mojo?

Mojo provides all the language functionality necessary for anyone to extend
hardware support. As such, we expect hardware vendors and community members
will contribute additional hardware support in the future.

### How does Mojo provide a 35,000x speed-up over Python?

Modern CPUs are surprisingly complex and diverse, but Mojo enables
systems-level optimizations and flexibility that unlock the features of any
device in a way that Python cannot. So the hardware matters for this sort of
benchmark, and for the Mandelbrot benchmarks we show in our [launch
keynote](https://www.youtube.com/watch?v=-3Kf2ZZU-dg\&t=1543s), we ran them on
an [AWS r7iz.metal-16xl](https://aws.amazon.com/ec2/instance-types/r7iz/)
machine.

For lots more information, check out our 3-part blog post series about
[how Mojo gets a 35,000x speedup over
Python](https://www.modular.com/blog/how-mojo-gets-a-35-000x-speedup-over-python-part-1).

By the way, all the CPU and GPU graph operations that power Modular's
[MAX Platfrom](https://www.modular.com/max) are written in Mojo.
We also compared our matrix multiplication implementation to other
state-of-the-art implementations that are usually written in assembly.
To see the results, see [our blog post about unified matrix
multiplication](https://www.modular.com/blog/the-worlds-fastest-unified-matrix-multiplication).

## Performance

### Are there any AI related performance benchmarks for Mojo?

It’s important to remember that Mojo is a general-purpose programming language,
and any AI-related benchmarks will rely heavily upon other framework
components. For example, our in-house CPU and GPU graph operations that power
Modular's [MAX](https://www.modular.com/max) are all written in
Mojo and you can learn more about performance in our [matrix multiplication blog
post](https://www.modular.com/blog/the-worlds-fastest-unified-matrix-multiplication).
For details about our end-to-end model performance, read about how we measure
performance at Modular [here](https://www.modular.com/blog/max-gpu-state-of-the-art-throughput-on-a-new-genai-platform).

## Mojo SDK

### How can I get access to the SDK?

Mojo is included with the `max` conda package. Try it now by following
the tutorial to [get started with Mojo](/mojo/manual/get-started).

Read more about [why Mojo is bundled with
MAX](/max/faq#why-bundle-mojo-with-max).

### Is the Mojo Playground still available?

Yes, but it's different. When we first announced Mojo, it was available
only through login, in a JupyterLab environment. Now that Mojo is available
for local development, we've shut down that service.

The new [Mojo Playground](https://developer.modular.com/playground)
does not require login.

* It provides access to Mojo and the Mojo standard library. It does not have
  network access, so you can't install additional Mojo or Python packages.

* It doesn't include any Python packages by default. In the future,
  we intend to make some common Python packages available to import in the
  Playground.

* You can download your code or share it as a gist, but there's no mechanism
  for saving code in the Playground itself. Any changes will be lost when you
  switch code examples (as well as in the event of a server refresh or update).
  If you come up with something you want to save, download it or share it
  using buttons in the Playground toolbar.

* There might be some bugs. Please [report issues and feedback on
  GitHub](https://github.com/modular/modular/issues/new/choose).

### What are the license terms for the SDK?

Please read the [Terms of use](https://www.modular.com/legal/terms).

### What operating systems are supported?

Currently, we support Ubuntu Linux 20.04/22.04 (64-bit x86) and macOS (Apple
silicon). Support for Windows will follow. Until then, you have several options:

* Windows users can use
  [Windows Subsystem for Linux version 2 (WSL 2)](https://learn.microsoft.com/en-us/windows/wsl/install)
  running a supported Linux distribution.
* Intel Mac users can use a [Docker](https://www.docker.com/) container running
  a supported Linux distribution.
* Users on any system can install the SDK on a remote machine running a
  supported Linux distribution.

### Is there IDE Integration?

Yes, we've published an official [Mojo language extension](https://marketplace.visualstudio.com/items?itemName=modular-mojotools.vscode-mojo)
for VS Code.

The extension supports various features including syntax highlighting, code
completion, formatting, hover, etc. It works seamlessly with remote-ssh and dev
containers to enable remote development in Mojo.

### Does the Mojo SDK collect telemetry?

Yes, the Mojo SDK collects some basic system information, basic
compiler/runtime events, and crash reports that enable us to identify, analyze,
and prioritize Mojo issues.

This telemetry is crucial to help us quickly identify problems and improve our
products. Without this telemetry, we would have to rely on user-submitted bug
reports, and in our decades of experience building developer products, we know
that most people don’t do that. The telemetry provides us the insights we need
to build better products for you.

You can opt-out of the crash report and compiler/runtime telemetry, but
package install/update/uninstall events cannot be
disabled (see the [MAX SDK terms](https://www.modular.com/legal/max)).

To disable crash reports, use this command:

```sh
modular config-set crash_reporting.enabled=false
```

To reduce other telemetry to only the required telemetry events, use this
command:

```sh
modular config-set telemetry.level=0
```

There are 3 telemetry levels: `0` currently records nothing (unless you're also
using MAX, which records hardware information and session durations); `1`
records high-level events such as when the compiler is invoked; and `2` records
more detail such as the time spend compiling.

## Versioning & compatibility

### What’s the Mojo versioning strategy?

Mojo is still in early development and not at a 1.0 version yet. It’s
still missing many foundational features, but please take a look at our
[roadmap](/mojo/roadmap) to understand where things are headed. As such,
the language is evolving rapidly and source stability is not guaranteed.

### How often will you be releasing new versions of Mojo?

Mojo development is moving fast and we are regularly releasing updates. Please
join the [Mojo Discord channel](http://discord.gg/modular) for notifications
and [sign up for our newsletter](https://www.modular.com/modverse#signup) for
more coarse-grain updates.

## Open Source

### Will Mojo be open-sourced?

We have committed to open-sourcing Mojo in 2026.
Mojo is still young, so we will continue to incubate it within Modular until
more of its internal architecture is fleshed out.

### Why not develop Mojo in the open from the beginning?

Mojo is a big project and has several architectural differences from previous
languages. We believe a tight-knit group of engineers with a common vision can
move faster than a community effort. This development approach is also
well-established from other projects that are now open source (such as LLVM,
Clang, Swift, MLIR, etc.).

## Community

### Where can I ask more questions or share feedback?

If you have questions about upcoming features or have suggestions
for the language, be sure you first read the [Mojo roadmap](/mojo/roadmap), which
provides important information about our current priorities and links to
our GitHub channels where you can report issues and discuss new features.

To get in touch with the Mojo team and developer community, use the resources
on our [community page](https://www.modular.com/community).

---

## Mojo🔥 roadmap & sharp edges

This document captures the broad plan about how we plan to implement things in
Mojo, and some early thoughts about key design decisions. This is not a full
design spec for any of these features, but it can provide a "big picture" view
of what to expect over time. It is also an acknowledgement of major missing
components that we plan to add.

## Overall priorities

Mojo is still in early development and many language features will arrive in
the coming months. We are highly focused on building Mojo the right way (for
the long-term), so we want to fully build-out the core Mojo language features
before we work on other dependent features and enhancements.

Currently, that means we are focused on the core system programming features
that are essential to [Mojo's mission](/mojo/why-mojo), and as outlined in the
following sections of this roadmap.

In the near-term, we will **not** prioritize "general goodness" work such as:

* Adding syntactic sugar and short-hands for Python.
* Adding features from other languages that are missing from Python (such as
  public/private declarations).
* Tackling broad Python ecosystem challenges like packaging.

If you have encountered any bugs with current Mojo behavior, please
[submit an issue on GitHub](https://github.com/modular/modular/issues).

If you have ideas about how to improve the core Mojo features, we prefer that
you first look for similar topics or start a new conversation about it
on [Discord](https://discord.gg/modular).

We also consider Mojo to be a new member of the Python family, so if you
have suggestions to improve the experience with Python, we encourage
you to propose these "general goodness" enhancements through the formal [PEP
process](https://peps.python.org/pep-0001/).

### Why not add syntactic sugar or other minor new features?

We are frequently asked whether Mojo will add minor features that people love
in other languages but that are missing in Python, such as "implicit
return" at the end of a function, public/private access control, fixing Python
packaging, and various syntactic shorthands.  As mentioned above, we are
intentionally *not* adding these kinds of features to Mojo right now.
There are three major reasons for this:

* First, Mojo is still young: we are still "building a house" by laying down
  major bricks in the type system and adding system programming features that
  Python lacks. We know we need to implement support for many existing Python
  features (compatibility is a massive and important goal of Mojo) and this work
  is not done yet. We have limited engineering bandwidth and want focus on
  building essential functionality, and we will not debate whether certain
  syntactic sugar is important or not.

* Second, syntactic sugar is like mortar in a building—its best use is to hold
  the building together by filling in usability gaps. Sugar (and mortar) is
  problematic to add early into a system: you can run into problems with laying
  the next bricks because the sugar gets in the way. We have experience building
  other languages (such as Swift) that added sugar early, which could have been
  subsumed by more general features if time and care were given to broader
  evaluation.

* Third, the Python community should tackle some of these ideas first. It is
  important to us that Mojo be a good member of the Python family,
  not just a language with Pythonic syntax. As such, we don't want to needlessly
  diverge from Python evolution: adding a bunch of features could lead to
  problems down the road if Python makes incompatible decisions. Such a future
  would fracture the community which would cause massively more harm than any
  minor language feature could offset.

For all these reasons, "nice to have" syntactic sugar is not a priority, and we
will quickly close such proposals to avoid cluttering the issue tracker. If
you'd like to propose a "general goodness" syntactic feature, please do so with
the existing [Python PEP process](https://peps.python.org/pep-0000/). If/when
Python adopts a feature, Mojo may also add it, because Mojo's goal is to adopt
Python's syntax. We are happy with this approach because the Python community is
better equipped to evaluate these features, they have mature code bases to
evaluate them with, and they have processes and infrastructure for making
structured language evolution features.

## Small independent features

There are a number of features that are missing that are important to round out
the language fully, but which don't depend strongly on other features.  These
include things like:

* Improved package management support.
* Many standard library features, including copy-on-write data structures.
* Support for "top level code" at file scope.
* Algebraic data types like `enum` in Swift/Rust, and pattern matching.
* Many standard library types need refinement, including `Optional[T]` and
  `Result[T, Error]`.

## Ownership and Lifetimes

The ownership system is partially implemented, and is expected to get built out
in the next couple of months.  The basic support for ownership includes features
like:

* Capture declarations in closures.
* Lifetime checker: complain about invalid mutable references.
* Lifetime checker: enforce argument exclusivity for mutable references.

Mojo has support for a safe `Pointer` type, and it is used in the standard
library, but it is still under active development and not very pretty or nice
to use right now.

## Traits support

Mojo has basic support for
[traits](/mojo/manual/traits). Traits allow you
to specify a set of requirements for types to implement. Types can implement
those requirements to *conform to* the trait. Traits allow you to write
generic functions and generic containers, which can work with any type that
conforms to a given trait, instead of being hard-coded to work with a specific
type.

Currently, the only kind of requirements supported by traits are required method
signatures. The trait can't provide a default implementation for its required
methods, so each conforming type must implement all of the required methods.

A number of [built-in traits](/mojo/manual/traits#built-in-traits) are
already implemented in the standard library.

We plan to expand traits support in future releases. Planned features include:

* Support for default implementations of required methods.

* Support for a feature like Swift's extensions, allowing you to add a trait to
  a preexisting type.

* Improve support for conditional conformance.

## Classes

Mojo still doesn't support classes, the primary thing Python programmers use
pervasively!  This isn't because we hate dynamism - quite the opposite.  It is
because we need to get the core language semantics nailed down before adding
them.  We expect to provide full support for all the dynamic features in Python
classes, and want the right framework to hang that off of.

When we get here, we will discuss what the right default is: for example, is
full Python hash-table dynamism the default? Or do we use a more efficient
model by default (e.g. vtable-based dispatch and explicitly declared stored
properties) and allow opt'ing into dynamism with a `@dynamic` decorator on the
class. More discussion is [in this proposal](https://github.com/modular/modular/blob/main/mojo/proposals/mojo-and-dynamism.md).

## C/C++ Interop

Integration to transparently import Clang C/C++ modules.  Mojo's type system
and C++'s are very compatible, so we should be able to have something pretty
nice here. Mojo can leverage Clang to transparently generate a foreign function
interface between C/C++ and Mojo, with the ability to directly import functions:

```mojo
from "math.h" import cos

print(cos(0))
```

## Calling Mojo from Python

Currently you can call Python code from Mojo, but not the reverse: you can't
pass a Mojo callback to a Python function, or build a Python extension in Mojo.
We want to support calling Mojo from Python, but we want to do it right and we
need the core language to be more mature first.

## Full MLIR decorator reflection

All decorators in Mojo have hard-coded behavior in the parser. In time, we will
move these decorators to being compile-time metaprograms that use MLIR
integration. This may depend on C++ interop for talking to MLIR. This completely
opens up the compiler to programmers. Static decorators are functions executed
at compile-time with the capability to inspect and modify the IR of functions
and types.

```mojo
fn value(t: TypeSpec):
    t.__copyinit__ = # synthesize dunder copyinit automatically

@value
struct TrivialType: pass

fn full_unroll(loop: mlir.Operation):
    # unrolling of structured loop

fn main():
    @full_unroll
    for i in range(10):
        print(i)
```

## Sharp Edges

The entire Modular kernel library is written in Mojo, and its development has
been prioritized based on the internal needs of those users. Given that Mojo is
still a young language, there are a litany of missing small features that many
Python and systems programmers may expect from their language, as well as
features that don't quite work the way we want to yet, and in ways that can be
surprising or unexpected. This section of the document describes a variety of
"sharp edges" in Mojo, and potentially how to work around them if needed. We
expect all of these to be resolved in time, but in the meantime, they are
documented here.

### No list or dict comprehensions

Mojo does not yet support Python list or dictionary comprehension expressions,
like `[x for x in range(10)]`.

### No `lambda` syntax

Mojo does not yet support defining anonymous functions with the `lambda`
keyword.

### Parametric aliases

Mojo aliases can refer to parametric values but cannot themselves have
parameter lists. As of v0.6.0, you can create a parametric alias by aliasing
an unbound or partially-bound type. For example, the new `Scalar` type is
defined as:

```mojo
alias Scalar = SIMD[size=1]
```

This creates a parametric alias that you can use like this:

```mojo
var i = Scalar[DType.int8]
```

Parametric aliases with an explicit parameter list aren't yet supported:

```mojo
alias mul2[x: Int] = x * 2
# Error!
```

### `Exception` is actually called `Error`

In Python, programmers expect that exceptions all subclass the `Exception`
builtin class. The only available type for Mojo "exceptions" is `Error`:

```mojo
fn raise_an_error() raises:
    raise Error("I'm an error!")
```

The reason we call this type `Error` instead of `Exception` is because it's not
really an exception. It's not an exception, because raising an error does not
cause stack unwinding, but most importantly it does not have a stack trace. And
without polymorphism, the `Error` type is the only kind of error that can be
raised in Mojo right now.

### No Python-style generator functions

Mojo does not yet support Python-style generator functions (`yield` syntax).
These are "synchronous co-routines" -- functions with multiple suspend points.

### No `async for` or `async with`

Although Mojo has support for async functions with `async fn` and `async def`,
Mojo does not yet support the `async for` and `async with` statements.

### Scoping and mutability of statement variables

Python programmers understand that local variables are implicitly declared and
scoped at the function level. As the Mojo Manual explains, this is supported in
Mojo for
[implicitly-declared variables](/mojo/manual/variables#implicitly-declared-variables).
However, there are some nuances to Python's implicit declaration rules that Mojo
does not match 1-to-1.

For example, the scope of `for` loop iteration variables and caught exceptions
in `except` statements is limited to the next indentation block, for both `def`
and `fn` functions. Python programmers will expect the following program to
print "2":

```python
for i in range(3): pass
print(i)
```

However, Mojo will complain that `print(i)` is a use of an unknown declaration.
This is because whether `i` is defined at this line is dynamic in Python. For
instance the following Python program will fail:

```python
for i in range(0): pass
print(i)
```

With `NameError: name 'i' is not defined`, because the definition of `i` is a
dynamic characteristic of the function. Mojo's lifetime tracker is intentionally
simple (so lifetimes are easy to use!), and cannot reason that `i` would be
defined even when the loop bounds are constant.

### Name scoping of nested function declarations

In Python, nested function declarations produce dynamic values. They are
essentially syntactic sugar for `bar = lambda ...`.

```python
def foo():
    def bar(): # creates a function bound to the dynamic value 'bar'
        pass
    bar() # indirect call
```

In Mojo, nested function declarations are static, so calls to them are direct
unless made otherwise.

```mojo
fn foo():
    fn bar(): # static function definition bound to 'bar'
        pass
    bar() # direct call
    var f = bar # materialize 'bar' as a dynamic value
    f() # indirect call
```

Currently, this means you cannot declare two nested functions with the same
name. For instance, the following example does not work in Mojo:

```mojo
def pick_func(cond) -> def() capturing:
    if cond:
        def bar(): return 42
    else:
        def bar(): return 3 # error: redeclaration of 'bar'
    return bar
```

The functions in each conditional must be explicitly materialized as dynamic
values.

```mojo
def pick_func(cond)  -> def() capturing:
    var result: def() capturing # Mojo function type
    if cond:
        def bar0(): return 42
        result = bar0
    else:
        def bar1(): return 3
        result = bar1
    return result
```

We hope to sort out these oddities with nested function naming as our model
of closures in Mojo develops further.

### Limited polymorphism

Mojo has implemented static polymorphism through traits, as noted above. We
plan to implement dynamic polymorphism through classes and MLIR reflection in
the future.

Python programmers are used to implementing special dunder methods on their
classes to interface with generic methods like `print()` and `len()`. For
instance, one expects that implementing `__repr__()` or `__str__()` on a class
will enable that class to be printed using `print()`.

```python
class One:
    def __init__(self): pass
    def __repr__(self): return '1'

print(One()) # prints '1'
```

Mojo currently supports similar functionality through the
[`Writable`](/mojo/stdlib/utils/write/Writable) trait, so that `print()` works
on all `Writable` types. We'll continue to add traits support to the standard
library to enable common use cases like this.

### The standard library has limited exceptions use

For historic and performance reasons, core standard library types typically do
not use exceptions. For instance, `List` will not raise an
out-of-bounds access (it will crash), and `Int` does not throw on divide by
zero. In other words, most standard library types are considered "unsafe".

```mojo
var l = List[Int](capacity=0)
print(l[1]) # could crash or print garbage values (undefined behavior)

print(1//0) # does not raise and could print anything (undefined behavior)
```

This is clearly unacceptable given the strong memory safety goals of Mojo. We
will circle back to this when more language features and language-level
optimizations are available.

### Nested functions cannot be recursive

Nested functions (any function that is not a top-level function) cannot be
recursive in any way. Nested functions are considered "parameters", and although
parameter values do not have to obey lexical order, their uses and definitions
cannot form a cycle. Current limitations in Mojo mean that nested functions,
which are considered parameter values, cannot be cyclic.

```mojo
fn try_recursion():
    fn bar(x: Int): # error: circular reference :
i = 5
print(type(i or s)) # prints 
```

In Mojo, given the expression `(a or b)`, the compiler needs to statically
determine a result type that the types of `a` and `b` can both be **converted** to.

For example, currently an `Int` can be implicitly converted to a `String`, but a
`String` can't be implicitly converted to an `Int`. So given an integer value
`i` and a string value `s`, the value of `(i or s)` will *always* be a `String`.

### `StringLiteral` behaves differently than `String`

String literals behave differently than `String` values in Mojo code. For
example:

```mojo
fn main():
    var g: Int = 0
    var h: String = "hello"
    print(g or h)  # prints `hello`
    print(g or "hello")  # prints `True`
```

While the `IntLiteral` and `FloatLiteral` types convert or *materialize* at
runtime into `Int` and `Float64` values, respectively, string literals continue
to exist at runtime as `StringLiteral` values. This can result in surprising
behavior because `StringLiteral` has a more restricted API than `String`.

In the example above, because the `or` expression is statically typed,
and `Int` cannot be implicitly converted to a `StringLiteral`, the compiler
chooses a result type that both `Int` and `StringLiteral` can be converted to—in
this case, `Bool`.

We plan to address this issue in the future, but in the near term, you can avoid
the inconsistency between `StringLiteral` and `String` problems by explicitly
converting string literals to `String` values. For example:

```mojo
var h: String = "hello"
# or
print(g or String("hello"))
```

### Walrus assignment expression limitations

The Mojo compiler reports an uninitialized value error if an expression uses
multiple "walrus" [assignment
expressions](/mojo/manual/operators#assignment-expressions) to declare more than
one variable. For example:

```mojo
def A() -> Int: return 42

def B() -> String: return "waffles"

def main():
    if (a := A()) and (b := B()):
        print("a =", a)
        print("b =", b)
```

```output
walrus-conditional.mojo:8:14: error: use of uninitialized value 'b'
        print("b =", b)
             ^
walrus-conditional.mojo:6:24: note: 'b' declared here
    if (a := A()) and (b := B()):
                       ^
```

Ideally, the Mojo compiler should compile this code because the second `print()`
statement executes only if a value is assigned to `b`. To work around this
limitation you can explicitly initialize `b` before the `if` statement, like
this:

```mojo
def A() -> Int: return 42

def B() -> String: return "waffles"

def main():
    b = String()
    if (a := A()) and (b := B()):
        print("a =", a)
        print("b =", b)
```

```output
a = 42
b = waffles
```

---

## monotonic

`monotonic() -> UInt`

Returns the current monotonic time time in nanoseconds. This function queries the current platform's monotonic clock, making it useful for measuring time differences, but the significance of the returned value varies depending on the underlying implementation.

**Returns:**

The current time in ns.

---

## Movable

The Movable trait denotes a type whose value can be moved.

Implement the `Movable` trait on `Foo` which requires the `__moveinit__`
method:

```mojo
struct Foo(Movable):
    fn __init__(out self):
        pass

    fn __moveinit__(out self, owned existing: Self):
        print("moving")
```

You can now use the ^ suffix to move the object instead of copying
it inside generic functions:

```mojo
fn return_foo[T: Movable](owned foo: T) -> T:
    return foo^

var foo = Foo()
var res = return_foo(foo^)
```

```plaintext
moving
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

---

## mul

`mul(lhs: IntTuple[origin], rhs: Int) -> IntTuple`

Multiply each element in an `IntTuple` by a scalar value.

This function creates a new `IntTuple` where each element (at any nesting level)
is multiplied by the provided integer value.

**Args:**

* ​lhs (`IntTuple[origin]`): The `IntTuple` whose elements will be multiplied.
* ​rhs (`Int`): The scalar integer to multiply each element by.

**Returns:**

A new `IntTuple` with the same structure as the input but with all
elements multiplied by the scalar value.

---

## mul

`mul(x: SIMD[dtype, size], y: SIMD[dtype, size]) -> SIMD[dtype, size]`

---

## mulhi

`mulhi(a: SIMD[uint16, 1], b: SIMD[uint16, 1]) -> SIMD[uint32, 1]`

Calculates the most significant 32 bits of the product of two 16-bit unsigned integers.

Multiplies two 16-bit unsigned integers and returns the high 32 bits
of their product. Useful for fixed-point arithmetic and overflow
detection.

Note:
On NVIDIA GPUs, this maps directly to the MULHI.U16 PTX instruction.
On others, it performs multiplication using 32-bit arithmetic.

**Args:**

* ​a (`SIMD[uint16, 1]`): First 16-bit unsigned integer operand.
* ​b (`SIMD[uint16, 1]`): Second 16-bit unsigned integer operand.

**Returns:**

The high 32 bits of the product a \* b

`mulhi(a: SIMD[int16, 1], b: SIMD[int16, 1]) -> SIMD[int32, 1]`

Calculates the most significant 32 bits of the product of two 16-bit signed integers.

Multiplies two 16-bit signed integers and returns the high 32 bits
of their product. Useful for fixed-point arithmetic and overflow detection.

Note:
On NVIDIA GPUs, this maps directly to the MULHI.S16 PTX instruction.
On others, it performs multiplication using 32-bit arithmetic.

**Args:**

* ​a (`SIMD[int16, 1]`): First 16-bit signed integer operand.
* ​b (`SIMD[int16, 1]`): Second 16-bit signed integer operand.

**Returns:**

The high 32 bits of the product a \* b

`mulhi(a: SIMD[uint32, 1], b: SIMD[uint32, 1]) -> SIMD[uint32, 1]`

Calculates the most significant 32 bits of the product of two 32-bit unsigned integers.

Multiplies two 32-bit unsigned integers and returns the high 32 bits
of their product. Useful for fixed-point arithmetic and overflow detection.

Note:
On NVIDIA GPUs, this maps directly to the MULHI.U32 PTX instruction.
On others, it performs multiplication using 64-bit arithmetic.

**Args:**

* ​a (`SIMD[uint32, 1]`): First 32-bit unsigned integer operand.
* ​b (`SIMD[uint32, 1]`): Second 32-bit unsigned integer operand.

**Returns:**

The high 32 bits of the product a \* b

`mulhi(a: SIMD[int32, 1], b: SIMD[int32, 1]) -> SIMD[int32, 1]`

Calculates the most significant 32 bits of the product of two 32-bit signed integers.

Multiplies two 32-bit signed integers and returns the high 32 bits
of their product. Useful for fixed-point arithmetic and overflow detection.

Note:
On NVIDIA GPUs, this maps directly to the MULHI.S32 PTX instruction.
On others, it performs multiplication using 64-bit arithmetic.

**Args:**

* ​a (`SIMD[int32, 1]`): First 32-bit signed integer operand.
* ​b (`SIMD[int32, 1]`): Second 32-bit signed integer operand.

**Returns:**

The high 32 bits of the product a \* b

---

## multimem_ld_reduce

`multimem_ld_reduce[type: DType, *, count: Int, reduction: ReduceOp, scope: Scope, consistency: Consistency, accum_type: DType = get_accum_type[::DType,::DType](), output_width: Int = 1](addr: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]) -> StaticTuple[SIMD[accum_type, output_width], count]`

Performs a vectorized load-reduce operation using NVIDIA's multimem feature.

This function loads multiple values from global memory and performs a reduction
operation across them in a single instruction. It utilizes NVIDIA's multimem
feature available on SM90+ GPUs for improved performance.

**Constraints:**

* Only supported on SM90+ GPUs.
* Count must be 2 or 4.
* Type must be float32, float16, or bfloat16.

**Parameters:**

* ​type (`DType`): Data type for the operation (float32, float16, or bfloat16).
* ​count (`Int`): Number of elements to load and reduce (2 or 4).
* ​reduction (`ReduceOp`): Type of reduction operation to perform.
* ​scope (`Scope`): Memory scope for the operation.
* ​consistency (`Consistency`): Memory consistency model to use.
* ​accum\_type (`DType`): Data type used for accumulation. Defaults to a wider type than input
  (e.g. float32 for float16 inputs) to maintain precision during reduction.
* ​output\_width (`Int`): Width of each output SIMD vector (default 1).

**Args:**

* ​addr (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]`): Pointer to global memory where data will be loaded from.

**Returns:**

A StaticTuple containing 'count' SIMD vectors of width 'output\_width'
holding the results of the load-reduce operation.

---

## multimem_st

`multimem_st[type: DType, *, count: Int, scope: Scope, consistency: Consistency, width: Int = 1](addr: UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)], values: StaticTuple[SIMD[type, width], count])`

Stages an inline multimem.st instruction.

This operation performs a store to all memory locations pointed to by the
multimem address using the specified memory consistency model and scope.

Notes:

* Requires SM90+ GPU architecture (PTX ISA 8.1+).
* The address must be a valid multimem address.
* Supported type-width combinations must total 32/64/128 bits.
* Default memory semantics: weak consistency (when not specified).
* Vector stores (.v2/.v4) require matching total size constraints.

Example:

```mojo
from gpu.memory import *

# Store 2 float32 values to multimem address.
multimem_st[DType.float32, count=2, scope=Scope.CTA, consistency=Consistency.RELAXED](
    addr, StaticTuple[DType.float32, 2](val1, val2)
)

# Vector store of 4 float16x2 values.
multimem_st[DType.float16, count=4, scope=Scope.CLUSTER, consistency=Consistency.RELEASE, width=2](
    addr, StaticTuple[DType.float16, 4](vec1, vec2, vec3, vec4)
)
```

See Also:
[PTX ISA Documentation](https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-multimem-ld-reduce-multimem-st-multimem-red).

**Parameters:**

* ​type (`DType`): The data type of elements to store (must be float16, bfloat16, or
  float32).
* ​count (`Int`): Number of vector elements per store operation (2 or 4).
* ​scope (`Scope`): Memory scope for visibility of the store operation
  (CTA/Cluster/GPU/System).
* ​consistency (`Consistency`): Memory consistency semantics (weak/relaxed/release).
* ​width (`Int`): Vector width modifier for packed data types (default 1).

**Args:**

* ​addr (`UnsafePointer[SIMD[type, 1], address_space=AddressSpace(1)]`): Multimem address in global address space pointing to multiple
  locations.
* ​values (`StaticTuple[SIMD[type, width], count]`): Packed SIMD values to store, with count matching the template
  parameter.

---

## multistage_dual_gemm

`multistage_dual_gemm[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, b_type: DType, b_layout: Layout, //, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: LayoutTensor[c_type, c_layout, origin], a: LayoutTensor[a_type, a_layout, origin], b0: LayoutTensor[b_type, b_layout, origin], b1: LayoutTensor[b_type, b_layout, origin], ctx: DeviceContext)`

`multistage_dual_gemm[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1] = swilu[::DType,::Int], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), num_k_partitions: Int = 1](c: NDBuffer[c_type, 2, origin, c_shape], a: NDBuffer[a_type, 2, origin, a_shape], b0: NDBuffer[b_type, 2, origin, b_shape], b1: NDBuffer[b_type, 2, origin, b_shape], ctx: DeviceContext)`

---

## multistage_dual_gemm_kernel

`multistage_dual_gemm_kernel[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, b_type: DType, b_layout: Layout, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], binary_lambda_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) -> SIMD[$0, $1], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: LayoutTensor[c_type, c_layout, MutableAnyOrigin], a: LayoutTensor[a_type, a_layout, MutableAnyOrigin], b0: LayoutTensor[b_type, b_layout, MutableAnyOrigin], b1: LayoutTensor[b_type, b_layout, MutableAnyOrigin])`

---

## multistage_dual_mma

`multistage_dual_mma[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, a_smem_layout: Layout, b_type: DType, b_layout: Layout, b_smem_layout: Layout, //, BM: Int, BN: Int, BK: Int, WM: Int, WN: Int, num_threads: Int, num_pipeline_stages: Int, transpose_b: Bool, /, *, swizzle_a: Bool = True, static_num_iters: Dim = Dim(-31337), k_group_size: UInt = UInt(1)](c0: LayoutTensor[c_type, c_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c1: LayoutTensor[c_type, c_layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], a_iter_arg: LayoutTensorIter[type, a_layout, MutableAnyOrigin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b0_iter_arg: LayoutTensorIter[b_type, b_layout, MutableAnyOrigin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b1_iter_arg: LayoutTensorIter[b_type, b_layout, MutableAnyOrigin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], a_smem_iter_arg: LayoutTensorIter[a_type, a_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut b0_smem_iter: LayoutTensorIter[b_type, b_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut b1_smem_iter: LayoutTensorIter[b_type, b_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], num_iters: Int, /, *, num_b_rows: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

---

## multistage_gemm

`multistage_gemm[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, config: MatmulConfig[a_type, b_type, c_type, transpose_b], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), serial_reduction: Bool = False](c: NDBuffer[c_type, 2, origin, c_shape], a: NDBuffer[a_type, 2, origin, a_shape], b: NDBuffer[b_type, 2, origin, b_shape], runtime_config: MatmulConfig[a_type, b_type, c_type, transpose_b], ctx: DeviceContext)`

---

## multistage_gemm_q

`multistage_gemm_q[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, group_size: Int, pack_factor: Int, config: MatmulConfig[a_type, b_type, c_type, True], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, c_shape], a: NDBuffer[a_type, 2, origin, a_shape], b: NDBuffer[b_type, 2, origin, b_shape], runtime_config: MatmulConfig[a_type, b_type, c_type, True], ctx: DeviceContext)`

---

## multistage_mma_q

`multistage_mma_q[BM: Int, BN: Int, BK: Int, WM: Int, WN: Int, num_threads: Int, num_pipeline_stages: Int, transpose_b: Bool, group_size: Int, pack_factor: Int, c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, a_smem_layout: Layout, b_type: DType, b_layout: Layout, b_smem_layout: Layout, scales_type: DType, scales_layout: Layout, scales_smem_layout: Layout, /, *, swizzle_a: Bool = True, static_num_iters: Dim = Dim(-31337), prefetch_init: Bool = True, continue_prefetch_b: Bool = False, transpose_b_next: Bool = False, b_next_gmem_layout: Layout = Layout(), b_next_smem_layout: Layout = Layout(), next_op_b_iter_alignment: Int = alignof[::DType,__mlir_type.!kgen.target]()](c: LayoutTensor[c_type, c_layout, origin, address_space=AddressSpace(5)], a_iter_arg: LayoutTensorIter[type, a_layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b_iter_arg: LayoutTensorIter[b_type, b_layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], a_smem_iter_arg: LayoutTensorIter[a_type, a_smem_layout, origin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], mut b_smem_iter: LayoutTensorIter[b_type, b_smem_layout, origin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], scales_smem_iter_arg: LayoutTensorIter[scales_type, scales_smem_layout, origin, address_space=AddressSpace(3), alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], scales_iter_arg: LayoutTensorIter[scales_type, scales_layout, origin, address_space=address_space, alignment=alignment, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], num_iters: Int, /, *, num_b_rows: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

---

## multistage_qgemm_kernel

`multistage_qgemm_kernel[c_type: DType, c_layout: Layout, a_type: DType, a_layout: Layout, b_packed_type: DType, b_layout: Layout, group_size: Int, pack_factor: Int, transpose_b: Bool, config: MatmulConfig[a_type, b_packed_type, c_type, transpose_b], elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: LayoutTensor[c_type, c_layout, MutableAnyOrigin], a: LayoutTensor[a_type, a_layout, MutableAnyOrigin], b_packed: LayoutTensor[b_packed_type, b_layout, MutableAnyOrigin])`

---

## mulwide

`mulwide(a: SIMD[uint32, 1], b: SIMD[uint32, 1]) -> SIMD[uint64, 1]`

Performs a wide multiplication of two 32-bit unsigned integers.

Multiplies two 32-bit unsigned integers and returns the full 64-bit result.
Useful when the product may exceed 32 bits.

Note:
On NVIDIA GPUs, this maps directly to the MUL.WIDE.U32 PTX instruction.
On others, it performs multiplication using 64-bit casts.

**Args:**

* ​a (`SIMD[uint32, 1]`): First 32-bit unsigned integer operand.
* ​b (`SIMD[uint32, 1]`): Second 32-bit unsigned integer operand.

**Returns:**

The full 64-bit product of a \* b

`mulwide(a: SIMD[int32, 1], b: SIMD[int32, 1]) -> SIMD[int64, 1]`

Performs a wide multiplication of two 32-bit signed integers.

Multiplies two 32-bit signed integers and returns the full 64-bit result.
Useful when the product may exceed 32 bits or be negative.

Note:
On NVIDIA GPUs, this maps directly to the MUL.WIDE.S32 PTX instruction.
On others, it performs multiplication using 64-bit casts.

**Args:**

* ​a (`SIMD[int32, 1]`): First 32-bit signed integer operand.
* ​b (`SIMD[int32, 1]`): Second 32-bit signed integer operand.

**Returns:**

The full 64-bit signed product of a \* b

---

## naive_gemv

`naive_gemv[c_size: Dim, a_shape: DimList, b_size: Dim, type: DType](c_buf: NDBuffer[type, 1, origin, __init__[::Intable](c_size)], a_buf: NDBuffer[type, 2, origin, a_shape], b_buf: NDBuffer[type, 1, origin, __init__[::Intable](b_size)])`

---

## naive_grouped_matmul

`naive_grouped_matmul[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool = True](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin], max_num_tokens_per_expert: Int, num_active_experts: Int, ctx: DeviceContext)`

---

## naive_grouped_matmul_kernel

`naive_grouped_matmul_kernel[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList](c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b: NDBuffer[b_type, 3, MutableAnyOrigin, b_shape], a_offsets: NDBuffer[uint32, 1, MutableAnyOrigin], expert_ids: NDBuffer[uint32, 1, MutableAnyOrigin])`

---

## Naive2dConvolution

`struct Naive2dConvolution[output_type: DType, input_type: DType, filter_type: DType]`

Struct wrapper for naive 2d convolution implementation.

## Fields

* ​output (`UnsafePointer[SIMD[output_type, 1]]`):
* ​input (`UnsafePointer[SIMD[input_type, 1]]`):
* ​filter (`UnsafePointer[SIMD[filter_type, 1]]`):
* ​pad\_d (`IndexList[2]`):
* ​pad\_h (`IndexList[2]`):
* ​pad\_w (`IndexList[2]`):
* ​stride (`IndexList[3]`):
* ​dilation (`IndexList[3]`):
* ​num\_groups (`Int`):
* ​output\_shape (`IndexList[5]`):
* ​input\_shape (`IndexList[5]`):
* ​filter\_shape (`IndexList[5]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, output: UnsafePointer[SIMD[output_type, 1]], input: UnsafePointer[SIMD[input_type, 1]], filter: UnsafePointer[SIMD[filter_type, 1]], output_shape: IndexList[5], input_shape: IndexList[5], filter_shape: IndexList[5], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[3], dilation: IndexList[3], num_groups: Int)`

### `run`

`static run(output: UnsafePointer[SIMD[output_type, 1]], input: UnsafePointer[SIMD[input_type, 1]], filter: UnsafePointer[SIMD[filter_type, 1]], output_shape: IndexList[5], input_shape: IndexList[5], filter_shape: IndexList[5], pad_d: IndexList[2], pad_h: IndexList[2], pad_w: IndexList[2], stride: IndexList[3], dilation: IndexList[3], num_groups: Int)`

---

## named_barrier

`named_barrier[num_threads: SIMD[int32, 1], id: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](0)]()`

Performs a named synchronization barrier at the block level.

This function creates a synchronization point using a specific barrier ID, allowing
for multiple independent barriers within a thread block. All threads in the block
must execute this function with the same barrier ID and thread count before any
thread can proceed past the barrier.

Notes:

* Only supported on NVIDIA GPUs.
* Maps directly to the `nvvm.barrier` instruction.
* Useful for fine-grained synchronization when different subsets of threads
  need to synchronize independently.
* The barrier ID must not exceed 16.
* All threads participating in the barrier must specify the same num\_threads value.

**Parameters:**

* ​num\_threads (`SIMD[int32, 1]`): The number of threads that must reach the barrier before any can proceed.
* ​id (`SIMD[int32, 1]`): The barrier identifier (0-16). Default is 0.

---

## NamedTemporaryFile

`struct NamedTemporaryFile`

A handle to a temporary file.

Example:

```mojo
from tempfile import NamedTemporaryFile
from pathlib import Path
def main():
    var p: Path
    with NamedTemporaryFile(mode="rw") as f:
        p = f.name
        f.write("Hello world!")
        f.seek(0)
        print(
            f.read() == "Hello world!"
        )
    print(String(p), p.exists()) #Removed by default
```

Note: `NamedTemporaryFile.__init__` document the arguments.

## Fields

* ​name (`String`): Name of the file.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, mode: String = __init__[__mlir_type.!kgen.string]("w"), name: Optional[String] = Optional(None), suffix: String = __init__[__mlir_type.!kgen.string](""), prefix: String = __init__[__mlir_type.!kgen.string]("tmp"), dir: Optional[String] = Optional(None), delete: Bool = True)`

Create a named temporary file.

This is a wrapper around a `FileHandle`,
`os.remove()` is called in the `close()` method if `delete` is True.

Can be used as a context manager. When used as a context manager, the
`close()` is called when the context manager exits.

**Args:**

* ​mode (`String`): The mode to open the file in (the mode can be "r" or "w").
* ​name (`Optional[String]`): The name of the temp file. If it is unspecified, then a random name will be provided.
* ​suffix (`String`): Suffix to use for the file name if name is not provided.
* ​prefix (`String`): Prefix to use for the file name if name is not provided.
* ​dir (`Optional[String]`): Directory in which the file will be created.
* ​delete (`Bool`): Whether the file is deleted on close.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Moves constructor for the file handle.

**Args:**

* ​existing (`Self`): The existing file handle.

### `__del__`

`__del__(owned self)`

Closes the file handle.

### `close`

`close(mut self)`

Closes the file handle.

### `read`

`read(self, size: Int = -1) -> String`

Reads the data from the file.

**Args:**

* ​size (`Int`): Requested number of bytes to read.

**Returns:**

The contents of the file.

### `read_bytes`

`read_bytes(self, size: Int = -1) -> List[SIMD[uint8, 1]]`

Read from file buffer until we have `size` characters or we hit EOF. If `size` is negative or omitted, read until EOF.

**Args:**

* ​size (`Int`): Requested number of bytes to read.

**Returns:**

The contents of the file.

### `seek`

`seek(self, offset: SIMD[uint64, 1], whence: SIMD[uint8, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[uint64, 1]`

Seeks to the given offset in the file.

**Args:**

* ​offset (`SIMD[uint64, 1]`): The byte offset to seek to from the start of the file.
* ​whence (`SIMD[uint8, 1]`): The reference point for the offset:
  os.SEEK\_SET = 0: start of file (Default).
  os.SEEK\_CUR = 1: current position.
  os.SEEK\_END = 2: end of file.

**Returns:**

The resulting byte offset from the start of the file.

**Raises:**

An error if this file handle is invalid, or if file seek returned a
failure.

### `write`

`write[*Ts: Writable](mut self, *args: *Ts)`

Write a sequence of Writable arguments to the provided Writer.

**Parameters:**

* ​\*Ts (`Writable`): Types of the provided argument sequence.

**Args:**

* ​\*args (`*Ts`): Sequence of arguments to write to this Writer.

### `write_bytes`

`write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])`

Write a span of bytes to the file.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this file.

### `__enter__`

`__enter__(owned self) -> Self`

The function to call when entering the context.

**Returns:**

The file handle.

---

## nan

`nan[dtype: DType]() -> SIMD[dtype, 1]`

Gets a NaN value for the given dtype.

**Constraints:**

Can only be used for FP dtypes.

**Parameters:**

* ​dtype (`DType`): The value dtype.

**Returns:**

The NaN value of the given dtype.

---

## NDBuffer

`@register_passable(trivial)`
`struct NDBuffer[mut: Bool, //, type: DType, rank: Int, origin: Origin[mut], shape: DimList = create_unknown[::Int](), strides: DimList = create_unknown[::Int](), *, alignment: Int = 1, address_space: AddressSpace = AddressSpace(0), exclusive: Bool = True]`

An N-dimensional buffer.

NDBuffer can be parametrized on rank, static dimensions and Dtype. It does
not own its underlying pointer.

## Parameters

* ​mut (`Bool`): The inferred mutability.
* ​type (`DType`): The element type of the buffer.
* ​rank (`Int`): The rank of the buffer.
* ​origin (`Origin[mut]`): The origin of the memory being addressed.
* ​shape (`DimList`): The static size (if known) of the buffer.
* ​strides (`DimList`): The strides (if known) of the buffer.
* ​alignment (`Int`): The preferred address alignment of the buffer.
* ​address\_space (`AddressSpace`): The address space of the buffer.
* ​exclusive (`Bool`): The underlying memory allocation of the tensor is known
  only to be accessible through this pointer.

## Fields

* ​data (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): The underlying data for the buffer. The pointer is not owned by the NDBuffer.
* ​dynamic\_shape (`IndexList[rank, element_type=uint64]`): The dynamic value of the shape.
* ​dynamic\_stride (`IndexList[rank, element_type=uint64]`): The dynamic stride of the buffer.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Default initializer for NDBuffer. By default the fields are all initialized to 0.

`@implicit`
`__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self`

Constructs an NDBuffer with statically known rank, shapes and type.

**Constraints:**

The rank, shapes, and type are known.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the data.

`@implicit`
`__init__(span: Span[SIMD[type, 1], origin, address_space=address_space, alignment=alignment]) -> Self`

Constructs an NDBuffer with statically known rank, shapes and type.

**Constraints:**

The rank, shapes, and type are known.

**Args:**

* ​span (`Span[SIMD[type, 1], origin, address_space=address_space, alignment=alignment]`): Span of the data.

`@implicit`
`__init__(other: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> Self`

Converts NDBuffers between different variants which do not effect the underlying memory representation.

E.g. this allows implicit conversion between

`NDBuffer[type, rank, DimList(1, 2, 3), DimList(6, 6, 1), alignment=16]`
to
`NDBuffer[type, rank, DimList(1, 2, 3), DimList.create_unknown[rank](), alignment=4]`

**Args:**

* ​other (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The other NDBuffer type.

`__init__(ptr: UnsafePointer[scalar>, address_space=address_space, mut=mut, origin=origin], dynamic_shape: IndexList[rank, element_type=element_type]) -> Self`

Constructs an NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​ptr (`UnsafePointer[scalar>, address_space=address_space, mut=mut, origin=origin]`): Pointer to the data.
* ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes.

`__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: IndexList[rank, element_type=element_type]) -> Self`

Constructs an NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data.
* ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes.

`__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: IndexList[rank, element_type=element_type]) -> Self`

Constructs an NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Span of the data.
* ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes.

`__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: DimList) -> Self`

Constructs an NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data.
* ​dynamic\_shape (`DimList`): A static tuple of size 'rank' representing shapes.

`__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: DimList) -> Self`

Constructs an NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Span of the data.
* ​dynamic\_shape (`DimList`): A static tuple of size 'rank' representing shapes.

`__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: IndexList[rank, element_type=element_type], dynamic_stride: IndexList[rank, element_type=element_type]) -> Self`

Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data.
* ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes.
* ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides.

`__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: IndexList[rank, element_type=element_type], dynamic_stride: IndexList[rank, element_type=element_type]) -> Self`

Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Span over the data.
* ​dynamic\_shape (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing shapes.
* ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides.

`__init__(ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin], dynamic_shape: DimList, dynamic_stride: IndexList[rank, element_type=element_type]) -> Self`

Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, mut=mut, origin=origin]`): Pointer to the data.
* ​dynamic\_shape (`DimList`): A DimList of size 'rank' representing shapes.
* ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides.

`__init__(span: Span[SIMD[type, 1], origin, address_space=address_space], dynamic_shape: DimList, dynamic_stride: IndexList[rank, element_type=element_type]) -> Self`

Constructs a strided NDBuffer with statically known rank, but dynamic shapes and type.

**Constraints:**

The rank is known.

**Args:**

* ​span (`Span[SIMD[type, 1], origin, address_space=address_space]`): Pointer to the data.
* ​dynamic\_shape (`DimList`): A DimList of size 'rank' representing shapes.
* ​dynamic\_stride (`IndexList[rank, element_type=element_type]`): A static tuple of size 'rank' representing strides.

### `__getitem__`

`__getitem__(self, *idx: Int) -> SIMD[type, 1]`

Gets an element from the buffer from the specified index.

**Args:**

* ​\*idx (`Int`): Index of the element to retrieve.

**Returns:**

The value of the element.

`__getitem__(self, idx: IndexList[rank, element_type=element_type]) -> SIMD[type, 1]`

Gets an element from the buffer from the specified index.

**Args:**

* ​idx (`IndexList[rank, element_type=element_type]`): Index of the element to retrieve.

**Returns:**

The value of the element.

### `__setitem__`

`__setitem__(self: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], idx: IndexList[rank, element_type=element_type], val: SIMD[type, 1])`

Stores a single value into the buffer at the specified index.

**Args:**

* ​idx (`IndexList[rank, element_type=element_type]`): The index into the buffer.
* ​val (`SIMD[type, 1]`): The value to store.

`__setitem__(self: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], *idx: Int, *, val: SIMD[type, 1])`

Stores a single value into the buffer at the specified index.

**Args:**

* ​\*idx (`Int`): Index of the element to retrieve.
* ​val (`SIMD[type, 1]`): The value to store.

### `origin_cast`

`origin_cast[mut: Bool = mut, origin: Origin[mut] = origin](self) -> NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`

Changes the origin or mutability of a pointer.

**Parameters:**

* ​mut (`Bool`): Whether the origin is mutable.
* ​origin (`Origin[mut]`): Origin of the destination pointer.

**Returns:**

A new `NDBuffer` object with the same type and the same address,
as the original `NDBuffer` and the new specified mutability and origin.

### `get_rank`

`get_rank(self) -> Int`

Returns the rank of the buffer.

**Returns:**

The rank of NDBuffer.

### `get_shape`

`get_shape(self) -> IndexList[rank]`

Returns the shapes of the buffer.

**Returns:**

A static tuple of size 'rank' representing shapes of the NDBuffer.

### `get_strides`

`get_strides(self) -> IndexList[rank]`

Returns the strides of the buffer.

**Returns:**

A static tuple of size 'rank' representing strides of the NDBuffer.

### `get_nd_index`

`get_nd_index(self, idx: Int) -> IndexList[rank]`

Computes the NDBuffer's ND-index based on the flat index.

**Args:**

* ​idx (`Int`): The flat index.

**Returns:**

The index positions.

### `__len__`

`__len__(self) -> Int`

Computes the NDBuffer's number of elements.

**Returns:**

The total number of elements in the NDBuffer.

### `num_elements`

`num_elements(self) -> Int`

Computes the NDBuffer's number of elements.

**Returns:**

The total number of elements in the NDBuffer.

### `size`

`size(self) -> Int`

Computes the NDBuffer's number of elements.

**Returns:**

The total number of elements in the NDBuffer.

### `__str__`

`__str__(self) -> String`

Gets the buffer as a string.

**Returns:**

A compact string of the buffer.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this buffer to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__repr__`

`__repr__(self) -> String`

Gets the buffer as a string.

**Returns:**

A compact string representation of the buffer.

### `tile`

`tile[*tile_sizes: Dim](self, tile_coords: IndexList[rank, element_type=element_type]) -> NDBuffer[type, rank, origin, DimList(VariadicList(tile_sizes)), address_space=address_space]`

Returns an n-d tile "slice" of the buffer of size tile\_sizes at    coords.

**Parameters:**

* ​\*tile\_sizes (`Dim`): The size of the tiles.

**Args:**

* ​tile\_coords (`IndexList[rank, element_type=element_type]`): The tile index.

**Returns:**

The tiled buffer at tile\_coords.

### `load`

`load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, *idx: Int) -> SIMD[type, width]`

Loads a simd value from the buffer at the specified index.

**Constraints:**

The buffer must be contiguous or width must be 1.

**Parameters:**

* ​width (`Int`): The simd\_width of the load.
* ​alignment (`Int`): The alignment value.

**Args:**

* ​\*idx (`Int`): The index into the NDBuffer.

**Returns:**

The simd value starting at the `idx` position and ending at
`idx+width`.

`load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, idx: VariadicList[Int]) -> SIMD[type, width]`

Loads a simd value from the buffer at the specified index.

**Constraints:**

The buffer must be contiguous or width must be 1.

**Parameters:**

* ​width (`Int`): The simd\_width of the load.
* ​alignment (`Int`): The alignment value.

**Args:**

* ​idx (`VariadicList[Int]`): The index into the NDBuffer.

**Returns:**

The simd value starting at the `idx` position and ending at
`idx+width`.

`load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, idx: IndexList[size, element_type=element_type]) -> SIMD[type, width]`

Loads a simd value from the buffer at the specified index.

**Constraints:**

The buffer must be contiguous or width must be 1.

**Parameters:**

* ​width (`Int`): The simd\_width of the load.
* ​alignment (`Int`): The alignment value.

**Args:**

* ​idx (`IndexList[size, element_type=element_type]`): The index into the NDBuffer.

**Returns:**

The simd value starting at the `idx` position and ending at
`idx+width`.

`load[*, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self, idx: StaticTuple[Int, rank]) -> SIMD[type, width]`

Loads a simd value from the buffer at the specified index.

**Constraints:**

The buffer must be contiguous or width must be 1.

**Parameters:**

* ​width (`Int`): The simd\_width of the load.
* ​alignment (`Int`): The alignment value.

**Args:**

* ​idx (`StaticTuple[Int, rank]`): The index into the NDBuffer.

**Returns:**

The simd value starting at the `idx` position and ending at
`idx+width`.

### `store`

`store[_alignment: Int, //, *, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self: NDBuffer[type, rank, origin, shape, strides, alignment=_alignment, address_space=address_space, exclusive=exclusive], idx: IndexList[rank, element_type=element_type], val: SIMD[type, width])`

Stores a simd value into the buffer at the specified index.

**Constraints:**

The buffer must be contiguous or width must be 1.

**Parameters:**

* ​\_alignment (`Int`): The inferred alignment of self.
* ​width (`Int`): The width of the simd vector.
* ​alignment (`Int`): The alignment value.

**Args:**

* ​idx (`IndexList[rank, element_type=element_type]`): The index into the buffer.
* ​val (`SIMD[type, width]`): The value to store.

`store[_alignment: Int, //, *, width: Int = 1, alignment: Int = _default_alignment[::Int]()](self: NDBuffer[type, rank, origin, shape, strides, alignment=_alignment, address_space=address_space, exclusive=exclusive], idx: StaticTuple[Int, rank], val: SIMD[type, width])`

Stores a simd value into the buffer at the specified index.

**Constraints:**

The buffer must be contiguous or width must be 1.

**Parameters:**

* ​\_alignment (`Int`): The inferred alignment of self.
* ​width (`Int`): The width of the simd vector.
* ​alignment (`Int`): The alignment value.

**Args:**

* ​idx (`StaticTuple[Int, rank]`): The index into the buffer.
* ​val (`SIMD[type, width]`): The value to store.

### `dim`

`dim[index: Int](self) -> Int`

Gets the buffer dimension at the given index.

**Parameters:**

* ​index (`Int`): The number of dimension to get.

**Returns:**

The buffer size at the given dimension.

`dim(self, index: Int) -> Int`

Gets the buffer dimension at the given index.

**Args:**

* ​index (`Int`): The number of dimension to get.

**Returns:**

The buffer size at the given dimension.

### `stride`

`stride[index: Int](self) -> Int`

Gets the buffer stride at the given index.

**Parameters:**

* ​index (`Int`): The number of dimension to get the stride for.

**Returns:**

The stride at the given dimension.

`stride(self, index: Int) -> Int`

Gets the buffer stride at the given index.

**Args:**

* ​index (`Int`): The number of dimension to get the stride for.

**Returns:**

The stride at the given dimension.

### `is_contiguous`

`is_contiguous(self) -> Bool`

Checks if the buffer is contiguous in memory.

**Returns:**

True if the buffer is contiguous in memory and False otherwise.

### `flatten`

`flatten(self) -> NDBuffer[type, 1, origin, __init__[::Intable](shape.product()), address_space=address_space]`

Constructs a flattened buffer counterpart for this NDBuffer.

**Constraints:**

The buffer must be contiguous.

**Returns:**

Constructed buffer object.

### `make_dims_unknown`

`make_dims_unknown(self) -> NDBuffer[type, rank, origin, address_space=address_space]`

Rebinds the NDBuffer to one with unknown shape.

**Returns:**

The rebound NDBuffer with unknown shape.

### `bytecount`

`bytecount(self) -> Int`

Returns the size of the NDBuffer in bytes.

**Returns:**

The size of the NDBuffer in bytes.

### `zero`

`zero(self)`

Sets all bytes of the NDBuffer to 0.

**Constraints:**

The buffer must be contiguous.

### `tofile`

`tofile(self, path: Path)`

Write values to a file.

**Args:**

* ​path (`Path`): Path to the output file.

### `fill`

`fill(self: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], val: SIMD[type, 1])`

Assigns val to all elements in the buffer.

The fill is performed in chunks of size N, where N is the native SIMD
width of type on the system.

**Args:**

* ​val (`SIMD[type, 1]`): The value to store.

### `stack_allocation`

`static stack_allocation[*, alignment: Int = alignof[::DType,__mlir_type.!kgen.target]()]() -> Self`

Constructs an NDBuffer instance backed by stack allocated memory space.

**Parameters:**

* ​alignment (`Int`): Address alignment requirement for the allocation.

**Returns:**

Constructed NDBuffer with the allocated space.

### `prefetch`

`prefetch[params: PrefetchOptions](self, *idx: Int)`

Prefetches the data at the given index.

**Parameters:**

* ​params (`PrefetchOptions`): The prefetch configuration.

**Args:**

* ​\*idx (`Int`): The N-D index of the prefetched location.

`prefetch[params: PrefetchOptions](self, indices: IndexList[rank])`

Prefetches the data at the given index.

**Parameters:**

* ​params (`PrefetchOptions`): The prefetch configuration.

**Args:**

* ​indices (`IndexList[rank]`): The N-D index of the prefetched location.

---

## ndbuffer_reshape

`ndbuffer_reshape[rank: Int, output_rank: Int, type: DType, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], new_shape: IndexList[output_rank]) -> NDBuffer[type, output_rank, origin]`

---

## NDBufferMHAOperand

`@register_passable(trivial)`
`struct NDBufferMHAOperand[type_: DType, rank: Int, shape: DimList, stride: DimList]`

An implementation for NDBuffer arguments to MHA kernels.

## Fields

* ​buffer (`NDBuffer[type_, rank, MutableAnyOrigin, shape, stride]`):

## Implemented traits

`AnyType`,
`Copyable`,
`MHAOperand`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `type`

`alias type = type_`

## Methods

### `__init__`

`__init__(buffer: NDBuffer[type_, rank, MutableAnyOrigin, shape, stride]) -> Self`

### `block_paged_ptr`

`block_paged_ptr[tile_size: Int](self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[type_, 1]]`

### `cache_length`

`cache_length(self, batch_idx: Int) -> Int`

### `max_context_length`

`max_context_length(self) -> SIMD[uint32, 1]`

---

## neg_inf

`neg_inf[dtype: DType]() -> SIMD[dtype, 1]`

Gets a -inf value for the given dtype.

**Constraints:**

Can only be used for FP dtypes.

**Parameters:**

* ​dtype (`DType`): The value dtype.

**Returns:**

The -inf value of the given dtype.

---

## neon_intrinsics


---

## next_power_of_two

`next_power_of_two(val: Int) -> Int`

Computes the smallest power of 2 that is greater than or equal to the input value. Any integral value less than or equal to 1 will be ceiled to 1.

Notes:
This operation is called `bit_ceil()` in C++.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The smallest power of 2 that is greater than or equal to the input
value.

`next_power_of_two(val: UInt) -> UInt`

Computes the smallest power of 2 that is greater than or equal to the input value. Any integral value less than or equal to 1 will be ceiled to 1.

Notes:
This operation is called `bit_ceil()` in C++.

**Args:**

* ​val (`UInt`): The input value.

**Returns:**

The smallest power of 2 that is greater than or equal to the input
value.

`next_power_of_two[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the smallest power of 2 that is greater than or equal to the input value for each element of a SIMD vector. Any integral value less than or equal to 1 will be ceiled to 1.

This operation is called `bit_ceil()` in C++.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` is the smallest power of 2
that is greater than or equal to the integer at position `i` of the input
value.

---

## nextafter

`nextafter[dtype: DType, simd_width: Int](arg0: SIMD[dtype, simd_width], arg1: SIMD[dtype, simd_width]) -> SIMD[dtype, simd_width]`

Computes next representable value of `arg0` in the direction of `arg1`.

**Constraints:**

The element dtype of the input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​simd\_width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​arg0 (`SIMD[dtype, simd_width]`): The first input argument.
* ​arg1 (`SIMD[dtype, simd_width]`): The second input argument.

**Returns:**

The `nextafter` of the inputs.

---

## nms

## Structs

* [​`BoundingBox`](./BoundingBox):

## Functions

* [​`non_max_suppression`](./non_max_suppression): Buffer semantic overload.
* [​`non_max_suppression_shape_func`](./non_max_suppression_shape_func): Overload to compute the output shape. Can be removed once the graph compiler supports value semantic kernels that allocate their own output.

---

## nn

APIs to build neural network components for deep learning models with Python.

## Modules

* [`conv`](/max/api/python/nn/conv)
* [`embedding`](/max/api/python/nn/embedding)
* [`kernels`](/max/api/python/nn/kernels)
* [`layer`](/max/api/python/nn/layer)
* [`linear`](/max/api/python/nn/linear)
* [`rotary_embedding`](/max/api/python/nn/rotary_embedding)
* [`sequential`](/max/api/python/nn/sequential)

## Packages

* [`attention`](/max/api/python/nn/attention)
* [`norm`](/max/api/python/nn/norm)
* [`transformer`](/max/api/python/nn/transformer)
* [`kv_cache`](/max/api/python/nn/kv_cache)

---

## nn

Provides neural network operators for deep learning models.

## Modules

* [​`activations`](./activations/): The module contains implementations of activation functions.
* [​`arange`](./arange/):
* [​`arg_nonzero`](./arg_nonzero/):
* [​`argmaxmin`](./argmaxmin/):
* [​`argmaxmin_gpu`](./argmaxmin_gpu/):
* [​`argsort`](./argsort/):
* [​`broadcast`](./broadcast/):
* [​`concat`](./concat/):
* [​`conv`](./conv/):
* [​`conv_transpose`](./conv_transpose/):
* [​`conv_utils`](./conv_utils/):
* [​`cumsum`](./cumsum/):
* [​`flash_attention`](./flash_attention/):
* [​`fold`](./fold/): Implements the fold operation.
* [​`fused_qk_rope`](./fused_qk_rope/):
* [​`gather_scatter`](./gather_scatter/):
* [​`image`](./image/):
* [​`index_tensor`](./index_tensor/):
* [​`irfft`](./irfft/): Inverse real FFT kernel using cuFFT.
* [​`kv_cache`](./kv_cache/):
* [​`kv_cache_ragged`](./kv_cache_ragged/):
* [​`mha`](./mha/):
* [​`mha_cross`](./mha_cross/):
* [​`mha_mask`](./mha_mask/):
* [​`mha_operand`](./mha_operand/):
* [​`mha_score_mod`](./mha_score_mod/):
* [​`mha_sm90`](./mha_sm90/):
* [​`mha_tile_scheduler`](./mha_tile_scheduler/):
* [​`mha_utils`](./mha_utils/):
* [​`mla`](./mla/):
* [​`moe`](./moe/):
* [​`nms`](./nms/):
* [​`normalization`](./normalization/):
* [​`pad`](./pad/):
* [​`pad_gpu`](./pad_gpu/):
* [​`pool`](./pool/):
* [​`rand_uniform`](./rand_uniform/):
* [​`randn`](./randn/):
* [​`repeat_interleave`](./repeat_interleave/):
* [​`reshape`](./reshape/):
* [​`resize`](./resize/):
* [​`roi_align`](./roi_align/):
* [​`sampling`](./sampling/):
* [​`shapes`](./shapes/):
* [​`slice`](./slice/):
* [​`softmax`](./softmax/):
* [​`split`](./split/):
* [​`tile`](./tile/):
* [​`topk`](./topk/):
* [​`toppminp`](./toppminp/):
* [​`toppminp_gpu`](./toppminp_gpu/):

---

## Node

`struct Node[ElementType: Copyable & Movable]`

A node in a linked list data structure.

## Parameters

* ​ElementType (`Copyable & Movable`): The type of element stored in the node.

## Fields

* ​value (`ElementType`): The value stored in this node.
* ​prev (`UnsafePointer[Node[ElementType]]`): The previous node in the list.
* ​next (`UnsafePointer[Node[ElementType]]`): The next node in the list.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, owned value: ElementType, prev: Optional[UnsafePointer[Node[ElementType]]], next: Optional[UnsafePointer[Node[ElementType]]])`

Initialize a new Node with the given value and optional prev/next pointers.

**Args:**

* ​value (`ElementType`): The value to store in this node.
* ​prev (`Optional[UnsafePointer[Node[ElementType]]]`): Optional pointer to the previous node.
* ​next (`Optional[UnsafePointer[Node[ElementType]]]`): Optional pointer to the next node.

### `__str__`

`__str__[ElementType: Copyable & Movable & Writable](self: Node[ElementType]) -> String`

Convert this node's value to a string representation.

**Parameters:**

* ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function if
  `ElementType` is `Writable`.

**Returns:**

String representation of the node's value.

### `write_to`

`write_to[ElementType: Copyable & Movable & Writable, W: Writer](self: Node[ElementType], mut writer: W)`

Write this node's value to the given writer.

**Parameters:**

* ​ElementType (`Copyable & Movable & Writable`): Used to conditionally enable this function if
  `ElementType` is `Writable`.
* ​W (`Writer`): The type of writer to write the value to.

**Args:**

* ​writer (`W`): The writer to write the value to.

---

## non_max_suppression

`non_max_suppression[type: DType](boxes: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scores: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[int64, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], max_output_boxes_per_class: Int, iou_threshold: SIMD[float32, 1], score_threshold: SIMD[float32, 1])`

Buffer semantic overload.

`non_max_suppression[: origin.set, //, type: DType, func: fn(SIMD[int64, 1], SIMD[int64, 1], SIMD[int64, 1]) capturing -> None](boxes: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scores: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], max_output_boxes_per_class: Int, iou_threshold: SIMD[float32, 1], score_threshold: SIMD[float32, 1])`

Implements the NonMaxSuppression operator from the ONNX spec .

---

## non_max_suppression_shape_func

`non_max_suppression_shape_func[type: DType](boxes: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scores: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], max_output_boxes_per_class: Int, iou_threshold: SIMD[float32, 1], score_threshold: SIMD[float32, 1]) -> IndexList[2]`

Overload to compute the output shape. Can be removed once the graph compiler supports value semantic kernels that allocate their own output.

---

## none

Defines the builtin `NoneType`.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`NoneType`](/mojo/stdlib/builtin/none/NoneType): Represents the absence of a value.

---

## none_true

`none_true(src: NDBuffer[type, 1, origin]) -> Bool`

Returns True if none of the elements in a buffer are True and False otherwise.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.

**Returns:**

True if none of the elements of the buffer are True and False otherwise.

---

## NoneType

`@register_passable(trivial)`
`struct NoneType`

Represents the absence of a value.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Representable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Construct an instance of the `None` type.

`@implicit`
`__init__(value: None) -> Self`

Construct an instance of the `None` type.

**Args:**

* ​value (`None`): The MLIR none type to construct from.

### `copy`

`copy(self) -> Self`

Explicit copy constructor.

**Returns:**

A copy of the value.

### `__str__`

`__str__(self) -> String`

Returns the string representation of `None`.

**Returns:**

`"None"`.

### `__repr__`

`__repr__(self) -> String`

Returns the string representation of `None`.

**Returns:**

`"None"`.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Write `None` to a writer stream.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

---

## NoPartition

`@register_passable(trivial)`
`struct NoPartition[dtype: DType]`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`MHAPartitionScheme`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `accum_dtype`

`alias accum_dtype = dtype`

### `do_partition`

`alias do_partition = False`

## Methods

### `__init__`

`__init__() -> Self`

### `num_partitions`

`num_partitions(self) -> SIMD[uint32, 1]`

### `get_exp_sum_qk_max_pointer`

`get_exp_sum_qk_max_pointer(self) -> UnsafePointer[SIMD[dtype, 1]]`

---

## norm

## Modules

* [`group_norm`](/max/api/python/nn/norm/group_norm)
* [`layer_norm`](/max/api/python/nn/norm/layer_norm)
* [`rms_norm`](/max/api/python/nn/norm/rms_norm)

---

## normalization

## Functions

* [​`block_reduce`](./block_reduce):
* [​`layer_norm`](./layer_norm):
* [​`layer_norm_cpu`](./layer_norm_cpu): Computes layernorm(elementwise\_fn(x)) across the last dimension of x, where layernorm is defined as $(x-mean(x))/(sqrt(var(x)+eps)*gamma_fn + beta$.
* [​`layer_norm_gpu`](./layer_norm_gpu):
* [​`layer_norm_gpu_block`](./layer_norm_gpu_block):
* [​`layer_norm_gpu_warp_tiling`](./layer_norm_gpu_warp_tiling):
* [​`layer_norm_reshape`](./layer_norm_reshape):
* [​`layer_norm_shape`](./layer_norm_shape): Compute the output shape of a `layer_norm` operation.
* [​`rms_norm`](./rms_norm):
* [​`rms_norm_cpu`](./rms_norm_cpu):
* [​`rms_norm_gpu`](./rms_norm_gpu):
* [​`rms_norm_gpu_block`](./rms_norm_gpu_block):
* [​`rms_norm_gpu_warp_tiling`](./rms_norm_gpu_warp_tiling):
* [​`rms_norm_shape`](./rms_norm_shape):
* [​`welford_block_all_reduce`](./welford_block_all_reduce):
* [​`welford_combine`](./welford_combine):
* [​`welford_update`](./welford_update):
* [​`welford_warp_all_reduce`](./welford_warp_all_reduce):
* [​`welford_warp_reduce`](./welford_warp_reduce):

---

## normalize

`normalize(value: SIMD[bfloat16, 1]) -> SIMD[uint16, 1]`

`normalize(value: SIMD[int32, 1]) -> SIMD[uint32, 1]`

`normalize(value: SIMD[uint16, 1]) -> SIMD[uint16, 1]`

`normalize(value: SIMD[float32, 1]) -> SIMD[uint32, 1]`

`normalize(value: SIMD[dtype, 1]) -> SIMD[_uint_type_of_width[::Int](), 1]`

Normalize the value to the appropriate unsigned integer type. This is needed for radix sort to work correctly.

---

## normalize_neg_index

`normalize_neg_index(idx: Int, dim_size: Int) -> Int`

Indices passed to gather and scatter ops may be negative. This performs a normalization so that they can be used to index into a buffer.

Returns val + dim if val 

`normalize_neg_index[type: DType, width: Int, out_type: DType = index](idx: SIMD[type, width], dim_size: Int) -> SIMD[out_type, width]`

Indices passed to gather and scatter ops may be negative. This performs a normalization so that they can be used to index into a buffer.

Returns val + dim if val

---

## normalize_u32

`normalize_u32(value: SIMD[uint32, 1]) -> SIMD[uint32, 1]`

---

## NullMask

`@register_passable(trivial)`
`struct NullMask`

Mask that's effectively a noop.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`MHAMask`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask = False`

### `mask_out_of_bound`

`alias mask_out_of_bound = True`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds = True`

## Methods

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

### `status`

`status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

---

## num_logical_cores

`num_logical_cores() -> Int`

Returns the number of hardware threads, including hyperthreads across all CPU sockets.

**Returns:**

Int: The number of threads on the system.

---

## num_matrix_reg

`num_matrix_reg[dim_1: Int, dim_2: Int]() -> Int`

Calculates the number of matrix registers required per thread.

Determines how many registers each thread in a warp needs to store a matrix
of the given dimensions. This is calculated by dividing the total number of
elements (dim\_1 \* dim\_2) by the warp size, as the matrix is distributed
across all threads in the warp.

**Parameters:**

* ​dim\_1 (`Int`): First dimension of the matrix.
* ​dim\_2 (`Int`): Second dimension of the matrix.

**Returns:**

The number of matrix registers needed per thread.

---

## num_performance_cores

`num_performance_cores() -> Int`

Returns the number of physical performance cores across all CPU sockets. If not known, returns the total number of physical cores.

**Returns:**

Int: The number of physical performance cores on the system.

---

## num_physical_cores

`num_physical_cores() -> Int`

Returns the number of physical cores across all CPU sockets.

**Returns:**

Int: The number of physical cores on the system.

---

## numerics

Defines utilities to work with numeric types.

You can import these APIs from the `utils` package. For example:

```mojo
from utils.numerics import FPUtils
```

## Structs

* [​`FlushDenormals`](/mojo/stdlib/utils/numerics/FlushDenormals): Flushes and denormals are set to zero within the context and the state is restored to the prior value on exit.
* [​`FPUtils`](/mojo/stdlib/utils/numerics/FPUtils): Collection of utility functions for working with FP values.

## Functions

* [​`get_accum_type`](/mojo/stdlib/utils/numerics/get_accum_type): Returns the recommended dtype for accumulation operations.
* [​`inf`](/mojo/stdlib/utils/numerics/inf): Gets a +inf value for the given dtype.
* [​`isfinite`](/mojo/stdlib/utils/numerics/isfinite): Checks if the value is not infinite.
* [​`isinf`](/mojo/stdlib/utils/numerics/isinf): Checks if the value is infinite.
* [​`isnan`](/mojo/stdlib/utils/numerics/isnan): Checks if the value is Not a Number (NaN).
* [​`max_finite`](/mojo/stdlib/utils/numerics/max_finite): Returns the maximum finite value of type.
* [​`max_or_inf`](/mojo/stdlib/utils/numerics/max_or_inf): Returns the maximum (potentially infinite) value of type.
* [​`min_finite`](/mojo/stdlib/utils/numerics/min_finite): Returns the minimum (lowest) finite value of type.
* [​`min_or_neg_inf`](/mojo/stdlib/utils/numerics/min_or_neg_inf): Returns the minimum (potentially negative infinite) value of type.
* [​`nan`](/mojo/stdlib/utils/numerics/nan): Gets a NaN value for the given dtype.
* [​`neg_inf`](/mojo/stdlib/utils/numerics/neg_inf): Gets a -inf value for the given dtype.
* [​`nextafter`](/mojo/stdlib/utils/numerics/nextafter): Computes next representable value of `arg0` in the direction of `arg1`.

---

## nvml

Implements wrappers around the NVIDIA Management Library (nvml).

## Modules

* [​`nvml`](./nvml/): Implements wrappers around the NVIDIA Management Library (nvml).

---

## nvml

Implements wrappers around the NVIDIA Management Library (nvml).

## Aliases

### `CUDA_NVML_LIBRARY`

`alias CUDA_NVML_LIBRARY = _Global[__init__[__mlir_type.!kgen.string]("CUDA_NVML_LIBRARY"), _OwnedDLHandle, _init_dylib]`

### `CUDA_NVML_LIBRARY_BASE_NAME`

`alias CUDA_NVML_LIBRARY_BASE_NAME = "libnvidia-ml"`

### `CUDA_NVML_LIBRARY_DIR`

`alias CUDA_NVML_LIBRARY_DIR = __init__[__mlir_type.!kgen.string]("/usr/lib/x86_64-linux-gnu")`

### `CUDA_NVML_LIBRARY_EXT`

`alias CUDA_NVML_LIBRARY_EXT = ".so"`

## Structs

* [​`ClockType`](./ClockType):
* [​`Device`](./Device):
* [​`DriverVersion`](./DriverVersion):
* [​`EnableState`](./EnableState):
* [​`Result`](./Result):

---

## Occupancy

In GPU programming, occupancy is a measure of the efficiency of the GPU's
compute resources. It is defined as the ratio of the number of active
[warps](warp.mdx) to the maximum number of warps that can be active on a given
[streaming multiprocessor](streaming-multiprocessor.mdx) (SM) at any one time.

Higher occupancy can improve parallel execution and hide memory latency, but
increasing occupancy does not always boost performance, as factors like memory
bandwidth and instruction dependencies may create bottlenecks. The optimal
occupancy level depends on the workload and GPU architecture.

---

## oct

`oct(value: SIMD[dtype, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0o")) -> String`

Returns the octal string representation of the given integer.

The octal representation is a base-8 encoding of the integer value.

The returned string will be prefixed with "0o" to indicate that the
subsequent digits are octal.

**Args:**

* ​value (`SIMD[dtype, 1]`): The integer value to format.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

A string containing the octal representation of the given integer.

`oct[T: Intable, //](value: T, /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0o")) -> String`

Returns the octal string representation of the given integer.

The octal representation is a base-8 encoding of the integer value.

The returned string will be prefixed with "0o" to indicate that the
subsequent digits are octal.

**Parameters:**

* ​T (`Intable`): The intable type to represent in octal.

**Args:**

* ​value (`T`): The integer value to format.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

A string containing the octal representation of the given integer.

`oct(value: SIMD[bool, 1], /, *, prefix: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("0o")) -> String`

Returns the octal string representation of the given scalar bool.

The octal representation is a base-8 encoding of the bool.

The returned string will be prefixed with "0o" to indicate that the
subsequent digits are octal.

**Args:**

* ​value (`SIMD[bool, 1]`): The bool value to format.
* ​prefix (`StringSlice[StaticConstantOrigin]`): The prefix of the formatted int.

**Returns:**

A string containing the octal representation of the given bool.

---

## Offline inference

import TutorialStack from '@site/src/components/TutorialStack';
import InstallModular from '@site/docs/_includes/install-modular.mdx';

Offline inference with MAX allows you to run large language models directly
in Python without relying on external API endpoints. This is in
contrast to online inference, where you would send requests to a remote service.

## When to use offline inference

You'll want to use offline inference in scenarios where you want to perform model
inference without the need for a separate model inference server. Typically this
includes where you have to process a batch of inputs concurrently.

This approach is beneficial for tasks that require high throughput and can be
executed in a controlled environment, such as data preprocessing, model
evaluation, or when working with large datasets that need to be processed in
batches.

## How offline inference works

The core of offline inference revolves around the the
[`LLM`](/max/api/python/entrypoints#max.entrypoints.llm.LLM) class which provides
a Python interface to load and run language models.

Specify the model from a Hugging Face repository or a local path and MAX handles
the process of downloading the model. The
[`PipelineConfig`](/max/api/python/pipelines/config/#max.pipelines.lib.config.PipelineConfig)
class allows you to specify parameters related to the inference pipeline, such as
[`max_length`](/max/api/python/pipelines/config/#max.pipelines.lib.config.PipelineConfig.max_length)
and
[`max_num_steps`](/max/api/python/pipelines/config/#max.pipelines.lib.config.PipelineConfig.max_num_steps).
The [`generate()`](/max/api/python/entrypoints#max.entrypoints.llm.LLM.generate)
function is used to generate text from the model.

:::note

The Python API for offline inference currently supports text-only input and does
not support multi-modal models. If you need to work with vision capabilities, see
the tutorial on [Generate image descriptions with Llama 3.2
Vision](/max/tutorials/deploy-llama-vision).

:::

## Quickstart

This quickstart demonstrates how to use offline inference using a Hugging Face
model with MAX in Python.

1. Set up your project:

    
2. Create a file named `main.py` with the following code:

    ```python
    from max.entrypoints.llm import LLM
    from max.pipelines import PipelineConfig

    def main():
        model_path = "modularai/Llama-3.1-8B-Instruct-GGUF"
        pipeline_config = PipelineConfig(model_path=model_path)
        llm = LLM(pipeline_config)

        prompts = [
            "In the beginning, there was",
            "I believe the meaning of life is",
            "The fastest way to learn python is",
        ]

        print("Generating responses...")
        responses = llm.generate(prompts, max_new_tokens=50)
        for i, (prompt, response) in enumerate(zip(prompts, responses)):
            print(f"========== Response {i} ==========")
            print(prompt + response)
            print()

    if __name__ == "__main__":
        main()
    ```

    This script downloads the
    [`modularai/Llama-3.1-8B-Instruct-GGUF`](https://huggingface.co/modularai/Llama-3.1-8B-Instruct-GGUF)
    model (if not already downloaded) and then run inference locally. While the
    initial model download requires internet access, the actual inference process is
    self-contained and does not send requests to a remote service for
    generating text. 
    
    You can update the script to use a different model or modify the prompts to
    generate different responses. For a list of available models, see our [Model
    repository](https://builds.modular.com/?category=models). We chose the
    Llama-3.1-8B-Instruct-GGUF model for this example because it's not gated, meaning
    it's freely available without requiring special access permissions or
    authentication. 
    
    For offline inference, MAX supports models in GGUF format. This includes most
    generative LLMs with "Chat" modality, but the specific configuration parameters
    might vary between models. Always refer to the model's documentation for
    compatibility details and optimal configuration settings.

3. Run the script:
    
    ```sh
    python main.py
    ```
    
    This command will download the model and generate responses for the prompts.
    
    You should see output like the following:
    
    ```output
    Generating responses...
    ========== Response 0 ==========
    In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the
    closest major galaxy to our own Milky Way, and it's been a source of fascination
    for astronomers and space enthusiasts for centuries. But what if I told you that
    there's
    
    ========== Response 1 ==========
    I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
    I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
    I believe that the meaning of life is
    
    ========== Response 2 ==========
    The fastest way to learn python is to practice with real-world projects. Here are
    some ideas for projects that you can use to learn Python:
    
    1. **Command Line Calculator**: Create a command line calculator that can perform
    basic arithmetic operations like addition, subtraction, multiplication, and
    division. 
    ```

## Next steps

For more information on offline inference, see the following:

- [Offline inference example](https://github.com/modular/modular/blob/main/examples/offline-inference/basic.py)
- [Offline inference recipe](https://builds.modular.com/recipes/max-offline-inference)

export const tutorials = [
    'deploy-llama-vision',
    'run-embeddings-with-max-serve',
];

---

## open

`open[PathLike: PathLike](path: PathLike, mode: StringSlice[origin]) -> FileHandle`

Opens the file specified by path using the mode provided, returning a FileHandle.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the file to open.
* ​mode (`StringSlice[origin]`): The mode to open the file in (the mode can be "r" or "w").

**Returns:**

A file handle.

---

## Operators, expressions, and dunder methods

Mojo includes a variety of operators for manipulating values of different types.
Generally, the operators are equivalent to those found in Python, though many
operators also work with additional Mojo types such as `SIMD` vectors.
Additionally, Mojo allows you to define the behavior of most of these operators
for your own custom types by implementing special *dunder* (double underscore)
methods.

This document contains the following three sections:

- [Operators and expressions](#operators-and-expressions) discusses Mojo's
  built-in operators and how they work with commonly used Mojo types.
- [Implement operators for custom types](#implement-operators-for-custom-types)
  describes the dunder methods that you can implement to support using operators
  with custom structs that you create.
- [An example of implementing operators for a custom
  type](#an-example-of-implementing-operators-for-a-custom-type) shows a
  progressive example of writing a custom struct with support for several
  operators.

## Operators and expressions

This section lists the operators that Mojo supports, their order or precedence
and associativity, and describes how these operators behave with several
commonly used built-in types.

### Operator precedence and associativity

The table below lists the various Mojo operators, along with their order of
precedence and associativity (also referred to as grouping). This table lists
operators from the highest precedence to the lowest precedence.

| **Operators** | **Description** | **Associativity (Grouping)** |
| ------------- | --------------- | ----------------- |
| `()` | Parenthesized expression | Left to right |
| `x[index]`, `x[index:index]` | Subscripting, slicing | Left to right |
| `**` | Exponentiation | Right to left |
| `+x`, `-x`, `~x` | Positive, negative, bitwise NOT | Right to left |
| `*`, `@`, `/`, `//`, `%` | Multiplication, matrix, division, floor division, remainder | Left to right |
| `+`, `–` | Addition and subtraction | Left to right |
| `>` | Shifts | Left to right |
| `&` | Bitwise AND | Left to right |
| `^` | Bitwise XOR | Left to right |
| `\|` | Bitwise OR | Left to right |
| `in`, `not in`, `is`, `is not`, ``, `>=`, `!=`, `==` | Comparisons, membership tests, identity tests | Left to Right |
| `not x` | Boolean NOT | Right to left |
| `x and y` | Boolean AND | Left to right |
| `x or y` | Boolean OR | Left to right |
| `if-else` | Conditional expression | Right to left |
| `:=` | Assignment expression (walrus operator) | Right to left |

Mojo supports the same operators as Python (plus a few extensions), and they
have the same precedence levels. For example, the following arithmetic
expression evaluates to 40:

```mojo
5 + 4 * 3 ** 2 - 1
```

It is equivalent to the following parenthesized expression to explicitly control
the order of evaluation:

```mojo
(5 + (4 * (3 ** 2))) - 1
```

Associativity defines how operators of the same precedence level are grouped
into expressions. The table indicates whether operators of a given level are
left- or right-associative. For example, multiplication and division are left
associative, so the following expression results in a value of 3:

```mojo
3 * 4 / 2 / 2
```

It is equivalent to the following parenthesized expression to explicitly control
the order of evaluation:

```mojo
((3 * 4) / 2) / 2
```

Whereas in the following, exponentiation operators are right associative
resulting in a value of 264,144:

```mojo
4 ** 3 ** 2
```

It is equivalent to the following parenthesized expression to explicitly control
the order of evaluation:

```mojo
4 ** (3 ** 2)
```

:::note

Mojo also uses the caret (`^`) as the [*transfer
sigil*](/mojo/manual/values/ownership#transfer-arguments-owned-and-). In
expressions where its use might be ambiguous, Mojo treats the character as the
bitwise XOR operator. For example, `x^+1` is treated as `(x)^(+1)`.

:::

### Arithmetic and bitwise operators

[Numeric types](/mojo/manual/types#numeric-types) describes the different
numeric types provided by the Mojo standard library. The arithmetic and bitwise
operators have slightly different behavior depending on the types of values
provided.

#### `Int` and `UInt` values

The [`Int`](/mojo/stdlib/builtin/int/Int) and
[`UInt`](/mojo/stdlib/builtin/uint/UInt) types represent signed and unsigned
integers of the [word
size](https://en.wikipedia.org/wiki/Word_(computer_architecture)) of the CPU,
typically 64 bits or 32 bits.

The `Int` and `UInt` types support all arithmetic operators except matrix
multiplication (`@`), as well as all bitwise and shift operators. If both
operands to a binary operator are `Int` values the result is an `Int`, if both
operands are `UInt` values the result is a `UInt`, and if one operand is `Int`
and the other `UInt` the result is an `Int`. The one exception for these types
is true division, `/`, which always returns a `Float64` type value.

```mojo
var a_int: Int = -7
var b_int: Int = 4
sum_int = a_int + b_int  # Result is type Int
print("Int sum:", sum_int)

var i_uint: UInt = 9
var j_uint: UInt = 8
sum_uint = i_uint + j_uint  # Result is type UInt
print("UInt sum:", sum_uint)

sum_mixed = a_int + i_uint  # Result is type Int
print("Mixed sum:", sum_mixed)

quotient_int = a_int / b_int  # Result is type Float64
print("Int quotient:", quotient_int)
quotient_uint = i_uint / j_uint  # Result is type Float64
print("UInt quotient:", quotient_uint)
```

```output
Int sum: -3
UInt sum: 17
Mixed sum: 2
Int quotient: -1.75
UInt quotient: 1.125
```

#### `SIMD` values

The Mojo standard library defines the [`SIMD`](/mojo/stdlib/builtin/simd/SIMD)
type to represent a fixed-size array of values that can fit into a processor's
register. This allows you to take advantage of [single instruction, multiple
data](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data)
operations in hardware to efficiently process multiple values in parallel.
`SIMD` values of a numeric [`DType`](/mojo/stdlib/builtin/dtype/DType) support
all arithmetic operators except for matrix multiplication (`@`), though the left
shift (`>`) operators support only integral types.
Additionally, `SIMD` values of an integral or boolean type support all bitwise
operators. `SIMD` values apply the operators in an *elementwise* fashion, as
shown in the following example:

```mojo
simd1 = SIMD[DType.int32, 4](2, 3, 4, 5)
simd2 = SIMD[DType.int32, 4](-1, 2, -3, 4)
simd3 = simd1 * simd2
print(simd3)
```

```output
[-2, 6, -12, 20]
```

[`Scalar`](/mojo/stdlib/builtin/simd/) values are simply aliases for
single-element `SIMD` vectors, so `Float16` is just an alias for
`SIMD[DType.float16, 1]`. Therefore `Scalar` values support the same set of
arithmetic and bitwise operators.

```mojo
var f1: Float16 = 2.5
var f2: Float16 = -4.0
var f3 = f1 * f2  # Implicitly of type Float16
print(f3)
```

```output
-10.0
```

When using these operators on `SIMD` values, Mojo requires both to have the same
size and `DType`, and the result is a `SIMD` of the same size and `DType`. The
operators do *not* automatically widen lower precision `SIMD` values to higher
precision. This means that the `DType` of each value must be the same or else
the result is a compilation error.

```mojo
var i8: Int8 = 8
var f64: Float64 = 64.0
result = i8 * f64
```

```output
error: invalid call to '__mul__': could not deduce parameter 'type' of parent struct 'SIMD'
    result = i8 * f64
             ~~~^~~~~
```

If you need to perform an arithmetic or bitwise operator on two `SIMD` values of
different types, you can explicitly convert a value to the desired type either
by invoking its [`cast()`](/mojo/stdlib/builtin/simd/SIMD#cast) method or by
passing it as an argument to the constructor of the target type.

```mojo
simd1 = SIMD[DType.float32, 4](2.2, 3.3, 4.4, 5.5)
simd2 = SIMD[DType.int16, 4](-1, 2, -3, 4)
simd3 = simd1 * simd2.cast[DType.float32]()  # Convert with cast() method
print("simd3:", simd3)
simd4 = simd2 + SIMD[DType.int16, 4](simd1)  # Convert with SIMD constructor
print("simd4:", simd4)
```

```output
simd3: [-2.2, 6.6, -13.200001, 22.0]
simd4: [1, 5, 1, 9]
```

One exception is that the exponentiation operator, `**`, is overloaded so that
you can specify an `Int` type exponent. All values in the `SIMD` are
exponentiated to the same power.

```mojo
base_simd = SIMD[DType.float64, 4](1.1, 2.2, 3.3, 4.4)
var power: Int = 2
pow_simd = base_simd ** power  # Result is SIMD[DType.float64, 4]
print(pow_simd)
```

```output
[1.2100000000000002, 4.8400000000000007, 10.889999999999999, 19.360000000000003]
```

There are three operators related to division:

- `/`, the "true division" operator, performs floating point division for `SIMD`
  values with a floating point `DType`. For `SIMD` values with an integral
  `DType`, true division *truncates* the quotient to an integral result.

    ```mojo
    num_float16 = SIMD[DType.float16, 4](3.5, -3.5, 3.5, -3.5)
    denom_float16 = SIMD[DType.float16, 4](2.5, 2.5, -2.5, -2.5)

    num_int32 = SIMD[DType.int32, 4](5, -6, 7, -8)
    denom_int32 = SIMD[DType.int32, 4](2, 3, -4, -5)

    # Result is SIMD[DType.float16, 4]
    true_quotient_float16 = num_float16 / denom_float16
    print("True float16 division:", true_quotient_float16)

    # Result is SIMD[DType.int32, 4]
    true_quotient_int32 = num_int32 / denom_int32
    print("True int32 division:", true_quotient_int32)
    ```

    ```output
    True float16 division: [1.4003906, -1.4003906, -1.4003906, 1.4003906]
    True int32 division: [2, -2, -1, 1]
    ```

- `//`, the "floor division" operator, performs division and *rounds down* the
  result to the nearest integer. The resulting `SIMD` is still the same type as
  the original operands. For example:

    ```mojo
    # Result is SIMD[DType.float16, 4]
    var floor_quotient_float16 = num_float16 // denom_float16
    print("Floor float16 division:", floor_quotient_float16)

    # Result is SIMD[DType.int32, 4]
    var floor_quotient_int32 = num_int32 // denom_int32
    print("Floor int32 division:", floor_quotient_int32)
    ```

    ```output
    Floor float16 division: [1.0, -2.0, -2.0, 1.0]
    Floor int32 division: [2, -2, -2, 1]
    ```

- `%`, the modulo operator, returns the remainder after dividing the numerator
  by the denominator an integral number of times. The relationship between the
  `//` and `%` operators can be defined as `num == denom * (num // denom) + (num
  % denom)`. For example:

    ```mojo
    # Result is SIMD[DType.float16, 4]
    var remainder_float16 = num_float16 % denom_float16
    print("Modulo float16:", remainder_float16)

    # Result is SIMD[DType.int32, 4]
    var remainder_int32 = num_int32 % denom_int32
    print("Modulo int32:", remainder_int32)

    print()

    # Result is SIMD[DType.float16, 4]
    var result_float16 = denom_float16 * floor_quotient_float16 + remainder_float16
    print("Result float16:", result_float16)

    # Result is SIMD[DType.int32, 4]
    var result_int32 = denom_int32 * floor_quotient_int32 + remainder_int32
    print("Result int32:", result_int32)
    ```

    ```output
    Modulo float16: [1.0, 1.5, -1.5, -1.0]
    Modulo int32: [1, 0, -1, -3]

    Result float16: [3.5, -3.5, 3.5, -3.5]
    Result int32: [5, -6, 7, -8]
    ```

#### `IntLiteral` and `FloatLiteral` values

[`IntLiteral`](/mojo/stdlib/builtin/int_literal/IntLiteral) and
[`FloatLiteral`](/mojo/stdlib/builtin/float_literal/FloatLiteral) are
compile-time, numeric values. When they are used in a compile-time context, they
are arbitrary-precision values. When they are used in a run-time context, they
are materialized as `Int` and `Float64` type values, respectively.

As an example, the following code causes a compile-time error because the
calculated `IntLiteral` value is too large to store in an `Int` variable:

```mojo
alias big_int = (1 `, and `>=`. However their behavior depends on the type of values being
compared.

- `Int`, `UInt`, `IntLiteral`, and any type that can be implicitly converted to
  `Int` or `UInt` do standard numerical comparison with a `Bool` result.
- Two `SIMD` values can be compared only if they are the same `DType` and size.
  (If you need to compare two `SIMD` values of different types, you can
  explicitly convert a value so that they have the same type either by invoking
  its [`cast()`](/mojo/stdlib/builtin/simd/SIMD#cast) method or by passing it as
  an argument to the constructor of the target type.) Mojo performs elementwise
  comparison with a `SIMD[DType.bool]` result. For example:

    ```mojo
    simd1 = SIMD[DType.int16, 4](-1, 2, -3, 4)
    simd2 = SIMD[DType.int16, 4](0, 1, 2, 3)
    simd3 = simd1 > simd2  # SIMD[DType.bool, 4]
    print(simd3)
    ```

    ```output
    [False, True, False, True]
    ```

- An integral type `SIMD` can be compared to an `IntLiteral`, `Int`, `UInt`, or
  any type that can be implicitly converted to `Int` or `UInt`. Mojo performs
  elementwise comparison against the value provided and produces a
  `SIMD[DType.bool]` result. For example:

    ```mojo
    simd1 = SIMD[DType.int16, 4](-1, 2, -3, 4)
    simd2 = simd1 > 2  # SIMD[DType.bool, 4]
    print(simd2)
    ```

    ```output
    [False, False, False, True]
    ```

- A floating point type `SIMD` can be compared to a `FloatLiteral`,
  `IntLiteral`, `Int`, `UInt`, or any type that can be implicitly converted to
  `Int` or `UInt`. Mojo performs elementwise comparison against the value
  provided and produces a `SIMD[DType.bool]` result. For example:

    ```mojo
    simd1 = SIMD[DType.float32, 4](1.1, -2.2, 3.3, -4.4)
    simd2 = simd1 > 0.5  # SIMD[DType.bool, 4]
    print(simd2)
    ```

    ```output
    [True, False, True, False]
    ```

- `Scalar` values are simply aliases for single-element `SIMD` vectors.
  Therefore, the same restrictions apply against comparing different types. In
  other words, you can't compare a `Float16` value to a `Float32` value unless
  you convert the values to the same type. You can convert a `Scalar` value by
  passing it as an argument to the constructor of the target type:

    ```mojo
    var float1: Float16 = 12.345         # SIMD[DType.float16, 1]
    var float2: Float32 = 0.5            # SIMD[DType.float32, 1]
    result = Float32(float1) > float2    # Result is SIMD[DType.bool, 1]
    print(result)
    ```

    ```output
    True
    ```

    :::note

    Note that the result of comparing a `Scalar` value is a `SIMD[DType.bool, 1]`,
    which is not the same as a `Bool` value. However, `SIMD` values of size
    1 implement the `Boolable` trait, which provides for implicit conversion to
    a `Bool` value when used in a boolean expression.

    :::

- `String` and `StringLiteral` values can be compared using standard
  lexicographical ordering, producing a `Bool`. (For example, "Zebra" is treated
  as less than "ant" because upper case letters occur before lower case letters
  in the character encoding.) String comparisons are discussed further in the
  [String operators](#string-operators) section below.

Several other types in the Mojo standard library support various comparison
operators, in particular the equality and inequality comparisons. Consult the
[API documentation](/mojo/lib) for a type to determine whether any comparison
operators are supported.

### String operators

As discussed in [Strings](/mojo/manual/types#strings), the
[`String`](/mojo/stdlib/collections/string/string/String) type represents a
mutable string value. In contrast, the
[`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral) type
represents a literal string that is embedded into your compiled program. At
run-time a `StringLiteral` is loaded into memory as a constant that persists for
the duration of your program's execution.

The `String` type has a constructor that accepts a `StringLiteral` value, which
means that a `StringLiteral` can be implicitly converted to a `String` at
run-time if you pass it as an argument to a function or assign it to a `String`
type variable. You also can use the [`String`
constructor](/mojo/stdlib/collections/string/string/String#__init__) to
explicitly convert the `StringLiteral` to a `String` value at run-time.

Additionally, the
[`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice) type
is a non-copying view of a `String` or `StringLiteral` value. It provides some
additional methods for string manipulation. A common use case is to create a
`StringSlice` from a `StringLiteral` using the `StaticString` alias so that you
can then invoke the `format()` method on it. For example:

```mojo
alias message = StaticString('{} says, "{}"')
name = "Pat"
greeting = "Good day!"
print(message.format(name, greeting))
```

```output
Pat says, "Good day!"
```

#### String concatenation

The `+` operator performs string concatenation. The `StringLiteral` type
supports compile-time string concatenation.

```mojo
alias last_name = "Curie"

# Compile-time StringLiteral alias
alias marie = "Marie " + last_name
print(marie)

# Compile-time concatenation assigned to a run-time StringLiteral type variable
pierre = "Pierre " + last_name
print(pierre)
```

```output
Marie Curie
Pierre Curie
```

With the `String` type the `+` operator performs run-time string concatenation
to produce a new `String` value. You can also concatenate a `String` and a
`StringLiteral` to produce a new `String` result.

```mojo
var first_name: String = "Grace"
var last_name: String = " Hopper"

# String type result
programmer = first_name + last_name
print(programmer)

# String type result
singer = first_name + " Slick"
print(singer)
```

```output
Grace Hopper
Grace Slick
```

:::tip
When concatenating multiple values together to form a `String`, using the
multi-argument `String()` constructor is more performant than using multiple
`+` concatenation operators and can improve code readability. For example,
instead of writing this:

```mojo
result = "The point at (" + String(x) + ", " + String(y) + ")"
```

you can write:

```mojo
result = String("The point at (", x, ", ", y, ")")
```

:::

#### String replication

The `*` operator replicates a `String` a specified number of times. For example:

```mojo
var str1: String = "la"
str2 = str1 * 5
print(str2)
```

```output
lalalalala
```

`StringLiteral` supports the `*` operator for both compile-time and run-time
string replication. The following examples perform compile-time string
replication resulting in `StringLiteral` values:

```mojo
alias divider1 = "=" * 40
alias symbol = "#"
alias divider2 = symbol * 40

# You must define the following function using `fn` because an alias
# initializer cannot call a function that can potentially raise an error.
fn generate_divider(char: String, repeat: Int) -> String:
    return char * repeat

alias divider3 = generate_divider("~", 40)  # Evaluated at compile-time

print(divider1)
print(divider2)
print(divider3)
```

```output
========================================
########################################
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

In contrast, the following examples perform run-time string replication
resulting in `String` values:

```mojo
repeat = 40
div1 = "^" * repeat
print(div1)
print("_" * repeat)
```

```output
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
________________________________________
```

#### String comparison

`String` and `StringLiteral` values can be compared using standard
lexicographical ordering, producing a `Bool`. For example, "Zebra" is treated as
less than "ant" because upper case letters occur before lower case letters in
the character encoding.

```mojo
var animal: String = "bird"

is_cat_eq = "cat" == animal
print(StaticString('Is "cat" equal to "{}"?').format(animal), is_cat_eq)

is_cat_ne = "cat" != animal
print(StaticString('Is "cat" not equal to "{}"?').format(animal), is_cat_ne)

is_bird_eq = "bird" == animal
print(StaticString('Is "bird" equal to "{}"?').format(animal), is_bird_eq)

is_cat_gt = "CAT" > animal
print(StaticString('Is "CAT" greater than "{}"?').format(animal), is_cat_gt)

is_ge_cat = animal >= "CAT"
print(StaticString('Is "{}" greater than or equal to "CAT"?').format(animal), is_ge_cat)
```

```output
Is "cat" equal to "bird"? False
Is "cat" not equal to "bird"? True
Is "bird" equal to "bird"? True
Is "CAT" greater than "bird"? False
Is "bird" greater than or equal to "CAT"? True
```

#### Substring testing

`String`, `StringLiteral`, and `StringSlice` support using the `in` operator to
produce a `Bool` result indicating whether a given substring appears within
another string. The operator is overloaded so that you can use any combination
of `String` and `StringLiteral` for both the substring and the string to test.

```mojo
var food: String = "peanut butter"

if "nut" in food:
    print("It contains a nut")
else:
    print("It doesn't contain a nut")
```

```output
It contains a nut
```

#### String indexing and slicing

`String`, `StringLiteral`, and `StringSlice` allow you to use indexing to return
a single character. Character positions are identified with a zero-based index
starting from the first character. You can also specify a negative index to
count backwards from the end of the string, with the last character identified
by index -1. Specifying an index beyond the bounds of the string results in a
run-time error.

```mojo
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"  # StringLiteral type value
print(alphabet[0], alphabet[-1])

# The following would produce a run-time error
# print(alphabet[45])
```

```output
A Z
```

The `String` and `StringSlice` types—but *not* the `StringLiteral` type—also
support slices to return a substring from the original `String`. Providing a
slice in the form `[start:end]` returns a substring starting with the character
index specified by `start` and continuing up to but not including the character
at index `end`. You can use positive or negative indexing for both the start and
end values. Omitting `start` is the same as specifying `0`, and omitting `end`
is the same as specifying 1 plus the length of the string.

```mojo
var alphabet: String = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
print(alphabet[1:4])  # The 2nd through 4th characters
print(alphabet[:6])   # The first 6 characters
print(alphabet[-6:])  # The last 6 characters
```

```output
BCD
ABCDEF
UVWXYZ
```

You can also specify a slice with a `step` value, as in `[start:end:step]`
indicating the increment between subsequent indices of the slide. (This is also
sometimes referred to as a "stride.") If you provide a negative value for
`step`, characters are selected in reverse order starting with `start` but then
with *decreasing* index values up to but not including `end`.

```mojo
print(alphabet[1:6:2])     # The 2nd, 4th, and 6th characters
print(alphabet[-1:-4:-1])  # The last 3 characters in reverse order
print(alphabet[::-1])      # The entire string reversed
```

```output
BDF
ZYX
ZYXWVUTSRQPONMLKJIHGFEDCBA
```

### In-place assignment operators

Mutable types that support binary arithmetic, bitwise, and shift operators
typically support equivalent in-place assignment operators. That means that for
a type that supports the `+` operator, the following two statements are
essentially equivalent:

```mojo
a = a + b
a += b
```

However there is a subtle difference between the two. In the first example, the
expression `a + b` produces a new value, which is then assigned to `a`. In
contrast, the second example does an in-place modification of the value
currently assigned to `a`. For register-passable types, the compiled results
might be equivalent at run-time. But for a memory-only type, the first example
allocates storage for the result of `a + b` and then assigns the value to the
variable, whereas the second example can do an in-place modification of the
existing value.

:::note

A type must explicitly implement in-place assignment methods, so you might
encounter some types where in-place equivalents are not supported.

:::

### Assignment expressions

The "walrus" operator, `:=`, allows you to assign a value to a variable within
an expression. The value provided is both assigned to the variable and becomes
the result of the expression. This often can simplify conditional or looping
logic. For example, consider the following prompting loop:

```mojo
while True:
    name = input("Enter a name or 'quit' to exit: ")
    if name == "quit":
        break
    print("Hello,", name)
```

```output
Enter a name or 'quit' to exit: Coco
Hello, Coco
Enter a name or 'quit' to exit: Vivienne
Hello, Vivienne
Enter a name or 'quit' to exit: quit
```

Using the walrus operator, you can implement the same behavior like this:

```mojo
while (name := input("Enter a name or 'quit' to exit: ")) != "quit":
    print("Hello,", name)
```

```output
Enter a name or 'quit' to exit: Donna
Hello, Donna
Enter a name or 'quit' to exit: Vera
Hello, Vera
Enter a name or 'quit' to exit: quit
```

## Implement operators for custom types

When you create a custom struct, Mojo allows you to define the behavior of many
of the built-in operators for that type by implementing special *dunder* (double
underscore) methods. This section lists the dunder methods associated with the
operators and briefly describes the requirements for implementing them.

:::note

Currently, Mojo doesn't support defining arbitrary custom operators (for
example, `-^-`). You can define behaviors for only the operators listed in the
following subsections.

:::

### Unary operator dunder methods

A unary operator invokes an associated dunder method on the value to which it
applies. The supported unary operators and their corresponding methods are shown
in the table below.

| **Operator**    | **Dunder method** |
| --------------- | ----------------- |
| `+` positive    | `__pos__()` |
| `-` negative    | `__neg__()` |
| `~` bitwise NOT | `__invert__()`  |

For each of these methods that you decide to implement, you should return either
the original value if unchanged, or a new value representing the result of the
operator. For example, you could implement the `-` negative operator for a
`MyInt` struct like this:

```mojo
@value
struct MyInt:
    var value: Int

    def __neg__(self) -> Self:
        return Self(-self.value)
```

### Binary arithmetic, shift, and bitwise operator dunder methods

When you have a binary expression like `a + b`, there are two possible dunder
methods that could be invoked.

Mojo first determines whether the left-hand side value (`a` in this example) has
a "normal" version of the `+` operator's dunder method defined that accepts a
value of the right-hand side's type. If so, it then invokes that method on the
left-hand side value and passes the right-hand side value as an argument.

If Mojo doesn't find a matching "normal" dunder method on the left-hand side
value, it then checks whether the right-hand side value has a "reflected"
(sometimes referred to as "reversed") version of the `+` operator's dunder
method defined that accepts a value of the left-hand side's type. If so, it then
invokes that method on the right-hand side value and passes the left-hand side
value as an argument.

For both the normal and the reflected versions, the dunder method should return
a new value representing the result of the operator.

Additionally, there are dunder methods corresponding to the in-place assignment
versions of the operators. These methods receive the right-hand side value as an
argument and the methods should modify the existing left-hand side value to
reflect the result of the operator.

The table below lists the various binary arithmetic, shift, and bitwise
operators and their corresponding normal, reflected, and in-place dunder
methods.

| **Operator** | **Normal** | **Reflected** | **In-place** |
| ------------ | ---------- | ------------- | ------------ |
| `+` addition | `__add__()` | `__radd__()` | `__iadd__()` |
| `-` subtraction | `__sub__()`  | `__rsub__()`  | `__isub__()` |
| `*` multiplication | `__mul__()`  | `__rmul__()`  | `__imul__()` |
| `/` division | `__truediv__()`  | `__rtruediv__()`  | `__itruediv__()` |
| `//` floor division | `__floordiv__()` | `__rfloordiv__()` | `__ifloordiv__()` |
| `%` modulus/remainder | `__mod__()`  | `__rmod__()`  | `__imod__()` |
| `**` exponentiation | `__pow__()`  | `__rpow__()`  | `__ipow__()` |
| `@` matrix multiplication | `__matmul__()` | `__rmatmul__()` | `__imatmul__()` |
| `>` right shift | `__rshift__()` | `__rrshift__()` | `__irshift__()` |
| `&` bitwise AND | `__and__()`  | `__rand__()`  | `__iand__()` |
| `\|` bitwise OR | `__or__()`  | `__ror__()`  | `__ior__()`  |
| `^` bitwise XOR | `__xor__()`  | `__rxor__()`  | `__ixor__()`  |

As an example, consider implementing support for all of the `+` operator dunder
methods for a custom `MyInt` struct. This shows supporting adding two `MyInt`
instances as well as adding a `MyInt` and an `Int`. We can support the case of
having the `Int` as the right-hand side argument by overloaded the definition of
`__add__()`. But to support the case of having the `Int` as the left-hand side
argument, we need to implement an `__radd__()` method, because the built-in
`Int` type doesn't have an `__add__()` method that supports our custom `MyInt`
type.

```mojo
@value
struct MyInt:
    var value: Int

    def __add__(self, rhs: MyInt) -> Self:
        return MyInt(self.value + rhs.value)

    def __add__(self, rhs: Int) -> Self:
        return MyInt(self.value + rhs)

    def __radd__(self, lhs: Int) -> Self:
        return MyInt(self.value + lhs)

    def __iadd__(mut self, rhs: MyInt) -> None:
        self.value += rhs.value

    def __iadd__(mut self, rhs: Int) -> None:
        self.value += rhs
```

### Comparison operator dunder methods

When you have a comparison expression like `a ` greater than | `__gt__()` |
| `>=` greater than or equal | `__ge__()` |

:::note

The `Comparable` and `EqualityComparable` traits don't allow the comparison
dunder methods to raise errors. Because using `def` to define a method implies
that it can raise an error, you must use `fn` to implement the comparison
methods declared by these traits. See [Functions](/mojo/manual/functions) for
more information on the differences between defining functions with `def` and
`fn`.

:::

As an example, consider implementing support for all of the comparison operator
dunder methods for a custom `MyInt` struct.

```mojo
@value
struct MyInt(
    Comparable
):
    var value: Int

    fn __eq__(self, rhs: MyInt) -> Bool:
        return self.value == rhs.value

    fn __ne__(self, rhs: MyInt) -> Bool:
        return self.value != rhs.value

    fn __lt__(self, rhs: MyInt) -> Bool:
        return self.value  Bool:
        return self.value  Bool:
        return self.value > rhs.value

    fn __ge__(self, rhs: MyInt) -> Bool:
        return self.value >= rhs.value
```

### Membership operator dunder methods

The `in` and `not in` operators depend on a type implementing the
`__contains__()` dunder method. Typically only collection types (such as `List`,
`Dict`, and `Set`) implement this method. It should accept the right-hand side
value as an argument and return a `Bool` indicating whether the value is present
in the collection or not.

### Subscript and slicing dunder methods

Subscripting and slicing typically apply only to sequential collection types,
like `List` and `String`. Subscripting references a single element of a
collection or a dimension of a multi-dimensional container, whereas slicing
refers to a range of values. A type supports both subscripting and slicing by
implementing the `__getitem__()` method for retrieving values and the
`__setitem__()` method for setting values.

#### Subscripting

In the simple case of a one-dimensional sequence, the `__getitem__()` and
`__setitem__()` methods should have signatures similar to this:

```mojo
struct MySeq[type: Copyable & Movable]:
    fn __getitem__(self, idx: Int) -> type:
        # Return element at the given index
        ...
    fn __setitem__(mut self, idx: Int, value: type):
        # Assign the element at the given index the provided value
```

It's also possible to support multi-dimensional collections, in which case you
can implement both `__getitem__()` and `__setitem__()` methods to accept
multiple index arguments—or even variadic index arguments for
arbitrary—dimension collections.

```mojo
struct MySeq[type: Copyable & Movable]:
    # 2-dimension support
    fn __getitem__(self, x_idx: Int, y_idx: Int) -> type:
        ...
    # Arbitrary-dimension support
    fn __getitem__(self, *indices: Int) -> type:
        ...
```

#### Slicing

You provide slicing support for a collection type also by implementing
`__getitem__()` and `__setitem__()` methods. But for slicing, instead of
accepting an `Int` index (or indices, in the case of a multi-dimensional
collection) you implement to methods to accept a
[`Slice`](/mojo/stdlib/builtin/builtin_slice/Slice) (or multiple `Slice`s in
the case of a multi-dimensional collection).

```mojo
struct MySeq[type: Copyable & Movable]:
    # Return a new MySeq with a subset of elements
    fn __getitem__(self, span: Slice) -> Self:
        ...

```

A `Slice` contains three fields:

- `start` (`Optional[Int]`): The starting index of the slice
- `end` (`Optional[Int]`): The ending index of the slice
- `step` (`Optional[Int]`): The step increment value of the slice.

Because the start, end, and step values are all optional when using slice
syntax, they are represented as `Optional[Int]` values in the `Slice`. And if
present, the index values might be negative representing a relative position
from the end of the sequence. As a convenience, `Slice` provides an `indices()`
method that accepts a `length` value and returns a 3-tuple of "normalized"
start, end, and step values for the given length, all represented as
non-negative values. You can then use these normalized values to determine the
corresponding elements of your collection being referenced.

```mojo
struct MySeq[type: Copyable & Movable]:
    var size: Int

    # Return a new MySeq with a subset of elements
    fn __getitem__(self, span: Slice) -> Self:
        var start: Int
        var end: Int
        var step: Int
        start, end, step = span.indices(self.size)
        ...

```

## An example of implementing operators for a custom type

As an example of implementing operators for a custom Mojo type, let's create a
`Complex` struct to represent a single complex number, with both the real and
imaginary components stored as `Float64` values. We'll implement most of the
arithmetic operators, the associated in-place assignment operators, the equality
comparison operators, and a few additional convenience methods to support
operations like printing complex values. We'll also allow mixing `Complex` and
`Float64` values in arithmetic expressions to produce a `Complex` result.

This example builds our `Complex` struct incrementally. You can also find the
[complete example in the public GitHub
repo](https://github.com/modular/modular/tree/main/examples/mojo/operators).

:::note

Note that the Mojo standard library implements a parameterized
[`ComplexSIMD`](/mojo/stdlib/complex/complex/ComplexSIMD) struct that provides
support for a basic set of arithmetic operators. However, our `Complex` type
will not be based on the `ComplexSIMD` struct or be compatible with it.

:::

### Implement lifecycle methods

Our `Complex` struct is an example of a simple value type consisting of trivial
numeric fields and requiring no special constructor or destructor behaviors.
This means that we can take advantage of Mojo's
[`@value`](/mojo/manual/decorators/value) decorator, which is described in
[Simple value types](/mojo/manual/lifecycle/life#value-decorator), to
automatically implement a member-wise initializer (a constructor with arguments
for each field), a copy constructor, a move constructor, and a destructor.

```mojo
@value
struct Complex():
    var re: Float64
    var im: Float64
```

This definition is enough for us to create `Complex` instances and access their
real and imaginary fields.

```mojo
c1 = Complex(-1.2, 6.5)
print(StaticString("c1: Real: {}; Imaginary: {}").format(c1.re, c1.im))
```

```output
c1: Real: -1.2; Imaginary: 6.5
```

As a convenience, let's add an explicit constructor to handle the case of
creating a `Complex` instance with an imaginary component of 0.

```mojo
@value
struct Complex():
    var re: Float64
    var im: Float64

    fn __init__(out self, re: Float64, im: Float64 = 0.0):
        self.re = re
        self.im = im
```

Now we can create a `Complex` instance and provide just a real component.

```mojo
c2 = Complex(3.14159)
print(StaticString("c2: Real: {}; Imaginary: {}").format(c2.re, c2.im))
```

```output
c2: Real: 3.1415899999999999; Imaginary: 0.0
```

### Implement the `Writable` and `Stringable` traits

To make it simpler to print `Complex` values, let's implement the
[Writable](/mojo/stdlib/utils/write/Writable) trait. While we're at it, let's
also implement the [`Stringable`](/mojo/stdlib/builtin/str/Stringable) trait so
that we can use the `String()` constructor to generate a `String` representation of a
`Complex` value. You can find out more about these traits and their associated
methods in [The `Stringable`, `Representable`, and `Writable`
traits](/mojo/manual/traits#the-stringable-representable-and-writable-traits).

```mojo
@value
struct Complex(
    Writable,
    Stringable,
):
    # ...

    fn __str__(self) -> String:
        return String.write(self)

    fn write_to[W: Writer](self, mut writer: W):
        writer.write("(", self.re)
        if self.im  Float64:
        if idx == 0:
            return self.re
        elif idx == 1:
            return self.im
        else:
            raise "index out of bounds"

    def __setitem__(mut self, idx: Int, value: Float64) -> None:
        if idx == 0:
            self.re = value
        elif idx == 1:
            self.im = value
        else:
            raise "index out of bounds"
```

Now let's try getting and setting the real and imaginary components of a
`Complex` value using indexing.

```mojo
c2 = Complex(3.14159)
print(StaticString("c2[0]: {}; c2[1]: {}").format(c2[0], c2[1]))
c2[0] = 2.71828
c2[1] = 42
print("c2[0] = 2.71828; c2[1] = 42; c2:", c2)
```

```output
c2[0]: 3.1415899999999999; c2[1]: 0.0
c2[0] = 2.71828; c2[1] = 42; c2: (2.71828 + 42.0i)
```

### Implement arithmetic operators

Now let's implement the dunder methods that allow us to perform arithmetic
operations on `Complex` values. (Refer to the [Wikipedia
page](https://en.wikipedia.org/wiki/Complex_number) on complex numbers for a
more in-depth explanation of the formulas for these operators.)

#### Implement basic operators for `Complex` values

The unary `+` operator simply returns the original value, whereas the unary `-`
operator returns a new `Complex` value with the real and imaginary components
negated.

```mojo
    # ...
    def __pos__(self) -> Self:
        return self

    def __neg__(self) -> Self:
        return Self(-self.re, -self.im)
```

Let's test these out by printing the result of applying each operator.

```mojo
c1 = Complex(-1.2, 6.5)
print("+c1:", +c1)
print("-c1:", -c1)
```

```output
+c1: (-1.2 + 6.5i)
-c1: (1.2 - 6.5i)
```

Next we'll implement the basic binary operators: `+`, `-`, `*`, and `/`.
Dividing complex numbers is a bit tricky, so we'll also define a helper method
called `norm()` to calculate the [Euclidean
norm](https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm_of_complex_numbers)
of a `Complex` instance, which can also be useful for other types of analysis
with complex numbers.

For all of these dunder methods, the left-hand side operand is `self` and the
right-hand side operand is passed as an argument. We return a new `Complex`
value representing the result.

```mojo
from math import sqrt

# ...

    def __add__(self, rhs: Self) -> Self:
        return Self(self.re + rhs.re, self.im + rhs.im)

    def __sub__(self, rhs: Self) -> Self:
        return Self(self.re - rhs.re, self.im - rhs.im)

    def __mul__(self, rhs: Self) -> Self:
        return Self(
            self.re * rhs.re - self.im * rhs.im,
            self.re * rhs.im + self.im * rhs.re
        )

    def __truediv__(self, rhs: Self) -> Self:
        denom = rhs.squared_norm()
        return Self(
            (self.re * rhs.re + self.im * rhs.im) / denom,
            (self.im * rhs.re - self.re * rhs.im) / denom
        )

    def squared_norm(self) -> Float64:
        return self.re * self.re + self.im * self.im

    def norm(self) -> Float64:
        return sqrt(self.squared_norm())
```

Now we can try them out.

```mojo
c1 = Complex(-1.2, 6.5)
c3 = Complex(3.14159, -2.71828)
print("c1 + c3 =", c1 + c3)
print("c1 - c3 =", c1 - c3)
print("c1 * c3 =", c1 * c3)
print("c1 / c3 =", c1 / c3)
```

```output
c1 + c3 = (1.9415899999999999 + 3.78172i)
c1 - c3 = (-4.3415900000000001 + 9.21828i)
c1 * c3 = (13.898912000000001 + 23.682270999999997i)
c1 / c3 = (-1.2422030701265261 + 0.99419218883955773i)
```

#### Implement overloaded arithmetic operators for `Float64` values

Our initial set of binary arithmetic operators work fine if both operands are
`Complex` instances. But if we have a `Float64` value representing just a real
value, we'd first need to use it to create a `Complex` value before we could
add, subtract, multiply, or divide it with another `Complex` value. If we think
that this will be a common use case, it makes sense to overload our arithmetic
methods to accept a `Float64` as the second operand.

For the case where we have `complex1 + float1`, we can just create an overloaded
definition of `__add__()`. But what about the case of `float1 + complex1`? By
default, when Mojo encounters a `+` operator it tries to invoke the `__add__()`
method of the left-hand operand, but the built-in `Float64` type doesn't
implement support for addition with a `Complex` value. This is an example where
we need to implement the `__radd__()` method on the `Complex` type. When Mojo
can't find an `__add__(self, rhs: Complex) -> Complex` method defined on
`Float64`, it uses the `__radd__(self, lhs: Float64) -> Complex` method defined
on `Complex`.

So we can support arithmetic operations on `Complex` and `Float64` values by
implementing the following eight methods.

```mojo
    # ...
    def __add__(self, rhs: Float64) -> Self:
        return Self(self.re + rhs, self.im)

    def __radd__(self, lhs: Float64) -> Self:
        return Self(self.re + lhs, self.im)

    def __sub__(self, rhs: Float64) -> Self:
        return Self(self.re - rhs, self.im)

    def __rsub__(self, lhs: Float64) -> Self:
        return Self(lhs - self.re, -self.im)

    def __mul__(self, rhs: Float64) -> Self:
        return Self(self.re * rhs, self.im * rhs)

    def __rmul__(self, lhs: Float64) -> Self:
        return Self(lhs * self.re, lhs * self.im)

    def __truediv__(self, rhs: Float64) -> Self:
        return Self(self.re / rhs, self.im / rhs)

    def __rtruediv__(self, lhs: Float64) -> Self:
        denom = self.squared_norm()
        return Self(
            (lhs * self.re) / denom,
            (-lhs * self.im) / denom
        )
```

Let's see them in action.

```mojo
c1 = Complex(-1.2, 6.5)
f1 = 2.5
print("c1 + f1 =", c1 + f1)
print("f1 + c1 =", f1 + c1)
print("c1 - f1 =", c1 - f1)
print("f1 - c1 =", f1 - c1)
print("c1 * f1 =", c1 * f1)
print("f1 * c1 =", f1 * c1)
print("c1 / f1 =", c1 / f1)
print("f1 / c1 =", f1 / c1)
```

```output
c1 + f1 = (1.3 + 6.5i)
f1 + c1 = (1.3 + 6.5i)
c1 - f1 = (-3.7000000000000002 + 6.5i)
f1 - c1 = (3.7000000000000002 - 6.5i)
c1 * f1 = (-3.0 + 16.25i)
f1 * c1 = (-3.0 + 16.25i)
c1 / f1 = (-0.47999999999999998 + 2.6000000000000001i)
f1 / c1 = (-0.068665598535133904 - 0.37193865873197529i)
```

#### Implement in-place assignment operators

Now let's implement support for the in-place assignment operators: `+=`, `-=`,
`*=`, and `/=`. These modify the original value, so we need to mark `self` as
being an `mut` argument and update the `re` and `im` fields instead of
returning a new `Complex` instance. And once again, we'll overload the
definitions to support both a `Complex` and a `Float64` operand.

```mojo
    # ...
    def __iadd__(mut self, rhs: Self) -> None:
        self.re += rhs.re
        self.im += rhs.im

    def __iadd__(mut self, rhs: Float64) -> None:
        self.re += rhs

    def __isub__(mut self, rhs: Self) -> None:
        self.re -= rhs.re
        self.im -= rhs.im

    def __isub__(mut self, rhs: Float64) -> None:
        self.re -= rhs

    def __imul__(mut self, rhs: Self) -> None:
        new_re = self.re * rhs.re - self.im * rhs.im
        new_im = self.re * rhs.im + self.im * rhs.re
        self.re = new_re
        self.im = new_im

    def __imul__(mut self, rhs: Float64) -> None:
        self.re *= rhs
        self.im *= rhs

    def __itruediv__(mut self, rhs: Self) -> None:
        denom = rhs.squared_norm()
        new_re = (self.re * rhs.re + self.im * rhs.im) / denom
        new_im = (self.im * rhs.re - self.re * rhs.im) / denom
        self.re = new_re
        self.im = new_im

    def __itruediv__(mut self, rhs: Float64) -> None:
        self.re /= rhs
        self.im /= rhs
```

And now to try them out.

```mojo
c4 = Complex(-1, -1)
print("c4 =", c4)
c4 += Complex(0.5, -0.5)
print("c4 += Complex(0.5, -0.5) =>", c4)
c4 += 2.75
print("c4 += 2.75 =>", c4)
c4 -= Complex(0.25, 1.5)
print("c4 -= Complex(0.25, 1.5) =>", c4)
c4 -= 3
print("c4 -= 3 =>", c4)
c4 *= Complex(-3.0, 2.0)
print("c4 *= Complex(-3.0, 2.0) =>", c4)
c4 *= 0.75
print("c4 *= 0.75 =>", c4)
c4 /= Complex(1.25, 2.0)
print("c4 /= Complex(1.25, 2.0) =>", c4)
c4 /= 2.0
print("c4 /= 2.0 =>", c4)
```

```output
c4 = (-1.0 - 1.0i)
c4 += Complex(0.5, -0.5) => (-0.5 - 1.5i)
c4 += 2.75 => (2.25 - 1.5i)
c4 -= Complex(0.25, 1.5) => (2.0 - 3.0i)
c4 -= 3 => (-1.0 - 3.0i)
c4 *= Complex(-3.0, 2.0) => (9.0 + 7.0i)
c4 *= 0.75 => (6.75 + 5.25i)
c4 /= Complex(1.25, 2.0) => (3.404494382022472 - 1.247191011235955i)
c4 /= 2.0 => (1.702247191011236 - 0.6235955056179775i)
```

### Implement equality operators

The field of complex numbers is not an ordered field, so it doesn't make sense
for us to implement the `Comparable` trait and the `>`, `>=`, ` Bool:
        return self.re == other.re and self.im == other.im

    fn __ne__(self, other: Self) -> Bool:
        return self.re != other.re and self.im != other.im
```

:::note

The `EqualityComparable` trait doesn't allow the `__eq__()` and `__ne__()`
methods to raise errors. Because defining a method with `def` implies that it
can raise an error, we instead have to define these methods with `fn`. See
[Functions](/mojo/manual/functions) for more information on the differences
between defining functions with `def` and `fn`.

:::

And now to try them out.

```mojo
c1 = Complex(-1.2, 6.5)
c3 = Complex(3.14159, -2.71828)
c5 = Complex(-1.2, 6.5)

if c1 == c5:
    print("c1 is equal to c5")
else:
    print("c1 is not equal to c5")

if c1 != c3:
    print("c1 is not equal to c3")
else:
    print("c1 is equal to c3")
```

```output
c1 is equal to c5
c1 is not equal to c3
```

---

## ops

Implements operations used when staging a graph.

This module provides operations for building computational graphs in MAX. These
operations create, transform, and manipulate tensor values within the graph.

You can also use functions in [Graph](/max/api/python/graph/Graph) to add
constant values to your graph with operations like
[constant()](/max/api/python/graph/ops#max.graph.ops.constant).

The [TensorValue](/max/api/python/graph/TensorValue/) type (returned by most
operations) implements various dunder methods to support operations between
TensorValues, such as + for addition, \* for multiplication, and @ for
matrix multiplication. It also provides convenience methods like
[reshape()](/max/api/python/graph/TensorValue/#max.graph.TensorValue.reshape)
and
[flatten()](/max/api/python/graph/TensorValue/#max.graph.TensorValue.flatten).

## Casting

### `broadcast_to()` {#max.graph.ops.broadcast_to}

> max.graph.ops.broadcast\_to(x, shape, out\_dims=None)

Broadcasts a symbolic tensor.

Broadcasts the input tensor to the specified shape.
Dimensions in the input must be one or match the target dimension.

**Parameters:**

* **x** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The input symbolic tensor to broadcast.
  This tensor may not contain any dynamic dimensions.
* **shape** ([`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – The new shape as a list of dimensions.
  Dynamic dimensions are not allowed.
* **out\_dims** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]`  `|`  `None` ) – Output dims used only for tensor-valued shape.

**Returns:**

A symbolic tensor with the same elements as the original tensor, but
in a new shape. Its symbolic shape is the same as `shape`.

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – if a tensor-valued shape is passed without out\_dims.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `cast()` {#max.graph.ops.cast}

> max.graph.ops.cast(x, dtype)

Casts a symbolic tensor to a different data type.

**Parameters:**

* **x** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The input tensor to cast.
* **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The target dtype to which the tensor is cast.

**Returns:**

A new symbolic tensor with the same shape as the input and the
specified dtype.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `rebind()` {#max.graph.ops.rebind}

> max.graph.ops.rebind(x, shape, message='')

Rebinds a symbolic tensor to a specified set of dimensions.

This does not mutate the symbolic tensor passed in, but instead adds a
runtime assert that the input symbolic shape is equivalent to
`out_dims` shape. For example, if the input tensor shape has
dynamic/unknown sizes, this will assert a fixed sizes that may be required
for a subsequent operation.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to rebind.
* **shape** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – The symbolic shape to assert for `x`, as a list of
  [`Dim`](/max/api/python/graph/type/Dim) values.
* **message** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – The message printed if the rebind fails at runtime.

**Returns:**

A symbolic tensor with the same elements and shape as the given tensor,
but with the symbolic shape asserted to `out_dims`.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `reshape()` {#max.graph.ops.reshape}

> max.graph.ops.reshape(x, shape)

Reshapes a symbolic tensor.

The number and order of the elements in the tensor is unchanged.
In other words, if you were to iterate over elements in the tensor
by major dimension to minor dimension, the iteration order would stay
the same.

If a value of -1 is present in the shape, that dimension becomes
an automatically calculated dimension collecting all unspecified dimensions.
Its length becomes the number of elements in the original tensor
divided by the product of elements of the reshape.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to reshape.
  This tensor may not contain any dynamic dimensions.
* **shape** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – The new shape as a list of dimensions.
  Dynamic dimensions are not allowed.
  A single dimension may be -1.

**Returns:**

A symbolic tensor with the same elements as the original tensor, but
in a new shape. Its symbolic shape is the same as `shape`.

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – if input and target shapes’ number of elements mismatch.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `shape_to_tensor()` {#max.graph.ops.shape_to_tensor}

> max.graph.ops.shape\_to\_tensor(shape)

Converts a shape to a tensor.

This is useful for using a shape attribute in an op that expects a tensor
value.

**Parameters:**

**shape** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – the shape attribute of a tensor value.

**Returns:**

The TensorValue containing the same value as shape.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### Example

```pycon
>>> x = ops.constant(np.zeros((1,)), DType.int64, device=DeviceRef.CPU())
>>> result = ops.stack([
...     x,
...     ops.shape_to_tensor(x.shape),
... ])
TensorValue(dtype=int64, shape=[StaticDim(dim=2), StaticDim(dim=1)])
```

### `squeeze()` {#max.graph.ops.squeeze}

> max.graph.ops.squeeze(x, axis)

Removes a size-1 dimension from a symbolic tensor.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to squeeze.
* **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimension to remove from the input’s shape. If negative, this
  indexes from the end of the tensor. For example,
  `squeeze(v, -1)` squeezes the last dimension.

**Returns:**

A symbolic tensor with the same number of elements as the input tensor,
and whose rank is 1 less than the rank of the input tensor.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `transpose()` {#max.graph.ops.transpose}

> max.graph.ops.transpose(x, axis\_1, axis\_2)

Transposes two axes of a symbolic tensor.
For more information, see [`transpose()`](TensorValue.md#max.graph.TensorValue.transpose).

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to transpose.
* **axis\_1** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – One of the two axes to transpose. If negative, this indexes
  from the end of the tensor. For example,
  `transpose(v, -1, -2)` transposes the last two axes.
* **axis\_2** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The other axis to transpose. May also be negative to index from
  the end of the tensor.

**Returns:**

A new symbolic tensor with the two specified axes transposed.
It has the same elements and dtype, but the order of the elements
is different according to the transposition.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `unsqueeze()` {#max.graph.ops.unsqueeze}

> max.graph.ops.unsqueeze(x, axis)

Inserts a size-1 dimension into a symbolic tensor.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to unsqueeze.
* **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The index at which to insert a new dimension into the input’s
  shape. Elements at that index or higher are shifted back.
  If negative, it indexes relative *1 plus* the rank of the tensor.
  For example, `unsqueeze(v, -1)` adds a new dimension at the
  end, and `unsqueeze(v, -2)` inserts the dimension immediately
  before the last dimension.

**Returns:**

A symbolic tensor with the same number of elements as the input tensor,
whose rank is 1 larger than the rank of the input tensor. The result’s
shape at the `axis` dimension is a static dimension of size 1.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Complex

### `as_interleaved_complex()` {#max.graph.ops.as_interleaved_complex}

> max.graph.ops.as\_interleaved\_complex(x)

Reshapes the input symbolic tensor as complex from alternating (real, imag).

**Parameters:**

* **interleaved** – A symbolic tensor representing complex numbers as
  alternating pairs of (real, imag) real-valued numbers. Its last
  dimension must have an even size.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A symbolic tensor representing the complex-valued tensor, but with the
values pulled out as complex numbers. The result has the same dimensions
for all dimensions except the last dimension, which is halved,
and then a final dimension of size 2 representing the complex value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Constant

### `constant()` {#max.graph.ops.constant}

> max.graph.ops.constant(value, dtype, device)

Adds a node representing a constant operation.

The value of this constant will have the type TensorType with the
same shape as value. If value is a scalar type, it will create a TensorType with 0 dimensions.

The constant will be loaded with the specified dtype.
If the constant does not fit within the specified dtype, an error is raised.

Warning: Loading the constant could result in precision loss.
For example, loading 16777217 as a float32 will result in 16777216.0.

**Parameters:**

* **value** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) ) – The constant’s value.
* **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The constant tensor’s element type.
* **device** (`DeviceRef` ) – The device the constant lives on.

**Returns:**

A graph value containing the constant data as an attribute.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Convolution

### `conv2d()` {#max.graph.ops.conv2d}

> max.graph.ops.conv2d(x, filter, stride=(1, 1), dilation=(1, 1), padding=(0, 0, 0, 0), groups=1, bias=None)

Computes the 2-D convolution product of the input with the given filter, bias,
strides, dilations, paddings, and groups.

The op supports 2-D convolution, with the following layout assumptions:

* input x has NHWC layout, i.e.,
  (batch\_size, height, width, in\_channels)
* filter has layout RSCF, i.e.,
  (height, width, in\_channels / num\_groups, out\_channels)
* bias has shape (out\_channels,)

The padding values are expected to take the form (pad\_dim1\_before,
pad\_dim1\_after, pad\_dim2\_before, pad\_dim2\_after…) and represent padding
0’s before and after the indicated *spatial* dimensions in input. In 2-D
convolution, dim1 here represents H and dim2 represents W. In Python like
syntax, padding a 2x3 spatial input with \[0, 1, 2, 1] would yield:

```python
input = [
  [1, 2, 3],
  [4, 5, 6]
]
## Shape is 2x3

padded_input = [
  [0, 0, 1, 2, 3, 0],
  [0, 0, 4, 5, 6, 0],
  [0, 0, 0, 0, 0, 0]
]
## Shape is 3x6
```

This op currently only supports strides and padding on the input.

**Parameters:**

* **input** – An NHWC input tensor to perform the convolution upon.
* **filter** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The convolution filter in RSCF layout:
  (height, width, in\_channels / num\_groups, out\_channels).
* **stride** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The stride of the convolution operation.
* **dilation** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The spacing between the kernel points.
* **padding** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The amount of padding applied to the input.
* **groups** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – When greater than 1, divides the convolution into multiple
  parallel convolutions. The number of input and output
  channels must both be divisible by the number of groups.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **bias** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )

**Returns:**

A symbolic tensor value with the convolution applied.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `conv3d()` {#max.graph.ops.conv3d}

> max.graph.ops.conv3d(x, filter, stride=(1, 1, 1), dilation=(1, 1, 1), padding=(0, 0, 0, 0, 0, 0), groups=1, bias=None)

Computes the 3-D convolution product of the input with the given filter,
strides, dilations, paddings, and groups.

The op supports 3-D convolution, with the following layout assumptions:

* input has NDHWC layout, i.e.,
  (batch\_size, depth, height, width, in\_channels)
* filter has layout RSCF, i.e.,
  (depth, height, width, in\_channels / num\_groups, out\_channels)

The padding values are expected to take the form (pad\_dim1\_before,
pad\_dim1\_after, pad\_dim2\_before, pad\_dim2\_after…) and represent padding
0’s before and after the indicated *spatial* dimensions in input. In 3-D
convolution, dim1 here represents D, dim2 represents H and dim3 represents W. In Python like
syntax, padding a 2x3 spatial input with \[0, 1, 2, 1] would yield:

```python
input = [
  [1, 2, 3],
  [4, 5, 6]
]
## Shape is 2x3

padded_input = [
  [0, 0, 1, 2, 3, 0],
  [0, 0, 4, 5, 6, 0],
  [0, 0, 0, 0, 0, 0]
]
## Shape is 3x6
```

This op currently only supports strides and padding on the input.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – An NDHWC input tensor to perform the convolution upon.
* **filter** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The convolution filter in RSCF layout:
  (depth, height, width, in\_channels / num\_groups, out\_channels).
* **stride** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The stride of the convolution operation.
* **dilation** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The spacing between the kernel points.
* **padding** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The amount of padding applied to the input.
* **groups** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – When greater than 1, divides the convolution into multiple
  parallel convolutions. The number of input and output
  channels must both be divisible by the number of groups.
* **bias** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )

**Returns:**

A symbolic tensor value with the convolution applied.
Output shape = (batch\_size, depth, height, width, out\_channels).

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `conv2d_transpose()` {#max.graph.ops.conv2d_transpose}

> max.graph.ops.conv2d\_transpose(x, filter, stride=(1, 1), dilation=(1, 1), padding=(0, 0, 0, 0), output\_paddings=(0, 0), bias=None)

Computes the 2-D deconvolution of the input with the given filter,
strides, dilations, paddings, and groups.

The op supports the transpose (gradient) of convolution, with the following layout assumptions:
(note the out\_channel is w\.r.t. the original convolution)

* input x has NHWC layout, i.e.,
  (batch\_size, height, width, in\_channels)
* filter has layout RSCF, i.e.,
  (kernel\_height, kernel\_width, out\_channels, in\_channels)
* bias has shape (out\_channels,)

The padding values are expected to take the form in the form \[\[0, 0], \[pad\_top, pad\_bottom],
\[pad\_left, pad\_right], \[0, 0]].

This op effectively computes the gradient of a convolution with
respect to its input (as if the original convolution operation had the same
filter and hyperparameters as this op). A visualization of the computation
can be found in .

The padding values are expected to take the form (pad\_dim1\_before,
pad\_dim1\_after, pad\_dim2\_before, pad\_dim2\_after…) and represent padding
0’s before and after the indicated *spatial* dimensions in input. In 2D
ConvTranspose, dim1 here repesents H\_out and dim2 represents W\_out. In
python like syntax, padding a 2x4 spatial output with \[0, 1, 2, 1] would
yield:

```python
output = [
  [1, 2, 3, 4],
  [5, 6, 7, 8]
]
## Shape is 2x4

padded_input = [
  [3],
]
## Shape is 1x1
```

**Parameters:**

* **input** – An NHWC input tensor to perform the convolution upon.
* **filter** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The convolution filter in RSCF layout:
  (height, width, out\_channels, in\_channels).
* **stride** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The stride of the sliding window for each dimension of input.
  If a single value is given it is replicated in the H and W dimension.
  By default the N and C dimensions are set to 0.
* **dilation** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The spacing between the kernel points.
* **padding** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The amount of padding applied to the input.
* **output\_paddings** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – this argument is meant to resolve the ambiguity of multiple
  potential output shapes when any stride is greater than 1. Basically,
  we’ll add output\_paddings\[i] number of zeros at the end of output’s ith
  axis. We only support output\_paddings = 0.
* **bias** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` ) – tensor of shape (out\_channels,)
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A symbolic tensor value with the convolution applied.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Control flow

### `cond()` {#max.graph.ops.cond}

> max.graph.ops.cond(pred, out\_types, then\_fn, else\_fn)

Conditionally execute one of two branches based on a boolean predicate.

Both branches must return the same number and types of values as specified
in `out_types`. Buffer mutations in branches are tracked automatically
through the chain mechanism.

Examples:

1. Basic conditional with return values:
   > ```python
   > def then_fn():
   >     return ops.constant(1, DType.int32, device=DeviceRef.CPU())
   > def else_fn():
   >     return ops.constant(0, DType.int32, device=DeviceRef.CPU())
   > ​
   > result = ops.cond(
   >     pred,
   >     [TensorType(DType.int32, [], device=device)],
   >     then_fn,
   >     else_fn
   > )
   > ```
2. Conditional with buffer mutations:
   > ```python
   > def then_fn():
   >     ops.inplace_custom("increment", [buffer])
   > def else_fn():
   >     ops.inplace_custom("decrement", [buffer])
   > ​
   > ops.cond(pred, None, then_fn, else_fn)
   > ```

::
:param pred: Boolean scalar tensor of type `DType.bool` determining branch execution
:param out\_types: Expected output types for both branches. Use [`None`](https://docs.python.org/3/library/constants.html#None) for branches that don’t return values
:param then\_fn: Callable executed when `pred` is True. Must return values matching `out_types` if `out_types` is not [`None`](https://docs.python.org/3/library/constants.html#None)
:param else\_fn: Callable executed when `pred` is False. Must return values matching `out_types` if `out_types` is not [`None`](https://docs.python.org/3/library/constants.html#None)

**Returns:**

List of output values from executed branch. Returns empty list when `out_types`
is [`None`](https://docs.python.org/3/library/constants.html#None)

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If branches return different numbers of results or result types
don’t match `out_types`

**Parameters:**

* **pred** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **out\_types** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`Type`](type.md#max.graph.type.Type) `]`  `|`  `None` )
* **then\_fn** ([`Callable`](https://docs.python.org/3/library/typing.html#typing.Callable) )
* **else\_fn** ([`Callable`](https://docs.python.org/3/library/typing.html#typing.Callable) )

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*TensorValue*](TensorValue.md#max.graph.TensorValue)]

##### NOTE

Buffer operations in branches automatically update the global chain state to
maintain mutation ordering constraints

### `while_loop()` {#max.graph.ops.while_loop}

> max.graph.ops.while\_loop(initial\_values, predicate, body)

Execute a loop until the predicate evaluates to false.

Both the predicate and body functions must take in as arguments the same
number and types of values as specified in the init\_args. The predication
function must return only a boolean scalar tensor of type `DType.bool`.
The body function must return a list of values matching the types of init\_args.

The following example demonstrates a basic while loop with a single argument:

```python
from max.graph import Graph, ops
from max.dtype import DType

with Graph("while_loop_example") as g:
    x = ops.constant(0, dtype=DType.int32, device=DeviceRef.CPU())

    def pred(x):
        return x 

**Parameters:**

* **initial\_values** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`Value`](Value.md#max.graph.Value) `]`  `|`  [`Value`](Value.md#max.graph.Value) ) – Initial values for loop arguments. Must be non-empty.
* **predicate** ([`Callable`](https://docs.python.org/3/library/typing.html#typing.Callable) `[` `[` `...` `]` `,`  [`TensorValue`](TensorValue.md#max.graph.TensorValue) `]` ) – Callable that takes loop arguments and returns a boolean scalar tensor
  of type `DType.bool` determining loop continuation.
* **body** ([`Callable`](https://docs.python.org/3/library/typing.html#typing.Callable) `[` `[` `...` `]` `,`  [`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`Value`](Value.md#max.graph.Value) `]` `]` ) – Callable that takes loop arguments and returns updated values matching
  the types of init\_args.

**Returns:**

List of output values from the final loop iteration.

**Raises:**

* [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If init\_args is empty.
* [**NotImplementedError**](https://docs.python.org/3/library/exceptions.html#NotImplementedError) – If any init\_arg is a `BufferValue`.

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*TensorValue*](TensorValue.md#max.graph.TensorValue)]

##### NOTE

Buffer operations are currently not supported.

## Custom

A custom operation (op) is a user-defined kernel written in
[Mojo](/mojo/manual/) that is registered and executed within the computation
graph. It allows you to extend the graph’s capabilities by implementing your own
specialized operations.

For example, you might write an `add_one_custom` function in Mojo that adds 1
to each element of a matrix. Then you’d call the operation by its string name
in the [`max.graph.Graph`](Graph.md#max.graph.Graph):

```python
def create_graph(rows: int, columns: int, dtype: DType) -> Graph:
    """Configure a graph with a custom operation."""
    graph = Graph(
        "addition",
        forward=lambda x: ops.custom(
            name="add_one_custom",
            values=[x],
            out_types=[TensorType(dtype=x.dtype, shape=x.tensor.shape)],
        )[0].tensor,
        input_types=[
            TensorType(dtype, shape=[rows, columns]),
        ],
    )
    return graph
```

Custom ops also support parametrization on int, str, and dtype. This
means you can define custom parametric Mojo types then use those types as
inputs to custom ops staged in the graph API. For example, given the
following Counter Mojo type:

```mojo
struct Counter[stride: Int](Movable):
    var a: Int
    var b: Int

    fn __init__(out self):
        self.a = 0
        self.b = 0

    fn __init__(out self, a: Int, b: Int):
        self.a = a
        self.b = b

    fn __moveinit__(out self, owned other: Self):
        self.a = other.a
        self.b = other.b

    fn bump(mut self):
        self.a += Self.stride
        self.b += self.a
```

The following [`inplace_custom()`](#max.graph.ops.inplace_custom) call stages an op that
bumps the parametric `Counter` type. Notice that we’re using
`_OpaqueType` here, which is a Python-based graph type that
represents a Mojo value (from `max.graph.type`), but it’s
currently an internal API and subject to change.

```python
counter_type = _OpaqueType("Counter")
## ... create counter object.

## Stage a graph that bumps the counter, parametrized on stride.
bumper_graph = Graph(
    "bumper",
    forward=lambda x: ops.inplace_custom(
        "bump_counter",
        values=[x],
        out_types=[],
        parameters={"stride": 2},
    ),
    input_types=[counter_type],
)
```

### `custom()` {#max.graph.ops.custom}

> max.graph.ops.custom(name, values, out\_types, parameters=None, device=None)

Creates a node to execute a custom graph operation in the graph.

The custom op should be registered by annotating a function with the
[@compiler.register](/max/api/mojo-decorators/compiler-register/)
decorator.

**Parameters:**

* **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – The op name provided to `@compiler.register`.
* **values** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Value`](Value.md#max.graph.Value) `]` ) – The op function’s arguments.
* **out\_types** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Type`](type.md#max.graph.type.Type) `]` ) – The list of op function’s return type.
* **parameters** ([`Mapping`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Mapping) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  [`bool`](https://docs.python.org/3/library/functions.html#bool)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`DType`](../dtype.md#max.dtype.DType) `]`  `|`  `None` ) – Dictionary of extra parameters expected by the kernel.
* **device** (`DeviceRef`  `|`  `None` ) – Device that the op is assigned to.
  This becomes a target parameter to the kernel.

**Returns:**

Symbolic values representing the outputs of the op in the graph.
These correspond 1:1 with the types passed as `out_types`.

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Value*](Value.md#max.graph.Value)]

### `inplace_custom()` {#max.graph.ops.inplace_custom}

> max.graph.ops.inplace\_custom(name, values, out\_types=None, parameters=None, device=None)

Creates a node to execute an in-place custom graph operation in the graph.

The custom op should be registered by annotating a function with the
[@compiler.register](/max/api/mojo-decorators/compiler-register/)
decorator.

**Parameters:**

* **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – The op name provided to `@compiler.register`.
* **values** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`Value`](Value.md#max.graph.Value) `]` ) – The op function’s arguments.
* **parameters** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  [`bool`](https://docs.python.org/3/library/functions.html#bool)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`DType`](../dtype.md#max.dtype.DType) `]`  `|`  `None` ) – Dictionary of extra parameters expected by the kernel.
* **device** (`DeviceRef`  `|`  `None` ) – Device that the op is assigned to.
  This becomes a target parameter to the kernel.
* **out\_types** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`Type`](type.md#max.graph.type.Type) `]`  `|`  `None` )

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Value*](Value.md#max.graph.Value)]

## Debug

Operations used to help debug your graph.

### `print()` {#max.graph.ops.print}

> max.graph.ops.print(value, label='debug\_tensor')

Prints the value of a tensor or a string during graph execution.

This function is used to output the current value of a tensor and is
primarily used for debugging purposes within the context of the Max
Engine and its graph execution framework. This is particularly useful to
verify the intermediate results of your computations are as expected.

By printing the tensor values, you can visualize the data flowing through the
graph, which helps in understanding how the operations are transforming
the data.

When labeling the function you can assign the output, making it easier to
identify which tensor’s value is being printed, especially when there are
multiple print statements in a complex graph.

```python
def add_tensors(a: np.ndarray, b: np.ndarray) -> dict[str, Any]:
    input_type = TensorType(dtype=DType.float32, shape=(1,), device=DeviceRef.CPU())
    with Graph(
        "simple_add_graph", input_types=(input_type, input_type)
    ) as graph:
        lhs, rhs = graph.inputs
        out = ops.add(lhs, rhs)
        ops.print(out, label="addition_output")  # Pass the output tensor here

        graph.output(out)
        print("final graph:", graph)
```

**Parameters:**

* **value** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The value to print. Can be either a string or a TensorValue.
* **label** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – A label to identify the printed value. Defaults to
  `debug_tensor`.

## Distributed

### `allgather()` {#max.graph.ops.allgather}

> max.graph.ops.allgather(inputs, dim=0)

Collective allgather operation.

This op is a collective op which takes in tensors from different devices and
outputs tensors on different devices.
In particular, this operation will gather the inputs across different
devices and concatenates them along the 0th dimension.
The result is then broadcasted back to the same devices that the inputs
came from.

**Parameters:**

* **inputs** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `]` ) – The input tensors to gather.
* **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Dimension to concatenate the input tensors. Defaults to 0.

**Returns:**

An iterable outputs which all hold the gathered output. Each output
is a rank-1 array.

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*TensorValue*](TensorValue.md#max.graph.TensorValue)]

### `sum()` {#max.graph.ops.allreduce.sum}

> max.graph.ops.allreduce.sum(inputs, signal\_buffers)

Collective allreduce summation operation.

This op is a collective op which takes in tensors from different devices and
outputs tensors on different devices.
In particular, this operation will gather the inputs across different
devices and reduce them via a summation operation.
The result is then broadcasted back to the same devices that the inputs
came from.

This version of the allreduce sum op uses device-to-device transfers and
hence is expected to be much slower than the `ops.allreduce.sum` version.

**Parameters:**

* **inputs** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`TensorValue`](TensorValue.md#max.graph.TensorValue) `]` ) – The input tensors to reduce.
* **signal\_buffers** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`BufferValue`](BufferValue.md#max.graph.BufferValue) `]` ) – Device buffer values used for synchronization.

**Returns:**

An iterable outputs which all hold the reduction output.

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*TensorValue*](TensorValue.md#max.graph.TensorValue)]

## Elementwise

An elementwise operation performs the same calculation on each element of an input tensor.
These operations take tensors of compatible shapes and apply the specified operation to each element pair.

For example, the following demonstrates how to add two tensors using the [`add()`](#max.graph.ops.add) function:

```python
import numpy as np
from max import engine
from max.dtype import DType
from max.graph import Graph, TensorType, ops

def main():
    input_type = TensorType(dtype=DType.float32, shape=(2,))

    with Graph("simple_add_graph", input_types=(input_type, input_type)) as graph:
        x = graph.inputs[0]  # First operand
        y = graph.inputs[1]  # Second addend

        out = ops.add(x, y)
        graph.output(out)

    session = engine.InferenceSession()
    model = session.load(graph)

    input_0 = np.array([10.0, 8.0], dtype=np.float32)
    input_1 = np.array([2.0, 4.0], dtype=np.float32)

    ret = model.execute(input_0, input_1)

    print("\nAddition computation:")
    print("Result       ", ret["output0"])

if __name__ == "__main__":
    main()
```

### `abs()` {#max.graph.ops.abs}

> max.graph.ops.abs(x)

Computes the elementwise absolute value of a symbolic tensor.

Creates a new op node to compute the elementwise absolute value of a
symbolic tensor and adds it to the graph, returning the symbolic result.

The following demonstrates how to compute the absolute value using the [`abs()`](#max.graph.ops.abs) function:

```python
def abs_graph():
    input_type = TensorType(dtype=DType.float32, shape=(2,), device=DeviceRef.CPU())

    with Graph("abs_graph", input_types=(input_type,)) as graph:
        x = graph.inputs[0]

        out = ops.abs(x)
        graph.output(out)
```

**Parameters:**

* **value** – The symbolic tensor to use as the input to the absolute value
  computation.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the absolute
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `add()` {#max.graph.ops.add}

> max.graph.ops.add(lhs, rhs)

Adds two symbolic tensors.

Creates a new op node to compute the addition of two symbol tensor values
and adds it to the graph, returning the symbolic result.

The following shows an example of the add() function with two inputs:

```python
def add_graph():
    input_type = TensorType(dtype=DType.float32, shape=(2,), device=DeviceRef.CPU())

    with Graph("add_graph", input_types=(input_type, input_type)) as graph:
        x = graph.inputs[0]
        y = graph.inputs[1]

        out = ops.add(x, y)
        graph.output(out)
```

* * If `lhs` and `rhs` have different dtypes, they will be promoted according
    to the dtype promotion rules before the operation.
  * If `lhs` and `rhs` have different shapes, they will be broadcast to the
    same shape according to broadcasting rules\` before the operation.

**Parameters:**

* **lhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the addition.
* **rhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the addition.
* **location** – An optional location for a more specific error message.

**Returns:**

A symbolic tensor value representing the output of the addition.
The result will have:

* the same dtype as the type-promotion of the two input dtypes
* the same shape as the broadcast of the two input shapes.

**Raises:**

* **Error** – If the input values’ shapes are not compatible for broadcasting.
* **Error** – If one of the input values has an unsupported dtype.
* **Error** – If the two symbols are parts of different graphs.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `cos()` {#max.graph.ops.cos}

> max.graph.ops.cos(x)

Computes the elementwise cosine of a symbolic tensor.

Creates a new op node to compute the elementwise cosine of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the cos
  computation. If it’s not a floating-point DType, an exception will be
  raised.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the cosine
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `div()` {#max.graph.ops.div}

> max.graph.ops.div(lhs, rhs)

Divides two symbolic tensors.

Creates a new op node to compute the division of two symbol tensor values
and adds it to the graph, returning the symbolic result.

* * If `lhs` and `rhs` have different dtypes, they will be promoted according
    to the dtype promotion rules before the operation.
  * If `lhs` and `rhs` have different shapes, they will be broadcast to the
    same shape according to broadcasting rules\` before the operation.

**Parameters:**

* **lhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the division.
* **rhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the division.
* **location** – An optional location for a more specific error message.

**Returns:**

A symbolic tensor value representing the output of the division.
The result will have:

* the same dtype as the type-promotion of the two input dtypes
* the same shape as the broadcast of the two input shapes.

**Raises:**

* **Error** – If the input values’ shapes are not compatible for broadcasting.
* **Error** – If one of the input values has an unsupported dtype.
* **Error** – If the two symbols are parts of different graphs.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `equal()` {#max.graph.ops.equal}

> max.graph.ops.equal(lhs, rhs)

Computes the elementwise equality comparison between two symbolic tensors.

Creates a new op node to compute the equality comparison of two symbol
tensor values and adds it to the graph, returning the symbolic result.

```python
def equal_graph():
    input_type = TensorType(dtype=DType.float32, shape=(3,), device=DeviceRef.CPU())

    with Graph("equal_graph", input_types=(input_type, input_type)) as graph:
        x = graph.inputs[0]  # First input
        y = graph.inputs[1]  # Second input

        out = ops.equal(x, y)
        graph.output(out)
```

* * If `lhs` and `rhs` have different dtypes, they will be promoted
    according to the dtype promotion rules before the operation.
  * If `lhs` and `rhs` have different shapes, they will be broadcast to
    the same shape according to broadcasting rules\` before the operation.

**Parameters:**

* **lhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the equality comparison.
* **rhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the equality comparison.

**Returns:**

A symbolic tensor value representing the output of the equality comparison.
The result will have:

* the same dtype as the type promotion of the two input dtypes
* the same shape as the broadcast of the two input shapes.

**Raises:**

* **Error** – If the input values’ shapes are not compatible for broadcasting.
* **Error** – If one of the input values has an unsupported dtype.
* **Error** – If the two symbols are parts of different graphs.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `erf()` {#max.graph.ops.erf}

> max.graph.ops.erf(x)

Computes the elementwise error function of a symbolic tensor.

Creates a new op node to compute the elementwise error function of a
symbolic tensor and adds it to the graph, returning the symbolic result.

The error function `erf` is defined as the probability that a randomly
sampled normal distribution falls within a given range.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the error function
  computation.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the error function
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `exp()` {#max.graph.ops.exp}

> max.graph.ops.exp(x)

Computes the elementwise exp function of a symbolic tensor.

Creates a new op node to compute the elementwise exp function of a
symbolic tensor and adds it to the graph, returning the symbolic result.

`exp` is defined as `exp(x) = e^x`, where `e` is Euler’s number.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the exp function
  computation.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the exp
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `floor()` {#max.graph.ops.floor}

> max.graph.ops.floor(x)

Computes the elementwise floor of a symbolic tensor.

Creates a new op node to compute the elementwise floor of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the floor
  computation. If it’s not a floating-point DType, an exception will be
  raised.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the floor
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `gelu()` {#max.graph.ops.gelu}

> max.graph.ops.gelu(x, approximate='none')

Computes the elementwise gelu of a symbolic tensor.

Creates a new op node to compute the elementwise gelu of a
symbolic tensor and adds it to the graph, returning the symbolic result.

For `approximate == "none"`, the exact gelu function is computed.

For `approximate == "tanh"`, the approximation:

$$
gelu(x) = 0.5 * x * (1.0 + tanh(0.7978845608028654 * (x + 0.044715 * x**3)))
$$

is used.

For `approximate == "quick"`, the approximation:

$$
gelu(x) = sigmoid(1.702 * x) * x
$$

is used.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the gelu
  computation.
* **x** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) )
* **approximate** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )

**Returns:**

A new symbolic tensor value representing the output of the gelu
value computation.

**Raises:**

* **Error** – If the symbol doesn’t represent a tensor value.
* [**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the approximation method is invalid.

### `greater()` {#max.graph.ops.greater}

> max.graph.ops.greater(lhs, rhs)

Computes the elementwise greater than comparison between two symbolic tensors.

Creates a new op node to compute the greater than comparison of two symbol
tensor values and adds it to the graph, returning the symbolic result.

```python
def greater_than_graph():
    input_type = TensorType(dtype=DType.float32, shape=(2,), device=DeviceRef.CPU())

    with Graph("greater_graph", input_types=(input_type, input_type)) as graph:
        x = graph.inputs[0]  # Left hand side
        y = graph.inputs[1]  # Right hand side

        out = ops.greater(x, y)
        graph.output(out)
```

* * If `lhs` and `rhs` have different dtypes, they will be promoted
    according to the dtype promotion rules before the operation.
  * If `lhs` and `rhs` have different shapes, they will be broadcast to
    the same shape according to broadcasting rules\` before the operation.

**Parameters:**

* **lhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the greater than comparison.
* **rhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the greater than comparison.

**Returns:**

A symbolic tensor value representing the output of the greater than comparison.
The result will have:

* the same dtype as the type-promotion of the two input dtypes
* the same shape as the broadcast of the two input shapes.

**Raises:**

* **Error** – If the input values’ shapes are not compatible for broadcasting.
* **Error** – If one of the input values has an unsupported dtype.
* **Error** – If the two symbols are parts of different graphs.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `greater_equal()` {#max.graph.ops.greater_equal}

> max.graph.ops.greater\_equal(lhs, rhs)

Computes the elementwise greater-or-equal comparison between two symbolic tensors.

Creates a new op node to compute the equality comparison of two symbol
tensor values and adds it to the graph, returning the symbolic result.

* * If `lhs` and `rhs` have different dtypes, they will be promoted
    according to the dtype promotion rules before the operation.
  * If `lhs` and `rhs` have different shapes, they will be broadcast to
    the same shape according to broadcasting rules\` before the operation.

**Parameters:**

* **lhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the greater-or-equal comparison.
* **rhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the greater-or-equal comparison.

**Returns:**

A symbolic tensor value representing the output of the greater-or-equal comparison.
The result will have:

* the same dtype as the type-promotion of the two input dtypes
* the same shape as the broadcast of the two input shapes.

**Raises:**

* **Error** – If the input values’ shapes are not compatible for broadcasting.
* **Error** – If one of the input values has an unsupported dtype.
* **Error** – If the two symbols are parts of different graphs.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `is_inf()` {#max.graph.ops.is_inf}

> max.graph.ops.is\_inf(x)

Computes the elementwise is\_inf of a symbolic tensor.

Creates a new op node to compute the elementwise is\_inf of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the is\_inf
  computation.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

* element type `bool`, true if the element at a given position
  is plus or minus infinity, false otherwise
* the same shape as the input value.

**Return type:**

The result will have

**Raises:**

**Raises** – If the symbol doesn’t represent a tensor value.

### `is_nan()` {#max.graph.ops.is_nan}

> max.graph.ops.is\_nan(x)

Computes the elementwise is\_nan of a symbolic tensor.

Creates a new op node to compute the elementwise is\_nan of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the is\_nan
  computation.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

* element type `bool`, true if the element at a given position
  is NaN, false otherwise
* the same shape as the input value.

**Return type:**

The result will have

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

### `log()` {#max.graph.ops.log}

> max.graph.ops.log(x)

Computes the elementwise natural logarithm of a symbolic tensor.

Creates a new op node to compute the elementwise natural logarithm of a
symbolic tensor and adds it to the graph, returning the symbolic result.

The natural logarithm function `log` is defined as the inverse of the
exponential function `exp()`. In other words, it computes the value `y` in
the equation `x = e^y` where `e` is Euler’s number.

`log(x)` is undefined for `x 

**Parameters:**

* **value** – The symbolic tensor to use as the input to the natural logarithm
  computation.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the natural logarithm
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `log1p()` {#max.graph.ops.log1p}

> max.graph.ops.log1p(x)

Computes the elementwise logarithm of 1 plus a symbolic tensor.

Creates a new op node to compute the elementwise log1p of a
symbolic tensor and adds it to the graph, returning the symbolic result.

The `log1p` function is defined as `log1p(x) = log(1 + x)`, where `log()`
is the natural logarithm.

Using `log1p(x)` rather than computing `log(1 + x)` can give greater
numerical precision results.

`log(x)` is undefined for `x 

**Parameters:**

* **value** – The symbolic tensor to use as the input to the log1p
  computation.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the log1p
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `logical_not()` {#max.graph.ops.logical_not}

> max.graph.ops.logical\_not(x)

Computes the elementwise logical\_not of a symbolic tensor.

Creates a new op node to compute the elementwise logical\_not of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the logical\_not
  computation.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

* element type `bool`, true if the element at a given position
  is plus or minus infinity, false otherwise
* the same shape as the input value.

**Return type:**

The result will have

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

### `logsoftmax()` {#max.graph.ops.logsoftmax}

> max.graph.ops.logsoftmax(x)

Computes the elementwise logsoftmax of a symbolic tensor.

Creates a new op node to compute the elementwise logsoftmax of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the logsoftmax
  computation.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the logsoftmax
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `max()` {#max.graph.ops.max}

> max.graph.ops.max(x, y=None, /, axis=None)

Overload for ops.elementwise.max and ops.reduction.max.

* If two tensors are provided, axis is ignored and returns an elementwise maximum.
* If one tensor is provided, compute ops.reduction.max on the tensor and axis.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **y** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )
* **axis** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `min()` {#max.graph.ops.min}

> max.graph.ops.min(x, y=None, /, axis=None)

Overload for ops.elementwise.min and ops.reduction.min.

* If two tensors are provided, axis is ignored and returns an elementwise minimum.
* If one tensor is provided, compute ops.reduction.min on the tensor and axis.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **y** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )
* **axis** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `mod()` {#max.graph.ops.mod}

> max.graph.ops.mod(lhs, rhs)

Computes the elementwise modulus of two symbolic tensors.

Creates a new op node to compute the modulus of two symbol tensor values
and adds it to the graph, returning the symbolic result.

* * If `lhs` and `rhs` have different dtypes, they will be promoted
    according to the dtype promotion rules before the operation.
  * If `lhs` and `rhs` have different shapes, they will be broadcast to
    the same shape according to broadcasting rules\` before the operation.

**Parameters:**

* **lhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the modulus operation.
* **rhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the modulus operation.

**Returns:**

A symbolic tensor value representing the output of the modulus operation.
The result will have:

* the same dtype as the type-promotion of the two input dtypes
* the same shape as the broadcast of the two input shapes.

**Raises:**

* **Error** – If the input values’ shapes are not compatible for broadcasting.
* **Error** – If one of the input values has an unsupported dtype.
* **Error** – If the two symbols are parts of different graphs.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `mul()` {#max.graph.ops.mul}

> max.graph.ops.mul(lhs, rhs)

Computes the elementwise multiplication of two symbolic tensors.

Creates a new op node to compute the multiplication of two symbol tensor values
and adds it to the graph, returning the symbolic result.

* * If `lhs` and `rhs` have different dtypes, they will be promoted
    according to the dtype promotion rules before the operation.
  * If `lhs` and `rhs` have different shapes, they will be broadcast to
    the same shape according to broadcasting rules\` before the operation.

**Parameters:**

* **lhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the multiplication.
* **rhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the multiplication.

**Returns:**

A symbolic tensor value representing the output of the multiplication.
The result will have:

* the same dtype as the type-promotion of the two input dtypes
* the same shape as the broadcast of the two input shapes.

**Raises:**

* **Error** – If the input values’ shapes are not compatible for broadcasting.
* **Error** – If one of the input values has an unsupported dtype.
* **Error** – If the two symbols are parts of different graphs.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `negate()` {#max.graph.ops.negate}

> max.graph.ops.negate(x)

Computes the elementwise negation of a symbolic tensor.

Creates a new op node to compute the elementwise negation of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the negation
  computation.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

* element type `bool`, true if the element at a given position
  is plus or minus infinity, false otherwise
* the same shape as the input value.

**Return type:**

The result will have

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

### `not_equal()` {#max.graph.ops.not_equal}

> max.graph.ops.not\_equal(lhs, rhs)

Computes the elementwise inequality comparison between two symbolic tensors.

Creates a new op node to compute the inequality comparison of two symbol
tensor values and adds it to the graph, returning the symbolic result.

```python
def not_equal_graph():
    input_type = TensorType(dtype=DType.float32, shape=(2,), device=DeviceRef.CPU())

    with Graph("not_equal_graph", input_types=(input_type, input_type)) as graph:
        x = graph.inputs[0]  # Left hand side
        y = graph.inputs[1]  # Right hand side

        out = ops.not_equal(x, y)
        graph.output(out)
```

* * If `lhs` and `rhs` have different dtypes, they will be promoted
    according to the dtype promotion rules before the operation.
  * If `lhs` and `rhs` have different shapes, they will be broadcast to
    the same shape according to broadcasting rules\` before the operation.

**Parameters:**

* **lhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the inequality comparison.
* **rhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the inequality comparison.

**Returns:**

A symbolic tensor value representing the output of the inequality comparison.
The result will have:

* the same dtype as the type-promotion of the two input dtypes
* the same shape as the broadcast of the two input shapes.

**Raises:**

* **Error** – If the input values’ shapes are not compatible for broadcasting.
* **Error** – If one of the input values has an unsupported dtype.
* **Error** – If the two symbols are parts of different graphs.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `outer()` {#max.graph.ops.outer}

> max.graph.ops.outer(lhs, rhs)

Computes the outer product of two symbolic vectors.

**Parameters:**

* **lhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The left side of the product. Whatever its shape,
  it will be flattened to a rank-1 vector.
* **rhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The right side of the product. Whatever its shape,
  it will be flattened to a rank-1 vector. Must have the
  same number of elements as lhs.

**Returns:**

A symbolic tensor representing the
[outer product](\[https://en.wikipedia.org/wiki/Outer_product]\(https://en.wikipedia.org/wiki/Outer_product\))
of the two input vectors. It will have rank 2, with the dimension
sizes being the number of elements of lhs and rhs respectively.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `pow()` {#max.graph.ops.pow}

> max.graph.ops.pow(lhs, rhs)

Computes the elementwise exponentiation of two symbolic tensors.

Creates a new op node to compute the exponentiation of two symbol tensor values
and adds it to the graph, returning the symbolic result.

* * If `lhs` and `rhs` have different dtypes, they will be promoted
    according to the dtype promotion rules before the operation.
  * If `lhs` and `rhs` have different shapes, they will be broadcast to
    the same shape according to broadcasting rules\` before the operation.

**Parameters:**

* **lhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the exponentiation.
* **rhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the exponentiation.

**Returns:**

A symbolic tensor value representing the output of the exponentiation.
The result will have:

* the same dtype as the type-promotion of the two input dtypes
* the same shape as the broadcast of the two input shapes.

**Raises:**

* **Error** – If the input values’ shapes are not compatible for broadcasting.
* **Error** – If one of the input values has an unsupported dtype.
* **Error** – If the two symbols are parts of different graphs.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `relu()` {#max.graph.ops.relu}

> max.graph.ops.relu(x)

Computes the elementwise relu of a symbolic tensor.

Creates a new op node to compute the elementwise relu of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the relu
  computation.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the relu
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `round()` {#max.graph.ops.round}

> max.graph.ops.round(x)

Computes the elementwise round of a symbolic tensor.

Creates a new op node to compute the elementwise round of a
symbolic tensor and adds it to the graph, returning the symbolic result.
Rounding is done with ties towards the nearest even number.

For example, if the model has one input tensor:

```python
def round_graph():
    input_type = TensorType(dtype=DType.float32, shape=(4,), device=DeviceRef.CPU())

    with Graph("round_graph_example", input_types=(input_type,)) as graph:
        x = graph.inputs[0]
        out = ops.round(x)
        graph.output(out)
```

**Parameters:**

* **value** – The symbolic tensor to use as the input to the round
  computation. If it’s not a floating-point DType, an exception will be
  raised.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the round
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `rsqrt()` {#max.graph.ops.rsqrt}

> max.graph.ops.rsqrt(x)

Computes the elementwise inverse-square-root of a symbolic tensor.

Creates a new op node to compute the elementwise rsqrt of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the rsqrt
  computation. If it’s not a floating-point DType, an exception will be raised.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the rsqrt
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `sigmoid()` {#max.graph.ops.sigmoid}

> max.graph.ops.sigmoid(x)

Computes the elementwise sigmoid of a symbolic tensor.

Creates a new op node to compute the elementwise sigmoid of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the sigmoid
  computation.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the sigmoid
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `silu()` {#max.graph.ops.silu}

> max.graph.ops.silu(x)

Computes the elementwise silu of a symbolic tensor.

Creates a new op node to compute the elementwise silu of a
symbolic tensor and adds it to the graph, returning the symbolic result.

`silu` is defined as `silu(x) = x * sigmoid(x)`.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the silu
  computation.
* **x** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) )

**Returns:**

A new symbolic tensor value representing the output of the silu
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

### `sin()` {#max.graph.ops.sin}

> max.graph.ops.sin(x)

Computes the elementwise sine of a symbolic tensor.

Creates a new op node to compute the elementwise sine of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the sin
  computation. If it’s not a floating-point DType, an exception will be raised.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the sin
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `softmax()` {#max.graph.ops.softmax}

> max.graph.ops.softmax(x)

Computes the elementwise softmax of a symbolic tensor.

Creates a new op node to compute the elementwise softmax of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the softmax
  computation.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the softmax
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `sqrt()` {#max.graph.ops.sqrt}

> max.graph.ops.sqrt(x)

Computes the elementwise sqrt of a symbolic tensor.

Creates a new op node to compute the elementwise sqrt of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the sqrt
  computation. If it’s not a floating-point DType, an exception will be raised.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the sqrt
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `sub()` {#max.graph.ops.sub}

> max.graph.ops.sub(lhs, rhs)

Computes the elementwise subtraction of two symbolic tensors.

Creates a new op node to compute the subtraction of two symbol tensor values
and adds it to the graph, returning the symbolic result.

```python
def sub_graph():
    input_type = TensorType(dtype=DType.float32, shape=(2,), device=DeviceRef.CPU())

    with Graph("sub_graph", input_types=(input_type, input_type)) as graph:
        x = graph.inputs[0]  # Minuend (number being subtracted from)
        y = graph.inputs[1]  # Subtrahend (number being subtracted)

        out = ops.sub(x, y)
        graph.output(out)
```

* * If `lhs` and `rhs` have different dtypes, they will be promoted
    according to the dtype promotion rules before the operation.
  * If `lhs` and `rhs` have different shapes, they will be broadcast to
    the same shape according to broadcasting rules\` before the operation.

**Parameters:**

* **lhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as left side of the subtraction.
* **rhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The symbol to use as right side of the subtraction.

**Returns:**

A symbolic tensor value representing the output of the subtraction.
The result will have:

* the same dtype as the type-promotion of the two input dtypes
* the same shape as the broadcast of the two input shapes.

**Raises:**

* **Error** – If the input values’ shapes are not compatible for broadcasting.
* **Error** – If one of the input values has an unsupported dtype.
* **Error** – If the two symbols are parts of different graphs.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `tanh()` {#max.graph.ops.tanh}

> max.graph.ops.tanh(x)

Computes the elementwise tanh of a symbolic tensor.

Creates a new op node to compute the elementwise tanh of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the tanh
  computation. If it’s not a floating-point DType, an exception will be raised.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the tanh
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `trunc()` {#max.graph.ops.trunc}

> max.graph.ops.trunc(x)

Computes the elementwise truncation of a symbolic tensor.

Creates a new op node to compute the elementwise truncation of a
symbolic tensor and adds it to the graph, returning the symbolic result.

**Parameters:**

* **value** – The symbolic tensor to use as the input to the truncation
  computation. If it’s not a floating-point DType, an exception will be
  raised.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor value representing the output of the truncation
value computation.

**Raises:**

**Error** – If the symbol doesn’t represent a tensor value.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Fast fourier transforms

### `irfft()` {#max.graph.ops.irfft}

> max.graph.ops.irfft(input\_tensor, n=None, axis=-1, normalization=Normalization.BACKWARD, input\_is\_complex=False)

Compute the inverse real FFT of the input tensor.

**Parameters:**

* **input\_tensor** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The input tensor to compute the inverse real FFT of.
* **n** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` ) – The size of the output tensor. Must be an int, and cannot be a
  symbolic Tensor. The input tensor will be padded or truncated to
  n // 2 + 1 along the specified axis.
* **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The axis to compute the inverse real FFT of.
* **normalization** (`Normalization`  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – The normalization to apply to the output tensor.
  Can be “backward”, “ortho”, or “forward”. When “backward”, the
  output is divided by n. When “ortho”, the output is divided by
  sqrt(n). When “forward”, no normalization is applied.
* **input\_is\_complex** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether the input tensor is already interleaved
  complex. The last dimension of the input tensor must be 2, and is
  excluded from the dimension referred to by axis.

**Returns:**

The inverse real FFT of the input tensor. The shape of the output tensor
is the same as the shape of the input tensor, except for the axis that
the inverse real FFT is computed over, which is replaced by n.

## Linalg

### `band_part()` {#max.graph.ops.band_part}

> max.graph.ops.band\_part(x, num\_lower=None, num\_upper=None, exclude=False)

Masks out everything except a diagonal band of an input matrix.

Copies a tensor setting everything outside the central diagonal band of the
matricies to zero, where all but the last two axes are effectively batches,
and the last two axes define sub matricies.

Assumes the input has dimensions \[I, J, …, M, N], then the output tensor
has the same shape as the input, and the values are given by

```python
out[i, j, ..., m, n] = in_band(m, n) * input[i, j,  ..., m, n].
```

with the indicator function:

```python
in_band(m, n) = ((num_lower is None || (m - n) 

**Parameters:**

* **input** – The input to mask out.
* **num\_lower** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` ) – The number of diagonal bands to include below the central
  diagonal. If None, include the entire lower triangle.
* **num\_upper** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` ) – The number of diagonal bands to include above the central
  diagonal. If None, include the entire upper triangle.
* **exclude** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – If true, invert the selection of elements to mask. Elements
  in the band are set to zero.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A symbolic tensor value with the configured selection masked out
to 0 values, and the remaining values copied from the input tensor.

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the input tensor rank is less than 2, or if num\_lower/num\_upper
are out of bounds for statically known dimensions.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `layer_norm()` {#max.graph.ops.layer_norm}

> max.graph.ops.layer\_norm(input, gamma, beta, epsilon)

Performs layer normalization.

**Parameters:**

* **input** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The input tensor to normalize.
* **gamma** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The gamma parameter of the normalization.
* **beta** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The beta parameter of the normalization.
* **epsilon** ([`float`](https://docs.python.org/3/library/functions.html#float) ) – The epsilon parameter of the normalization.

**Returns:**

A graph tensor value with the normalization applied.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `matmul()` {#max.graph.ops.matmul}

> max.graph.ops.matmul(lhs, rhs)

Computes the matrix multiplication of two tensor graph values.

Performs general matrix multiplication with broadcasting.

If the lhs is 1D, it will be reshaped to `1xD`.
If the rhs is 1D, it will be reshaped to `Dx1`.
In both cases, the additional 1 dimensions will be removed from the
output shape.

For the multiplication, the innermost (rightmost) 2 dimensions are treated
as a matrix.
The lhs matrix will have the shape `MxK`.
The rhs matrix will have the shape `KxN`.
The output will have the shape MxN
The `K` dimensions must be equivalent in both matrices.

The remaining outer dimensions will be broadcasted.

**Parameters:**

* **lhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The left-hand-side of the matmul.
* **rhs** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The right-hand-side of the matmul.
* **location** – An optional location for a more specific error message.

**Returns:**

A tensor graph value representing he result of broadcasting the two
matricies together and then performing a matrix multiply
along the innermost two dimension of each tensor.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Buffer operations

### `buffer_load()` {#max.graph.ops.buffer_load}

> max.graph.ops.buffer\_load(x)

Loads the input buffer into a tensor.

It loads the in-place mutable tensor to an immutable tensor graph value.
This is semantically equivalent to a copy from the mutable tensor x to the
mutable value-semantic tensor output.

**Parameters:**

**x** ([`BufferValue`](BufferValue.md#max.graph.BufferValue) ) – The buffer to be loaded to a tensor.

**Returns:**

A tensor graph value representing a copy of the buffer loaded.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `buffer_store()` {#max.graph.ops.buffer_store}

> max.graph.ops.buffer\_store(destination, source)

Stores the input tensor into the inout buffer.

It stores the immutable input tensor x in the mutable tensor y.
This is semantically equivalent to a copy from x tensor to the y buffer.

**Parameters:**

* **x** – The tensor to be stored in the buffer.
* **y** – The buffer to store the tensor in.
* **destination** ([`BufferValue`](BufferValue.md#max.graph.BufferValue) )
* **source** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) )

**Return type:**

None

### `buffer_store_slice()` {#max.graph.ops.buffer_store_slice}

> max.graph.ops.buffer\_store\_slice(destination, source, indices)

Stores the input tensor to into a slice in the input buffer.

It stores the immutable input tensor source in the mutable tensor destination.
This is semantically equivalent to a copy from source tensor to a slice in the
destination buffer at index specified by indices.

**Parameters:**

* **destination** ([`BufferValue`](BufferValue.md#max.graph.BufferValue) ) – The buffer to store the tensor in.
* **source** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The tensor to be stored in the buffer.
* **indices** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`slice`](https://docs.python.org/3/library/functions.html#slice)  `|`  [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`slice`](https://docs.python.org/3/library/functions.html#slice) `,`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]`  `|`  `EllipsisType` `]` ) – The index in the buffer where the tensor should be stored

**Return type:**

None

## Call operations

### `call()` {#max.graph.ops.call}

> max.graph.ops.call(graph, \*args)

Call a graph with the provided arguments and return its results.

This function invokes a previously defined graph, passing in the provided
arguments and the current chain value, and returns the results.

The body of the graph is ultimately inlined into the caller, so the chain
value is only used for serialization if the subgraph’s body contains an
operation that makes use of it in the first place.

The current advantage of using subgraphs is that it offers a way to improve
compile times for operations that are used repeatedly in a model. As a
secondary benefit, it also makes the IR more readable by allowing control
flow to be expressed in a more natural way.

**Parameters:**

* **graph** ([`Graph`](Graph.md#max.graph.Graph) ) – The graph to call
* **\*args** ([`Value`](Value.md#max.graph.Value) ) – Arguments to pass to the called graph

**Returns:**

Either a single Value or a list of Values representing the graph outputs
(excluding the chain value which is handled internally)

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*Value*](Value.md#max.graph.Value)]

## Flatten

### `flatten()` {#max.graph.ops.flatten}

> max.graph.ops.flatten(x, start\_dim=0, end\_dim=-1)

Flattens the specified dims of a symbolic tensor.

The number and order of the elements in the tensor is unchanged.
All dimensions from start\_dim to end\_dim (inclusive) are merged into a single output dim.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **start\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **end\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Fold

### `fold()` {#max.graph.ops.fold}

> max.graph.ops.fold(input, output\_size, kernel\_size, stride=1, dilation=1, padding=0)

Combines an array of sliding blocks into a larger containing tensor.

The input tensor must have shape `(N, C * kernel_sizes, L)` where `N` is
the batch dimension, `C` is the number of channels, `kernel_sizes` is
the product of the kernel sizes, and `L` is the number of local blocks.

The resulting output tensor will have shape
`(N, C, output_shape[0], output_shape[1])`.

`L`, the number of blocks, must be equivalent to:
`prod((output_size[d] + 2 * padding[d] - dilation[d] * (kernel_size[d] - 1) - 1) / stride[d] + 1)`

where `d` is over all spatial dimensions.

**Parameters:**

* **input** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The 3D tensor to fold with shape `(N, C * kernel sizes, L)`.
* **output\_size** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `,`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – Spacial dimensions of the output tensor. Must be a tuple of two ints.
* **kernel\_size** ([`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `,`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – The size of the sliding blocks. Must be a tuple of two ints.
* **stride** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The stride of the sliding blocks in the input dimension
  (can be an int or a tuple of two ints).
* **dilation** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The spacing between the kernel elements.
  (can be an int or a tuple of two ints).
* **padding** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`tuple`](https://docs.python.org/3/library/stdtypes.html#tuple) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `,`  [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – 0-paddings to be added on both sides of the inputs.
  (can be an int or a tuple of two ints).

**Returns:**

The folded 4D tensor with shape `(N, C, output_shape[0], output_shape[1])`.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Pad

### `pad()` {#max.graph.ops.pad}

> max.graph.ops.pad(input, paddings, mode='constant', value=0)

**Parameters:**

* **input** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **paddings** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` )
* **mode** ([`Literal`](https://docs.python.org/3/library/typing.html#typing.Literal) `[` `'constant'` `]` )
* **value** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Permute

### `permute()` {#max.graph.ops.permute}

> max.graph.ops.permute(x, dims)

Permutes all dimensions of a symbolic tensor.

**Parameters:**

* **input** – The input symbolic tensor to transpose.
* **dims** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – The desired ordering of the dimensions in the output tensor.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor with the dimensions permuted to match the passed in order.
It has the same elements and dtype, but the order of the elements
is different according to the permutation.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Quantized

### `dequantize()` {#max.graph.ops.dequantize}

> max.graph.ops.dequantize(encoding, quantized)

Dequantizes a quantized tensor to floating point.

NOTE: Currently this supports Q4\_0, Q4\_K, and Q6\_K encodings only.

**Parameters:**

* **encoding** ([`QuantizationEncoding`](quantization.md#max.graph.quantization.QuantizationEncoding) ) – The quantization encoding to use.
* **quantized** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The quantized tensor to dequantize.

**Returns:**

The dequantized result (a floating point tensor).

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `qmatmul()` {#max.graph.ops.qmatmul}

> max.graph.ops.qmatmul(encoding, config, lhs, \*rhs)

Performs matrix multiplication between floating point and quantized
tensors.

This quantizes the `lhs` floating point value to match the encoding of the
`rhs` quantized value, performs matmul, and then dequantizes the result.
Beware that, compared to a regular matmul op, this one expects the `rhs`
value to be transposed. For example, if the `lhs` shape is \[32, 64], and
the quantized `rhs` shape is also `[32, 64]`, then the output shape is
`[32, 32]`.

That is, this function returns the result from:

> dequantize(quantize(lhs) @ transpose(rhs))

The last two dimensions in `lhs` are treated as matrices and multiplied
by `rhs` (which must be a 2D tensor). Any remaining dimensions in `lhs`
are broadcast dimensions.

NOTE: Currently this supports Q4\_0, Q4\_K, and Q6\_K encodings only.

**Parameters:**

* **encoding** ([`QuantizationEncoding`](quantization.md#max.graph.quantization.QuantizationEncoding) ) – The quantization encoding to use.
* **lhs** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The non-quantized, left-hand-side of the matmul.
* **\*rhs** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The transposed and quantized right-hand-side of the matmul and
  auxiliary tensor (if has). Must be rank 2 and in a supported
  \[quantization encoding] (/max/api/mojo/graph/quantization/).
* **config** ([`QuantizationConfig`](quantization.md#max.graph.quantization.QuantizationConfig)  `|`  `None` )

**Returns:**

The dequantized result (a floating point tensor).

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Range

### `range()` {#max.graph.ops.range}

> max.graph.ops.range(start, stop, step, out\_dim=None, device=cpu:0, dtype=float32)

Creates a sequence of numbers. The sequence goes from start with
increments of size step up to (but not including) stop. All arguments
are mandatory and must have the same element type.

Note the following restrictions on input values:

1. step must be non-zero
2. stop - start must be zero or have the same sign as step

**Parameters:**

* **start** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The start of the range to generate.
* **stop** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The range will be generated up to, but not including, this value.
* **step** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The step size for the range.
* **out\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  `None` ) – The expected output dimensions returned by the range op.
  These will be assert at graph execution time to be correct.
* **device** (`DeviceRef` ) – Device of the result tensor.
* **dtype** ([`DType`](../dtype.md#max.dtype.DType) )

**Returns:**

A symbolic tensor value containing the defined range of values.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Repeat

### `repeat_interleave()` {#max.graph.ops.repeat_interleave}

> max.graph.ops.repeat\_interleave(x, repeats, axis=None, out\_dim=None)

Repeats elements of a tensor along the given dimension.

Modeled after `torch.repeat_interleave`, with the constraint that

For example, given `repeats=2` and the following input:

```python
## Input tensor with shape (2, 2)
input = TensorValue(x)  # Contains [[1.0, 2.0], [3.0, 4.0]]
```

`repeat_interleave` with `axis=0`:

```python
## Output tensor with shape (4, 2)
output = repeat_interleave(input, repeats=2, axis=0)
## Contains [[1.0, 2.0], [1.0, 2.0], [3.0, 4.0], [3.0, 4.0]]
```

`repeat_interleave` with `axis=1`:

```python
## Output tensor with shape (2, 4)
output = repeat_interleave(input, repeats=2, axis=1)
## Contains [[1.0, 1.0, 2.0, 2.0], [3.0, 3.0, 4.0, 4.0]]
```

`repeat_interleave` with `axis=None` (the default):

`repeat_interleave` with `repeats=[2, 3]` and `axis=0`:

```python
repeat_value = TensorValue([2, 3])

## Output tensor with shape (5, 2)
output = repeat_interleave(input, repeats=repeat_value, axis=0)
## Contains [[1.0, 2.0], [1.0, 2.0], [3.0, 4.0], [3.0, 4.0], [3.0, 4.0]]
```

```python
## Output tensor with shape (8,)
output = repeat_interleave(input, repeats=2)  # axis = None
## Contains [1.0, 1.0, 2.0, 2.0, 3.0, 3.0, 4.0, 4.0]
```

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input tensor.
* **repeats** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The number of repetitions for each element.
* **axis** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` ) – The dimension along which to repeat values. If axis is not
  specified or None (the default), flatten the input array
  and repeat the flattened values.
* **out\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  `None` )

**Returns:**

A symbolic tensor with the elements interleaved.

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If `repeats` non-positive or if `axis` is out of range.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Tile

### `tile()` {#max.graph.ops.tile}

> max.graph.ops.tile(x, repeats)

Returns a new Tensor as the result of copying the input tensor N\_i times
on each dimension, where N\_i = repeats\[i].

The i-th dimension of output shape will be the ith dimension of input shape
multiplied by N\_i.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **repeats** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` )

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Transfer

### `transfer_to()` {#max.graph.ops.transfer_to}

> max.graph.ops.transfer\_to(x, device)

Device-to-Device transfer operation.

This op transfers the input tensor from its current device over to another. A device represents a
computation unit, like CPU, GPU, etc. This op is useful for instance when working with
accelerators, like GPU, where for instance one may need to move data from GPU to GPU, or
from one GPU to CPU.

**Parameters:**

* **x** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – The input tensor to transfer.
* **device** (`DeviceRef` ) – The device to transfer to.

**Returns:**

A tensor transferred to device specified.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## TopK

### `top_k()` {#max.graph.ops.top_k}

> max.graph.ops.top\_k(input, k, axis=-1)

Returns tensor with only top K values along given axis.

**Parameters:**

* **input** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input tensor from which to select top k.
* **k** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of values to select from input.
* **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The axis from which to select top k.

**Returns:**

Top K values, Top K indices

**Return type:**

[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[*TensorValue*](TensorValue.md#max.graph.TensorValue), [*TensorValue*](TensorValue.md#max.graph.TensorValue)]

## Reduction

### `argmax()` {#max.graph.ops.argmax}

> max.graph.ops.argmax(x, axis=-1)

Reduces a symbolic tensor using an argmax operation.

When provided with a tensor with all identical elements,
on CPU this will return the first element index in the tensor,
on GPU this will return an arbitrary index.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input tensor for the operation.
* **axis** – The axis along which to compute the reduction. If negative,
  indexes from the last dimension. For example, a value of -1 will
  compute the reduction along the last dimension.

**Returns:**

A symbolic tensor representing the result of the argmax operation.
The tensor will have the same rank as the input tensor, and the same
shape except along the `axis` dimension which will have size 1.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `argmin()` {#max.graph.ops.argmin}

> max.graph.ops.argmin(x, axis=-1)

Reduces a symbolic tensor using an argmin operation.

When provided with a tensor with all identical elements,
on CPU this will return the first element index in the tensor,
on GPU this will return an arbitrary index.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input tensor for the operation.
* **axis** – The axis along which to compute the reduction. If negative,
  indexes from the last dimension. For example, a value of -1 will
  compute the reduction along the last dimension.

**Returns:**

A symbolic tensor representing the result of the argmin operation.
The tensor will have the same rank as the input tensor, and the same
shape except along the `axis` dimension which will have size 1.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `mean()` {#max.graph.ops.mean}

> max.graph.ops.mean(x, axis=-1)

Reduces a symbolic tensor using a mean operation.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input tensor for the operation.
* **axis** – The axis along which to compute the reduction. If negative,
  indexes from the last dimension. For example, a value of -1 will
  compute the reduction along the last dimension.

**Returns:**

A symbolic tensor representing the result of the mean operation.
The tensor will have the same rank as the input tensor, and the same
shape except along the `axis` dimension which will have size 1.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `sum()` {#max.graph.ops.sum}

> max.graph.ops.sum(x, axis=-1)

Reduces a symbolic tensor using a sum operation.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input tensor for the operation.
* **axis** – The axis along which to compute the reduction. If negative,
  indexes from the last dimension. For example, a value of -1 will
  compute the reduction along the last dimension.

**Returns:**

A symbolic tensor representing the result of the sum operation.
The tensor will have the same rank as the input tensor, and the same
shape except along the `axis` dimension which will have size 1.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Indexing

### `argsort()` {#max.graph.ops.argsort}

> max.graph.ops.argsort(x, ascending=True)

Returns the indices that would sort a tensor.

This function returns the indices that would sort the input tensor along
its first dimension. The returned indices are of type int64.

**Parameters:**

* **x** ([`TensorValue`](TensorValue.md#max.graph.TensorValue) ) – Input tensor to be sorted.
* **ascending** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – If True (default), sort in ascending order. If False, sort in
  descending order.

**Returns:**

A tensor of indices of the same shape as the input tensor.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `nonzero()` {#max.graph.ops.nonzero}

> max.graph.ops.nonzero(x, out\_dim)

Returns the indices of all nozero elements in a tensor.

Returns a tensor of indices of the nonzero values in the given tensor.
The return value is a 2D tensor of shape \[out\_dim x rank\_in], where out\_dim is the
number of nonzero elements in the input tensor, and rank\_in is the rank of
the input tensor. Indices are generated in row-major order.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor.
* **out\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) ) – The newly generated dimension that is sized for the number of
  nonzero elements.

**Returns:**

A symbolic tensor of indices

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Cumulative operations

### `cumsum()` {#max.graph.ops.cumsum}

> max.graph.ops.cumsum(x, axis=-1, exclusive=False, reverse=False)

Computes the cumulative sum of the input tensor along the given axis.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input tensor to sum over.
* **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The axis along which to compute the sum. If negative,
  indexes from the last dimension. For example, a value of -1 will
  compute the sum along the last dimension.
* **exclusive** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – If set, start at 0 and exclude the final element.
  Otherwise, start with the first element. Said another way, cumsum
  computes \[sum(x\[…, :i, …]) for i in range(x.shape\[axis])].
  If exclusive is set, the bounds are instead range(1, x.shape\[axis]).
* **reverse** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – If set, start from the end. In other words, the first element
  will be the total sum, with each element following counting
  downwards; or \[sum(x\[…, i:, …]) for i in range(x.shape\[axis])].

**Returns:**

A symbolic tensor representing the result of the cumsum operation.
The tensor will have the same type as the input tensor. The computed
values will be the cumulative sum of the values along the given axis,
according to the specified parameters:

* if exclusive is set, the first value will be 0, and the last
  value will be excluded from the sum
* if reverse is set, the sum will be computed starting at the
  back of the axis back to the front, rather than front-to-back

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Audio processing

### `hann_window()` {#max.graph.ops.hann_window}

> max.graph.ops.hann\_window(window\_length, device, periodic=True, dtype=float32)

Calculate a Hann window for a given length.

Hann window function:

$$
H[n] = 1/2 [1 - cos(2 * pi * n / (N - 1))]
$$

where N is window\_length.

**Parameters:**

* **window\_length** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The length of the window.
* **device** (`DeviceRef` ) – The device to run the operation on.
* **periodic** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – bool
  flag determines whether the returned window trims off the last
  duplicate value from the symmetric window and is ready to be used
  as a periodic window with functions like stft().
  hann\_window(L, periodic=True) == hann\_window(L + 1, periodic=False)\[:-1])
* **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The desired data type of the output tensor.

**Returns:**

A 1-D tensor of size (window\_length,) containing the window.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Slicing

### `chunk()` {#max.graph.ops.chunk}

> max.graph.ops.chunk(x, chunks, axis=0)

Chunk the tensor into an exact number of chunks along the specified dim.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The tensor to chunk.
* **chunks** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of chunks to split the tensor into.
  chunks must statically evenly divide x.shape\[axis].
* **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The axis to split the tensor along.

**Returns:**

A list of chunks tensors.

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*TensorValue*](TensorValue.md#max.graph.TensorValue)]

### Example

```pycon
>>> a = TensorValue([1, 2, 3, 4, 5])
>>> chunk(a, 2, 0)
[TensorValue([1, 2]), TensorValue([3, 4])]
```

### `concat()` {#max.graph.ops.concat}

> max.graph.ops.concat(original\_vals, axis=0)

Concatenates a list of symbolic tensors along an axis.

**Parameters:**

* **original\_vals** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` `Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `]` ) – A list of symbolic tensor values. Each tensor must have the same
  dtype and rank, and must have the same dimension size for each
  dimension other than `axis`.
* **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The axis to concatenate along. If negative, indexes relative
  to the end of the tensor shape. For instance, `concat(vs, -1)`
  will concat along the last dimension.

**Returns:**

A new symbolic tensor representing the concatenation result. It will
have the same rank as each input tensor, and its dimensions will be the same
as each input tensor’s for each dimension other than axis, which will
have size equal to the sum of all tensor’s size for that dimension.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `gather()` {#max.graph.ops.gather}

> max.graph.ops.gather(input, indices, axis=-1)

Selects elements out of an input tensor by index.

**Parameters:**

* **input** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to select elements from.
* **indices** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – A symbolic tensor of index values to use for selection.
* **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimension which `indices` indexes from `input`.
  If negative, indexes relative to the end of the input tensor.
  For instance, `gather(input, indices, axis=-1)` will index
  against the last dimension of `input`.

**Returns:**

A new symbolic tensor representing the result of the gather operation.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `masked_scatter()` {#max.graph.ops.masked_scatter}

> max.graph.ops.masked\_scatter(input, mask, updates, out\_dim)

Creates a new symbolic tensor where the updates are written to input where mask is true.

**Parameters:**

* **input** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to write elements to.
* **mask** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – A symbolic tensor of boolean values to update.
* **updates** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – A symbolic tensor of elements to write to input.
* **out\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) ) – The new data-dependent dimension.

**Returns:**

A new symbolic tensor representing the result of the masked\_scatter operation.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `scatter()` {#max.graph.ops.scatter}

> max.graph.ops.scatter(input, updates, indices, axis=-1)

Creates a new symbolic tensor where the updates are written to input according to indices.

**Parameters:**

* **input** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to write elements to.
* **updates** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – A symbolic tensor of elements to write to input.
* **indices** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The positions in input to update.
* **axis** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The axis along which indices indexes into.

**Returns:**

A new symbolic tensor representing the result of the scatter operation.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `select()` {#max.graph.ops.select}

> max.graph.ops.select(cond, x, y)

Returns `condition ? x : y` (element-wise), where `cond`, `x` and `y`
are input tensors.

**Parameters:**

* **condition** – The condition tensor to use for selecting elementwise
  values.
* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – If the condition is true at a position, the value from the same
  position in this tensor will be selected.
* **y** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – If the condition is false at a position, the value from the same
  position in this tensor will be selected.
* **cond** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Returns:**

A new symbolic tensor holding either values from either `x` or `y`,
based on the elements in condition.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

### `split()` {#max.graph.ops.split}

> max.graph.ops.split(x, split\_sizes, axis=0)

Splits the input tensor into multiple tensors along a given dimension.

**Parameters:**

* **x** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) ) – The input symbolic tensor to split.
* **split\_sizes** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – Sizes of each output tensor. Must add up to the split
  dimension axis.
* **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Dimension to split the input tensor.

**Returns:**

A list of tensors with the same length as split\_sizes, where each
tensor has the same shape as the input except along the split dimension
axis, where the size is given by the corresponding element in
split\_sizes.

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*TensorValue*](TensorValue.md#max.graph.TensorValue)]

### `stack()` {#max.graph.ops.stack}

> max.graph.ops.stack(values, axis=0)

Stacks a list of tensors along a new axis.

**Parameters:**

* **values** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` `Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) `]` ) – A list of symbolic tensor values. Each tensor must have the same
  dtype and rank, and must have the same dimension size for each
  dimension.
* **axis** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The axis to concatenate along. If negative, indexes relative
  to the end of the tensor shape *plus 1*. For instance,
  `stack(vs, -1)` will create and stack along a new axis as the
  last dimension, aad `stack(vs, -2)` will create and stack along a new
  dimension which is inserted immediately before the last dimension.

**Returns:**

A new symbolic tensor representing the result of the stack. It will
have rank `n+1` where `n` is the rank of each input tensor. Its size
on each dimension other than `axis` will be the same as each input tensors’,
with the new axis inserted. Along the new dimension it will have size
`len(values)`.

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

## Random operations

### `normal()` {#max.graph.ops.random.normal}

> max.graph.ops.random.normal(like, mean=0.0, std=1.0)

**Parameters:**

* **like** ([`TensorType`](type.md#max.graph.type.TensorType) )
* **mean** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **std** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](type.md#max.graph.type.Shape)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Return type:**

[*TensorValue*](TensorValue.md#max.graph.TensorValue)

---

## optional

Defines Optional, a type modeling a value which may or may not be present.

Optional values can be thought of as a type-safe nullable pattern.
Your value can take on a value or `None`, and you need to check
and explicitly extract the value to get it out.

Examples:

```mojo
var a = Optional(1)
var b = Optional[Int](None)
if a:
    print(a.value())  # prints 1
if b:  # Bool(b) is False, so no print
    print(b.value())
var c = a.or_else(2)
var d = b.or_else(2)
print(c)  # prints 1
print(d)  # prints 2
```

## Structs

* [​`Optional`](/mojo/stdlib/collections/optional/Optional): A type modeling a value which may or may not be present.
* [​`OptionalReg`](/mojo/stdlib/collections/optional/OptionalReg): A register-passable optional type.

---

## Optional

`struct Optional[T: Copyable & Movable]`

A type modeling a value which may or may not be present.

Optional values can be thought of as a type-safe nullable pattern.
Your value can take on a value or `None`, and you need to check
and explicitly extract the value to get it out.

Currently T is required to be a `Copyable & Movable` so we can implement
copy/move for Optional and allow it to be used in collections itself.

Examples:

```mojo
var a = Optional(1)
var b = Optional[Int](None)
if a:
    print(a.value())  # prints 1
if b:  # Bool(b) is False, so no print
    print(b.value())
var c = a.or_else(2)
var d = b.or_else(2)
print(c)  # prints 1
print(d)  # prints 2
```

## Parameters

* ​T (`Copyable & Movable`): The type of value stored in the `Optional`.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Construct an empty `Optional`.

`@implicit`
`__init__(out self, owned value: T)`

Construct an `Optional` containing a value.

**Args:**

* ​value (`T`): The value to store in the `Optional`.

`@implicit`
`__init__(out self, value: NoneType)`

Construct an empty `Optional`.

**Args:**

* ​value (`NoneType`): Must be exactly `None`.

### `__bool__`

`__bool__(self) -> Bool`

Return true if the Optional has a value.

**Returns:**

True if the `Optional` has a value and False otherwise.

### `__getitem__`

`__getitem__(ref self) -> ref [$1._value] T`

Retrieve a reference to the value inside the `Optional`.

**Returns:**

A reference to the value inside the `Optional`.

**Raises:**

On empty `Optional`.

### `__invert__`

`__invert__(self) -> Bool`

Return False if the `Optional` has a value.

**Returns:**

False if the `Optional` has a value and True otherwise.

### `__eq__`

`__eq__(self, rhs: NoneType) -> Bool`

Return `True` if a value is not present.

**Args:**

* ​rhs (`NoneType`): The `None` value to compare to.

**Returns:**

`True` if a value is not present, `False` otherwise.

`__eq__[T: EqualityComparable & Copyable & Movable](self: Optional[T], rhs: Optional[T]) -> Bool`

Return `True` if this is the same as another `Optional` value, meaning both are absent, or both are present and have the same underlying value.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the
  traits `Copyable`, `Movable` and `EqualityComparable`.

**Args:**

* ​rhs (`Optional[T]`): The value to compare to.

**Returns:**

True if the values are the same.

### `__ne__`

`__ne__(self, rhs: NoneType) -> Bool`

Return `True` if a value is present.

**Args:**

* ​rhs (`NoneType`): The `None` value to compare to.

**Returns:**

`False` if a value is not present, `True` otherwise.

`__ne__[T: EqualityComparable & Copyable & Movable, //](self: Optional[T], rhs: Optional[T]) -> Bool`

Return `False` if this is the same as another `Optional` value, meaning both are absent, or both are present and have the same underlying value.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the list. Must implement the
  traits `Copyable`, `Movable` and `EqualityComparable`.

**Args:**

* ​rhs (`Optional[T]`): The value to compare to.

**Returns:**

False if the values are the same.

### `__is__`

`__is__(self, other: NoneType) -> Bool`

Return `True` if the Optional has no value.

Notes:
It allows you to use the following syntax:
`if my_optional is None:`.

**Args:**

* ​other (`NoneType`): The value to compare to (None).

**Returns:**

True if the Optional has no value and False otherwise.

### `__isnot__`

`__isnot__(self, other: NoneType) -> Bool`

Return `True` if the Optional has a value.

Notes:
It allows you to use the following syntax:
`if my_optional is not None:`.

**Args:**

* ​other (`NoneType`): The value to compare to (None).

**Returns:**

True if the Optional has a value and False otherwise.

### `copy`

`copy(self) -> Self`

Copy construct an `Optional`.

**Returns:**

A copy of the value.

### `__str__`

`__str__[U: Copyable & Movable & Representable, //](self: Optional[U]) -> String`

Return the string representation of the value of the `Optional`.

**Parameters:**

* ​U (`Copyable & Movable & Representable`): The type of the elements in the list. Must implement the
  traits `Representable`, `Copyable` and `Movable`.

**Returns:**

A string representation of the `Optional`.

### `__repr__`

`__repr__[U: Representable & Copyable & Movable, //](self: Optional[U]) -> String`

Returns the verbose string representation of the `Optional`.

**Parameters:**

* ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the
  traits `Representable`, `Copyable` and `Movable`.

**Returns:**

A verbose string representation of the `Optional`.

### `write_to`

`write_to[W: Writer, U: Representable & Copyable & Movable, //](self: Optional[U], mut writer: W)`

Write `Optional` string representation to a `Writer`.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.
* ​U (`Representable & Copyable & Movable`): The type of the elements in the list. Must implement the
  traits `Representable`, `Copyable` and `Movable`.

**Args:**

* ​writer (`W`): The object to write to.

### `value`

`value(ref self) -> ref [$1._value] T`

Retrieve a reference to the value of the `Optional`.

Notes:
This will abort on empty `Optional`.

**Returns:**

A reference to the contained data of the `Optional` as a reference.

### `unsafe_value`

`unsafe_value(ref self) -> ref [$1._value] T`

Unsafely retrieve a reference to the value of the `Optional`.

Notes:
This will **not** abort on empty `Optional`.

**Returns:**

A reference to the contained data of the `Optional` as a reference.

### `take`

`take(mut self) -> T`

Move the value out of the `Optional`.

Notes:
This will abort on empty `Optional`.

**Returns:**

The contained data of the `Optional` as an owned T value.

### `unsafe_take`

`unsafe_take(mut self) -> T`

Unsafely move the value out of the `Optional`.

Notes:
This will **not** abort on empty `Optional`.

**Returns:**

The contained data of the `Optional` as an owned T value.

### `or_else`

`or_else(self, default: T) -> T`

Return the underlying value contained in the `Optional` or a default value if the `Optional`'s underlying value is not present.

**Args:**

* ​default (`T`): The new value to use if no value was present.

**Returns:**

The underlying value contained in the `Optional` or a default value.

### `copied`

`copied[mut: Bool, origin: Origin[mut], //, T: Copyable & Movable](self: Optional[Pointer[T, origin]]) -> Optional[T]`

Converts an `Optional` containing a Pointer to an `Optional` of an owned value by copying.

Examples:

Copy the value of an `Optional[Pointer[_]]`

```mojo
var data = String("foo")
var opt = Optional(Pointer(to=data))
var opt_owned: Optional[String] = opt.copied()
```

Notes:
If `self` is an empty `Optional`, the returned `Optional` will be
empty as well.

**Parameters:**

* ​mut (`Bool`): Mutability of the pointee origin.
* ​origin (`Origin[mut]`): Origin of the contained `Pointer`.
* ​T (`Copyable & Movable`): Type of the owned result value.

**Returns:**

An `Optional` containing an owned copy of the pointee value.

---

## OptionallyStaticInt

## Implemented traits

`AnyType`,
`Copyable`,
`Intable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `static_value`

`alias static_value`

## Methods

### `__copyinit__`

`__copyinit__(out self: _Self, existing: _Self, /)`

Create a new instance of the value by copying an existing one.

**Args:**

* ​existing (`_Self`): The value to copy.

### `__moveinit__`

`__moveinit__(out self: _Self, owned existing: _Self, /)`

Create a new instance of the value by moving the value of another.

**Args:**

* ​existing (`_Self`): The value to move.

### `as_uint32`

`as_uint32(self: _Self) -> SIMD[uint32, 1]`

### `__int__`

`__int__(self: _Self) -> Int`

Get the integral representation of the value.

**Returns:**

The integral representation of the value.

---

## OptionalReg

`@register_passable(trivial)`
`struct OptionalReg[T: AnyTrivialRegType]`

A register-passable optional type.

This struct optionally contains a value. It only works with trivial register
passable types at the moment.

## Parameters

* ​T (`AnyTrivialRegType`): The type of value stored in the Optional.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Create an optional with a value of None.

`@implicit`
`__init__(value: T) -> Self`

Create an optional with a value.

**Args:**

* ​value (`T`): The value.

`@implicit`
`__init__(value: NoneType) -> Self`

Create an optional without a value from a None literal.

**Args:**

* ​value (`NoneType`): The None value.

### `__bool__`

`__bool__(self) -> Bool`

Return true if the optional has a value.

**Returns:**

True if the optional has a value and False otherwise.

### `__is__`

`__is__(self, other: NoneType) -> Bool`

Return `True` if the Optional has no value.

It allows you to use the following syntax: `if my_optional is None:`

**Args:**

* ​other (`NoneType`): The value to compare to (None).

**Returns:**

True if the Optional has no value and False otherwise.

### `__isnot__`

`__isnot__(self, other: NoneType) -> Bool`

Return `True` if the Optional has a value.

It allows you to use the following syntax: `if my_optional is not None:`

**Args:**

* ​other (`NoneType`): The value to compare to (None).

**Returns:**

True if the Optional has a value and False otherwise.

### `value`

`value(self) -> T`

Get the optional value.

**Returns:**

The contained value.

### `or_else`

`or_else(owned self, owned default: T) -> T`

Return the underlying value contained in the Optional or a default value if the Optional's underlying value is not present.

**Args:**

* ​default (`T`): The new value to use if no value was present.

**Returns:**

The underlying value contained in the Optional or a default value.

---

## ord

`ord(s: StringSlice[origin]) -> Int`

Returns an integer that represents the codepoint of a single-character string.

Given a string containing a single character `Codepoint`, return an integer
representing the codepoint of that character. For example, `ord("a")`
returns the integer `97`. This is the inverse of the `chr()` function.

This function is in the prelude, so you don't need to import it.

**Args:**

* ​s (`StringSlice[origin]`): The input string, which must contain only a single- character.

**Returns:**

An integer representing the code point of the given character.

---

## Origin

`@register_passable(trivial)`
`struct Origin[mut: Bool]`

This represents a origin reference for a memory value.

## Parameters

* ​mut (`Bool`): Whether the origin is mutable.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `cast_from`

`alias cast_from = _lit_mut_cast[mut, ?]`

Cast an existing Origin to be of the specified mutability.

This is a low-level way to coerce Origin mutability. This should be used
rarely, typically when building low-level fundamental abstractions. Strongly
consider alternatives before reaching for this "escape hatch".

Safety:
This is an UNSAFE operation if used to cast an immutable origin to
a mutable origin.

Examples:

Cast a mutable origin to be immutable:

```mojo
struct Container[mut: Bool, //, origin: Origin[mut]]:
    var data: Int

    fn imm_borrow(self) -> Container[ImmutableOrigin.cast_from[origin].result]:
        # ...
```

### `empty`

`alias empty = {}`

An empty `__origin_of()` of the given mutability. The empty origin is guaranteed not to alias any existing origins.

---

## OrMask

`@register_passable(trivial)`
`struct OrMask[T: MHAMask, S: MHAMask, //, lhs: T, rhs: S]`

Mask that's the OR of two masks.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`MHAMask`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask = get_vtable_entry(:trait T, "apply_log2e_after_mask") if get_vtable_entry(:trait T, "apply_log2e_after_mask") else get_vtable_entry(:trait S, "apply_log2e_after_mask")`

### `mask_out_of_bound`

`alias mask_out_of_bound = get_vtable_entry(:trait S, "mask_out_of_bound") if get_vtable_entry(:trait T, "mask_out_of_bound") else get_vtable_entry(:trait T, "mask_out_of_bound")`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds = get_vtable_entry(:trait S, "mask_safe_out_of_bounds") if get_vtable_entry(:trait T, "mask_safe_out_of_bounds") else get_vtable_entry(:trait T, "mask_safe_out_of_bounds")`

## Methods

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

### `status`

`status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

---

## os

Provides access to operating-system dependent functionality.

The types and functions in this package primarily provide operating-system
independent access to operating-system dependent features, such as file systems
and environment variables.

For accessing files, see built-in [`open()`](/mojo/stdlib/builtin/file/open)
function and the [`file`](/mojo/stdlib/builtin/file/) module. For manipulating
file system paths, see the [`os.path`](/mojo/stdlib/os/path/) package for
OS-independent path manipulation functions and the `pathlib` package for
the [`Path`](/mojo/stdlib/pathlib/path/Path) struct, an abstraction for handling
paths.

## Packages

* [​`path`](/mojo/stdlib/os/path/): Provides a set of operating-system independent functions for manipulating file system paths.

## Modules

* [​`atomic`](/mojo/stdlib/os/atomic/): Implements the `Atomic` struct.
* [​`env`](/mojo/stdlib/os/env/): Provides functions for working with environment variables.
* [​`fstat`](/mojo/stdlib/os/fstat/): Implements file system status operations.
* [​`os`](/mojo/stdlib/os/os/): Provides functions to access operating-system dependent functionality, including file system operations.
* [​`pathlike`](/mojo/stdlib/os/pathlike/): Implements the `PathLike` trait.

---

## os

Provides functions to access operating-system dependent functionality, including file system operations.

You can import a method from the `os` package. For example:

```mojo
from os import listdir
```

## Aliases

### `SEEK_CUR`

`alias SEEK_CUR = __init__[__mlir_type.!pop.int_literal](1)`

Seek from the current position.

### `SEEK_END`

`alias SEEK_END = __init__[__mlir_type.!pop.int_literal](2)`

Seek from the end of the file.

### `SEEK_SET`

`alias SEEK_SET = __init__[__mlir_type.!pop.int_literal](0)`

Seek from the beginning of the file.

### `sep`

`alias sep = "\\".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]() if os_is_windows() else "/".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]()`

## Functions

* [​`abort`](/mojo/stdlib/os/os/abort): Calls a target dependent trap instruction if available.
* [​`getuid`](/mojo/stdlib/os/os/getuid): Retrieve the user ID of the calling process.
* [​`listdir`](/mojo/stdlib/os/os/listdir): Gets the list of entries contained in the path provided.
* [​`makedirs`](/mojo/stdlib/os/os/makedirs): Creates a specified leaf directory along with any necessary intermediate directories that don't already exist.
* [​`mkdir`](/mojo/stdlib/os/os/mkdir): Creates a directory at the specified path.
* [​`remove`](/mojo/stdlib/os/os/remove): Removes the specified file.
* [​`removedirs`](/mojo/stdlib/os/os/removedirs): Removes a leaf directory and all empty intermediate ones.
* [​`rmdir`](/mojo/stdlib/os/os/rmdir): Removes the specified directory.
* [​`unlink`](/mojo/stdlib/os/os/unlink): Removes the specified file.

---

## os_is_linux

`os_is_linux() -> Bool`

Returns True if the host operating system is Linux.

**Returns:**

True if the host operating system is Linux and False otherwise.

---

## os_is_macos

`os_is_macos() -> Bool`

Returns True if the host operating system is macOS.

**Returns:**

True if the host operating system is macOS and False otherwise.

---

## os_is_windows

`os_is_windows() -> Bool`

Returns True if the host operating system is Windows.

**Returns:**

True if the host operating system is Windows and False otherwise.

---

## outer_product_acc

`outer_product_acc(res: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], lhs: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], rhs: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Updates result tensor with the outer product of two vectors.

Computes `res += outer(lhs, rhs)` where `lhs` and `rhs` are vectors and
`res` is a matrix.

**Constraints:**

All tensors must have statically known shapes.
`res` must be rank 2.
`lhs` and `rhs` must be rank 1.
`res.shape[0]` `==` `lhs.shape[0]` and `res.shape[1]` `==` `rhs.shape[0]`.

**Args:**

* ​res (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The result matrix to accumulate into, shape (M, N).
* ​lhs (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The left-hand side vector, shape (M,).
* ​rhs (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The right-hand side vector, shape (N,).

---

## owned_pointer

Implements `OwnedPointer`, a safe, single-ownership smart pointer.

You can import these APIs from the `memory` package. For example:

```mojo
from memory import OwnedPointer
```

## Structs

* [​`OwnedPointer`](/mojo/stdlib/memory/owned_pointer/OwnedPointer): A safe, owning, smart pointer.

---

## OwnedKwargsDict

`struct OwnedKwargsDict[V: Copyable & Movable]`

Container used to pass owned variadic keyword arguments to functions.

This type mimics the interface of a dictionary with `String` keys, and
should be usable more-or-less like a dictionary. Notably, however, this type
should not be instantiated directly by users.

## Parameters

* ​V (`Copyable & Movable`): The value type of the dictionary. Currently must be Copyable & Movable.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `key_type`

`alias key_type = String`

## Methods

### `__init__`

`__init__(out self)`

Initialize an empty keyword dictionary.

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Copy an existing keyword dictionary.

**Args:**

* ​existing (`Self`): The existing keyword dictionary.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Move data of an existing keyword dictionary into a new one.

**Args:**

* ​existing (`Self`): The existing keyword dictionary.

### `__getitem__`

`__getitem__(self, key: String) -> V`

Retrieve a value out of the keyword dictionary.

**Args:**

* ​key (`String`): The key to retrieve.

**Returns:**

The value associated with the key, if it's present.

**Raises:**

"KeyError" if the key isn't present.

### `__setitem__`

`__setitem__(mut self, key: String, value: V)`

Set a value in the keyword dictionary by key.

**Args:**

* ​key (`String`): The key to associate with the specified value.
* ​value (`V`): The data to store in the dictionary.

### `__contains__`

`__contains__(self, key: String) -> Bool`

Check if a given key is in the keyword dictionary or not.

**Args:**

* ​key (`String`): The key to check.

**Returns:**

True if there key exists in the keyword dictionary, False
otherwise.

### `copy`

`copy(self) -> Self`

Copy an existing keyword dictionary.

**Returns:**

A copy of the value.

### `__len__`

`__len__(self) -> Int`

The number of elements currently stored in the keyword dictionary.

**Returns:**

The number of elements currently stored in the keyword dictionary.

### `find`

`find(self, key: String) -> Optional[V]`

Find a value in the keyword dictionary by key.

**Args:**

* ​key (`String`): The key to search for in the dictionary.

**Returns:**

An optional value containing a copy of the value if it was present,
otherwise an empty Optional.

### `pop`

`pop(mut self, key: String, owned default: V) -> V`

Remove a value from the dictionary by key.

**Args:**

* ​key (`String`): The key to remove from the dictionary.
* ​default (`V`): A default value to return if the key
  was not found instead of raising.

**Returns:**

The value associated with the key, if it was in the dictionary.
If it wasn't, return the provided default value instead.

`pop(mut self, key: String) -> V`

Remove a value from the dictionary by key.

**Args:**

* ​key (`String`): The key to remove from the dictionary.

**Returns:**

The value associated with the key, if it was in the dictionary.
Raises otherwise.

**Raises:**

"KeyError" if the key was not present in the dictionary.

### `__iter__`

`__iter__(ref self) -> _DictKeyIter[String, V, self_is_origin._dict]`

Iterate over the keyword dict's keys as immutable references.

**Returns:**

An iterator of immutable references to the dictionary keys.

### `keys`

`keys(ref self) -> _DictKeyIter[String, V, self_is_origin._dict]`

Iterate over the keyword dict's keys as immutable references.

**Returns:**

An iterator of immutable references to the dictionary keys.

### `values`

`values(ref self) -> _DictValueIter[String, V, self_is_origin._dict]`

Iterate over the keyword dict's values as references.

**Returns:**

An iterator of references to the dictionary values.

### `items`

`items(ref self) -> _DictEntryIter[String, V, self_is_origin._dict]`

Iterate over the keyword dictionary's entries as immutable references.

Examples:

```mojo
var my_dict = Dict[String, Int]()
my_dict["a"] = 1
my_dict["b"] = 2

for e in my_dict.items():
    print(e[].key, e[].value)
```

Notes:
These can't yet be unpacked like Python dict items, but you can
access the key and value as attributes.

**Returns:**

An iterator of immutable references to the dictionary entries.

---

## OwnedPointer

`@register_passable`
`struct OwnedPointer[T: AnyType]`

A safe, owning, smart pointer.

This smart pointer is designed for cases where there is clear ownership
of the underlying data, and restricts access to it through the origin
system such that no more than one mutable alias for the underlying data
may exist.

For a comparison with other pointer types, see [Intro to
pointers](/mojo/manual/pointers/) in the Mojo Manual.

## Parameters

* ​T (`AnyType`): The type to be stored in the `OwnedPointer`.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__[T: Movable](owned value: T) -> OwnedPointer[T]`

Construct a new `OwnedPointer` by moving the passed value into a new backing allocation.

**Parameters:**

* ​T (`Movable`): The type of the data to store. It is restricted to `Movable` here to allow efficient move construction.

**Args:**

* ​value (`T`): The value to move into the `OwnedPointer`.

`__init__[T: ExplicitlyCopyable](*, copy_value: T) -> OwnedPointer[T]`

Construct a new `OwnedPointer` by explicitly copying the passed value into a new backing allocation.

**Parameters:**

* ​T (`ExplicitlyCopyable`): The type of the data to store, which must be
  `ExplicitlyCopyable`.

**Args:**

* ​copy\_value (`T`): The value to explicitly copy into the `OwnedPointer`.

`__init__[T: Copyable, U: NoneType = NoneType(None)](value: T) -> OwnedPointer[T]`

Construct a new `OwnedPointer` by copying the passed value into a new backing allocation.

**Parameters:**

* ​T (`Copyable`): The type of the data to store.
* ​U (`NoneType`): A dummy type parameter, to lower the selection priority of this ctor.

**Args:**

* ​value (`T`): The value to copy into the `OwnedPointer`.

`__init__[T: ExplicitlyCopyable](*, other: OwnedPointer[T]) -> OwnedPointer[T]`

Construct a new `OwnedPointer` by explicitly copying the value from another `OwnedPointer`.

**Parameters:**

* ​T (`ExplicitlyCopyable`): The type of the data to store.

**Args:**

* ​other (`OwnedPointer[T]`): The `OwnedPointer` to copy.

### `__del__`

`__del__(owned self)`

Destroy the OwnedPointer\[].

### `__getitem__`

`__getitem__(ref self) -> ref [self] T`

Returns a reference to the pointers's underlying data with parametric mutability.

**Returns:**

A reference to the data underlying the `OwnedPointer`.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[T]`

UNSAFE: returns the backing pointer for this `OwnedPointer`.

**Returns:**

An UnsafePointer to the backing allocation for this `OwnedPointer`.

### `take`

`take[T: Movable](owned self: OwnedPointer[T]) -> T`

Move the value within the `OwnedPointer` out of it, consuming the `OwnedPointer` in the process.

**Parameters:**

* ​T (`Movable`): The type of the data backing this `OwnedPointer`. `take()` only exists for `T: Movable`
  since this consuming operation only makes sense for types that you want to avoid copying.
  For types that are `Copyable` or `ExplicitlyCopyable` but are not `Movable`, you can copy them through
  `__getitem__` as in `var v = some_ptr_var[]`.

**Returns:**

The data that is (was) backing the `OwnedPointer`.

### `steal_data`

`steal_data(owned self) -> UnsafePointer[T]`

Take ownership over the heap allocated pointer backing this `OwnedPointer`.

**Safety:**
This function is not unsafe to call, as a memory leak is not
considered unsafe.

However, to avoid a memory leak, callers should ensure that the
returned pointer is eventually deinitialized and deallocated.
Failure to do so will leak memory.

**Returns:**

The pointer owned by this instance.

---

## Ownership

A challenge you might face when using some programming languages is that you
must manually allocate and deallocate memory. When multiple parts of the
program need access to the same memory, it becomes difficult to keep track of
who "owns" a value and determine when is the right time to deallocate it. If
you make a mistake, it can result in a "use-after-free" error, a "double free"
error, or a "leaked memory" error, any one of which can be catastrophic.

Mojo helps avoid these errors by ensuring there is only one variable that owns
each value at a time, while still allowing you to share references with other
functions. When the life span of the owner ends, Mojo [destroys the
value](/mojo/manual/lifecycle/death). Programmers are still responsible for
making sure any type that allocates resources (including memory) also
deallocates those resources in its destructor. Mojo's ownership system ensures
that destructors are called promptly.

On this page, we'll explain the rules that govern this ownership model, and how
to specify different argument conventions that define how values are passed into
functions.

## Ownership summary

The fundamental rules that make Mojo's ownership model work are the following:

* Every value has only one owner at a time.
* When the lifetime of the owner ends, Mojo destroys the value.
* If there are existing references to a value, Mojo extends the lifetime of
  the owner.

### Variables and references

A variable *owns* its value. A struct owns its fields.

A *reference* allows you to access a value owned by another variable. A
reference can have either mutable access or immutable access to that value.

Mojo references are created when you call a function: function arguments can be
passed as mutable or immutable references. A function can also return a
reference instead of returning a value.

## Argument conventions

In all programming languages, code quality and performance is heavily dependent
upon how functions treat argument values. That is, whether a value received by
a function is a unique value or a reference, and whether it's mutable or
immutable, has a series of consequences that define the readability,
performance, and safety of the language.

In Mojo, we want to provide full [value
semantics](/mojo/manual/values/value-semantics) by default, which provides
consistent and predictable behavior. But as a systems programming language, we
also need to offer full control over memory optimizations, which generally
requires reference semantics. The trick is to introduce reference semantics in
a way that ensures all code is memory safe by tracking the lifetime of every
value and destroying each one at the right time (and only once). All of this is
made possible in Mojo through the use of argument conventions that ensure every
value has only one owner at a time.

An argument convention specifies whether an argument is mutable or immutable,
and whether the function owns the value. Each convention is defined by a
keyword at the beginning of an argument declaration:

* `read`: The function receives an **immutable reference**. This means the
  function can read the original value (it is *not* a copy), but it cannot
  mutate (modify) it. Functions defined with `def` treat this differently, as
  described below in [Borrowed arguments](#borrowed-arguments-read).

* `mut`: The function receives a **mutable reference**. This means the
  function can read and mutate the original value (it is *not* a copy).

* `owned`: The function takes **ownership** of a value. This means the function
  has exclusive ownership of the argument. The caller might choose to transfer
  ownership of an existing value to this function, but that's not always what
  happens. The callee might receive a newly-created value, or a copy of an
  existing value.

* `ref`: The function gets a reference with an parametric mutability: that is,
  the reference can be either mutable or immutable. You can think of `ref`
  arguments as a generalization of the `read` and `mut` conventions.
  `ref` arguments are an advanced topic, and they're described in more detail in
  [Lifetimes, origins, and references](/mojo/manual/values/lifetimes).

* `out`: A special convention used for the `self` argument in
  [constructors](/mojo/manual/lifecycle/life#constructor) and for
  [named results](/mojo/manual/functions#named-results). An `out`
  argument is uninitialized at the beginning of the function, and must be
  initialized before the function returns. Although `out` arguments show up in
  the argument list, they're never passed in by the caller.

For example, this function has one argument that's a mutable
reference and one that's immutable:

```mojo
fn add(mut x: Int, read y: Int):
    var x += y

fn main():
    var a = 1
    var b = 2
    add(a, b)
    print(a)
```

```output
3
```

You've probably already seen some function arguments that don't declare a
convention. By default, all arguments are `read`.
In the following sections, we'll explain each of these argument conventions in
more detail.

## Borrowed arguments (`read`)

The `read` convention is the default for all arguments. But as is described in
[`def` and `fn` comparison](/mojo/manual/functions#def-and-fn-comparison),
functions treat `read` arguments somewhat differently depending on whether the
function is defined with `def` or `fn`:

* When using `def`, if you mutate the value in the body of the function, the
  function receives a mutable copy of the argument. Otherwise, it receives an
  immutable reference. This allows you to treat arguments as mutable, but avoid
  the overhead of making extra copies when they're not needed.

* When using `fn`, the function always receives an immutable reference. If you
  want a mutable copy, you can assign it to a local variable:

  ```mojo
  var my_copy = read_arg
  ```

In both cases, the original value on the caller side can't be changed by the
callee.

For example:

```mojo
def print_list(list: List[Int]):
    print(list.__str__())

def main():
    var values = List(1, 2, 3, 4)
    print_list(values)
```

```output
[1, 2, 3, 4]
```

Here the `list` argument to `print_list()` is read and not mutated, so the
`print_list()` function gets an immutable reference to the original `List`, and
doesn't do any copying.

In general, passing an immutable reference is much more efficient
when handling large or expensive-to-copy values, because the copy constructor
and destructor are not invoked for a `read` argument.

### Compared to C++ and Rust

Mojo's read argument convention is similar in some ways to passing an
argument by `const&` in C++, which also avoids a copy of the value and disables
mutability in the callee. However, the read convention differs from
`const&` in C++ in two important ways:

* The Mojo compiler implements a lifetime checker that ensures that values are
  not destroyed when there are outstanding references to those values.

* Small values like `Int`, `Float`, and `SIMD` are passed directly in machine
  registers instead of through an extra indirection (this is because they are
  declared with the `@register_passable` decorator). This is a [significant
  performance
  enhancement](https://www.forrestthewoods.com/blog/should-small-rust-structs-be-passed-by-copy-or-by-borrow/)
  when compared to languages like C++ and Rust, and moves this optimization from
  every call site to a declaration on the type definition.

The major difference between Rust and Mojo is that Mojo does not require a sigil
on the caller side to pass by immutable reference. Also, Mojo is more efficient
when passing small values, and Rust defaults to moving values instead of passing
them around by borrow. These policy and syntax decisions allow Mojo to provide
an easier-to-use programming model.

## Mutable arguments (`mut`)

If you'd like your function to receive a **mutable reference**, add the `mut`
keyword in front of the argument name. You can think of `mut` like this: it
means any changes to the value *in*side the function are visible *out*side the
function.

For example, this `mutate()` function updates the original `list` value:

```mojo
def print_list(list: List[Int]):
    print(list.__str__())

def mutate(mut l: List[Int]):
    l.append(5)

def main():
    var values = List(1, 2, 3, 4)

    mutate(values)
    print_list(values)
```

```output
[1, 2, 3, 4, 5]
```

That behaves like an optimized replacement for this:

```mojo
def print_list(list: List[Int]):
    print(list.__str__())

def mutate_copy(l: List[Int]) -> List[Int]:
    # def creates an implicit copy of the list because it's mutated
    l.append(5)
    return l

def main():
    var values = List(1, 2, 3, 4)

    values = mutate_copy(values)
    print_list(values)
```

```output
[1, 2, 3, 4, 5]
```

Although the code using `mut` isn't that much shorter, it's more memory
efficient because it does not make a copy of the value.

However, remember that the values passed as `mut` must already be mutable.
For example, if you try to take a `read` value and pass it to another
function as `mut`, you'll get a compiler error because Mojo can't form a
mutable reference from an immutable reference.

:::note

You cannot define [default
values](/mojo/manual/functions#optional-arguments) for `mut`
arguments.

:::

### Argument exclusivity

Mojo enforces *argument exclusivity* for mutable references. This means that if
a function receives a mutable reference to a value (such as an `mut` argument),
it can't receive any other references to the same value—mutable or immutable.
That is, a mutable reference can't have any other references that *alias* it.

For example, consider the following code example:

```mojo
fn append_twice(mut s: String, other: String):
   # Mojo knows 's' and 'other' cannot be the same string.
   s += other
   s += other

fn invalid_access():
  var my_string = String("o")  # Create a run-time String value

  # error: passing `my_string` mut is invalid since it is also passed
  # read.
  append_twice(my_string, my_string)
  print(my_string)
```

This code is confusing because the user might expect the output to be `ooo`,
but since the first addition mutates both `s` and `other`, the actual output
would be `oooo`. Enforcing exclusivity of mutable references not only prevents
coding errors, it also allows the Mojo compiler to optimize code in some cases.

One way to avoid this issue when you do need both a mutable and an immutable
reference (or need to pass the same value to two arguments) is to make a copy:

```mojo
fn valid_access():
  var my_string = String("o")           # Create a run-time String value
  var other_string = String(my_string)  # Create a copy of the String value
  append_twice(my_string, other_string)
  print(my_string)
```

Note that argument exclusivity isn't enforced for register-passable trivial
types (like `Int` and `Bool`), because they are always passed by copy. When
passing the same value into two `Int` arguments, the callee will receive two
copies of the value.

## Transfer arguments (`owned` and `^`)

And finally, if you'd like your function to receive value **ownership**, add the
`owned` keyword in front of the argument name.

This convention is often combined with use of the postfixed `^` "transfer"
sigil on the variable that is passed into the function, which ends the
lifetime of that variable.

Technically, the `owned` keyword does not guarantee that the received value is
*the original value*—it guarantees only that the function
gets unique ownership of a value. This happens in one of
three ways:

* The caller passes the argument with the `^` transfer sigil, which ends the
  lifetime of that variable (the variable becomes uninitialized) and ownership
  is transferred into the function.

* The caller **does not** use the `^` transfer sigil, in which case, Mojo copies
  the value. If the type isn't copyable, this is a compile-time error.

* The caller passes in a newly-created "owned" value, such as a value returned
  from a function. In this case, no variable owns the value and it can be
  transferred directly to the callee. For example:

  ```mojo
  def take(owned s: String):
      pass

  def main():
      take(String("A brand-new String!"))
  ```

The following code works by making a copy of the string, because `take_text()`
uses the `owned` convention, and the caller does not include the transfer sigil:

```mojo
fn take_text(owned text: String):
    text += "!"
    print(text)

fn main():
    var message = String("Hello")  # Create a run-time String value
    take_text(message)
    print(message)
```

```output
Hello!
Hello
```

However, if you add the `^` transfer sigil when calling `take_text()`, the
compiler complains about `print(message)`, because at that point, the `message`
variable is no longer initialized. That is, this version does not compile:

```mojo
fn main():
    var message = String("Hello")  # Create a run-time String value
    take_text(message^)
    print(message)  # error: use of uninitialized value 'message'
```

This is a critical feature of Mojo's lifetime checker, because it ensures that no
two variables can have ownership of the same value. To fix the error, you must
not use the `message` variable after you end its lifetime with the `^` transfer
operator. So here is the corrected code:

```mojo
fn take_text(owned text: String):
    text += "!"
    print(text)

fn main():
    var message = String("Hello")  # Create a run-time String value
    take_text(message^)
```

```output
Hello!
```

Regardless of how it receives the value, when the function declares an argument
as `owned`, it can be certain that it has unique mutable access to that value.
Because the value is owned, the value is destroyed when the function
exits—unless the function transfers the value elsewhere.

For example, in the following example, `add_to_list()` takes a string and
appends it to the list. Ownership of the string is transferred to the list, so
it's not destroyed when the function exits. On the other hand,
`consume_string()` doesn't transfer its `owned` value out, so the value is
destroyed at the end of the function.

```mojo
def add_to_list(owned name: String, mut list: List[String]):
    list.append(name^)
    # name is uninitialized, nothing to destroy

def consume_string(owned s: String):
    print(s)
    # s is destroyed here
```

### Transfer implementation details

In Mojo, you shouldn't conflate "ownership transfer" with a "move
operation"—these are not strictly the same thing.

There are multiple ways that Mojo can transfer ownership of a value:

* If a type implements the [move
  constructor](/mojo/manual/lifecycle/life#move-constructor),
  `__moveinit__()`, Mojo may invoke this method *if* a value of that type is
  transferred into a function as an `owned` argument, *and* the original
  variable's lifetime ends at the same point (with or without use of the `^`
  transfer sigil).

* If a type implements the [copy
  constructor](/mojo/manual/lifecycle/life#move-constructor), `__copyinit__()`
  and not `__moveinit__()`, Mojo may copy the value and destroy the old value.

* In some cases, Mojo can optimize away the move operation entirely, leaving the
  value in the same memory location but updating its ownership. In these cases,
  a value can be transferred without invoking either the `__copyinit__()` or
  `__moveinit__()` constructors.

In order for the `owned` convention to work *without* the transfer
sigil, the value type must be copyable (via `__copyinit__()`).

## Comparing `def` and `fn` argument conventions

As mentioned in [Functions](/mojo/manual/functions), a function defined with
`def` can treat a `read` argument as mutable, in which case it receives a
mutable copy. An equivalent function defined with `fn` would need to make this
copy explicit. For example, these two functions have the exact same behavior.

```mojo
def def_example(a: Int, mut b: Int):
    pass

fn fn_example(a_in: Int, mut b: Int):
    var a = a_in
    pass
```

This shadow copy typically adds little overhead, for small types. However,
copying large types that allocate heap storage can be expensive. (For example,
copying `List` or `Dict` types, or copying large numbers of strings.)

### `read` versus `owned` in `def` functions

The difference between `read` and `owned` in a `def` function may be a
little subtle. In both cases, you can end up with a uniquely-owned value that's
a copy of the original value.

* The `read` argument always gets an immutable reference or a local copy.
  You can't transfer a value into a `read` argument.

* The `owned` argument always gets a uniquely owned value, which may have been
  copied or transferred from the callee. Using `owned` arguments without the
  transfer sigil (`^`) usually results in values being copied.

---

## pack_b

`pack_b[transpose_b: Bool, simd_size: Int, inner_size: Int, a_type: DType, b_type: DType, c_type: DType, src_shape: DimList, dst_shape: DimList](dst: NDBuffer[b_type, 2, origin, dst_shape], src: NDBuffer[b_type, 2, origin, src_shape], tile_n: Int, tile_k: Int)`

Utility function to pack the entire B matrix, such that each \[tile\_n // inner\_size, tile\_k, inner\_size] tile of src is contiguous in dst.

Tiles (not tile contents) are stored in row major order, so tile\[i, j] is
tile\_n \* tile\_k bytes away from tile\[i, j+1].

---

## pack_b_ndbuffer

`pack_b_ndbuffer[b_mut: Bool, //, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, c_type: DType, c_shape: DimList, b_origin: Origin[b_mut], output_origin: MutableOrigin](b_input: NDBuffer[b_type, 2, b_origin, b_shape], output_buffer: NDBuffer[b_type, 2, output_origin])`

---

## pack_bits

`pack_bits[width: Int, //, new_type: DType = uint8 if (width == 8) else uint16 if (width == 16) else uint32 if (width == 32) else uint64 if (width == 64) else ui128 if (width == 128) else ui256 if (width == 256) else invalid](val: SIMD[bool, width]) -> SIMD[new_type, 1]`

Packs a SIMD vector of `bool` values into an integer.

Examples:

This example packs a vector of 8 `bool` values into a single 8-bit integer.

```mojo
from memory import pack_bits

flags = SIMD[DType.bool, 8](1, 1, 0, 1, 0, 0, 0, 0)
i = pack_bits[DType.uint8](flags)
print(flags, i) # [True, True, False, True, False, False, False, False] 11
```

**Constraints:**

The width of the bool vector must be the same as the bitwidth of the
target type.

**Parameters:**

* ​width (`Int`): The source width.
* ​new\_type (`DType`): The target type.

**Args:**

* ​val (`SIMD[bool, width]`): The source value.

**Returns:**

A new integer scalar which has the same bitwidth as the bool vector.

---

## pack_conv_filter_shape

`pack_conv_filter_shape[single_thread_blocking_override: Bool](filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int) -> IndexList[(rank + 1)]`

Compute the output shape of convolution filter packing.

**Parameters:**

* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The filter to be packed.
* ​num\_groups (`Int`): The number of groups in the convolution.

**Returns:**

The output shape.

---

## pack_filter

`pack_filter(filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], packed_filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int)`

This packs the filter form RSCF to FRSCf. Use the default micro kernel size for dynamic shapes.

`pack_filter[simd_size: Int, micro_kernel_f_size: Int](filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], packed_filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int)`

This packs the filter form RSCF to FRSCf.

F is first broken down to segements of size micro\_kernel\_f\_size, then the
remainder is further divided by simd\_size. The last residual elements if
any is padded with zero to fill simd\_size.

**Parameters:**

* ​simd\_size (`Int`): Can differ from the simd size of the input type.
* ​micro\_kernel\_f\_size (`Int`): The size of the last dimension in FRSCf, which is
  equals the size of the micro kernel's F dimension.

**Args:**

* ​filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): Filter in RSCF layout (if 2D).
* ​packed\_filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): Packed filter in FRSCf layout (if 2D).
  F       - the index of continuous segments in micro kernel.
  R, S, C - original R, S, C.
  f       - the index within a continuous segments.
* ​num\_groups (`Int`): The number of groups in the convolution.

---

## pack_filter

`pack_filter(filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], packed_filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int)`

This packs the filter form RSFC to FRSCf.

---

## pack_filter_shape

`pack_filter_shape[filter_type: DType, input_shape: DimList, filter_shape: DimList, output_shape: DimList, strides: DimList, dilations: DimList, paddings: DimList, num_groups: Int, single_thread_blocking_override: Bool](filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> IndexList[(rank + 1)]`

Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel.

**Returns:**

The output shape.

---

## pack_filter_shape

`pack_filter_shape(filter: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], num_groups: Int) -> IndexList[(rank + 1)]`

Compute the output shape of transposed convolution filter packing.

**Args:**

* ​filter (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The filter to be packed.
* ​num\_groups (`Int`): The number of groups in the convolution.

**Returns:**

The output shape.

---

## pack_filter_shape_impl

`pack_filter_shape_impl[filter_type: DType](Q: Int, R: Int, S: Int, C: Int, F: Int, num_groups: Int) -> IndexList[6]`

Compute the shape of packed filter. The packed layout is FRSCf. shape\_ref should be allocated with size 5 outside this kernel.

**Args:**

* ​Q (`Int`): Original Q filter dimension.
* ​R (`Int`): Original R filter dimension.
* ​S (`Int`): Original S filter dimension.
* ​C (`Int`): Original C filter dimension.
* ​F (`Int`): Original F filter dimension.
* ​num\_groups (`Int`): Number of groups in the convolution.

**Returns:**

The output shape.

---

## pack_matmul_b_shape_func

`pack_matmul_b_shape_func[a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, c_type: DType, c_shape: DimList, transpose_in_0: Bool, single_thread_blocking_override: Bool](b_input: NDBuffer[b_type, 2, origin, b_shape]) -> IndexList[2]`

---

## pack_Q_tile

`pack_Q_tile(input: SIMD[uint8, 16]) -> SIMD[uint32, 4]`

---

## pack_transposed_b_ndbuffer

`pack_transposed_b_ndbuffer[a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, c_type: DType, c_shape: DimList](b_input: NDBuffer[b_type, 2, origin, b_shape], output_buffer: NDBuffer[b_type, 2, origin])`

---

## packA_i8mm

`packA_i8mm[a_type: DType](t0: Int, t1: Int, k: Int, a_ptr: UnsafePointer[SIMD[a_type, 1]], a_packed_ptr: UnsafePointer[SIMD[a_type, 1]])`

---

## packing

## Structs

* [​`BTileGenerator`](./BTileGenerator): Struct to encapsulate a tile of B that supports prepacking.
* [​`PackMatrixCols`](./PackMatrixCols): Pack columns from a matrix into the mlas packed layout and extract inner vectors of columns into the packed inner dimension, e.g. extracts \[X, Y] and packs as \[Yo]\[X]\[Yi].
* [​`PackMatrixRows`](./PackMatrixRows): Pack rows from a matrix into the mlas packed layout and extract inner vectors of rows into the packed inner dimension, e.g. extract tile \[X, Y] and pack into \[Xo]\[Y]\[Xi].

## Functions

* [​`pack_b`](./pack_b): Utility function to pack the entire B matrix, such that each \[tile\_n // inner\_size, tile\_k, inner\_size] tile of src is contiguous in dst.
* [​`pack_b_ndbuffer`](./pack_b_ndbuffer):
* [​`pack_matmul_b_shape_func`](./pack_matmul_b_shape_func):
* [​`pack_transposed_b_ndbuffer`](./pack_transposed_b_ndbuffer):

---

## PackMatrixCols

`struct PackMatrixCols[original_mut: Bool, //, original_shape: DimList, packed_shape: DimList, type: DType, simd_size: Int, column_inner_size: Int, use_vnni: Bool, use_i8mm: Bool, packed_origin: MutableOrigin, original_origin: Origin[original_mut]]`

Pack columns from a matrix into the mlas packed layout and extract inner vectors of columns into the packed inner dimension, e.g. extracts \[X, Y] and packs as \[Yo]\[X]\[Yi].

## Fields

* ​packed\_matrix (`NDBuffer[type, 3, packed_origin, packed_shape]`):
* ​original\_matrix (`NDBuffer[type, 2, original_origin, original_shape]`):
* ​global\_offset (`IndexList[2]`):
* ​pack\_tile\_dim (`IndexList[2]`):
* ​valid\_data\_dim (`IndexList[2]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `run`

`static run(packed_matrix: NDBuffer[type, 3, MutableAnyOrigin, packed_shape], original_matrix: NDBuffer[type, 2, MutableAnyOrigin, original_shape], global_offset: IndexList[2], pack_tile_dim: IndexList[2], valid_data_dim: IndexList[2])`

Interface function to run the packing routine. Args:     packed\_matrix(NDBuffer): pre-allocated buffer space for packed         data.     original\_matrix(NDBuffer): data buffer containing the original matrix         to pack.     global\_offset(IndexList): offset to use when indexing the         original matrix.     pack\_tile\_dim(IndexList): 2D dimension tuple describing the         size of the packed tile.     valid\_data\_dim(IndexList): 2D dimension tuple describing the         amount of valid data on the global buffer starting from the         offset.

---

## PackMatrixRows

`struct PackMatrixRows[original_mut: Bool, //, original_shape: DimList, packed_shape: DimList, type: DType, simd_size: Int, row_inner_size: Int, packed_origin: MutableOrigin, original_origin: Origin[original_mut]]`

Pack rows from a matrix into the mlas packed layout and extract inner vectors of rows into the packed inner dimension, e.g. extract tile \[X, Y] and pack into \[Xo]\[Y]\[Xi].

## Fields

* ​packed\_matrix (`NDBuffer[type, 3, packed_origin, packed_shape]`):
* ​original\_matrix (`NDBuffer[type, 2, original_origin, original_shape]`):
* ​global\_offset (`IndexList[2]`):
* ​pack\_tile\_dim (`IndexList[2]`):
* ​valid\_data\_dim (`IndexList[2]`):
* ​valid\_simd\_dim (`IndexList[2]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `run`

`static run(packed_matrix: NDBuffer[type, 3, packed_origin, packed_shape], original_matrix: NDBuffer[type, 2, original_origin, original_shape], global_offset: IndexList[2], pack_tile_dim: IndexList[2], valid_data_dim: IndexList[2])`

Interface function to run the packing routine. Args:     packed\_matrix(NDBuffer): pre-allocated buffer space for packed         data.     original\_matrix(NDBuffer): data buffer containing the original matrix         to pack.     global\_offset(IndexList): offset to use when indexing the         original matrix.     pack\_tile\_dim(IndexList): 2D dimension tuple describing the         size of the packed tile.     valid\_data\_dim(IndexList): 2D dimension tuple describing the         amount of valid data on the global buffer starting from the         offset.

---

## pad

## Functions

* [​`pad_constant`](./pad_constant): Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`.
* [​`pad_reflect`](./pad_reflect): Fill `output` with values from `input`, and edges padded with reflected values from the unpadded region.
* [​`pad_repeat`](./pad_repeat): Fill `output` with values from `input`, and edges padded boundary values from the unpadded region.
* [​`pad_shape`](./pad_shape): Compute the output shape of a `pad` operation, and assert the inputs are compatible.

---

## pad_constant

`pad_constant[rank: Int, output_shape: DimList, input_shape: DimList, type: DType, paddings_type: DType, constant_type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape], paddings: UnsafePointer[SIMD[paddings_type, 1]], constant: SIMD[constant_type, 1])`

Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`.

Example:
var input\_shape = (X, Y, Z)
var paddings = [x0, x1, y0, y1, z0, z1]

out\[x, y, z] =
input\[x - x0, y - y0, z - z0] if x ∈ \[x0, x0 + X] &&
y ∈ \[y0, y0 + Y] &&
z ∈ \[z0, z0 + Z]
else constant

**Args:**

* ​output (`NDBuffer[type, rank, origin, output_shape]`): The output buffer.
* ​input (`NDBuffer[type, rank, origin, input_shape]`): The input buffer.
* ​paddings (`UnsafePointer[SIMD[paddings_type, 1]]`): Ordered (before, after) padding sizes for each axis.
* ​constant (`SIMD[constant_type, 1]`): The constant to pad output with.

---

## pad_constant

`pad_constant[rank: Int, type: DType, padding_type: DType](output: UnsafePointer[SIMD[type, 1]], output_shape: IndexList[rank], input: UnsafePointer[SIMD[type, 1]], input_shape: IndexList[rank], paddings: UnsafePointer[SIMD[padding_type, 1]], constant: SIMD[type, 1], ctx: DeviceContext)`

Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`.

Example:

```mojo
var input_shape = (X, Y, Z)
var paddings = [x0, x1, y0, y1, z0, z1]

out[x, y, z] =
  input[x - x0, y - y0, z - z0] if x ∈ [x0, x0 + X] &&
                                   y ∈ [y0, y0 + Y] &&
                                   z ∈ [z0, z0 + Z]
  else constant
```

**Args:**

* ​output (`UnsafePointer[SIMD[type, 1]]`): The output buffer.
* ​output\_shape (`IndexList[rank]`): The output shape.
* ​input (`UnsafePointer[SIMD[type, 1]]`): The input buffer.
* ​input\_shape (`IndexList[rank]`): The input shape.
* ​paddings (`UnsafePointer[SIMD[padding_type, 1]]`): Ordered (before, after) padding sizes for each axis.
* ​constant (`SIMD[type, 1]`): The constant to pad output with.
* ​ctx (`DeviceContext`): Device context for participating GPU.

---

## pad_gpu

## Functions

* [​`get_padding_output_shape`](./get_padding_output_shape):
* [​`pad_constant`](./pad_constant): Fill `output` with values from `input`, and edges padded with `constant` based on `paddings`.

---

## pad_reflect

`pad_reflect[rank: Int, output_shape: DimList, input_shape: DimList, type: DType, paddings_type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape], paddings: UnsafePointer[SIMD[paddings_type, 1]])`

Fill `output` with values from `input`, and edges padded with reflected values from the unpadded region.

Example:
var input = [\[1, 2],
\[3, 4]]
var paddings = [2, 2, 1, 0]

Yields:
output = [\[2, 1, 2],
\[4, 3, 4],
\[2, 1, 2],
\[4, 3, 4],
\[2, 1, 2],
\[4, 3, 4]]

**Args:**

* ​output (`NDBuffer[type, rank, origin, output_shape]`): The output buffer.
* ​input (`NDBuffer[type, rank, origin, input_shape]`): The input buffer.
* ​paddings (`UnsafePointer[SIMD[paddings_type, 1]]`): Ordered (before, after) padding sizes for each axis.

---

## pad_repeat

`pad_repeat[rank: Int, output_shape: DimList, input_shape: DimList, type: DType, paddings_type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape], paddings: UnsafePointer[SIMD[paddings_type, 1]])`

Fill `output` with values from `input`, and edges padded boundary values from the unpadded region.

Example:
var input = [\[1, 2],
\[3, 4]]
var paddings = [2, 2, 1, 0]

Yields:
output = [\[1, 1, 2],
\[1, 1, 2],
\[1, 1, 2],
\[3, 3, 4],
\[3, 3, 4],
\[3, 3, 4]]

**Parameters:**

* ​rank (`Int`): Rank of the input/output buffers.
* ​output\_shape (`DimList`): Dimensions of the output buffer.
* ​input\_shape (`DimList`): Dimensions of the input buffer.
* ​type (`DType`): DType of the input/output buffer.
* ​paddings\_type (`DType`): DType of the input, output, and padding buffers.

**Args:**

* ​output (`NDBuffer[type, rank, origin, output_shape]`): The output buffer.
* ​input (`NDBuffer[type, rank, origin, input_shape]`): The input buffer.
* ​paddings (`UnsafePointer[SIMD[paddings_type, 1]]`): Ordered (before, after) padding sizes for each axis.

---

## pad_shape

`pad_shape[input_rank: Int, input_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], paddings_buf: NDBuffer[paddings_type, 1, origin]) -> IndexList[input_rank]`

Compute the output shape of a `pad` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_rank (`Int`): Rank of the input tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​paddings\_type (`DType`): Type of the padding tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The tensor to pad.
* ​paddings\_buf (`NDBuffer[paddings_type, 1, origin]`): The paddings tensor, of shape (input\_rank, 2).

**Returns:**

The output shape.

---

## Padding tokens

Padding tokens are extra tokens (usually zeros or special tokens) that are
added to the input for a model so that the input matches the model's fixed
input length or to ensure that all sequences in a [batch](batching.mdx) have
the same length.

In [transformer](transformer.mdx) models, padding tokens have been mostly
replaced with [ragged tensors](ragged-tensors.mdx).

---

## PadHandling

`@register_passable(trivial)`
`struct PadHandling`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `EXCLUDE_PAD`

`alias EXCLUDE_PAD = PadHandling(0)`

### `INCLUDE_PAD`

`alias INCLUDE_PAD = PadHandling(2)`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

---

## PagedAttention

PagedAttention is a memory management technique designed to improve GPU memory
utilization during large language model (LLM) serving. Inspired by classical
virtual memory and paging methods used in operating systems, PagedAttention
divides the [KV cache](kv-cache.mdx) into fixed-size blocks, which are not
necessarily stored contiguously in memory. This approach enables more efficient
handling of dynamic states in LLMs, allowing the model to manage large context
sizes while optimizing memory usage, as described in the 2023 paper [Efficient
Memory Management for Large Language Model Serving with
PagedAttention](https://arxiv.org/abs/2309.06180) (Kwon, et al., 2023).

Also written as "paged attention."

---

## PagedKVCache

`@register_passable(trivial)`
`struct PagedKVCache[type_: DType, kv_params_: KVCacheStaticParams, page_size: Int, assert_write_mode: Int = 0]`

The PagedKVCache is a wrapper around the KVCache blocks for a given layer. It is used to access the KVCache blocks for PagedAttention.

## Fields

* ​blocks (`NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`):
* ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`):
* ​lookup\_table (`NDBuffer[uint32, 2, MutableAnyOrigin]`):
* ​max\_seq\_length (`SIMD[uint32, 1]`):
* ​max\_cache\_length (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`KVCacheT`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `blocks_shape`

`alias blocks_shape = __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))`

### `blocks_stride`

`alias blocks_stride = _strides_from_shape[::DimList,::Int]()`

### `blocks_type`

`alias blocks_type = NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`

### `kv_params`

`alias kv_params = kv_params_`

### `type`

`alias type = type_`

## Methods

### `__init__`

`__init__(blocks: NDBuffer[type_, 4, MutableAnyOrigin, __init__[::Indexer,::Indexer,::Indexer,::Indexer](Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 2, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1]) -> Self`

### `max_tile_size`

`static max_tile_size() -> Int`

Returns the maximum tile size for the KVCache.

### `cache_lengths_nd`

`cache_lengths_nd(self) -> NDBuffer[uint32, 1, MutableAnyOrigin]`

### `cache_length`

`cache_length(self, batch_idx: Int) -> Int`

Returns the length of the cache for a given batch index.

### `load`

`load[width: Int](self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int) -> SIMD[type_, width]`

Loads an element from the given index.

### `store`

`store(self, bs: Int, head_idx: Int, tok_idx: Int, head_dim_idx: Int, val: SIMD[type_, size])`

Stores an element at the given index.

### `empty_cache`

`empty_cache(self) -> Bool`

Returns true if the cache\_lengths for all requests is 0, false otherwise.

### `max_prompt_length`

`max_prompt_length(self) -> SIMD[uint32, 1]`

Returns the maximum sequence length across all batches of the current request.

### `max_context_length`

`max_context_length(self) -> SIMD[uint32, 1]`

Returns the maximum cache length used across all batches of the current request.

### `block_paged_ptr`

`block_paged_ptr[tile_size: Int](self, batch_idx: Int, start_tok_idx: Int, head_idx: Int, head_dim_idx: Int = 0) -> UnsafePointer[SIMD[type_, 1]]`

---

## PagedKVCacheCollection

`struct PagedKVCacheCollection[type_: DType, kv_params_: KVCacheStaticParams, page_size: Int, assert_write_mode: Int = 0]`

## Fields

* ​blocks (`NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`):
* ​cache\_lengths (`NDBuffer[uint32, 1, MutableAnyOrigin]`):
* ​lookup\_table (`NDBuffer[uint32, 2, MutableAnyOrigin]`):
* ​max\_seq\_length (`SIMD[uint32, 1]`):
* ​max\_cache\_length (`SIMD[uint32, 1]`):
* ​kv\_cache\_dynamic\_shape (`IndexList[4]`):
* ​kv\_cache\_dynamic\_strides (`IndexList[4]`):

## Implemented traits

`AnyType`,
`Copyable`,
`KVCollectionT`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `blocks_shape`

`alias blocks_shape = DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size))`

### `blocks_stride`

`alias blocks_stride = _strides_from_shape[::DimList,::Int]()`

### `blocks_type`

`alias blocks_type = NDBuffer[type_, 6, MutableAnyOrigin, DimList(Dim(-31337), Dim(-31337), Dim(-31337), Dim(page_size), Dim(kv_params_.num_heads), Dim(kv_params_.head_size)), _strides_from_shape[::DimList,::Int]()]`

### `CacheType`

`alias CacheType = PagedKVCache[type_, kv_params_, page_size, assert_write_mode]`

### `kv_params`

`alias kv_params = kv_params_`

### `name_str`

`alias name_str = "paged"`

### `type`

`alias type = type_`

## Methods

### `__init__`

`__init__(out self, blocks: NDBuffer[type_, 6, MutableAnyOrigin], cache_lengths: NDBuffer[uint32, 1, MutableAnyOrigin], lookup_table: NDBuffer[uint32, 2, MutableAnyOrigin], max_seq_length: SIMD[uint32, 1], max_cache_length: SIMD[uint32, 1])`

### `__copyinit__`

`__copyinit__(out self, other: Self)`

### `__moveinit__`

`__moveinit__(out self, owned other: Self)`

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

### `get_key_cache`

`get_key_cache(self, layer_idx: Int) -> PagedKVCache[type_, kv_params_, page_size, assert_write_mode]`

### `get_value_cache`

`get_value_cache(self, layer_idx: Int) -> PagedKVCache[type_, kv_params_, page_size, assert_write_mode]`

### `cache_length`

`cache_length(self, bs_idx: Int) -> Int`

---

## parallel_memcpy

`parallel_memcpy[type: DType](dest: UnsafePointer[SIMD[type, 1]], src: UnsafePointer[SIMD[type, 1]], count: Int, count_per_task: Int, num_tasks: Int)`

Copies `count` elements from a memory buffer `src` to `dest` in parallel by spawning `num_tasks` tasks each copying `count_per_task` elements.

**Parameters:**

* ​type (`DType`): The element dtype.

**Args:**

* ​dest (`UnsafePointer[SIMD[type, 1]]`): The destination buffer.
* ​src (`UnsafePointer[SIMD[type, 1]]`): The source buffer.
* ​count (`Int`): Number of elements in the buffer.
* ​count\_per\_task (`Int`): Task size.
* ​num\_tasks (`Int`): Number of tasks to run in parallel.

`parallel_memcpy[type: DType](dest: UnsafePointer[SIMD[type, 1]], src: UnsafePointer[SIMD[type, 1]], count: Int)`

Copies `count` elements from a memory buffer `src` to `dest` in parallel.

**Parameters:**

* ​type (`DType`): The element type.

**Args:**

* ​dest (`UnsafePointer[SIMD[type, 1]]`): The destination pointer.
* ​src (`UnsafePointer[SIMD[type, 1]]`): The source pointer.
* ​count (`Int`): The number of elements to copy.

---

## parallelism_level

`parallelism_level() -> Int`

Gets the parallelism level of the Runtime.

**Returns:**

The number of worker threads available in the async runtime.

---

## parallelize

`parallelize[origins: origin.set, //, func: fn(Int) capturing -> None](num_work_items: Int)`

Executes func(0) ... func(num\_work\_items-1) as sub-tasks in parallel, and returns when all are complete.

**Parameters:**

* ​origins (`origin.set`): The capture origins.
* ​func (`fn(Int) capturing -> None`): The function to invoke.

**Args:**

* ​num\_work\_items (`Int`): Number of parallel tasks.

`parallelize[origins: origin.set, //, func: fn(Int) capturing -> None](num_work_items: Int, num_workers: Int)`

Executes func(0) ... func(num\_work\_items-1) as sub-tasks in parallel, and returns when all are complete.

**Parameters:**

* ​origins (`origin.set`): The capture origins.
* ​func (`fn(Int) capturing -> None`): The function to invoke.

**Args:**

* ​num\_work\_items (`Int`): Number of parallel tasks.
* ​num\_workers (`Int`): The number of workers to use for execution.

---

## parallelize_over_rows

`parallelize_over_rows[: origin.set, //, func: fn(Int, Int) capturing -> None](shape: IndexList[size, element_type=element_type], axis: Int, grain_size: Int)`

Parallelize func over non-axis dims of shape.

**Parameters:**

* ​func (`fn(Int, Int) capturing -> None`): Function to call on range of rows.

**Args:**

* ​shape (`IndexList[size, element_type=element_type]`): Shape to parallelize over.
* ​axis (`Int`): Rows are slices along the axis dimension of shape.
* ​grain\_size (`Int`): The minimum number of elements to warrant using an additional thread.

---

## param_env

Implements functions for retrieving compile-time defines.

You can use these functions to set parameter values or runtime constants based on
name-value pairs defined on the command line. For example:

```mojo
  from sys import is_defined

  alias float_type = DType.float32 if is_defined["FLOAT32"]() else DType.float64

  # Use `float_type` as a constant.
```

And on the command line:

```
  mojo -D FLOAT_32 main.mojo
```

For more information, see the [Mojo build docs](/mojo/cli/build.html#d-keyvalue).
The `mojo run` command also supports the `-D` option.

You can import these APIs from the `sys` package. For example:

```mojo
from sys import is_defined
```

## Functions

* [​`env_get_bool`](/mojo/stdlib/sys/param_env/env_get_bool): Try to get an boolean-valued define. Compilation fails if the name is not defined or the value is neither `True` or `False`.
* [​`env_get_dtype`](/mojo/stdlib/sys/param_env/env_get_dtype): Try to get an DType-valued define. If the name is not defined, return a default value instead.
* [​`env_get_int`](/mojo/stdlib/sys/param_env/env_get_int): Try to get an integer-valued define. Compilation fails if the name is not defined.
* [​`env_get_string`](/mojo/stdlib/sys/param_env/env_get_string): Try to get a string-valued define. Compilation fails if the name is not defined.
* [​`is_defined`](/mojo/stdlib/sys/param_env/is_defined): Return true if the named value is defined.

---

## Parameterization: compile-time metaprogramming

Many languages have facilities for *metaprogramming*: that is, for writing code that
generates or modifies code. Python has facilities for dynamic metaprogramming: features
like decorators, metaclasses, and many more. These features make Python very flexible
and productive, but since they're dynamic, they come with runtime overhead. Other
languages have static or compile-time metaprogramming features, like C preprocessor
macros and C++ templates. These can be limiting and hard to use.

To support Modular's work in AI, Mojo aims to provide powerful, easy-to-use
metaprogramming with zero runtime cost. This compile-time metaprogramming uses the same
language as runtime programs, so you don't have to learn a new language—just a few new
features.

The main new feature is *parameters*. You can think of a parameter as a
compile-time variable that becomes a runtime constant. This usage of "parameter"
is probably different from what you're used to from other languages, where
"parameter" and "argument" are often used interchangeably. In Mojo, "parameter"
and "parameter expression" refer to compile-time values, and "argument" and
"expression" refer to runtime values.

In Mojo, you can add parameters to a struct or function. You can also define
named parameter expressions—aliases—that you can use as runtime constants.

## Parameterized functions

To define a *parameterized function*, add parameters in square brackets ahead
of the argument list. Each parameter is formatted just like an argument: a
parameter name, followed by a colon and a type (which is required). In the
following example, the function has a single parameter, `count` of type `Int`.

```mojo
fn repeat[count: Int](msg: String):
    @parameter
    for i in range(count):
        print(msg)
```

The [`@parameter`](/mojo/manual/decorators/parameter) decorator shown here
causes the `for` loop to be evaluated at compile time. The decorator only works
if the loop limits are compile-time constants. Since `count` is a parameter,
`range(count)` can be calculated at compile time.

Calling a parameterized function, you provide values for the parameters, just
like function arguments:

```mojo
repeat[3]("Hello")
```

```output
Hello
Hello
Hello
```

The compiler resolves the parameter values during compilation, and creates a
concrete version of the `repeat[]()` function for each unique parameter value.
After resolving the parameter values and unrolling the loop, the `repeat[3]()`
function would be roughly equivalent to this:

```mojo
fn repeat_3(msg: String):
    print(msg)
    print(msg)
    print(msg)
```

:::note

This doesn't represent actual code generated by the compiler. By the
time parameters are resolved, Mojo code has already been transformed to an
intermediate representation in [MLIR](https://mlir.llvm.org/).

:::

If the compiler can't resolve all parameter values to constant values,
compilation fails.

## Anatomy of a parameter list

Parameters to a function or struct appear in square brackets after a function
or struct name. Parameters always require type annotations.

When you're looking at a function or struct definition, you may
see some special characters such as `/` and `*` in the parameter list.
Here's an example:

```mojo
def my_sort[
    # infer-only parameters
    Type: DType,
    width: Int,
    //,
    # positional-only parameter
    values: SIMD[Type, width],
    /,
    # positional-or-keyword parameter
    compare: fn (Scalar[Type], Scalar[Type]) -> Int,
    *,
    # keyword-only parameter
    reverse: Bool = False,
]() -> SIMD[Type, width]:
```

Here's a quick overview of the special characters in the parameter list:

- Double slash (`//`): parameters declared before the double slash are
  [infer-only parameters](#infer-only-parameters).
- Slash (`/`): parameters declared before a slash are positional-only parameters. Positional-only
  and keyword-only parameters follow the same rules as
  [positional-only and keyword-only
  arguments](/mojo/manual/functions#positional-only-and-keyword-only-arguments).
- A parameter name prefixed with a star, like `*Types` identifies a
  [variadic parameter](#variadic-parameters) (not shown in the example above).
  Any parameters following the variadic parameter are keyword-only.
- Star (`*`): in a parameter list with no variadic parameter, a star by itself
  indicates that the following parameters are keyword-only parameters.
- An equals sign (`=`) introduces a default value for an
  [optional parameter](#optional-parameters-and-keyword-parameters).

## Parameters and generics

"Generics" refers to functions that can act on multiple types of values, or
containers that can hold multiple types of values. For example,
[`List`](/mojo/stdlib/collections/list/List), can hold
different types of values, so you can have a list of `Int` values, or
a list of `String` values).

In Mojo, generics use parameters to specify types. For example, `List`
takes a type parameter, so a vector of integers is written `List[Int]`.
So all generics use parameters, but **not** everything that uses parameters is a
generic.

For example, the `repeat[]()` function in the previous section includes
parameter of type `Int`, and an argument of type `String`. It's parameterized,
but not generic. A generic function or struct is parameterized on *type*. For
example, we could rewrite `repeat[]()` to take any type of argument that
conforms to the [`Stringable`](/mojo/stdlib/builtin/str/Stringable) trait:

```mojo
fn repeat[MsgType: Stringable, count: Int](msg: MsgType):
    @parameter
    for i in range(count):
        print(String(msg))

# Must use keyword parameter for `count`
repeat[count=2](42)
```

```output
42
42
```

This updated function takes any `Stringable` type, so you can pass it an `Int`,
`String`, or `Bool` value.

You can't pass the `count` as a positional keyword without also specifying
`MsgType`. You can put `//` after `MsgType` to specify that it's always inferred
by the argument. Now you can pass the following parameter `count` positionally:

```mojo
fn repeat[MsgType: Stringable, //, count: Int](msg: MsgType):
    @parameter
    for i in range(count):
        print(String(msg))

# MsgType is always inferred, so first positional keyword `2` is passed to `count`
repeat[2](42)
```

```output
42
42
```

Mojo's support for generics is still early. You can write generic functions like
this using traits and parameters. You can also write generic collections like
`List` and `Dict`. If you're interested in learning how these types work, you
can find the source code for the standard library collection types
[on GitHub](https://github.com/modular/modular/blob/main/mojo/stdlib/src/collections/).

## Parameterized structs

You can also add parameters to structs. You can use parameterized structs to
build generic collections. For example, a generic array type might include code
like this:

```mojo
from memory import UnsafePointer

struct GenericArray[ElementType: Copyable & Movable]:
    var data: UnsafePointer[ElementType]
    var size: Int

    fn __init__(out self, *elements: ElementType):
        self.size = len(elements)
        self.data = UnsafePointer[ElementType].alloc(self.size)
        for i in range(self.size):
            (self.data + i).init_pointee_move(elements[i])

    fn __del__(owned self):
        for i in range(self.size):
            (self.data + i).destroy_pointee()
        self.data.free()

    fn __getitem__(self, i: Int) raises -> ref [self] ElementType:
        if (i  Self:
        # Create a new array with count instances of the given value
```

Here, `Self` is equivalent to writing `GenericArray[ElementType]`. That is, you
can call the `splat()` method like this:

```mojo
GenericArray[Float64].splat(8, 0)
```

The method returns an instance of `GenericArray[Float64]`.

### Conditional conformance

When creating a generic struct, you might want to define some methods that
require extra features. For example, consider a collection like `GenericArray`
that holds instances of a type that conforms to the
[`Copyable`](/mojo/stdlib/builtin/value/Copyable) and
[`Movable`](/mojo/stdlib/builtin/value/Movable) traits. This imposes a lot
of limitations: you can't implement a `sort()` method because you can't
guarantee that the stored type supports the comparison operators; you can't
write a useful `__str__()` or `__repr__()` dunder method because you can't
guarantee that the stored type supports conversion to a string.

The answer to these issues is *conditional conformance*, which lets you define a
method that requires additional features. You do this by defining the `self`
value that has a more specific bound on one or more of its parameters.

For example, the following code defines a `Container` type that holds an
instance of a type conforming to `Copyable` and `Movable`. It also defines a
`__str__()` method that can only be called if the stored `ElementType` conforms
to `Writable`, `Copyable` and `Movable`:

```mojo
@value
struct Container[ElementType: Copyable & Movable]:
    var element: ElementType

    def __str__[StrElementType: Writable & Copyable & Movable, //](
            self: Container[StrElementType]) -> String:
        return String(self.element)

def use_container():
    float_container = Container(5)
    string_container = Container("Hello")
    print(float_container.__str__())
    print(string_container.__str__())

use_container()
```

```output
5
Hello
```

Note the signature of the `__str__()` method, which declares the `self` argument
with a more specific type. Specifically, it declares that it takes a `Container`
with an `ElementType` that conforms to the `Writable`, `Copyable` and `Movable`
traits.

```mojo
def __str__[StrElementType: Writable & Copyable & Movable, //](
        self: Container[StrElementType]) -> String:
```

This trait must be a superset of `ElementType`'s original trait: for example,
the trait composition `Writable & Copyable & Movable` ensures that it includes
all of the requirements of the original trait.

Note that the `use_container()` function calls the `__str__()` method directly,
rather than calling `String(float_container)`. One current limitation of
conditional conformance is that Mojo can't recognize the struct
`Container[Int]` as conforming to `Stringable`, even though the `__str__()`
method is implemented for any `ElementType` that's also `Stringable`.

### Case study: the SIMD type

For a real-world example of a parameterized type, let's look at the
[`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type from Mojo's standard library.

[Single instruction, multiple data (SIMD)](https://en.wikipedia.org/wiki/Single_instruction,_multiple_data) is a parallel processing technology built into many modern CPUs,
GPUs, and custom accelerators. SIMD allows you to perform a single operation on
multiple pieces of data at once. For example, if you want to take the square
root of each element in an array, you can use SIMD to parallelize the work.

Processors implement SIMD using low-level vector registers in hardware that hold
multiple instances of a scalar data type. In order to use the SIMD instructions
on these processors, the data must be shaped into the proper SIMD width
(data type) and length (vector size). Processors may support 512-bit or
longer SIMD vectors, and support many data types from 8-bit integers to 64-bit
floating point numbers, so it's not practical to define all of the possible SIMD
variations.

Mojo's [`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type (defined as a struct)
exposes the common SIMD operations through its methods, and makes the SIMD data
type and size values parametric. This allows you to directly map your data to
the SIMD vectors on any hardware.

Here's a cut-down (non-functional) version of Mojo's `SIMD` type definition:

```mojo
struct SIMD[type: DType, size: Int]:
    var value: … # Some low-level MLIR stuff here

    # Create a new SIMD from a number of scalars
    fn __init__(out self, *elems: SIMD[type, 1]):  ...

    # Fill a SIMD with a duplicated scalar value.
    @staticmethod
    fn splat(x: SIMD[type, 1]) -> SIMD[type, size]: ...

    # Cast the elements of the SIMD to a different elt type.
    fn cast[target: DType](self) -> SIMD[target, size]: ...

    # Many standard operators are supported.
    fn __add__(self, rhs: Self) -> Self: ...
```

So you can create and use a SIMD vector like this:

```mojo
var vector = SIMD[DType.int16, 4](1, 2, 3, 4)
vector = vector * vector
for i in range(4):
    print(vector[i], end=" ")
```

```output
1 4 9 16
```

As you can see, a simple arithmetic operator like `*` applied to a pair of
`SIMD` vector operates on the corresponding elements in each vector.

Defining each SIMD variant with parameters is great for code reuse because the
`SIMD` type can express all the different vector variants statically, instead of
requiring the language to pre-define every variant.

Because `SIMD` is a parameterized type, the `self` argument in its functions
carries those parameters—the full type name is `SIMD[type, size]`. Although
it's valid to write this out (as shown in the return type of `splat()`), this
can be verbose, so we recommend using the `Self` type (from
[PEP673](https://peps.python.org/pep-0673/)) like the `__add__` example does.

## Overloading on parameters

Functions and methods can be overloaded on their parameter signatures. For
information on overload resolution, see
[Overloaded functions](/mojo/manual/functions#overloaded-functions).

## Using parameterized types and functions

You can use parametric types and functions by passing values to the
parameters in square brackets. For example, for the `SIMD` type above, `type`
specifies the data type and `size` specifies the length of the SIMD vector (it
must be a power of 2):

```mojo
# Make a vector of 4 floats.
var small_vec = SIMD[DType.float32, 4](1.0, 2.0, 3.0, 4.0)

# Make a big vector containing 1.0 in float16 format.
var big_vec = SIMD[DType.float16, 32](1.0)

# Do some math and convert the elements to float32.
var bigger_vec = (big_vec+big_vec).cast[DType.float32]()

# You can write types out explicitly if you want of course.
var bigger_vec2 : SIMD[DType.float32, 32] = bigger_vec

print('small_vec type:', small_vec.element_type, 'length:', len(small_vec))
print('bigger_vec2 type:', bigger_vec2.element_type, 'length:', len(bigger_vec2))
```

```output
small_vec type: float32 length: 4
bigger_vec2 type: float32 length: 32
```

Note that the `cast()` method also needs a parameter to specify the type you
want from the cast (the method definition above expects a `target` parametric
value). Thus, just as the `SIMD` struct is a generic type definition, the
`cast()` method is a generic method definition. At compile time, the compiler
creates a concrete version of the `cast()` method with the target parameter
bound to `DType.float32`.

The code above shows the use of concrete types (that is, the parameters are all
bound to known values). But the major power of parameters comes from the
ability to define parametric algorithms and types (code that uses the parameter
values). For example, here's how to define a parametric algorithm with `SIMD`
that is type- and width-agnostic:

```mojo
from math import sqrt

fn rsqrt[dt: DType, width: Int](x: SIMD[dt, width]) -> SIMD[dt, width]:
    return 1 / sqrt(x)

var v = SIMD[DType.float16, 4](42)
print(rsqrt(v))
```

```output
[0.154296875, 0.154296875, 0.154296875, 0.154296875]
```

Notice that the `x` argument is actually a `SIMD` type based on the function
parameters. The runtime program can use the value of the parameters, because the
parameters are resolved at compile-time before they are needed by the runtime
program (but compile-time parameter expressions cannot use runtime values).

### Parameter inference

The Mojo compiler can often *infer* parameter values, so you don't always have
to specify them. For example, you can call the `rsqrt()` function defined above
without any parameters:

```mojo
var v = SIMD[DType.float16, 4](33)
print(rsqrt(v))
```

```output
[0.174072265625, 0.174072265625, 0.174072265625, 0.174072265625]
```

The compiler infers its parameters based on the parametric `v`
value passed into it, as if you wrote `rsqrt[DType.float16, 4](v)` explicitly.

Mojo can also infer the values of struct parameters from the arguments passed to
a constructor or static method.

For example, consider the following struct:

```mojo
@value
struct One[Type: Writable & Copyable & Movable]:
    var value: Type

    fn __init__(out self, value: Type):
        self.value = value

def use_one():
    s1 = One(123)
    s2 = One("Hello")
```

Note that you can create an instance of `One` without specifying the `Type`
parameter—Mojo can infer it from the `value` argument.

You can also infer parameters from a parameterized type passed to a constructor
or static method:

```mojo
struct Two[Type: Writable & Copyable & Movable]:
    var val1: Type
    var val2: Type

    fn __init__(out self, one: One[Type], another: One[Type]):
        self.val1 = one.value
        self.val2 = another.value
        print(String(self.val1), String(self.val2))

    @staticmethod
    fn fire(thing1: One[Type], thing2: One[Type]):
        print("🔥", String(thing1.value), String(thing2.value))

def use_two():
    s3 = Two(One("infer"), One("me"))
    Two.fire(One(1), One(2))

use_two()
```

```output
infer me
🔥 1 2
```

`Two` takes a `Type` parameter, and its constructor takes values of type
`One[Type]`. When constructing an instance of `Two`, you don't need to specify
the `Type` parameter, since it can be inferred from the arguments.

Similarly, the static `fire()` method takes values of type `One[Type]`, so Mojo
can infer the `Type` value at compile time.

:::note

If you're familiar with C++, you may recognize this as similar to Class Template
Argument Deduction (CTAD).

:::

## Optional parameters and keyword parameters

Just as you can specify [optional
arguments](/mojo/manual/functions#optional-arguments) in function signatures,
you can also define an optional *parameter* by giving it a default value.

You can also pass parameters by keyword, just like you can use
[keyword arguments](/mojo/manual/functions#keyword-arguments).
For a function or struct with multiple optional parameters, using keywords
allows you to pass only the parameters you want to specify, regardless of
their position in the function signature.

For example, here's a function with two parameters, each with a default value:

```mojo
fn speak[a: Int = 3, msg: StringLiteral = "woof"]():
    print(msg, a)

fn use_defaults() raises:
    speak()             # prints 'woof 3'
    speak[5]()          # prints 'woof 5'
    speak[7, "meow"]()  # prints 'meow 7'
    speak[msg="baaa"]() # prints 'baaa 3'
```

Recall that when a parametric function is called, Mojo can infer the parameter values.
That is, it can use the parameter values attached to an argument value (see the
`sqrt[]()` example above). If the parametric function also has a default value defined,
then the inferred parameter type takes precedence.

For example, in the following code, we update the parametric `speak[]()` function
to take an argument with a parametric type. Although the function has a default
parameter value for `a`, Mojo instead uses the inferred `a` parameter value
from the `bar` argument (as written, the default `a` value can never be used,
but this is just for demonstration purposes):

```mojo
@value
struct Bar[v: Int]:
    pass

fn speak[a: Int = 3, msg: StringLiteral = "woof"](bar: Bar[a]):
    print(msg, a)

fn use_inferred():
    speak(Bar[9]())  # prints 'woof 9'
```

As mentioned above, you can also use optional parameters and keyword
parameters in a struct:

```mojo
struct KwParamStruct[greeting: String = "Hello", name: String = "🔥mojo🔥"]:
    fn __init__(out self):
        print(greeting, name)

fn use_kw_params():
    var a = KwParamStruct[]()                 # prints 'Hello 🔥mojo🔥'
    var b = KwParamStruct[name="World"]()     # prints 'Hello World'
    var c = KwParamStruct[greeting="Hola"]()  # prints 'Hola 🔥mojo🔥'
```

:::note

Mojo supports positional-only and keyword-only parameters, following the same
rules as [positional-only and keyword-only
arguments](/mojo/manual/functions#positional-only-and-keyword-only-arguments).

:::

## Infer-only parameters

Sometimes you need to declare functions where parameters depend on other
parameters. Because the signature is processed left to right, a parameter can
only *depend* on a parameter earlier in the parameter list. For example:

```mojo
fn dependent_type[dtype: DType, value: Scalar[dtype]]():
    print("Value: ", value)
    print("Value is floating-point: ", dtype.is_floating_point())

dependent_type[DType.float64, Float64(2.2)]()
```

```output
Value:  2.2000000000000002
Value is floating-point:  True
```

You can't reverse the position of the `dtype` and `value` parameters, because
`value` depends on `dtype`. However, because `dtype` is a required parameter,
you can't leave it out of the parameter list and let Mojo infer it from `value`:

```mojo
dependent_type[Float64(2.2)]() # Error!
```

Infer-only parameters are a special class of parameters that are **always** either
inferred from context or specified by keyword. Infer-only parameters are placed at the
**beginning** of the parameter list, set off from other parameters by the `//` sigil:

```mojo
fn example[type: Copyable & Movable, //, list: List[type]]()
```

Transforming `dtype` into an infer-only parameter solves this problem:

```mojo
fn dependent_type[dtype: DType, //, value: Scalar[dtype]]():
    print("Value: ", value)
    print("Value is floating-point: ", dtype.is_floating_point())

dependent_type[Float64(2.2)]()
```

```output
Value:  2.2000000000000002
Value is floating-point:  True
```

Because infer-only parameters are declared at the beginning of the parameter
list, other parameters can depend on them, and the compiler will always attempt
to infer the infer-only values from bound parameters or arguments.

There are sometimes cases where it's useful to specify an infer-only parameter
by keyword. For example, the
[`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice) type
is parametric on [origin](/mojo/manual/values/lifetimes):

```mojo
struct StringSlice[mut: Bool, //, origin: Origin[mut]]: ...
```

Here, the `StringSlice` `mut` parameter is infer-only. The value is usually
inferred when you create an instance of `StringSlice`. Binding the `mut`
parameter by keyword lets you define a new type that's constrained to an
immutable origin:

```mojo
alias ImmutableStringSlice = StringSlice[mut=False]
```

If the compiler can't infer the value of an infer-only parameter, and it's not
specified by keyword, compilation fails.

## Variadic parameters

Mojo also supports variadic parameters, similar to
[Variadic arguments](/mojo/manual/functions#variadic-arguments):

```mojo
struct MyTensor[*dimensions: Int]:
    pass
```

Variadic parameters currently have some limitations that variadic arguments don't have:

* Variadic parameters must be homogeneous—that is, all the values must be the
  same type.

* The parameter type must be register-passable.

* The parameter values aren't automatically projected into a `VariadicList`, so you
  need to construct the list explicitly:

```mojo
fn sum_params[*values: Int]() -> Int:
    alias list = VariadicList(values)
    var sum = 0
    for v in list:
        sum += v
    return sum
```

Variadic keyword parameters (for example, `**kwparams`) are
not supported yet.

## Parameter expressions are just Mojo code

A parameter expression is any code expression (such as `a+b`) that occurs where
a parameter is expected. Parameter expressions support operators and function
calls, just like runtime code, and all parameter types use the same type
system as the runtime program (such as `Int` and `DType`).

Because parameter expressions use the same grammar and types as runtime
Mojo code, you can use many
["dependent type"](https://en.wikipedia.org/wiki/Dependent_type) features. For
example, you might want to define a helper function to concatenate two SIMD
vectors:

```mojo
fn concat[ty: DType, len1: Int, len2: Int](
        lhs: SIMD[ty, len1], rhs: SIMD[ty, len2]) -> SIMD[ty, len1+len2]:

    var result = SIMD[ty, len1 + len2]()
    for i in range(len1):
        result[i] = SIMD[ty, 1](lhs[i])
    for j in range(len2):
        result[len1 + j] = SIMD[ty, 1](rhs[j])
    return result

var a = SIMD[DType.float32, 2](1, 2)
var x = concat(a, a)

print('result type:', x.element_type, 'length:', len(x))
```

```output
result type: float32 length: 4
```

Note that the resulting length is the sum of the input vector lengths, and this is
expressed with a simple `+` operation.

### Powerful compile-time programming

While simple expressions are useful, sometimes you want to write imperative
compile-time logic with control flow. You can even do compile-time recursion.
For instance, here is an example "tree reduction" algorithm that sums all
elements of a vector recursively into a scalar:

```mojo
fn slice[ty: DType, new_size: Int, size: Int](
        x: SIMD[ty, size], offset: Int) -> SIMD[ty, new_size]:
    var result = SIMD[ty, new_size]()
    for i in range(new_size):
        result[i] = SIMD[ty, 1](x[i + offset])
    return result

fn reduce_add[ty: DType, size: Int](x: SIMD[ty, size]) -> Int:
    @parameter
    if size == 1:
        return Int(x[0])
    elif size == 2:
        return Int(x[0]) + Int(x[1])

    # Extract the top/bottom halves, add them, sum the elements.
    alias half_size = size // 2
    var lhs = slice[ty, half_size, size](x, 0)
    var rhs = slice[ty, half_size, size](x, half_size)
    return reduce_add[ty, half_size](lhs + rhs)

var x = SIMD[DType.index, 4](1, 2, 3, 4)
print(x)
print("Elements sum:", reduce_add(x))
```

```output
[1, 2, 3, 4]
Elements sum: 10
```

This makes use of the [`@parameter`](/mojo/manual/decorators/parameter) decorator to
create a parametric if condition, which is an `if` statement that runs at compile-time.
It requires that its condition be a valid parameter expression, and ensures that only
the live branch of the `if` statement is compiled into the program. (This is similar to
use of the `@parameter` decorator with a `for` loop shown earlier.)

## `alias`: named parameter expressions

It is very common to want to *name* compile-time values. Whereas `var` defines a
runtime value, we need a way to define a
compile-time temporary value. For this, Mojo uses an `alias` declaration.

For example, the [`DType`](/mojo/stdlib/builtin/dtype/DType) struct
implements a simple enum using aliases for the enumerators like this (the actual
`DType` implementation details vary a bit):

```mojo
struct DType:
    var value : UI8
    alias invalid = DType(0)
    alias bool = DType(1)
    alias int8 = DType(2)
    alias uint8 = DType(3)
    alias int16 = DType(4)
    alias int16 = DType(5)
    ...
    alias float32 = DType(15)
```

This allows clients to use `DType.float32` as a parameter expression (which also
works as a runtime value) naturally. Note that this is invoking the
runtime constructor for `DType` at compile-time.

Types are another common use for aliases. Because types are compile-time
expressions, it is handy to be able to do things like this:

```mojo
alias Float16 = SIMD[DType.float16, 1]
alias UInt8 = SIMD[DType.uint8, 1]

var x: Float16 = 0  # Float16 works like a "typedef"
```

Like `var` variables, aliases obey scope, and you can use local aliases within
functions as you'd expect.

## Fully-bound, partially-bound, and unbound types

A parametric type with its parameters specified is said to be *fully-bound*.
That is, all of its parameters are bound to values. As mentioned before, you can
only instantiate a fully-bound type (sometimes called a *concrete type*).

However, parametric types can be *unbound* or *partially bound* in some
contexts. For example, you can alias a partially-bound type to create a new type
that requires fewer parameters:

```mojo
alias StringKeyDict = Dict[String, _]
var b: StringKeyDict[UInt8] = {"answer": 42}
```

Here, `StringKeyDict` is a type alias for a `Dict` that takes `String` keys. The
underscore `_` in the parameter list indicates that the second parameter,
`V` (the value type), is unbound.
You specify the `V` parameter later, when you use `StringKeyDict`.

For example, given the following type:

```mojo
struct MyType[s: String, i: Int, i2: Int, b: Bool = True]:
    pass
```

It can appear in code in the following forms:

* *Fully bound*, with all of its parameters specified:

  ```mojo
  MyType["Hello", 3, 4, True]
  ```

* *Partially bound*, with *some but not all* of its parameters specified:

  ```mojo
  MyType["Hola", _, _, True]
  ```

* *Unbound*, with no parameters specified:

  ```mojo
  MyType[_, _, _, _]
  ```

You can also use the star-underscore expression `*_` to unbind an arbitrary
number of positional parameters at the end of a parameter
list.

```mojo
# These two types are equivalent
MyType["Hello", *_]
MyType["Hello", _, _, _]
```

The `*_` expression specifically matches any parameters that can be specified by
position (positional-only or positional-or-keyword). To unbind keyword-only parameters,
use the double-star-underscore expression, `**_`, which matches any parameters that can
be specified by keyword (positional-or-keyword or keyword-only).

```mojo
@value
struct KeyWordStruct[pos_or_kw: Int, *, kw_only: Int = 10]:
    pass

# Unbind both pos_or_kw and kw_only parameters
fn use_kw_struct(k: KeyWordStruct[**_]):
    pass

def main():
    use_kw_struct(KeyWordStruct[10, kw_only=11]())
```

When a parameter is explicitly unbound with the `_`, `*_`, or `**_` expressions, you
**must** specify a value for that parameter to use the type. Any default value from the
original type declaration is ignored.

Partially-bound and unbound parametric types can be used in some contexts where
the missing (unbound) parameters will be supplied later—such as in
[aliases](#alias-named-parameter-expressions) and
[automatically parameterized functions](#automatic-parameterization-of-functions).

### Omitted parameters

Mojo also supports an alternate format for unbound parameter where the parameter
is simply omitted from the expression:

```mojo
# Partially bound
MyType["Hi there"]
# Unbound
MyType
```

This format differs from the explicit unbinding syntax described above in that
the default values for omitted parameters are bound immediately. For example,
the following expressions are equivalent:

```mojo
MyType["Hi there"]
# equivalent to
MyType["Hi there", _, _, True] # Uses the default value for `b`
```

:::note

This format is currently supported for backwards compatibility. We intend to
deprecate this format in the future in favor of the explicit unbinding syntax.

:::

## Automatic parameterization of functions

Mojo  supports "automatic" parameterization of functions. If a function
argument type is a
[partially-bound or unbound type](#fully-bound-partially-bound-and-unbound-types),
the unbound parameters are automatically added as input parameters on the
function. This is easier to understand with an example:

```mojo
fn print_params(vec: SIMD[*_]):
    print(vec.type)
    print(vec.size)

var v = SIMD[DType.float64, 4](1.0, 2.0, 3.0, 4.0)
print_params(v)
```

```output
float64
4
```

In the above example, the `print_params` function is automatically
parameterized. The `vec` argument takes an argument of type `SIMD[*_]`. This is
an [unbound parameterized
type](#fully-bound-partially-bound-and-unbound-types)—that is, it doesn't
specify any parameter values for the type. Mojo treats the unbound parameters
on `vec` as infer-only parameters on the function. This is roughly equivalent to
the following codes:

```mojo
fn print_params[t: DType, s: Int, //](vec: SIMD[t, s]):
    print(vec.type)
    print(vec.size)
```

When you call `print_params()` you must pass it a concrete instance of the
`SIMD`  type—that is, one with all of its parameters specified, like
`SIMD[DType.float64, 4]`. The Mojo compiler *infers* the parameter
values from the input argument.

With a manually parameterized function, you can access the input parameters by
name (for example, `t` and `s` in the previous example). For an
automatically parameterized function, you can access the parameters as
attributes on the argument (for example, `vec.type`).

This ability to access a type's input parameters is not specific to
automatically parameterized functions, you can use it anywhere. You can access
the input parameters of a parameterized type as attributes on the type itself:

```mojo
fn on_type():
    print(SIMD[DType.float32, 2].size) # prints 2
```

Or as attributes on an *instance* of the type:

```mojo
fn on_instance():
    var x = SIMD[DType.int32, 2](4, 8)
    print(x.type) # prints int32
```

You can even use this syntax in the function's signature to define a
function's arguments and return type based on an argument's parameters.
For example, if you want your function to take two SIMD vectors with the same
type and size, you can write code like this:

```mojo
fn interleave(v1: SIMD, v2: __type_of(v1)) -> SIMD[v1.type, v1.size*2]:
    var result = SIMD[v1.type, v1.size*2]()
    for i in range(v1.size):
        result[i*2] = SIMD[v1.type, 1](v1[i])
        result[i*2+1] = SIMD[v1.type, 1](v2[i])
    return result

var a = SIMD[DType.int16, 4](1, 2, 3, 4)
var b = SIMD[DType.int16, 4](0, 0, 0, 0)
var c = interleave(a, b)
print(c)
```

```output
[1, 0, 2, 0, 3, 0, 4, 0]
```

As shown in the example, you can use the magic `__type_of(x)` call if you just want to
match the type of an argument. In this case, it's more convenient and compact that
writing the equivalent `SIMD[v1.type, v1.size]`.

### Automatic parameterization of parameters

You can also take advantage of automatic parameterization in a function's
parameter list. For example:

```mojo
fn foo[value: SIMD]():
    pass

# Equivalent to:
fn foo[type: DType, size: Int, //, value: SIMD[type, size]]():
    pass
```

### Automatic parameterization with partially-bound types

Mojo also supports automatic parameterization: with [partially-bound
parameterized types](#fully-bound-partially-bound-and-unbound-types) (that is,
types with some but not all of the parameters specified).

For example, suppose we have a `Fudge` struct with three parameters:

```mojo
@value
struct Fudge[sugar: Int, cream: Int, chocolate: Int = 7](Stringable):
    fn __str__(self) -> String:
        return String.write("Fudge (", sugar, ",", cream, ",", chocolate, ")")
```

We can write a function that takes a `Fudge` argument with just one bound
parameter (it's *partially bound*):

```mojo
fn eat(f: Fudge[5, *_]):
    print("Ate " + String(f))
```

The `eat()` function takes a `Fudge` struct with the first parameter (`sugar`)
bound to the value 5. The second and third parameters, `cream` and `chocolate`
are unbound.

The unbound `cream` and `chocolate` parameters become implicit input parameters
on the `eat` function. In practice, this is roughly equivalent to writing:

```mojo
fn eat[cr: Int, ch: Int](f: Fudge[5, cr, ch]):
    print("Ate", String(f))
```

In both cases, we can call the function by passing in an instance with the
`cream` and `chocolate` parameters bound:

```mojo
eat(Fudge[5, 5, 7]())
eat(Fudge[5, 8, 9]())
```

```output
Ate Fudge (5,5,7)
Ate Fudge (5,8,9)
```

If you try to pass in an argument with a `sugar` value other than 5,
compilation fails, because it doesn't match the argument type:

```mojo
eat(Fudge[12, 5, 7]())
# ERROR: invalid call to 'eat': argument #0 cannot be converted from 'Fudge[12, 5, 7]' to 'Fudge[5, 5, 7]'
```

You can also explicitly unbind individual parameters. This gives you
more freedom in specifying unbound parameters.

For example, you might want to let the user specify values for `sugar` and
`chocolate`, and leave `cream` constant. To do this, replace each unbound
parameter value with a single underscore (`_`):

```mojo
fn devour(f: Fudge[_, 6, _]):
    print("Devoured",  String(f))
```

Again, the unbound parameters (`sugar` and `chocolate`) are added as implicit
input parameters on the function. This version is roughly equivalent to the
following code, where these two values are explicitly bound to the input
parameters, `su` and `ch`:

```mojo
fn devour[su: Int, ch: Int](f: Fudge[su, 6, ch]):
    print("Devoured", String(f))
```

You can also specify parameters by keyword, or mix positional and keyword
parameters, so the following function is roughly equivalent to the previous one:
the first parameter, `sugar` is explicitly unbound with the underscore character.
The `chocolate` parameter is unbound using the keyword syntax, `chocolate=_`.
And `cream` is explicitly bound to the value 6:

```mojo
fn devour(f: Fudge[_, chocolate=_, cream=6]):
    print("Devoured", String(f))
```

All three versions of the `devour()` function work with the following calls:

```mojo
devour(Fudge[3, 6, 9]())
devour(Fudge[4, 6, 8]())
```

```output
Devoured Fudge (3,6,9)
Devoured Fudge (4,6,8)
```

### Legacy syntax (omitted parameters)

You can also specify an unbound or partially-bound type by omitting parameters:
for example:

```mojo
fn nibble(f: Fudge[5]):
    print("Ate", String(f))

nibble(Fudge[5, 4, 7]())

```

```output
Ate Fudge (5,4,7)
```

Here, `Fudge[5]` works like `Fudge[5, *_]` **except** in the handling of
parameters with default values. Instead of discarding the default value of
`chocolate`, `Fudge[5]` binds the default value immediately, making it
equivalent to: `Fudge[5, _, 7]`.

This means that the following code won't compile with the previous definition
for the `nibble()` function, since it doesn't use the default value for
`chocolate`:

```mojo
nibble(Fudge[5, 5, 9]())
# ERROR: invalid call to 'nibble': argument #0 cannot be converted from 'Fudge[5, 5, 9]' to 'Fudge[5, 5, 7]'
```

:::note TODO

Support for omitting unbound parameters will eventually be deprecated in
favor of explicitly unbound parameters using `_` and `*_`.

:::

## The `rebind()` builtin

One of the consequences of Mojo not performing function instantiation in the
parser like C++ is that Mojo cannot always figure out whether some parametric
types are equal and complain about an invalid conversion. This typically occurs
in static dispatch patterns. For example, the following code won't compile:

```mojo
fn take_simd8(x: SIMD[DType.float32, 8]):
    pass

fn generic_simd[nelts: Int](x: SIMD[DType.float32, nelts]):
    @parameter
    if nelts == 8:
        take_simd8(x)
```

The parser will complain:

```plaintext
error: invalid call to 'take_simd8': argument #0 cannot be converted from
'SIMD[f32, nelts]' to 'SIMD[f32, 8]'
        take_simd8(x)
        ~~~~~~~~~~^~~
```

This is because the parser fully type-checks the function without instantiation,
and the type of `x` is still `SIMD[f32, nelts]`, and not `SIMD[f32, 8]`, despite
the static conditional. The remedy is to manually "rebind" the type of `x`,
using the `rebind` builtin, which inserts a compile-time assert that the input
and result types resolve to the same type after function instantiation:

```mojo
fn take_simd8(x: SIMD[DType.float32, 8]):
    pass

fn generic_simd[nelts: Int](x: SIMD[DType.float32, nelts]):
    @parameter
    if nelts == 8:
        take_simd8(rebind[SIMD[DType.float32, 8]](x))
```

---

## partial_simd_load

`partial_simd_load[type: DType, //, width: Int](storage: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], lbound: Int, rbound: Int, pad_value: SIMD[type, 1]) -> SIMD[type, width]`

Loads a vector with dynamic bound.

Out of bound data will be filled with pad value. Data is valid if
lbound type (`DType`): The DType of storage.
* ​width (`Int`): The system simd vector size.

**Args:**

* ​storage (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the address to perform load.
* ​lbound (`Int`): Lower bound of valid index within simd (inclusive).
* ​rbound (`Int`): Upper bound of valid index within simd (non-inclusive).
* ​pad\_value (`SIMD[type, 1]`): Value to fill for out of bound indices.

**Returns:**

The SIMD vector loaded and zero-filled.

---

## partial_simd_store

`partial_simd_store[type: DType, //, width: Int](storage: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], lbound: Int, rbound: Int, data: SIMD[type, width])`

Stores a vector with dynamic bound.

Out of bound data will ignored. Data is valid if lbound type (`DType`): The DType of storage.
* ​width (`Int`): The system simd vector size.

**Args:**

* ​storage (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the address to perform load.
* ​lbound (`Int`): Lower bound of valid index within simd (inclusive).
* ​rbound (`Int`): Upper bound of valid index within simd (non-inclusive).
* ​data (`SIMD[type, width]`): The vector value to store.

---

## partition

`partition[: origin.set, T: Copyable & Movable, origin: MutableOrigin, //, cmp_fn: fn(T, T) capturing -> Bool](span: Span[T, origin], k: Int)`

Partition the input buffer inplace such that first k elements are the largest (or smallest if cmp\_fn is T (`Copyable & Movable`): Type of the underlying data.
* ​origin (`MutableOrigin`): Origin of span.
* ​cmp\_fn (`fn(T, T) capturing -> Bool`): Comparison functor of (T, T) capturing \[\_] -> Bool type.

**Args:**

* ​span (`Span[T, origin]`): Input buffer.
* ​k (`Int`): Index of the partition element.

---

## partition_work

`partition_work(task_id: Int, num_tasks: Int, work: Int, work_block_size: Int) -> IndexList[2]`

---

## Passwd

`struct Passwd`

Represents user account information retrieved from the user password database related to a user ID.

## Fields

* ​pw\_name (`String`): User name.
* ​pw\_passwd (`String`): User password.
* ​pw\_uid (`Int`): User ID.
* ​pw\_gid (`Int`): Group ID.
* ​pw\_gecos (`String`): Real name or comment field.
* ​pw\_dir (`String`): Home directory.
* ​pw\_shell (`String`): Shell program.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this string to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__str__`

`__str__(self) -> String`

Gets the Passwd struct as a string.

**Returns:**

A compact string of the Passwd struct.

### `__repr__`

`__repr__(self) -> String`

Gets the Passwd struct as a string.

**Returns:**

A compact string representation of Passwd struct.

---

## path

Provides a set of operating-system independent functions for manipulating file system paths.

## Modules

* [​`path`](/mojo/stdlib/os/path/path/): Provides a set of operating-system independent functions for manipulating file system paths.

---

## path

Provides a set of operating-system independent functions for manipulating file system paths.

You can import these APIs from the `os.path` package. For example:

```mojo
from os.path import isdir
```

## Functions

* [​`basename`](/mojo/stdlib/os/path/path/basename): Returns the tail section of a path.
* [​`dirname`](/mojo/stdlib/os/path/path/dirname): Returns the directory component of a pathname.
* [​`exists`](/mojo/stdlib/os/path/path/exists): Return True if path exists.
* [​`expanduser`](/mojo/stdlib/os/path/path/expanduser): Expands a tilde "\~" prefix in `path` to the user's home directory.
* [​`expandvars`](/mojo/stdlib/os/path/path/expandvars): Replaces `${var}` or `$var` in the path with values from the current environment variables. Malformed variable names and references to non-existing variables are left unchanged.
* [​`getsize`](/mojo/stdlib/os/path/path/getsize): Return the size, in bytes, of the specified path.
* [​`is_absolute`](/mojo/stdlib/os/path/path/is_absolute): Return True if `path` is an absolute path name. On Unix, that means it begins with a slash.
* [​`isdir`](/mojo/stdlib/os/path/path/isdir): Return True if path is an existing directory. This follows symbolic links, so both islink() and isdir() can be true for the same path.
* [​`isfile`](/mojo/stdlib/os/path/path/isfile): Test whether a path is a regular file.
* [​`islink`](/mojo/stdlib/os/path/path/islink): Return True if path refers to an existing directory entry that is a symbolic link.
* [​`join`](/mojo/stdlib/os/path/path/join): Join two or more pathname components, inserting '/' as needed. If any component is an absolute path, all previous path components will be discarded.  An empty last part will result in a path that ends with a separator.
* [​`lexists`](/mojo/stdlib/os/path/path/lexists): Return True if path exists or is a broken symlink.
* [​`split`](/mojo/stdlib/os/path/path/split): Split a given pathname into two components: head and tail. This is useful for separating the directory path from the filename. If the input path ends with a separator, the tail component will be empty. If there is no separator in the path, the head component will be empty, and the entire path will be considered the tail. Trailing separators in the head are stripped unless the head is the root directory.
* [​`split_extension`](/mojo/stdlib/os/path/path/split_extension): Splits `path` into the root and extension.
* [​`splitroot`](/mojo/stdlib/os/path/path/splitroot): Splits `path` into drive, root and tail. The tail contains anything after the root.

---

## path

Implements `Path` and related functions.

## Aliases

### `DIR_SEPARATOR`

`alias DIR_SEPARATOR = "\\".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]() if os_is_windows() else "/".__merge_with__[__mlir_type.!kgen.string,AnyStruct[::StringLiteral[$1]]]()`

## Structs

* [​`Path`](/mojo/stdlib/pathlib/path/Path): The Path object.

## Functions

* [​`cwd`](/mojo/stdlib/pathlib/path/cwd): Gets the current directory.

---

## Path

`struct Path`

The Path object.

## Fields

* ​path (`String`): The underlying path string representation.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Hashable`,
`KeyElement`,
`Movable`,
`PathLike`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Methods

### `__init__`

`__init__(out self)`

Initializes a path with the current directory.

`__init__(out self, path: StringSlice[origin])`

Initializes a path with the provided path.

**Args:**

* ​path (`StringSlice[origin]`): The file system path.

`@implicit`
`__init__(out self, owned path: String)`

Initializes a path with the provided path.

**Args:**

* ​path (`String`): The file system path.

`@implicit`
`__init__(out self, path: StringLiteral[value])`

Initializes a path with the provided path.

**Args:**

* ​path (`StringLiteral[value]`): The file system path.

### `__bool__`

`__bool__(self) -> Bool`

Checks if the path is not empty.

**Returns:**

True if the path length is greater than zero, and False otherwise.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Returns True if the two paths are equal.

**Args:**

* ​other (`Self`): The other path to compare against.

**Returns:**

True if the paths are equal and False otherwise.

`__eq__(self, other: StringSlice[origin]) -> Bool`

Returns True if the two paths are equal.

**Args:**

* ​other (`StringSlice[origin]`): The other path to compare against.

**Returns:**

True if the String and Path are equal, and False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Returns True if the two paths are not equal.

**Args:**

* ​other (`Self`): The other path to compare against.

**Returns:**

True if the paths are not equal and False otherwise.

### `__truediv__`

`__truediv__(self, suffix: Self) -> Self`

Joins two paths using the system-defined path separator.

**Args:**

* ​suffix (`Self`): The suffix to append to the path.

**Returns:**

A new path with the suffix appended to the current path.

`__truediv__(self, suffix: StringSlice[origin]) -> Self`

Joins two paths using the system-defined path separator.

**Args:**

* ​suffix (`StringSlice[origin]`): The suffix to append to the path.

**Returns:**

A new path with the suffix appended to the current path.

### `__itruediv__`

`__itruediv__(mut self, suffix: StringSlice[origin])`

Joins two paths using the system-defined path separator.

**Args:**

* ​suffix (`StringSlice[origin]`): The suffix to append to the path.

### `copy`

`copy(self) -> Self`

Copy the object.

**Returns:**

A copy of the value.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the path.

**Returns:**

A string representation of the path.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this path to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__fspath__`

`__fspath__(self) -> String`

Returns a string representation of the path.

**Returns:**

A string representation of the path.

### `__repr__`

`__repr__(self) -> String`

Returns a printable representation of the path.

**Returns:**

A printable representation of the path.

### `__hash__`

`__hash__(self) -> UInt`

Hash the underlying path string using builtin hash.

**Returns:**

An integer value containing the hash of the path string.

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with the path string value.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

### `stat`

`stat(self) -> stat_result`

Returns the stat information on the path.

**Returns:**

A stat\_result object containing information about the path.

### `lstat`

`lstat(self) -> stat_result`

Returns the lstat information on the path. This is similar to stat, but if the file is a symlink then it gives you information about the symlink rather than the target.

**Returns:**

A stat\_result object containing information about the path.

### `exists`

`exists(self) -> Bool`

Returns True if the path exists and False otherwise.

**Returns:**

True if the path exists on disk and False otherwise.

### `expanduser`

`expanduser(self) -> Self`

Expands a prefixed `~` with `$HOME` on posix or `$USERPROFILE` on windows. If environment variables are not set or the `path` is not prefixed with `~`, returns the `path` unmodified.

**Returns:**

The expanded path.

### `home`

`static home() -> Self`

Returns `$HOME` on posix or `$USERPROFILE` on windows. If environment variables are not set it returns `~`.

**Returns:**

Path to user home directory.

### `is_dir`

`is_dir(self) -> Bool`

Returns True if the path is a directory and False otherwise.

**Returns:**

Return True if the path points to a directory (or a link pointing to
a directory).

### `is_file`

`is_file(self) -> Bool`

Returns True if the path is a file and False otherwise.

**Returns:**

Return True if the path points to a file (or a link pointing to
a file).

### `read_text`

`read_text(self) -> String`

Returns content of the file.

**Returns:**

Contents of file as string.

### `read_bytes`

`read_bytes(self) -> List[SIMD[uint8, 1]]`

Returns content of the file as bytes.

**Returns:**

Contents of file as list of bytes.

### `write_text`

`write_text[T: Writable](self, value: T)`

Writes the value to the file as text.

**Parameters:**

* ​T (`Writable`): The type of an object conforming to the `Writable` trait.

**Args:**

* ​value (`T`): The value to write.

### `write_bytes`

`write_bytes(self, bytes: Span[SIMD[uint8, 1], origin])`

Writes bytes to the file.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The bytes to write to this file.

### `suffix`

`suffix(self) -> String`

The path's extension, if any. This includes the leading period. For example: '.txt'. If no extension is found, returns the empty string.

**Returns:**

The path's extension.

### `joinpath`

`joinpath(self, *pathsegments: String) -> Self`

Joins the Path using the pathsegments.

**Args:**

* ​\*pathsegments (`String`): The path segments.

**Returns:**

The path concatenation with the pathsegments using the
directory separator.

### `listdir`

`listdir(self) -> List[Path]`

Gets the list of entries contained in the path provided.

**Returns:**

The list of entries in the path provided.

---

## pathlib

Implements the pathlib package.

## Modules

* [​`path`](/mojo/stdlib/pathlib/path/): Implements `Path` and related functions.

---

## pathlike

Implements the `PathLike` trait.

You can import the trait from the `os` package. For example:

```mojo
from os import PathLike
```

## Traits

* [​`PathLike`](/mojo/stdlib/os/pathlike/PathLike): A trait representing file system paths.

---

## PathLike

A trait representing file system paths.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__fspath__`

`__fspath__(self: _Self) -> String`

Return the file system path representation of the object.

**Returns:**

The file system path representation as a string.

---

## PDL

`struct PDL`

Programmatic Dependency Launch (PDL) control structure.

This struct provides a way to manage programmatic stream serialization on
NVIDIA GPUs. It includes functions for launching dependent grids and waiting
for them to complete.

Note:

* Only supported on NVIDIA SM90+ (Hopper architecture and newer) GPUs.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Initialize the PDL control structure.

### `__enter__`

`__enter__(self)`

Launch dependent grids that were previously configured to depend on the current grid.

### `__exit__`

`__exit__(self)`

Wait for all dependent grids launched by this grid to complete execution.

---

## PDLLevel

`@register_passable(trivial)`
`struct PDLLevel`

Programmatic Dependency Launch (PDL) level.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `NO_WAIT_OVERLAP_AT_END`

`alias NO_WAIT_OVERLAP_AT_END = PDLLevel(3)`

### `OFF`

`alias OFF = PDLLevel(0)`

### `OVERLAP_AT_BEGINNING`

`alias OVERLAP_AT_BEGINNING = PDLLevel(2)`

### `OVERLAP_AT_END`

`alias OVERLAP_AT_END = PDLLevel(1)`

## Methods

### `__init__`

`__init__() -> Self`

Initialize the PDL level to OFF.

`__init__(level: Int) -> Self`

Initialize the PDL level.

**Args:**

* ​level (`Int`): The PDL level to initialize.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Check if the PDL level is equal to another PDL level.

**Args:**

* ​other (`Self`): The other PDL level to compare against.

**Returns:**

True if the PDL level is equal to the other PDL level, False otherwise.

`__eq__(self, other: Int) -> Bool`

Check if the PDL level is equal to another PDL level.

**Args:**

* ​other (`Int`): The other PDL level to compare against.

**Returns:**

True if the PDL level is equal to the other PDL level, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Check if the PDL level is not equal to another PDL level.

**Args:**

* ​other (`Self`): The other PDL level to compare against.

**Returns:**

True if the PDL level is not equal to the other PDL level, False otherwise.

### `__gt__`

`__gt__(self, other: Self) -> Bool`

Check if the PDL level is greater than another PDL level.

**Args:**

* ​other (`Self`): The other PDL level to compare against.

**Returns:**

True if the PDL level is greater than the other PDL level, False otherwise.

### `__ge__`

`__ge__(self, other: Self) -> Bool`

Check if the PDL level is greater than or equal to another PDL level.

**Args:**

* ​other (`Self`): The other PDL level to compare against.

**Returns:**

True if the PDL level is greater or equal to the other PDL level,
False otherwise.

---

## per_channel_grouped_4bit

## Structs

* [​`block_Q4_K`](./block_Q4_K):
* [​`block_Q6_K`](./block_Q6_K):
* [​`block_QK_K`](./block_QK_K):
* [​`Q4sym`](./Q4sym): Q4sym: compresses values of type `float_dtype` to 4bit unsigned integers which have been dynamically symmetrically quantized with the given scale factor.

## Functions

* [​`calculate_symmetric_vector`](./calculate_symmetric_vector): Symmetrically quantizes the given SIMD vector `data` with input type `input_dtype` and `simd_width` elements, assuming we want the results to fit in an unsigned integer of size `output_bits`.
* [​`q4_k_dequantize_impl`](./q4_k_dequantize_impl):
* [​`q6_k_dequantize_impl`](./q6_k_dequantize_impl):
* [​`scale_min_k4`](./scale_min_k4):

---

## perf_counter

`perf_counter() -> SIMD[float64, 1]`

Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid.

**Returns:**

The current time in ns.

---

## perf_counter_ns

`perf_counter_ns() -> UInt`

Return the value (in nanoseconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid.

**Returns:**

The current time in ns.

---

## pipeline

Hugging Face Token Generation Pipeline.

## `KVCacheMixin` {#max.pipelines.lib.pipeline.KVCacheMixin}

> *class* max.pipelines.lib.pipeline.KVCacheMixin(\*args, \*\*kwargs)

### `estimate_kv_cache_size()` {#max.pipelines.lib.pipeline.KVCacheMixin.estimate_kv_cache_size}

> *abstract classmethod* estimate\_kv\_cache\_size(pipeline\_config, available\_cache\_memory, devices, huggingface\_config, kv\_cache\_config, cache\_dtype)

Estimates the size of the kv cache in bytes.

**Parameters:**

* **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) )
* **available\_cache\_memory** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`Device`](../driver.md#max.driver.Device) `]` )
* **huggingface\_config** (`AutoConfig` )
* **kv\_cache\_config** (`KVCacheConfig` )
* **cache\_dtype** ([`DType`](../dtype.md#max.dtype.DType) )

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `get_kv_params()` {#max.pipelines.lib.pipeline.KVCacheMixin.get_kv_params}

> *abstract classmethod* get\_kv\_params(huggingface\_config, n\_devices, kv\_cache\_config, cache\_dtype)

Returns the KV cache params for the pipeline model.

**Parameters:**

* **huggingface\_config** (`AutoConfig` )
* **n\_devices** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **kv\_cache\_config** (`KVCacheConfig` )
* **cache\_dtype** ([`DType`](../dtype.md#max.dtype.DType) )

**Return type:**

[*KVCacheParams*](../nn/kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams)

### `get_num_layers()` {#max.pipelines.lib.pipeline.KVCacheMixin.get_num_layers}

> *abstract classmethod* get\_num\_layers(huggingface\_config)

Returns the number of layers for the pipeline model.

**Parameters:**

**huggingface\_config** (`AutoConfig` )

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `load_kv_manager()` {#max.pipelines.lib.pipeline.KVCacheMixin.load_kv_manager}

> load\_kv\_manager(session, available\_cache\_memory)

Provided a PipelineConfig and InferenceSession, loads the KV manager.

**Parameters:**

* **session** ([`InferenceSession`](../engine.md#max.engine.InferenceSession) ) – Inference session to compile and init the KV cache.
* **available\_cache\_memory** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` ) – Amount of memory available to the KV cache,
  in bytes.

**Returns:**

one per input modality.

**Return type:**

Either a single KV cache manager or a tuple of KV cache managers

## `ModelInputs` {#max.pipelines.lib.pipeline.ModelInputs}

> *class* max.pipelines.lib.pipeline.ModelInputs

Base class for model inputs.
Use this class to encapsulate inputs for your model.
You may store any number of dataclass fields

The following example demonstrates how to create a custom inputs class for a model:

```python
class ReplitInputs(ModelInputs):
    tokens: Tensor
    input_row_offsets: Tensor

    def __init__(self, tokens: Tensor, input_row_offsets: Tensor):
        self.tokens = tokens
        self.input_row_offsets = input_row_offsets

tokens = Tensor.zeros((1, 2, 3), DType.int64)
input_row_offsets = Tensor.zeros((1, 1, 1), DType.int64)

# Initialize inputs
inputs = ReplitInputs(tokens=tokens, input_row_offsets=input_row_offsets)

# Access tensors
list(inputs) == [tokens, input_row_offsets]  # Output: True
```

### `kv_cache_inputs` {#max.pipelines.lib.pipeline.ModelInputs.kv_cache_inputs}

> kv\_cache\_inputs\*: [KVCacheInputs](../nn/kv_cache/manager.md#max.nn.kv_cache.manager.KVCacheInputs) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

## `ModelOutputs` {#max.pipelines.lib.pipeline.ModelOutputs}

> *class* max.pipelines.lib.pipeline.ModelOutputs(logits: 'Tensor', next\_token\_logits: 'Tensor | None' = None, logit\_offsets: 'Tensor | None' = None)

**Parameters:**

* **logits** ([`Tensor`](../driver.md#max.driver.Tensor) )
* **next\_token\_logits** ([`Tensor`](../driver.md#max.driver.Tensor)  `|`  `None` )
* **logit\_offsets** ([`Tensor`](../driver.md#max.driver.Tensor)  `|`  `None` )

### `logit_offsets` {#max.pipelines.lib.pipeline.ModelOutputs.logit_offsets}

> logit\_offsets\*: [Tensor](../driver.md#max.driver.Tensor) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

Offsets to access variable length logits for each sequence.

### `logits` {#max.pipelines.lib.pipeline.ModelOutputs.logits}

> logits\*: [Tensor](../driver.md#max.driver.Tensor)\*

Logits for a variable number of tokens per sequence.

### `next_token_logits` {#max.pipelines.lib.pipeline.ModelOutputs.next_token_logits}

> next\_token\_logits\*: [Tensor](../driver.md#max.driver.Tensor) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

Logits for just the next token.

## `PipelineModel` {#max.pipelines.lib.pipeline.PipelineModel}

> *class* max.pipelines.lib.pipeline.PipelineModel(pipeline\_config, session, huggingface\_config, encoding, devices, kv\_cache\_config, weights, adapter, return\_logits)

A pipeline model with setup, input preparation and execution methods.

**Parameters:**

* **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) )
* **session** ([`InferenceSession`](../engine.md#max.engine.InferenceSession) )
* **huggingface\_config** (`AutoConfig` )
* **encoding** (`SupportedEncoding` )
* **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`Device`](../driver.md#max.driver.Device) `]` )
* **kv\_cache\_config** (`KVCacheConfig` )
* **weights** (`Weights` )
* **adapter** (`Optional` `[` `WeightsAdapter` `]` )
* **return\_logits** ([`ReturnLogits`](../nn/transformer/transformer.md#max.nn.transformer.transformer.ReturnLogits) )

### `calculate_max_seq_len()` {#max.pipelines.lib.pipeline.PipelineModel.calculate_max_seq_len}

> *abstract classmethod* calculate\_max\_seq\_len(pipeline\_config, huggingface\_config)

Calculate the optimal max sequence length for the model.
Models are expected to implement this method.

The following example shows how to implement this method for a Mistral model:

```python
class MistralModel(PipelineModel):
    @classmethod
    def calculate_max_seq_len(cls, pipeline_config, huggingface_config) -> int:
        try:
            return upper_bounded_default(
                upper_bound=huggingface_config.max_seq_len,
                default=pipeline_config.max_length,
            )
        except ValueError as e:
            msg = (
                "Unable to infer max_length for Mistral, the provided "
                f"max_length ({pipeline_config.max_length}) exceeds the "
                f"model's max_seq_len ({huggingface_config.max_seq_len})."
            )
            raise ValueError(msg) from e
```

**Parameters:**

* **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) ) – Configuration for the pipeline.
* **huggingface\_config** (`AutoConfig` ) – Hugging Face model configuration.

**Returns:**

The maximum sequence length to use.

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `compute_log_probabilities()` {#max.pipelines.lib.pipeline.PipelineModel.compute_log_probabilities}

> compute\_log\_probabilities(model\_inputs, model\_outputs, next\_tokens, batch\_top\_n, batch\_echo)

Optional method that can be overridden to compute log probabilities.

**Parameters:**

* **model\_inputs** ([`ModelInputs`](#max.pipelines.lib.pipeline.ModelInputs) ) – Inputs to the model returned by
  prepare\_\*\_token\_inputs().
* **model\_outputs** ([`ModelOutputs`](#max.pipelines.lib.pipeline.ModelOutputs) ) – Outputs returned by execute().
* **next\_tokens** ([`Tensor`](../driver.md#max.driver.Tensor) ) – Sampled tokens. Should have shape=\[batch size]
* **batch\_top\_n** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – Number of top log probabilities to return per input in
  the batch. For any element where top\_n == 0, the
  LogProbabilities is skipped.
* **batch\_echo** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`bool`](https://docs.python.org/3/library/functions.html#bool) `]` ) – Whether to include input tokens in the returned log
  probabilities.

**Returns:**

List of log probabilities.

**Return type:**

[list](https://docs.python.org/3/library/stdtypes.html#list)\[[*LogProbabilities*](core.md#max.pipelines.core.LogProbabilities) | None] | None

### `dtype` {#max.pipelines.lib.pipeline.PipelineModel.dtype}

> *property* dtype\*: [DType](../dtype.md#max.dtype.DType)\*

### `estimate_weights_size()` {#max.pipelines.lib.pipeline.PipelineModel.estimate_weights_size}

> *classmethod* estimate\_weights\_size(pipeline\_config)

Calculates the estimated memory consumption of our model.

**Parameters:**

**pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) )

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `execute()` {#max.pipelines.lib.pipeline.PipelineModel.execute}

> *abstract* execute(model\_inputs)

Executes the graph with the given inputs.

**Parameters:**

**model\_inputs** ([`ModelInputs`](#max.pipelines.lib.pipeline.ModelInputs) ) – The model inputs to execute, containing tensors and any other
required data for model execution.

**Returns:**

ModelOutputs containing the pipeline’s output tensors.

**Return type:**

[*ModelOutputs*](#max.pipelines.lib.pipeline.ModelOutputs)

This is an abstract method that must be implemented by concrete PipelineModels
to define their specific execution logic.

### `infer_optimal_batch_size()` {#max.pipelines.lib.pipeline.PipelineModel.infer_optimal_batch_size}

> *classmethod* infer\_optimal\_batch\_size(pipeline\_config, available\_cache\_memory, huggingface\_config, devices, kv\_cache\_config, cache\_dtype)

Returns the estimated optimal batch size to run the model
given current memory constraints.

**Parameters:**

* **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) )
* **available\_cache\_memory** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **huggingface\_config** (`AutoConfig` )
* **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`Device`](../driver.md#max.driver.Device) `]` )
* **kv\_cache\_config** (`KVCacheConfig` )
* **cache\_dtype** ([`DType`](../dtype.md#max.dtype.DType) )

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `prepare_initial_token_inputs()` {#max.pipelines.lib.pipeline.PipelineModel.prepare_initial_token_inputs}

> *abstract* prepare\_initial\_token\_inputs(context\_batch, kv\_cache\_inputs=None, return\_n\_logits=1)

Prepares the initial inputs to be passed to .execute().

The inputs and functionality of this method can vary per model.
For example, the model inputs could include:

* Encoded tensors
* A unique IDs for each tensor if this model uses a KV Cache manager.
* kv\_cache\_inputs: The kv cache inputs required for the model. This
  should be None if the model does not use KV Cache.
  This function would batch the encoded tensors, claim a slot in the kv
  cache if the ID hasn’t been seen before, and return the inputs and
  caches as a list of tensors.

**Parameters:**

* **context\_batch** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` `T` `]` )
* **kv\_cache\_inputs** ([`KVCacheInputs`](../nn/kv_cache/manager.md#max.nn.kv_cache.manager.KVCacheInputs)  `|`  `None` )
* **return\_n\_logits** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[*ModelInputs*](#max.pipelines.lib.pipeline.ModelInputs)

### `prepare_next_token_inputs()` {#max.pipelines.lib.pipeline.PipelineModel.prepare_next_token_inputs}

> *abstract* prepare\_next\_token\_inputs(next\_tokens, prev\_model\_inputs)

Prepares the secondary inputs to be passed to .execute().

While prepare\_initial\_token\_inputs is responsible for managing the initial inputs.
This function is responsible for updating the inputs, for each step in a multi-step execution pattern.

**Parameters:**

* **next\_tokens** ([`Tensor`](../driver.md#max.driver.Tensor) )
* **prev\_model\_inputs** ([`ModelInputs`](#max.pipelines.lib.pipeline.ModelInputs) )

**Return type:**

[*ModelInputs*](#max.pipelines.lib.pipeline.ModelInputs)

## `TextGenerationPipeline` {#max.pipelines.lib.pipeline.TextGenerationPipeline}

> *class* max.pipelines.lib.pipeline.TextGenerationPipeline(pipeline\_config, pipeline\_model, eos\_token\_id, weight\_adapters)

Generalized token generator pipeline.

**Parameters:**

* **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) )
* **pipeline\_model** ([`type`](https://docs.python.org/3/library/functions.html#type) `[` [`PipelineModel`](#max.pipelines.lib.pipeline.PipelineModel) `]` )
* **eos\_token\_id** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **weight\_adapters** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` `WeightsFormat` `,`  `WeightsAdapter` `]` )

### `calculate_num_steps()` {#max.pipelines.lib.pipeline.TextGenerationPipeline.calculate_num_steps}

> calculate\_num\_steps(num\_steps, context)

**Parameters:**

* **num\_steps** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **context** (`T` )

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `next_token()` {#max.pipelines.lib.pipeline.TextGenerationPipeline.next_token}

> next\_token(batch, num\_steps)

Provided a batch, process batch inputs, execute the graph for num\_steps in a multi-step scenario,
then decode the tokens holistically and return the list of decoded tokens.

**Parameters:**

* **batch** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  `T` `]` )
* **num\_steps** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[dict](https://docs.python.org/3/library/stdtypes.html#dict)\[[str](https://docs.python.org/3/library/stdtypes.html#str), [*TextGenerationResponse*](core.md#max.pipelines.core.TextGenerationResponse)]

### `prepare_batch()` {#max.pipelines.lib.pipeline.TextGenerationPipeline.prepare_batch}

> prepare\_batch(batch, num\_steps)

**Parameters:**

* **batch** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `T` `]` )
* **num\_steps** ([`int`](https://docs.python.org/3/library/functions.html#int) )

**Return type:**

[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[*ModelInputs*](#max.pipelines.lib.pipeline.ModelInputs), [int](https://docs.python.org/3/library/functions.html#int), *Tensor* | None]

### `release()` {#max.pipelines.lib.pipeline.TextGenerationPipeline.release}

> release(context)

Mark the context as complete, releasing the cache slot from the KV manager.

**Parameters:**

**context** (`T` )

**Return type:**

None

### `sample_logits()` {#max.pipelines.lib.pipeline.TextGenerationPipeline.sample_logits}

> sample\_logits(logits, prev\_tokens, logit\_offsets, bitmask, \*, token\_frequency\_data=None, token\_frequency\_row\_offsets=None)

**Parameters:**

* **logits** ([`Tensor`](../driver.md#max.driver.Tensor) )
* **prev\_tokens** ([`Tensor`](../driver.md#max.driver.Tensor) )
* **logit\_offsets** ([`Tensor`](../driver.md#max.driver.Tensor)  `|`  `None` )
* **bitmask** ([`Tensor`](../driver.md#max.driver.Tensor)  `|`  `None` )
* **token\_frequency\_data** ([`Tensor`](../driver.md#max.driver.Tensor)  `|`  `None` )
* **token\_frequency\_row\_offsets** ([`Tensor`](../driver.md#max.driver.Tensor)  `|`  `None` )

**Return type:**

[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[*Tensor*](../driver.md#max.driver.Tensor), [*Tensor*](../driver.md#max.driver.Tensor)]

## `get_paged_manager()` {#max.pipelines.lib.pipeline.get_paged_manager}

> max.pipelines.lib.pipeline.get\_paged\_manager(pipeline)

**Parameters:**

**pipeline** ([`TokenGenerator`](core.md#max.pipelines.core.TokenGenerator) )

**Return type:**

*PagedKVCacheManager* | None

## `upper_bounded_default()` {#max.pipelines.lib.pipeline.upper_bounded_default}

> max.pipelines.lib.pipeline.upper\_bounded\_default(upper\_bound, default)

Given an upper bound and an optional default value, returns a final value
that cannot exceed the upper bound.

**Parameters:**

* **default** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` ) – The default value to use, or None to use the upper bound.
* **upper\_bound** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The upper bound to use.

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the provided default value exceeds the upper bound.

**Returns:**

The final value.

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

---

## pipelines

NOTE: These APIs are under heavy development and subject to change.

## Modules

* [`architectures`](/max/api/python/pipelines/architectures)
* [`config`](/max/api/python/pipelines/config)
* [`core`](/max/api/python/pipelines/core)
* [`hf_pipeline`](/max/api/python/pipelines/hf_pipeline)
* [`hf_utils`](/max/api/python/pipelines/hf_utils)
* [`pipeline`](/max/api/python/pipelines/pipeline)
* [`registry`](/max/api/python/pipelines/registry)
* [`sampling`](/max/api/python/pipelines/sampling)
* [`tokenizer`](/max/api/python/pipelines/tokenizer)

## Packages

---

## PipelineState

`@register_passable(trivial)`
`struct PipelineState[num_stages: Int]`

Manages state for a multi-stage pipeline with circular buffer semantics.

PipelineState provides a mechanism for tracking the current stage in a
multi-stage pipeline, particularly useful for double or triple buffering
in GPU tensor operations. It maintains an index that cycles through the
available stages, a phase bit that toggles when the index wraps around,
and a monotonically increasing count.

This struct is commonly used with TMA operations to coordinate the use of
multiple buffers in a pipeline fashion, allowing for overlapping computation
and data transfer.

## Parameters

* ​num\_stages (`Int`): The number of stages in the pipeline (e.g., 2 for double buffering,
  3 for triple buffering).

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Initialize a PipelineState with default values.

Creates a new PipelineState with index 0, phase 0, and count 0.

`__init__(index: Int, phase: Int, count: Int) -> Self`

Initialize a PipelineState with specific values.

Creates a new PipelineState with the specified index, phase, and count.

**Args:**

* ​index (`Int`): The initial stage index.
* ​phase (`Int`): The initial phase value (0 or 1).
* ​count (`Int`): The initial count value.

### `index`

`index(self) -> Int`

Get the current stage index.

**Returns:**

The current index value, which ranges from 0 to num\_stages-1.

### `phase`

`phase(self) -> SIMD[uint32, 1]`

Get the current phase bit.

**Returns:**

The current phase value (0 or 1), which toggles when the index wraps around.

### `step`

`step(mut self)`

Advance the pipeline state to the next stage.

Increments the index and count. When the index reaches num\_stages,
it wraps around to 0 and toggles the phase bit.

This function is used to move to the next buffer in a multi-buffer
pipeline, implementing circular buffer semantics.

---

## pmaddubs

`pmaddubs[width: Int](a: SIMD[int32, width], b: SIMD[int32, width]) -> SIMD[int32, width]`

---

## pmaddw

`pmaddw[width: Int](a: SIMD[int32, width], b: SIMD[int32, width]) -> SIMD[int32, width]`

---

## pointer

Implements the Pointer type.

You can import these APIs from the `memory` package. For example:

```mojo
from memory import Pointer
```

## Structs

* [​`AddressSpace`](/mojo/stdlib/memory/pointer/AddressSpace): Address space of the pointer.
* [​`Pointer`](/mojo/stdlib/memory/pointer/Pointer): Defines a non-nullable safe pointer.

---

## Pointer

`@register_passable(trivial)`
`struct Pointer[mut: Bool, //, type: AnyType, origin: Origin[mut], address_space: AddressSpace = AddressSpace(0)]`

Defines a non-nullable safe pointer.

For a comparison with other pointer types, see [Intro to
pointers](/mojo/manual/pointers/) in the Mojo Manual.

## Parameters

* ​mut (`Bool`): Whether the pointee data may be mutated through this.
* ​type (`AnyType`): Type of the underlying data.
* ​origin (`Origin[mut]`): The origin of the pointer.
* ​address\_space (`AddressSpace`): The address space of the pointee data.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`

## Aliases

### `Immutable`

`alias Immutable = Pointer[type, (muttoimm origin._mlir_origin), address_space]`

The immutable version of the `Pointer`.

### `Mutable`

`alias Mutable = Pointer[type, (mutcast origin._mlir_origin), address_space]`

The mutable version of the `Pointer`.

## Methods

### `__init__`

`__init__(*, ref [origin, address_space] to: type) -> Self`

Constructs a Pointer from a reference to a value.

**Args:**

* ​to (`type`): The value to construct a pointer to.

### `__getitem__`

`__getitem__(self) -> ref [origin, address_space] type`

Enable subscript syntax `ptr[]` to access the element.

**Returns:**

A reference to the underlying value in memory.

### `__eq__`

`__eq__(self, rhs: Pointer[type, origin, address_space]) -> Bool`

Returns True if the two pointers are equal.

**Args:**

* ​rhs (`Pointer[type, origin, address_space]`): The value of the other pointer.

**Returns:**

True if the two pointers are equal and False otherwise.

### `__ne__`

`__ne__(self, rhs: Pointer[type, origin, address_space]) -> Bool`

Returns True if the two pointers are not equal.

**Args:**

* ​rhs (`Pointer[type, origin, address_space]`): The value of the other pointer.

**Returns:**

True if the two pointers are not equal and False otherwise.

### `address_of`

`static address_of(ref [origin, address_space] value: type) -> Self`

Constructs a Pointer from a reference to a value.

**Args:**

* ​value (`type`): The value to get the address of.

**Returns:**

The result Pointer.

### `copy`

`copy(self) -> Self`

Constructs a copy from another Pointer.

Note that this does **not** copy the underlying data.

**Returns:**

A copy of the value.

### `get_immutable`

`get_immutable(self) -> Pointer[type, (muttoimm origin._mlir_origin), address_space]`

Constructs a new Pointer with the same underlying target and an ImmutableOrigin.

Notes:
This does **not** copy the underlying data.

**Returns:**

A new Pointer with the same target as self and an ImmutableOrigin.

### `__str__`

`__str__(self) -> String`

Gets a string representation of the Pointer.

**Returns:**

The string representation of the Pointer.

### `__merge_with__`

`__merge_with__[: Bool, : Origin[$0], //, other_type: AnyStruct[Pointer[type, $1, address_space]]](self) -> Pointer[type, origin, address_space]`

Returns a pointer merged with the specified `other_type`.

**Parameters:**

* ​other\_type (`AnyStruct[Pointer[type, $1, address_space]]`): The type of the pointer to merge with.

**Returns:**

A pointer merged with the specified `other_type`.

---

## polynomial

Provides two implementations for evaluating polynomials.

You can import these APIs from the `math` package. For example:

```mojo
from math.polynomial import polynomial_evaluate
```

## Functions

* [​`polynomial_evaluate`](/mojo/stdlib/math/polynomial/polynomial_evaluate): Evaluates the polynomial.

---

## polynomial_evaluate

`polynomial_evaluate[: Bool, dtype: DType, simd_width: Int, //, coefficients: List[SIMD[dtype, simd_width], $0]](x: SIMD[dtype, simd_width]) -> SIMD[dtype, simd_width]`

Evaluates the polynomial.

**Parameters:**

* ​dtype (`DType`): The dtype of the value.
* ​simd\_width (`Int`): The simd\_width of the computed value.
* ​coefficients (`List[SIMD[dtype, simd_width], $0]`): The coefficients.

**Args:**

* ​x (`SIMD[dtype, simd_width]`): The value to compute the polynomial with.

**Returns:**

The polynomial evaluation results using the specified value and the
constant coefficients.

---

## pool

## Structs

* [​`PoolMethod`](./PoolMethod):

## Functions

* [​`avg_pool`](./avg_pool): Computes the average pool.
* [​`avg_pool_gpu`](./avg_pool_gpu): Computes the average pool on GPU.
* [​`max_pool`](./max_pool): Computes fp32 pooling.
* [​`max_pool_gpu`](./max_pool_gpu): Computes max pooling on GPU.
* [​`pool_shape`](./pool_shape):
* [​`pool_shape_ceil`](./pool_shape_ceil):
* [​`pool_shape_impl`](./pool_shape_impl): Compute the output shape of a pooling operation, and assert the inputs are compatible. Works for 2D pool operations only in the NHWC format.

---

## pool_shape

`pool_shape[input_rank: Int, input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], filter_buf: NDBuffer[filter_type, 1, origin], strides_buf: NDBuffer[strides_type, 1, origin], dilations_buf: NDBuffer[dilations_type, 1, origin], paddings_buf: NDBuffer[paddings_type, 1, origin]) -> IndexList[input_rank]`

---

## pool_shape_ceil

`pool_shape_ceil[input_rank: Int, input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], filter_buf: NDBuffer[filter_type, 1, origin], strides_buf: NDBuffer[strides_type, 1, origin], dilations_buf: NDBuffer[dilations_type, 1, origin], paddings_buf: NDBuffer[paddings_type, 1, origin]) -> IndexList[input_rank]`

---

## pool_shape_impl

`pool_shape_impl[input_rank: Int, input_type: DType, filter_type: DType, strides_type: DType, dilations_type: DType, paddings_type: DType, single_thread_blocking_override: Bool, ceil_mode: Bool](input_buf: NDBuffer[input_type, input_rank, origin], filter_buf: NDBuffer[filter_type, 1, origin], strides_buf: NDBuffer[strides_type, 1, origin], dilations_buf: NDBuffer[dilations_type, 1, origin], paddings_buf: NDBuffer[paddings_type, 1, origin]) -> IndexList[input_rank]`

Compute the output shape of a pooling operation, and assert the inputs are compatible. Works for 2D pool operations only in the NHWC format.

**Parameters:**

* ​input\_rank (`Int`): Rank of the input tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​filter\_type (`DType`): Type of the filter tensor.
* ​strides\_type (`DType`): Type of the strides tensor.
* ​dilations\_type (`DType`): Type of the dilations tensor.
* ​paddings\_type (`DType`): Type of the paddings tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​ceil\_mode (`Bool`): Define rounding mode for shape calculation.

**Args:**

* ​input\_buf (`NDBuffer[input_type, input_rank, origin]`): The input tensor.
* ​filter\_buf (`NDBuffer[filter_type, 1, origin]`): The filter size buffer.
* ​strides\_buf (`NDBuffer[strides_type, 1, origin]`): The strides size buffer.
* ​dilations\_buf (`NDBuffer[dilations_type, 1, origin]`): The dilations size buffer.
* ​paddings\_buf (`NDBuffer[paddings_type, 1, origin]`): The paddings size buffer.

**Returns:**

The output shape.

---

## PoolMethod

`@register_passable(trivial)`
`struct PoolMethod`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `AVG`

`alias AVG = PoolMethod(1)`

### `MAX`

`alias MAX = PoolMethod(0)`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

---

## pop_count

`pop_count(val: Int) -> Int`

Counts the number of bits set in an integer value.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The number of bits set in the input value.

`pop_count[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Counts the number of bits set in a SIMD vector of integer values.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` contains the number of
bits set in the element at position `i` of the input value.

---

## pow

`pow[T: Powable](base: T, exp: T) -> T`

Computes the `base` raised to the power of the `exp`.

**Parameters:**

* ​T (`Powable`): A type conforming to the `Powable` trait.

**Args:**

* ​base (`T`): The base of the power operation.
* ​exp (`T`): The exponent of the power operation.

**Returns:**

The `base` raised to the power of the `exp`.

`pow(base: SIMD[dtype, size], exp: Int) -> SIMD[dtype, size]`

Computes elementwise value of a SIMD vector raised to the power of the given integer.

**Args:**

* ​base (`SIMD[dtype, size]`): The first input argument.
* ​exp (`Int`): The second input argument.

**Returns:**

The `base` elementwise raised raised to the power of `exp`.

---

## Powable

The `Powable` trait describes a type that defines a power operation (i.e. exponentiation) with the same base and exponent types.

Types that conform to `Powable` will work with the builtin `pow` function,
which will return the same type as the inputs.

For example:

```mojo
struct Rational(Powable):
    var numerator: Float64
    var denominator: Float64

    fn __init__(out self, numerator: Float64, denominator: Float64):
        self.numerator = numerator
        self.denominator = denominator

    fn __pow__(self, exp: Self)  -> Self:
        var exp_value = exp.numerator / exp.denominator
        return Self(pow(self.numerator, exp_value), pow(self.denominator, exp_value))
```

You can now use the \*\* operator to exponentiate objects
inside generic functions:

```mojo
fn exponentiate[T: Powable](base: T, exp: T) -> T:
    return base ** exp

var base = Rational(Float64(3.0), 5.0)
var exp = Rational(Float64(1.0), 2.0)
var res = exponentiate(base, exp)
```

```plaintext
raising to power
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__pow__`

`__pow__(self: _Self, exp: _Self) -> _Self`

Return the value raised to the power of the given exponent.

**Args:**

* ​exp (`_Self`): The exponent value.

**Returns:**

The value of `self` raised to the power of `exp`.

---

## prefetch

`prefetch[dtype: DType, //, params: PrefetchOptions = PrefetchOptions()](addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin])`

Prefetches an instruction or data into cache before it is used.

The prefetch function provides prefetching hints for the target
to prefetch instruction or data into cache before they are used.

**Parameters:**

* ​dtype (`DType`): The DType of value stored in addr.
* ​params (`PrefetchOptions`): Configuration options for the prefect intrinsic.

**Args:**

* ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The data pointer to prefetch.

---

## PrefetchCache

`@register_passable(trivial)`
`struct PrefetchCache`

Prefetch cache type.

## Fields

* ​value (`SIMD[int32, 1]`): The cache prefetch. It should be in \[0, 1].

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `DATA`

`alias DATA = PrefetchCache(1)`

The data prefetching option.

### `INSTRUCTION`

`alias INSTRUCTION = PrefetchCache(0)`

The instruction prefetching option.

## Methods

### `__init__`

`__init__(value: Int) -> Self`

Constructs a prefetch option.

**Args:**

* ​value (`Int`): An integer value representing the prefetch cache option to be
  used. Should be a value in the range `[0, 1]`.

---

## PrefetchLocality

`@register_passable(trivial)`
`struct PrefetchLocality`

The prefetch locality.

The locality, rw, and cache type correspond to LLVM prefetch intrinsic's
inputs (see
[LLVM prefetch locality](https://llvm.org/docs/LangRef.html#llvm-prefetch-intrinsic))

## Fields

* ​value (`SIMD[int32, 1]`): The prefetch locality to use. It should be a value in \[0, 3].

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `HIGH`

`alias HIGH = PrefetchLocality(3)`

Extremely local locality (keep in cache).

### `LOW`

`alias LOW = PrefetchLocality(1)`

Low locality.

### `MEDIUM`

`alias MEDIUM = PrefetchLocality(2)`

Medium locality.

### `NONE`

`alias NONE = PrefetchLocality(0)`

No locality.

## Methods

### `__init__`

`__init__(value: Int) -> Self`

Constructs a prefetch locality option.

**Args:**

* ​value (`Int`): An integer value representing the locality. Should be a value
  in the range `[0, 3]`.

---

## PrefetchOptions

`@register_passable(trivial)`
`struct PrefetchOptions`

Collection of configuration parameters for a prefetch intrinsic call.

The op configuration follows similar interface as LLVM intrinsic prefetch
op, with a "locality" attribute that specifies the level of temporal locality
in the application, that is, how soon would the same data be visited again.
Possible locality values are: `NONE`, `LOW`, `MEDIUM`, and `HIGH`.

The op also takes a "cache tag" attribute giving hints on how the
prefetched data will be used. Possible tags are: `ReadICache`, `ReadDCache`
and `WriteDCache`.

Note: the actual behavior of the prefetch op and concrete interpretation of
these attributes are target-dependent.

## Fields

* ​rw (`PrefetchRW`): Indicates prefetching for read or write.
* ​locality (`PrefetchLocality`): Indicates locality level.
* ​cache (`PrefetchCache`): Indicates i-cache or d-cache prefetching.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

Constructs an instance of PrefetchOptions with default params.

### `for_read`

`for_read(self) -> Self`

Sets the prefetch purpose to read.

**Returns:**

The updated prefetch parameter.

### `for_write`

`for_write(self) -> Self`

Sets the prefetch purpose to write.

**Returns:**

The updated prefetch parameter.

### `no_locality`

`no_locality(self) -> Self`

Sets the prefetch locality to none.

**Returns:**

The updated prefetch parameter.

### `low_locality`

`low_locality(self) -> Self`

Sets the prefetch locality to low.

**Returns:**

The updated prefetch parameter.

### `medium_locality`

`medium_locality(self) -> Self`

Sets the prefetch locality to medium.

**Returns:**

The updated prefetch parameter.

### `high_locality`

`high_locality(self) -> Self`

Sets the prefetch locality to high.

**Returns:**

The updated prefetch parameter.

### `to_data_cache`

`to_data_cache(self) -> Self`

Sets the prefetch target to data cache.

**Returns:**

The updated prefetch parameter.

### `to_instruction_cache`

`to_instruction_cache(self) -> Self`

Sets the prefetch target to instruction cache.

**Returns:**

The updated prefetch parameter.

---

## PrefetchRW

`@register_passable(trivial)`
`struct PrefetchRW`

Prefetch read or write.

## Fields

* ​value (`SIMD[int32, 1]`): The read-write prefetch. It should be in \[0, 1].

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `READ`

`alias READ = PrefetchRW(0)`

Read prefetch.

### `WRITE`

`alias WRITE = PrefetchRW(1)`

Write prefetch.

## Methods

### `__init__`

`__init__(value: Int) -> Self`

Constructs a prefetch read-write option.

**Args:**

* ​value (`Int`): An integer value representing the prefetch read-write option
  to be used. Should be a value in the range `[0, 1]`.

---

## Prefill

Prefill is the first phase of an AI model's forward pass in which the model
processes the input and initializes a cache to accelerate predictions.

Different model architectures may have their own version of a prefill, but it's
primarily associated with large language models (LLMs), in which case it's also
called [context encoding](context-encoding.mdx).

---

## Prefix caching with PagedAttention

Prefix caching is a technique that caches the key-value (KV) cache of existing
inference requests so that new queries can reuse the context encoded in the KV
cache if they share the same prefix. This eliminates redundant computations and
improves performance for workloads with repeated prefixes.

By default, prefix caching is disabled in MAX. It can be enabled using the
`--enable-prefix-caching` flag.

:::note

Prefix caching with MAX is still in preview and some aspects may change as we
refine the implementation. Expect ongoing improvements and potential adjustments
based on feedback and performance optimizations.

:::

## When to use prefix caching

Prefix caching speeds up the pre-fill stage of inference, which reduces time to
first token (TTFT). It can also reduce memory usage within the KV cache for all
requests, which makes room for scheduling larger batches and yielding higher
throughput.

Prefix caching can provide significant performance improvements in the following
scenarios:

1. **Similar queries**: When a user repeatedly makes similar queries that use
    the same system prompt instructions, the KV cache of the prefix can be
    stored in advance to reduce redundant computation.
2. **Multi-round conversations**: In chat applications, users often ask
    follow-up queries related to previous inputs. Since the server releases KV
    cache memory after each request, prefix caching preserves computation from
    past conversation turns without requiring an explicit session.

Prefix caching won't result in performance degradation. However, it also does
not provide additional benefit in the following cases:

- **Unique queries**: If new queries do not share prefixes with previous
    queries, there is no opportunity to reuse cached KV values, making prefix
    caching ineffective.
- **Long response generation**: Prefix caching only speeds up the pre-fill phase
    of a request. If most of the time is spent generating new tokens (decoding),
    caching will have little impact.

## How prefix caching works

Prefix caching works by storing the key-value (KV) cache for a prefix and
applying it to future prompts that include the same prefix, reducing redundant
computation. You must specify all of the following to use prefix caching with
the `max` CLI:

- `--cache-strategy` : Prefix caching requires PagedAttention. To use
    PagedAttention, set your cache strategy to `paged`.
- `--enable-prefix-caching`: Enables prefix caching.
- `--kv-cache-page-size`: PagedAttention currently requires a page size that is
    a multiple of 128.

Prefix caching with PagedAttention works on both CPU and GPU. To deploy a model
with prefix caching using the `max` CLI, you can use the flag `--devices cpu`
for CPU or `--devices gpu` for GPU workloads. If no flag is provided, the model
runs on the first available GPU, or on the first available CPU if no GPUs are
available.

## Quickstart

You can enable prefix caching when serving your model with the
[`max` CLI](/max/max-cli#serve). To install the `max` CLI, see the
[installation guide](/max/packages).

```
max serve --model-path modularai/Llama-3.1-8B-Instruct-GGUF \
    --cache-strategy paged \
    --enable-prefix-caching \
    --kv-cache-page-size 128 \
    --quantization-encoding float32
```

:::note

Paged KV caching does not support quantized encodings. It may take some time to
download the `float32` weights. For more information about encoding options in
MAX, see [Quantization](/max/graph/quantize).

:::

## Next steps

Now that you know the basics of prefix caching and PagedAttention, you can get
started with MAX on GPUs.

MAX also includes a benchmarking script that allows you to evaluate throughput,
latency, and GPU utilization metrics. You can use this script to track
performance gains from prefix caching. For more detailed instructions on
benchmarking, please see
[Benchmark MAX](https://github.com/modular/modular/tree/main/benchmark).

export const cards_start = [
  {
    title: 'Deploy Llama 3 on GPU with MAX',
    link: '/max/tutorials/max-serve-local-to-cloud',
    description:
    `Learn how to deploy an LLM to the cloud on GPU.`,
  },
  {
    title: 'Deploy Llama 3.1 on GPU-powered Kubernetes clusters',
    link: '/max/tutorials/deploy-max-serve-on-kubernetes',
    description:
    `Learn how to deploy Llama 3.1 using Kubernetes, MAX, and NVIDIA GPUs`,
  },
];

---

## prefix_product

`prefix_product(a: IntTuple[origin]) -> IntTuple`

Compute the exclusive prefix product of an `IntTuple`.

This is a convenience wrapper that initializes the prefix product with 1.

**Args:**

* ​a (`IntTuple[origin]`): The input `IntTuple` to compute the prefix product for.

**Returns:**

A new `IntTuple` containing the exclusive prefix product of the input.

`prefix_product(a: IntTuple[origin], init: Int) -> IntTuple`

Compute the exclusive prefix product of an `IntTuple` with an initial value.

This function delegates to the implementation in prefix\_product2.

**Args:**

* ​a (`IntTuple[origin]`): The input `IntTuple` to compute the prefix product for.
* ​init (`Int`): The initial value(s) for the prefix product, defaults to 1.

**Returns:**

A new `IntTuple` containing the exclusive prefix product of the input.

---

## prefix_product

`prefix_product[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> RuntimeTuple[prefix_product[::Origin[::Bool(t)]`

Computes the prefix products of elements in the `RuntimeTuple`.

This function calculates the running product of elements, where each
output element is the product of all previous elements in the input.
This is commonly used in tensor computations to calculate stride values.

**Parameters:**

* ​t (`IntTuple[$0]`): The IntTuple type parameter of the input RuntimeTuple.

**Args:**

* ​tuple (`RuntimeTuple[t, element_type=element_type]`): The input `RuntimeTuple`.

**Returns:**

A new `RuntimeTuple` containing the prefix products of the input elements.

---

## prefix_sum

`prefix_sum[type: DType, //, *, block_size: Int, exclusive: Bool = False](val: SIMD[type, 1]) -> SIMD[type, 1]`

Performs a prefix sum (scan) operation across all threads in a block.

This function implements a block-level inclusive or exclusive scan,
efficiently computing the cumulative sum for each thread based on
thread indices.

**Parameters:**

* ​type (`DType`): The data type of the Scalar elements.
* ​block\_size (`Int`): The total number of threads in the block.
* ​exclusive (`Bool`): If True, perform exclusive scan instead of inclusive.

**Args:**

* ​val (`SIMD[type, 1]`): The Scalar value from each thread to include in the scan.

**Returns:**

A Scalar value containing the result of the scan operation for each
thread.

---

## prefix_sum

`prefix_sum[type: DType, //, intermediate_type: DType = type, *, output_type: DType = type, exclusive: Bool = False](x: SIMD[type, 1]) -> SIMD[output_type, 1]`

Computes a warp-level prefix sum (scan) operation.

Performs an inclusive or exclusive prefix sum across threads in a warp using
a parallel scan algorithm with warp shuffle operations. This implements an
efficient parallel scan with logarithmic complexity.

For example, if we have a warp with the following elements:

$$
[x_0, x_1, x_2, x_3, x_4]
$$

The prefix sum is:

$$
[x_0, x_0 + x_1, x_0 + x_1 + x_2, x_0 + x_1 + x_2 + x_3, x_0 + x_1 + x_2 + x_3 + x_4]
$$

**Parameters:**

* ​type (`DType`): The data type of the input SIMD elements.
* ​intermediate\_type (`DType`): Type used for intermediate calculations (defaults to
  input type).
* ​output\_type (`DType`): The desired output data type (defaults to input type).
* ​exclusive (`Bool`): If True, performs exclusive scan where each thread receives
  the sum of all previous threads. If False (default), performs
  inclusive scan where each thread receives the sum including
  its own value.

**Args:**

* ​x (`SIMD[type, 1]`): The SIMD value to include in the prefix sum.

**Returns:**

A scalar containing the prefix sum at the current thread's position in
the warp, cast to the specified output type.

---

## prev_power_of_two

`prev_power_of_two(val: Int) -> Int`

Computes the largest power of 2 that is less than or equal to the input value. Any integral value less than or equal to 0 will be floored to 0.

This operation is called `bit_floor()` in C++.

**Args:**

* ​val (`Int`): The input value.

**Returns:**

The largest power of 2 that is less than or equal to the input value.

`prev_power_of_two[dtype: DType, width: Int, //](val: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the largest power of 2 that is less than or equal to the input value for each element of a SIMD vector. Any integral value less than or equal to 0 will be floored to 0.

This operation is called `bit_floor()` in C++.

**Constraints:**

The element type of the input vector must be integral.

**Parameters:**

* ​dtype (`DType`): `dtype` used for the computation.
* ​width (`Int`): SIMD width used for the computation.

**Args:**

* ​val (`SIMD[dtype, width]`): The input value.

**Returns:**

A SIMD value where the element at position `i` is the largest power of 2
that is less than or equal to the integer at position `i` of the input
value.

---

## print

`print[*Ts: Writable](*values: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" "), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("\n"), flush: Bool = False, owned file: FileDescriptor = FileDescriptor(1))`

Prints elements to the text stream. Each element is separated by `sep` and followed by `end`.

**Parameters:**

* ​\*Ts (`Writable`): The elements types.

**Args:**

* ​\*values (`*Ts`): The elements to print.
* ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements.
* ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements.
* ​flush (`Bool`): If set to true, then the stream is forcibly flushed.
* ​file (`FileDescriptor`): The output stream.

---

## print_kv_cache_cont_batch_generic_cpu

`print_kv_cache_cont_batch_generic_cpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: ContinuousBatchingKVCacheCollection[type, kv_params], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)`

---

## print_kv_cache_cont_batch_generic_gpu

`print_kv_cache_cont_batch_generic_gpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: ContinuousBatchingKVCacheCollection[type, kv_params], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)`

---

## print_kv_cache_paged_generic_cpu

`print_kv_cache_paged_generic_cpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams, page_size: Int, assert_write_mode: Int = 0](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: PagedKVCacheCollection[type, kv_params, page_size, assert_write_mode], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)`

---

## print_kv_cache_paged_generic_gpu

`print_kv_cache_paged_generic_gpu[target: StringSlice[StaticConstantOrigin], type: DType, kv_params: KVCacheStaticParams, page_size: Int, assert_write_mode: Int = 0](valid_lengths: NDBuffer[uint32, 1, origin], kv_collection: PagedKVCacheCollection[type, kv_params, page_size, assert_write_mode], layer_idx: SIMD[uint32, 1], is_print_compact: Bool, context: DeviceContextPtr)`

---

## print_layout

`print_layout(layout: Layout)`

Prints a 2D layout to the standard output.

This function visualizes a 2D layout by printing a formatted table showing
the memory indices for each logical coordinate.

**Args:**

* ​layout (`Layout`): The 2D layout to print.

---

## prod_dims

`prod_dims[start_dim: Int, end_dim: Int](x: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> Int`

Computes the product of a slice of the given buffer's dimensions.

**Parameters:**

* ​start\_dim (`Int`): The index at which to begin computing the product.
* ​end\_dim (`Int`): The index at which to stop computing the product.

**Args:**

* ​x (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The NDBuffer whose dimensions will be multiplied.

**Returns:**

The product of the specified slice of the buffer's dimensions.

---

## producer_main_loop

`producer_main_loop[a_type: DType, b_type: DType, a_tile_layout: Layout, b_tile_layout: Layout, a_smem_layout: Layout, b_smem_layout: Layout, a_desc_layout: Layout, b_desc_layout: Layout, pipeline_stages: Int, /, *, block_tile_shape: IndexList[3], cluster_shape: StaticTuple[SIMD[int32, 1], 3] = StaticTuple(__init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1), __init__[__mlir_type.!pop.int_literal](1)), partitioned_multicast: Bool = False](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], a_smem_iter: LayoutTensorIter[a_type, a_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], b_smem_iter: LayoutTensorIter[b_type, b_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128, circular=circular, axis=axis, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked], num_k_iters: Int, m_coord: UInt, n_coord: UInt, rank_n: UInt, rank_m: UInt, mut write_pipeline_states: PipelineState[pipeline_stages], empty_mbar: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8], full_mbar: UnsafePointer[SharedMemBarrier, address_space=AddressSpace(3), alignment=8])`

---

## product

`product(t: IntTuple[origin]) -> Int`

Calculate the product of all values in an `IntTuple`.

This function recursively computes the product of all integer values
in a potentially nested `IntTuple` structure.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to multiply.

**Returns:**

The product of all integer values, or `UNKNOWN_VALUE` if any value
in the tuple is `UNKNOWN_VALUE`.

---

## product

`product[: ImmutableOrigin, //, t: IntTuple[$0]](tuple: RuntimeTuple[t, element_type=element_type]) -> Int`

Computes the product of all elements in the `RuntimeTuple`.

This function multiplies all scalar values in the tuple, including
those in nested tuples after flattening. This is commonly used to
calculate the total size of a tensor from its shape.

**Parameters:**

* ​t (`IntTuple[$0]`): The IntTuple type parameter of the input RuntimeTuple.

**Args:**

* ​tuple (`RuntimeTuple[t, element_type=element_type]`): The input `RuntimeTuple`.

**Returns:**

The product of all scalar elements in the tuple.

---

## product

`product(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]`

Computes the product of the buffer elements.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.

**Returns:**

The product of the buffer elements.

`product[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])`

Computes the product across reduce\_axis of an NDBuffer.

**Parameters:**

* ​reduce\_axis (`Int`): The axis to reduce across.

**Args:**

* ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer.
* ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer.

`product[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())`

Computes the product across the input and output shape.

This performs the product computation on the domain specified by `input_shape`,
loading the inputs using the `input_fn`. The results are stored using
the `output_fn`.

**Parameters:**

* ​type (`DType`): The type of the input and output.
* ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input.
* ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.

**Args:**

* ​input\_shape (`IndexList[size]`): The input shape.
* ​reduce\_dim (`Int`): The axis to perform the product on.
* ​context (`DeviceContextPtr`): The pointer to DeviceContext.

---

## product

`product[size: Int](tuple: IndexList[size, element_type=element_type], end_idx: Int = size) -> Int`

Computes a product of values in the tuple up to the given index.

**Parameters:**

* ​size (`Int`): The tuple size.

**Args:**

* ​tuple (`IndexList[size, element_type=element_type]`): The tuple to get a product of.
* ​end\_idx (`Int`): The end index.

**Returns:**

The product of all tuple elements in the given range.

`product[size: Int](tuple: IndexList[size, element_type=element_type], start_idx: Int, end_idx: Int) -> Int`

Computes a product of values in the tuple in the given index range.

**Parameters:**

* ​size (`Int`): The tuple size.

**Args:**

* ​tuple (`IndexList[size, element_type=element_type]`): The tuple to get a product of.
* ​start\_idx (`Int`): The start index of the range.
* ​end\_idx (`Int`): The end index of the range.

**Returns:**

The product of all tuple elements in the given range.

---

## product_each

`product_each(t: IntTuple[origin]) -> IntTuple`

Compute the product of elements in each sub-tuple of an `IntTuple`.

For each immediate child of the input tuple, this function computes
the product of all elements within that child.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` containing sub-tuples.

**Returns:**

A new `IntTuple` where each element is the product of the corresponding
sub-tuple in the input.

---

## ProfileBlock

`struct ProfileBlock[enabled: Bool = False]`

A struct for profiling code blocks.

This struct provides context manager functionality to profile code blocks.
When enabled, it records the start and end time of the block and prints
the timing information.

## Parameters

* ​enabled (`Bool`): Whether profiling is enabled for this block.

## Fields

* ​name (`StringSlice[StaticConstantOrigin]`): Name of the profiling block used for identification in timing output.
* ​loc (`_SourceLocation`): Source code location information for the profiling block, including file, line, and column.
* ​start\_time (`UInt`): Start time of the profiling block in nanoseconds, captured using perf\_counter\_ns().

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(out self, name: StringSlice[StaticConstantOrigin])`

Initialize a new ProfileBlock.

**Args:**

* ​name (`StringSlice[StaticConstantOrigin]`): Name to identify this profiling block.

### `__enter__`

`__enter__(mut self)`

Enter the profiling block and record start time if enabled.

### `__exit__`

`__exit__(mut self)`

Exit the profiling block, record end time and print timing if enabled.

---

## profiler

This module provides GPU profiling functionality.

The profiler module enables performance profiling of GPU code blocks through a simple
context manager interface. It includes:

* ProfileBlock: A context manager for timing code blocks
* Configurable profiling that can be enabled/disabled at compile time
* Nanosecond precision timing using perf\_counter\_ns()
* Source location tracking for profiled blocks
* Formatted timing output

Example:

```mojo
from gpu import profiler
    with profiler.ProfileBlock("my_kernel"):
        # Code to profile
        run_gpu_kernel()
```

## Structs

* [​`ProfileBlock`](/mojo/stdlib/gpu/profiler/ProfileBlock): A struct for profiling code blocks.

---

## promote_to_cuda_cores

`promote_to_cuda_cores[accum_type: DType, layout: Layout](c_reg_tile: LayoutTensor[accum_type, layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], final_c_reg_tile: LayoutTensor[accum_type, layout, MutableAnyOrigin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

---

## propagate_unknown

`propagate_unknown(src: IntTuple[origin], target: IntTuple[origin]) -> IntTuple`

Propagates unknown dimensions from the target `IntTuple` to the source `IntTuple`.

This function creates a new `IntTuple` by combining the source and target `IntTuple`s,
preserving unknown dimensions (UNKNOWN\_VALUE) from the target while using values
from the source for known dimensions.

**Args:**

* ​src (`IntTuple[origin]`): The source `IntTuple` containing known dimension values.
* ​target (`IntTuple[origin]`): The target `IntTuple` that may contain unknown dimensions (UNKNOWN\_VALUE).

**Returns:**

A new `IntTuple` with unknown dimensions from target and known dimensions from src.

---

## pwd

Provides access to user and group information from the password database.

Use the [`Passwd`](/mojo/stdlib/pwd/pwd/Passwd) type to access user account
information such as user name, ID, group, and home directory.

## Modules

* [​`pwd`](/mojo/stdlib/pwd/pwd/):

---

## pwd

## Structs

* [​`Passwd`](/mojo/stdlib/pwd/pwd/Passwd): Represents user account information retrieved from the user password database related to a user ID.

## Functions

* [​`getpwnam`](/mojo/stdlib/pwd/pwd/getpwnam): Retrieves the user ID in the password database for the given user name.
* [​`getpwuid`](/mojo/stdlib/pwd/pwd/getpwuid): Retrieve the password database entry for a given user ID.

---

## PyMojoObject

`struct PyMojoObject[T: AnyType]`

Storage backing a PyObject\* wrapping a Mojo value.

This struct represents the C-level layout of a Python object that contains
a wrapped Mojo value. It must be ABI-compatible with CPython's PyObject
structure to enable seamless interoperability between Mojo and Python.

The struct follows Python's object model where all Python objects begin
with a PyObject header (ob\_base), followed by type-specific data. In this
case, the type-specific data is a Mojo value of type T.

See  for more details.

## Parameters

* ​T (`AnyType`): The Mojo type being wrapped. Can be any type that satisfies `AnyType`.

## Fields

* ​ob\_base (`PyObject`): The standard Python object header containing reference count and type information.
  This must be the first field to maintain ABI compatibility with Python's object layout.
  All Python objects begin with this header structure.
* ​mojo\_value (`T`): The actual Mojo value being wrapped and exposed to Python.
  This field stores the Mojo data that Python code can interact with through
  the registered type methods and bindings.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

---

## python

Implements the python package.

## Modules

* [​`bindings`](/mojo/stdlib/python/bindings/):
* [​`python`](/mojo/stdlib/python/python/): Implements Python interoperability.
* [​`python_object`](/mojo/stdlib/python/python_object/): Implements PythonObject.

---

## python

Implements Python interoperability.

You can import these APIs from the `python` package. For example:

```mojo
from python import Python
```

## Structs

* [​`Python`](/mojo/stdlib/python/python/Python): Provides methods that help you use Python code in Mojo.

---

## Python

`struct Python`

Provides methods that help you use Python code in Mojo.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Default constructor.

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Copy constructor.

**Args:**

* ​existing (`Self`): The existing instance to copy from.

### `cpython`

`cpython(self) -> ref [StaticConstantOrigin] CPython`

Handle to the low-level C API of the CPython interpreter present in the current process.

**Returns:**

Handle to the CPython interpreter instance in the current process.

### `eval`

`eval(self, owned code: String) -> Bool`

Executes the given Python code.

**Args:**

* ​code (`String`): The python code to execute.

**Returns:**

`True` if the code executed successfully or `False` if the code
raised an exception.

### `evaluate`

`static evaluate(owned expr: String, file: Bool = False, name: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("__main__")) -> PythonObject`

Executes the given Python code.

**Args:**

* ​expr (`String`): The Python expression to evaluate.
* ​file (`Bool`): Evaluate as a file and return the module.
* ​name (`StringSlice[StaticConstantOrigin]`): The name of the module (most relevant if `file` is True).

**Returns:**

`PythonObject` containing the result of the evaluation.

### `add_to_path`

`static add_to_path(dir_path: StringSlice[origin])`

Adds a directory to the Python path.

This might be necessary to import a Python module via `import_module()`.
For example:

```mojo
from python import Python

# Specify path to `mypython.py` module
Python.add_to_path("path/to/module")
var mypython = Python.import_module("mypython")

var c = mypython.my_algorithm(2, 3)
```

**Args:**

* ​dir\_path (`StringSlice[origin]`): The path to a Python module you want to import.

### `import_module`

`static import_module(owned module: String) -> PythonObject`

Imports a Python module.

This provides you with a module object you can use just like you would
in Python. For example:

```mojo
from python import Python

# This is equivalent to Python's `import numpy as np`
np = Python.import_module("numpy")
a = np.array([1, 2, 3])
```

**Args:**

* ​module (`String`): The Python module name. This module must be visible from the
  list of available Python paths (you might need to add the
  module's path with `add_to_path()`).

**Returns:**

The Python module.

### `create_module`

`static create_module(name: StringSlice[StaticConstantOrigin]) -> TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]`

Creates a Python module using the provided name.

Inspired by 

TODO: allow specifying a doc-string to attach to the module upon creation or lazily added?

**Args:**

* ​name (`StringSlice[StaticConstantOrigin]`): The Python module name.

**Returns:**

The Python module.

### `add_functions`

`static add_functions(module: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")], owned functions: List[PyMethodDef])`

Adds functions to a PythonModule object.

**Args:**

* ​module (`TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]`): The PythonModule object.
* ​functions (`List[PyMethodDef]`): List of function data.

**Raises:**

If we fail to add the functions to the module.

### `add_object`

`static add_object(module: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")], owned name: String, value: PythonObject)`

Add a new object to `module` with the given name and value.

The provided object can be any type of Python object: an instance,
a type object, a function, etc.

The added value will be inserted into the `__dict__` of the provided
module.

**Args:**

* ​module (`TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]`): The Python module to modify.
* ​name (`String`): The name of the new object.
* ​value (`PythonObject`): The python object value.

### `dict`

`static dict[V: PythonConvertible & Copyable & Movable = PythonObject](*, owned **kwargs: V) -> PythonObject`

Construct an Python dictionary from keyword arguments.

**Parameters:**

* ​V (`PythonConvertible & Copyable & Movable`): The type of the values in the dictionary. Must implement the
  `PythonConvertible`, `Copyable`, and `Movable` traits.

**Args:**

* ​\*\*kwargs (`V`): The keyword arguments to construct the dictionary with.

**Returns:**

The constructed Python dictionary.

**Raises:**

On failure to construct the dictionary or convert the values to
Python objects.

`static dict[K: PythonConvertible & Copyable & Movable = PythonObject, V: PythonConvertible & Copyable & Movable = PythonObject](tuples: Span[Tuple[K, V], origin]) -> PythonObject`

Construct an Python dictionary from a list of key-value tuples.

**Parameters:**

* ​K (`PythonConvertible & Copyable & Movable`): The type of the keys in the dictionary. Must implement the
  `PythonConvertible`, `Copyable`, and `Movable` traits.
* ​V (`PythonConvertible & Copyable & Movable`): The type of the values in the dictionary. Must implement the
  `PythonConvertible`, `Copyable`, and `Movable` traits.

**Args:**

* ​tuples (`Span[Tuple[K, V], origin]`): The list of key-value tuples to construct the dictionary
  with.

**Returns:**

The constructed Python dictionary.

**Raises:**

On failure to construct the dictionary or convert the keys or values
to Python objects.

### `list`

`static list[T: PythonConvertible & Copyable & Movable](values: Span[T, origin]) -> PythonObject`

Initialize the object from a list of values.

**Parameters:**

* ​T (`PythonConvertible & Copyable & Movable`): The span element type.

**Args:**

* ​values (`Span[T, origin]`): The values to initialize the list with.

**Returns:**

A PythonObject representing the list.

`static list[*Ts: PythonConvertible](*values: *Ts) -> PythonObject`

Construct an Python list of objects.

**Parameters:**

* ​\*Ts (`PythonConvertible`): The list element types.

**Args:**

* ​\*values (`*Ts`): The values to initialize the list with.

**Returns:**

The constructed Python list.

### `tuple`

`static tuple[*Ts: PythonConvertible](*values: *Ts) -> PythonObject`

Construct an Python tuple of objects.

**Parameters:**

* ​\*Ts (`PythonConvertible`): The list element types.

**Args:**

* ​\*values (`*Ts`): The values to initialize the tuple with.

**Returns:**

The constructed Python tuple.

### `as_string_slice`

`as_string_slice(self, str_obj: PythonObject) -> StringSlice[MutableAnyOrigin]`

Return a string representing the given Python object.

**Args:**

* ​str\_obj (`PythonObject`): The Python object.

**Returns:**

Mojo string representing the given Python object.

### `type`

`static type(obj: PythonObject) -> PythonObject`

Return Type of this PythonObject.

**Args:**

* ​obj (`PythonObject`): PythonObject we want the type of.

**Returns:**

A PythonObject that holds the type object.

### `none`

`static none() -> PythonObject`

Get a `PythonObject` representing `None`.

**Returns:**

`PythonObject` representing `None`.

### `str`

`static str(obj: PythonObject) -> PythonObject`

Convert a PythonObject to a Python `str`.

**Args:**

* ​obj (`PythonObject`): The PythonObject to convert.

**Returns:**

A Python `str` object.

**Raises:**

An error if the conversion failed.

### `int`

`static int(obj: PythonObject) -> PythonObject`

Convert a PythonObject to a Python `int` (i.e. arbitrary precision integer).

**Args:**

* ​obj (`PythonObject`): The PythonObject to convert.

**Returns:**

A PythonObject representing the result of the conversion to `int`.

**Raises:**

If the conversion to `int` fails.

### `float`

`static float(obj: PythonObject) -> PythonObject`

Convert a PythonObject to a Python `float` object.

**Args:**

* ​obj (`PythonObject`): The PythonObject to convert.

**Returns:**

A Python `float` object.

**Raises:**

If the conversion fails.

### `py_long_as_ssize_t`

`static py_long_as_ssize_t(obj: PythonObject) -> Int`

Get the value of a Python `long` object.

**Args:**

* ​obj (`PythonObject`): The Python `long` object.

**Returns:**

The value of the `long` object as a `Py_ssize_t`.

**Raises:**

If `obj` is not a Python `long` object, or if the `long` object
value overflows `Py_ssize_t`.

### `is_true`

`static is_true(obj: PythonObject) -> Bool`

Check if the PythonObject is truthy.

**Args:**

* ​obj (`PythonObject`): The PythonObject to check.

**Returns:**

True if the PythonObject is truthy and False otherwise.

**Raises:**

If the boolean value of the PythonObject cannot be determined.

---

## Python interoperability

Because Mojo uses a Pythonic syntax, its easy to start reading and writing Mojo
when coming from Python. Mojo also optimizes for ease of use in across the
Python-Mojo language boundary, with built-in support for both calling **into
Python** from Mojo, and calling **into Mojo** from Python.

  
The common API for interoperability in both directions is the
[`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) type, which
wraps a Python object within Mojo.

In Mojo, you can import Python modules, construct Python objects, and call
Python functions and methods directly. Mojo will first load the CPython
interpreter as a dynamic library (called `libpython.dylib` on macOS), and use
that interpreter to execute Python code. For example:

```mojo title="🔥 Mojo"
from python import Python

fn main():
    # Loads CPython dynamically behind the scenes; returns a PythonObject
    var res = Python.evaluate("2 + 2")
```

Calling into Mojo from Python is different. Because Mojo is a compiled
language, we can't directly "evaluate" Mojo code. Instead, Mojo code must
declare up front which functions and types are available to be called from
Python. For example:

```mojo title="🔥 mojo_module.mojo"
@export
fn PyInit_mojo_module() -> PythonObject:
    try:
        var m = PythonModuleBuilder("mojo_module")
        m.def_function[mojo_greet]("mojo_greet", docstring="Say hello from Mojo")
        return m.finalize()
    except e:
        return abort[PythonObject](String("error creating Python Mojo module:", e))

fn mojo_greet(name: PythonObject):
  print("Hello to", name, "from Mojo 👋")
```

By defining a suitable `PyInit_*()` function, Mojo performs the necessary
low-level binding calls to inform Python how to call Mojo code:

```python title="🐍 main.py"
import max._mojo.mojo_importer
import mojo_module

mojo_module.mojo_greet("Python")
```

(Although it's not quite that simple yet.)

These quick examples give you a taste of what interoperability looks like for
Python and Mojo. Flexible interop enables you to move incrementally and
efficiently. By embracing both directions of language interop, you can choose
how to use Mojo in a way that works best for your use case.

**To learn more about bridging Python ↔ Mojo, continue reading**:

import MDXListing from '@site/src/components/Listing/MDXListing';

export const docs = [
    'python-from-mojo',
    'mojo-from-python',
]

---

## Python types

When calling Python methods, Mojo needs to convert back and forth between native
Python objects and native Mojo objects. Most of these conversions happen
automatically, but there are a number of cases that Mojo doesn't handle yet.
In these cases you may need to do an explicit conversion, or call an extra
method.

## Mojo types in Python

Mojo primitive types implicitly convert into Python objects. Today we support
integers, floats, booleans, and strings.

To demonstrate, the following example dynamically creates an in-memory Python
module named `py_utils` containing a `type_printer()` function, which simply
prints the type of a given value. Then you can see how different Mojo values
convert into corresponding Python types.

```mojo
from python import Python

def main():
    py_module = """
def type_printer(value):
    print(type(value))
"""
    py_utils = Python.evaluate(py_module, file=True, name="py_utils")

    py_utils.type_printer(4)
    py_utils.type_printer(3.14)
    py_utils.type_printer(True)
    py_utils.type_printer("Mojo")
```

```output

```

## Python types in Mojo

You can also create and use Python objects from Mojo.

### Mojo wrapper objects

When you use Python objects in your Mojo code, Mojo adds the
[`PythonObject`](/mojo/stdlib/python/python_object/PythonObject) wrapper around
the Python object. This object exposes a number of common double underscore
methods (dunder methods) like `__getitem__()` and `__getattr__()`, passing them
through to the underlying Python object. Most of the time, you can treat the
wrapped object just like you'd treat it in Python. You can use dot-notation to
access attributes and call methods, and use the `[]` operator to access an item
in a sequence.

You can explicitly create a wrapped Python object by initializing a
`PythonObject` with a Mojo integer, float, boolean, or string. Additionally, you
can create several types of Python collections directly in Mojo using the
[`Python.dict()`](/mojo/stdlib/python/python/Python#dict),
[`Python.list()`](/mojo/stdlib/python/python/Python#list), and
[`Python.tuple()`](/mojo/stdlib/python/python/Python#tuple) static methods.

For example, to create a Python dictionary, use the
[`Python.dict()`](/mojo/stdlib/python/python/Python#dict) method:

```mojo
from python import Python

def main():
    py_dict = Python.dict()
    py_dict["item_name"] = "whizbang"
    py_dict["price"] = 11.75
    py_dict["inventory"] = 100
    print(py_dict)
```

```output
{'item_name': 'whizbang', 'price': 11.75, 'inventory': 100}
```

With the [`Python.list()`](/mojo/stdlib/python/python/Python#list) method, you
can create a Python list and optionally initialize it:

```mojo
from python import Python

def main():
    py_list = Python.list("cat", 2, 3.14159, 4)
    n = py_list[2]
    print("n =", n)
    py_list.append(5)
    py_list[0] = "aardvark"
    print(py_list)
```

```output
n = 3.14159
['aardvark', 2, 3.14159, 4, 5]
```

The [`Python.tuple()`](/mojo/stdlib/python/python/Python#tuple) method creates a
Python tuple of values:

```mojo
from python import Python

def main():
    py_tuple = Python.tuple("cat", 2, 3.1415, "cat")
    n = py_tuple[2]
    print("n =", n)
    print("Number of cats:", py_tuple.count("cat"))
```

```output
n = 3.1415
Number of cats: 2
```

If you want to construct a Python type that doesn't have a literal Mojo
equivalent, you can also use the
[`Python.evaluate()`](/mojo/stdlib/python/python/Python#evaluate) method. For
example, to create a Python `set`:

```mojo
from python import Python

def main():
    var py_set = Python.evaluate('{2, 3, 2, 7, 11, 3}')
    num_items = len(py_set)
    print(num_items, "items in the set.")
    contained = 7 in py_set
    print("Is 7 in the set:", contained)
```

```output
4 items in the set.
Is 7 in the set: True
```

Some Mojo APIs handle `PythonObject` just fine, but sometimes you'll need to
explicitly convert a Python value into a native Mojo value.
Currently `PythonObject` conforms to the
[`Stringable`](/mojo/stdlib/builtin/str/Stringable),
[`Boolable`](/mojo/stdlib/builtin/bool/Boolable),
[`Intable`](/mojo/stdlib/builtin/int/Intable), and
[`Floatable`](/mojo/stdlib/builtin/floatable/Floatable/) traits. This allows you
to convert a `PythonObject` to the corresponding Mojo types.

```mojo
var s = String(py_string)
var b = Bool(py_bool)
var i = Int(py_int)
var f = Float64(py_float)
```

PythonObject also implements the [`Writable`](/mojo/stdlib/utils/write/Writable)
trait, so that you can print Python values using the built-in
[`print()`](/mojo/stdlib/builtin/io/print) function.

```mojo
print(python_object)
```

### Comparing Python types in Mojo

You can use Python objects in Mojo comparison expressions, and the Mojo `is`
operator also works to compare the identity of two Python objects. Python values
like `False` and `None` evaluate as false in Mojo boolean expressions as well.

If you need to know the type of the underlying Python object, you can use the
[`Python.type()`](/mojo/stdlib/python/python/Python#type) method, which is
equivalent to the Python `type()` builtin. You can test if a Python
object is of a particular type by performing an identity comparison against the
type as shown below:

```mojo
from python import Python
from python import PythonObject

def main():
    var value1: PythonObject = 3.7
    value2 = Python.evaluate("10/3")

    # Compare values
    print("Is value1 greater than 3:", value1 > 3)
    print("Is value1 greater than value2:", value1 > value2)

    # Compare identities
    value3 = value2
    print("value1 is value2:", value1 is value2)
    print("value2 is value3:", value2 is value3)

    # Compare types
    py_float_type = Python.evaluate("float")
    print("Python float type:", py_float_type)
    print("value1 type:", Python.type(value1))
    print("Is value1 a Python float:", Python.type(value1) is py_float_type)
```

```output
Is value1 greater than 3: True
Is value1 greater than value2: True
value1 is value2: False
value2 is value3: True
Python float type: 
value1 type: 
Is value1 a Python float: True
```

---

## python_object

Implements PythonObject.

You can import these APIs from the `python` package. For example:

```mojo
from python import PythonObject
```

## Aliases

### `PyFunction`

`alias PyFunction = fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) -> PythonObject`

### `PyFunctionRaising`

`alias PyFunctionRaising = fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) raises -> PythonObject`

### `PythonModule`

`alias PythonModule = TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]`

## Structs

* [​`PythonObject`](/mojo/stdlib/python/python_object/PythonObject): A Python object.
* [​`TypedPythonObject`](/mojo/stdlib/python/python_object/TypedPythonObject): A wrapper around `PythonObject` that indicates the type of the contained object.

## Traits

* [​`ConvertibleFromPython`](/mojo/stdlib/python/python_object/ConvertibleFromPython): Denotes a type that can attempt construction from a read-only Python object.
* [​`PythonConvertible`](/mojo/stdlib/python/python_object/PythonConvertible): A trait that indicates a type can be converted to a PythonObject, and that specifies the behavior with a `to_python_object` method.

---

## PythonConvertible

A trait that indicates a type can be converted to a PythonObject, and that specifies the behavior with a `to_python_object` method.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `to_python_object`

`to_python_object(self: _Self) -> PythonObject`

Convert a value to a PythonObject.

**Returns:**

A PythonObject representing the value.

---

## PythonModuleBuilder

`struct PythonModuleBuilder`

A builder for creating Python modules with Mojo function and type bindings.

This builder provides a high-level API for declaring Python bindings for Mojo
functions and types within a Python module. It manages the registration of
functions, types, and their associated metadata, then finalizes everything
into a complete Python module object.

The builder follows a declarative pattern where you:

1. Create a builder instance with a module name
2. Add function bindings using `def_function()`, `def_py_function()`, `def_py_c_function()`
3. Add type bindings using `add_type[T]()` and configure them
4. Call `finalize()` to finish building the Python module.

Example:

```mojo
from python.bindings import PythonModuleBuilder

var builder = PythonModuleBuilder("my_module")
builder.def_function[my_func]("my_func", "Documentation for my_func")

_ = builder.add_type[MyType]("MyType").def_method[my_method]("my_method")

var module = builder.finalize()
```

Note:
After calling `finalize()`, the builder's internal state is cleared and
it should not be reused for creating additional modules.

TODO: This should be enforced programmatically in the future.

## Fields

* ​module (`TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]`): The Python module being built.
* ​functions (`List[PyMethodDef]`): List of function definitions that will be exposed in the module.
* ​type\_builders (`List[PythonTypeBuilder]`): List of type builders for types that will be exposed in the module.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, name: StringSlice[StaticConstantOrigin])`

Construct a Python module builder with the given module name.

**Args:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the module.

**Raises:**

If the module creation fails.

`__init__(out self, module: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")])`

Construct a Python module builder with the given module.

**Args:**

* ​module (`TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]`): The module to build.

### `add_type`

`add_type[T: Movable & Defaultable & Representable & TypeIdentifiable](mut self, type_name: StringSlice[StaticConstantOrigin]) -> ref [*[0,0].type_builders] PythonTypeBuilder`

Add a type to the module and return a builder for it.

**Parameters:**

* ​T (`Movable & Defaultable & Representable & TypeIdentifiable`): The mojo type to bind in the module.

**Args:**

* ​type\_name (`StringSlice[StaticConstantOrigin]`): The name of the type to expose in the module.

**Returns:**

A reference to a type builder registered in the module builder.

### `def_py_c_function`

`def_py_c_function(mut self, func: fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PyCFunction signature in the module.

**Args:**

* ​func (`fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr`): The function to declare a binding for.
* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

### `def_py_function`

`def_py_function[func: fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PyFunction signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) -> PythonObject`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_py_function[func: fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) raises -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PyFunctionRaising signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) raises -> PythonObject`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

### `def_function`

`def_function[func: fn() raises -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn() raises -> PythonObject`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn(mut PythonObject) raises -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject) raises -> PythonObject`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn(mut PythonObject, mut PythonObject) raises -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject, mut PythonObject) raises -> PythonObject`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> PythonObject`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn() -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn() -> PythonObject`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn(mut PythonObject) -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject) -> PythonObject`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn(mut PythonObject, mut PythonObject) -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject, mut PythonObject) -> PythonObject`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn(mut PythonObject, mut PythonObject, mut PythonObject) -> PythonObject](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject, mut PythonObject, mut PythonObject) -> PythonObject`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn() raises -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn() raises -> None`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn(mut PythonObject) raises -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject) raises -> None`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn(mut PythonObject, mut PythonObject) raises -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject, mut PythonObject) raises -> None`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> None`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn() -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn() -> None`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn(mut PythonObject) -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject) -> None`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn(mut PythonObject, mut PythonObject) -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject, mut PythonObject) -> None`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

`def_function[func: fn(mut PythonObject, mut PythonObject, mut PythonObject) -> None](mut self, func_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice())`

Declare a binding for a function with PythonObject signature in the module.

**Parameters:**

* ​func (`fn(mut PythonObject, mut PythonObject, mut PythonObject) -> None`): The function to declare a binding for.

**Args:**

* ​func\_name (`StringSlice[StaticConstantOrigin]`): The name with which the function will be exposed in the
  module.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the function in the module.

### `finalize`

`finalize(mut self) -> TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]`

Finalize the module builder, creating the module object.

All types and functions added to the builder will be built and exposed
in the module. After calling this method, the builder's internal state
is cleared and it should not be reused for creating additional modules.

**Returns:**

The finalized Python module containing all registered functions and types.

**Raises:**

If the module creation fails or if we fail to add any of the
declared functions or types to the module.

---

## PythonObject

`@register_passable`
`struct PythonObject`

A Python object.

## Fields

* ​py\_object (`PyObjectPtr`): A pointer to the underlying Python object.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`Movable`,
`PythonConvertible`,
`SizedRaising`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Initialize the object with a `None` value.

`__init__(*, from_owned_ptr: PyObjectPtr) -> Self`

Initialize this object from an owned reference-counted Python object pointer.

Ownership of the reference will be assumed by `PythonObject`.

**Args:**

* ​from\_owned\_ptr (`PyObjectPtr`): The `PyObjectPtr` to take ownership of.

`__init__(*, from_borrowed_ptr: PyObjectPtr) -> Self`

Initialize this object from a read-only reference-counted Python object pointer.

The reference count of the pointee object will be incremented, and
ownership of the additional reference count will be assumed by the
initialized `PythonObject`.

The CPython API documentation indicates the ownership semantics of the
returned object on any function that returns a `PyObject*` value. The
two possible annotations are:

* "Return value: New reference."
* "Return value: Borrowed reference.

This function should be used to construct a `PythonObject` from the
pointer returned by 'Borrowed reference'-type objects.

**Args:**

* ​from\_borrowed\_ptr (`PyObjectPtr`): A read-only reference counted pointer to a Python
  object.

**Returns:**

An owned PythonObject pointer.

`__init__[T: Movable & TypeIdentifiable](out self, *, owned alloc: T)`

Allocate a new `PythonObject` and store a Mojo value in it.

The newly allocated Python object will contain the provided Mojo `T`
instance directly, without attempting conversion to an equivalent Python
builtin type.

Only Mojo types that have a registered Python 'type' object can be stored
as a Python object. Mojo types are registered using a
`PythonTypeBuilder`.

**Parameters:**

* ​T (`Movable & TypeIdentifiable`): The Mojo type of the value that the resulting Python object
  holds.

**Args:**

* ​alloc (`T`): The Mojo value to store in the new Python object.

**Raises:**

If no Python type object has been registered for `T` by a
`PythonTypeBuilder`.

`@implicit`
`__init__(owned typed_obj: TypedPythonObject[type_hint]) -> Self`

Construct a PythonObject from a typed object, dropping the type hint information.

This is a no-op at runtime. The only information that is lost is static
type information.

**Args:**

* ​typed\_obj (`TypedPythonObject[type_hint]`): The typed python object to unwrap.

`@implicit`
`__init__(none: NoneType) -> Self`

Initialize a none value object from a `None` literal.

**Args:**

* ​none (`NoneType`): None.

`@implicit`
`__init__(value: Bool) -> Self`

Initialize the object from a bool.

**Args:**

* ​value (`Bool`): The boolean value.

`@implicit`
`__init__(integer: Int) -> Self`

Initialize the object with an integer value.

**Args:**

* ​integer (`Int`): The integer value.

`@implicit`
`__init__[dtype: DType](value: SIMD[dtype, 1]) -> Self`

Initialize the object with a generic scalar value. If the scalar value type is bool, it is converted to a boolean. Otherwise, it is converted to the appropriate integer or floating point type.

**Parameters:**

* ​dtype (`DType`): The scalar value type.

**Args:**

* ​value (`SIMD[dtype, 1]`): The scalar value.

`@implicit`
`__init__(value: StringLiteral[value]) -> Self`

Initialize the object from a string literal.

**Args:**

* ​value (`StringLiteral[value]`): The string value.

`@implicit`
`__init__(value: String) -> Self`

Initialize the object from a string.

**Args:**

* ​value (`String`): The string value.

`@implicit`
`__init__(string: StringSlice[origin]) -> Self`

Initialize the object from a string.

**Args:**

* ​string (`StringSlice[origin]`): The string value.

`@implicit`
`__init__(slice: Slice) -> Self`

Initialize the object from a Mojo Slice.

**Args:**

* ​slice (`Slice`): The dictionary value.

`__init__[*Ts: PythonConvertible](owned *values: *Ts, *, __list_literal__: Tuple[]) -> Self`

Construct an Python list of objects.

**Parameters:**

* ​\*Ts (`PythonConvertible`): The types of the input values.

**Args:**

* ​\*values (`*Ts`): The values to initialize the list with.
* ​**list\_literal** (`Tuple[]`): Tell Mojo to use this method for list literals.

**Returns:**

The constructed Python list.

`__init__[*Ts: PythonConvertible](out self, owned *values: *Ts, *, __set_literal__: Tuple[])`

Construct an Python set of objects.

**Parameters:**

* ​\*Ts (`PythonConvertible`): The types of the input values.

**Args:**

* ​\*values (`*Ts`): The values to initialize the set with.
* ​**set\_literal** (`Tuple[]`): Tell Mojo to use this method for set literals.

**Returns:**

The constructed Python set.

`__init__(out self, owned keys: List[PythonObject], owned values: List[PythonObject], __dict_literal__: Tuple[])`

Construct a Python dictionary from a list of keys and a list of values.

**Args:**

* ​keys (`List[PythonObject]`): The keys of the dictionary.
* ​values (`List[PythonObject]`): The values of the dictionary.
* ​**dict\_literal** (`Tuple[]`): Tell Mojo to use this method for dict literals.

### `__copyinit__`

`__copyinit__(existing: Self) -> Self`

Copy the object.

This increments the underlying refcount of the existing object.

**Args:**

* ​existing (`Self`): The value to copy.

### `__del__`

`__del__(owned self)`

Destroy the object.

This decrements the underlying refcount of the pointed-to object.

### `__bool__`

`__bool__(self) -> Bool`

Evaluate the boolean value of the object.

**Returns:**

Whether the object evaluates as true.

### `__getitem__`

`__getitem__(self, *args: Self) -> Self`

Return the value for the given key or keys.

**Args:**

* ​\*args (`Self`): The key or keys to access on this object.

**Returns:**

The value corresponding to the given key for this object.

`__getitem__(self, *args: Slice) -> Self`

Return the sliced value for the given Slice or Slices.

**Args:**

* ​\*args (`Slice`): The Slice or Slices to apply to this object.

**Returns:**

The sliced value corresponding to the given Slice(s) for this object.

### `__setitem__`

`__setitem__(self, *args: Self, *, value: Self)`

Set the value with the given key or keys.

**Args:**

* ​\*args (`Self`): The key or keys to set on this object.
* ​value (`Self`): The value to set.

### `__neg__`

`__neg__(self) -> Self`

Negative.

Calls the underlying object's `__neg__` method.

**Returns:**

The result of prefixing this object with a `-` operator. For most
numerical objects, this returns the negative.

### `__pos__`

`__pos__(self) -> Self`

Positive.

Calls the underlying object's `__pos__` method.

**Returns:**

The result of prefixing this object with a `+` operator. For most
numerical objects, this does nothing.

### `__invert__`

`__invert__(self) -> Self`

Inversion.

Calls the underlying object's `__invert__` method.

**Returns:**

The logical inverse of this object: a bitwise representation where
all bits are flipped, from zero to one, and from one to zero.

### `__lt__`

`__lt__(self, rhs: Self) -> Self`

Less than (rich) comparison operator.

**Args:**

* ​rhs (`Self`): The value of the right hand side of the comparison.

**Returns:**

The result of the comparison, not necessarily a boolean.

**Raises:**

If the object doesn't implement the `__lt__` method, or if it fails.

### `__le__`

`__le__(self, rhs: Self) -> Self`

Less than or equal (rich) comparison operator.

**Args:**

* ​rhs (`Self`): The value of the right hand side of the comparison.

**Returns:**

The result of the comparison, not necessarily a boolean.

**Raises:**

If the object doesn't implement the `__le__` method, or if it fails.

### `__eq__`

`__eq__(self, rhs: Self) -> Self`

Equality (rich) comparison operator.

**Args:**

* ​rhs (`Self`): The value of the right hand side of the comparison.

**Returns:**

The result of the comparison, not necessarily a boolean.

**Raises:**

If the object doesn't implement the `__eq__` method, or if it fails.

### `__ne__`

`__ne__(self, rhs: Self) -> Self`

Inequality (rich) comparison operator.

**Args:**

* ​rhs (`Self`): The value of the right hand side of the comparison.

**Returns:**

The result of the comparison, not necessarily a boolean.

**Raises:**

If the object doesn't implement the `__ne__` method, or if it fails.

### `__gt__`

`__gt__(self, rhs: Self) -> Self`

Greater than (rich) comparison operator.

**Args:**

* ​rhs (`Self`): The value of the right hand side of the comparison.

**Returns:**

The result of the comparison, not necessarily a boolean.

**Raises:**

If the object doesn't implement the `__gt__` method, or if it fails.

### `__ge__`

`__ge__(self, rhs: Self) -> Self`

Greater than or equal (rich) comparison operator.

**Args:**

* ​rhs (`Self`): The value of the right hand side of the comparison.

**Returns:**

The result of the comparison, not necessarily a boolean.

**Raises:**

If the object doesn't implement the `__ge__` method, or if it fails.

### `__is__`

`__is__(self, other: Self) -> Bool`

Test if the PythonObject is the `other` PythonObject, the same as `x is y` in Python.

**Args:**

* ​other (`Self`): The right-hand-side value in the comparison.

**Returns:**

True if they are the same object and False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Test if the PythonObject is not the `other` PythonObject, the same as `x is not y` in Python.

**Args:**

* ​other (`Self`): The right-hand-side value in the comparison.

**Returns:**

True if they are not the same object and False otherwise.

### `__contains__`

`__contains__(self, rhs: Self) -> Bool`

Contains dunder.

Calls the underlying object's `__contains__` method.

**Args:**

* ​rhs (`Self`): Right hand value.

**Returns:**

True if rhs is in self.

### `__add__`

`__add__(self, rhs: Self) -> Self`

Addition and concatenation.

Calls the underlying object's `__add__` method.

**Args:**

* ​rhs (`Self`): Right hand value.

**Returns:**

The sum or concatenated values.

### `__sub__`

`__sub__(self, rhs: Self) -> Self`

Subtraction.

Calls the underlying object's `__sub__` method.

**Args:**

* ​rhs (`Self`): Right hand value.

**Returns:**

The difference.

### `__mul__`

`__mul__(self, rhs: Self) -> Self`

Multiplication.

Calls the underlying object's `__mul__` method.

**Args:**

* ​rhs (`Self`): Right hand value.

**Returns:**

The product.

### `__truediv__`

`__truediv__(self, rhs: Self) -> Self`

Division.

Calls the underlying object's `__truediv__` method.

**Args:**

* ​rhs (`Self`): The right-hand-side value by which this object is divided.

**Returns:**

The result of dividing the right-hand-side value by this.

### `__floordiv__`

`__floordiv__(self, rhs: Self) -> Self`

Return the division of self and rhs rounded down to the nearest integer.

Calls the underlying object's `__floordiv__` method.

**Args:**

* ​rhs (`Self`): The right-hand-side value by which this object is divided.

**Returns:**

The result of dividing this by the right-hand-side value, modulo any
remainder.

### `__mod__`

`__mod__(self, rhs: Self) -> Self`

Return the remainder of self divided by rhs.

Calls the underlying object's `__mod__` method.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

The remainder of dividing self by rhs.

### `__pow__`

`__pow__(self, exp: Self) -> Self`

Raises this object to the power of the given value.

**Args:**

* ​exp (`Self`): The exponent.

**Returns:**

The result of raising this by the given exponent.

### `__lshift__`

`__lshift__(self, rhs: Self) -> Self`

Bitwise left shift.

**Args:**

* ​rhs (`Self`): The right-hand-side value by which this object is bitwise
  shifted to the left.

**Returns:**

This value, shifted left by the given value.

### `__rshift__`

`__rshift__(self, rhs: Self) -> Self`

Bitwise right shift.

**Args:**

* ​rhs (`Self`): The right-hand-side value by which this object is bitwise
  shifted to the right.

**Returns:**

This value, shifted right by the given value.

### `__and__`

`__and__(self, rhs: Self) -> Self`

Bitwise AND.

**Args:**

* ​rhs (`Self`): The right-hand-side value with which this object is bitwise
  AND'ed.

**Returns:**

The bitwise AND result of this and the given value.

### `__or__`

`__or__(self, rhs: Self) -> Self`

Bitwise OR.

**Args:**

* ​rhs (`Self`): The right-hand-side value with which this object is bitwise
  OR'ed.

**Returns:**

The bitwise OR result of this and the given value.

### `__xor__`

`__xor__(self, rhs: Self) -> Self`

Exclusive OR.

**Args:**

* ​rhs (`Self`): The right-hand-side value with which this object is exclusive
  OR'ed.

**Returns:**

The exclusive OR result of this and the given value.

### `__radd__`

`__radd__(self, lhs: Self) -> Self`

Reverse addition and concatenation.

Calls the underlying object's `__radd__` method.

**Args:**

* ​lhs (`Self`): The left-hand-side value to which this object is added or
  concatenated.

**Returns:**

The sum.

### `__rsub__`

`__rsub__(self, lhs: Self) -> Self`

Reverse subtraction.

Calls the underlying object's `__rsub__` method.

**Args:**

* ​lhs (`Self`): The left-hand-side value from which this object is subtracted.

**Returns:**

The result of subtracting this from the given value.

### `__rmul__`

`__rmul__(self, lhs: Self) -> Self`

Reverse multiplication.

Calls the underlying object's `__rmul__` method.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is multiplied by this object.

**Returns:**

The product of the multiplication.

### `__rtruediv__`

`__rtruediv__(self, lhs: Self) -> Self`

Reverse division.

Calls the underlying object's `__rtruediv__` method.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is divided by this object.

**Returns:**

The result of dividing the given value by this.

### `__rfloordiv__`

`__rfloordiv__(self, lhs: Self) -> Self`

Reverse floor division.

Calls the underlying object's `__rfloordiv__` method.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is divided by this object.

**Returns:**

The result of dividing the given value by this, modulo any
remainder.

### `__rmod__`

`__rmod__(self, lhs: Self) -> Self`

Reverse modulo.

Calls the underlying object's `__rmod__` method.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is divided by this object.

**Returns:**

The remainder from dividing the given value by this.

### `__rpow__`

`__rpow__(self, lhs: Self) -> Self`

Reverse power of.

**Args:**

* ​lhs (`Self`): The number that is raised to the power of this object.

**Returns:**

The result of raising the given value by this exponent.

### `__rlshift__`

`__rlshift__(self, lhs: Self) -> Self`

Reverse bitwise left shift.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is bitwise shifted to the left
  by this object.

**Returns:**

The given value, shifted left by this.

### `__rrshift__`

`__rrshift__(self, lhs: Self) -> Self`

Reverse bitwise right shift.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is bitwise shifted to the right
  by this object.

**Returns:**

The given value, shifted right by this.

### `__rand__`

`__rand__(self, lhs: Self) -> Self`

Reverse bitwise and.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is bitwise AND'ed with this
  object.

**Returns:**

The bitwise AND result of the given value and this.

### `__ror__`

`__ror__(self, lhs: Self) -> Self`

Reverse bitwise OR.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is bitwise OR'ed with this
  object.

**Returns:**

The bitwise OR result of the given value and this.

### `__rxor__`

`__rxor__(self, lhs: Self) -> Self`

Reverse exclusive OR.

**Args:**

* ​lhs (`Self`): The left-hand-side value that is exclusive OR'ed with this
  object.

**Returns:**

The exclusive OR result of the given value and this.

### `__iadd__`

`__iadd__(mut self, rhs: Self)`

Immediate addition and concatenation.

**Args:**

* ​rhs (`Self`): The right-hand-side value that is added to this object.

### `__isub__`

`__isub__(mut self, rhs: Self)`

Immediate subtraction.

**Args:**

* ​rhs (`Self`): The right-hand-side value that is subtracted from this object.

### `__imul__`

`__imul__(mut self, rhs: Self)`

In-place multiplication.

Calls the underlying object's `__imul__` method.

**Args:**

* ​rhs (`Self`): The right-hand-side value by which this object is multiplied.

### `__itruediv__`

`__itruediv__(mut self, rhs: Self)`

Immediate division.

**Args:**

* ​rhs (`Self`): The value by which this object is divided.

### `__ifloordiv__`

`__ifloordiv__(mut self, rhs: Self)`

Immediate floor division.

**Args:**

* ​rhs (`Self`): The value by which this object is divided.

### `__imod__`

`__imod__(mut self, rhs: Self)`

Immediate modulo.

**Args:**

* ​rhs (`Self`): The right-hand-side value that is used to divide this object.

### `__ipow__`

`__ipow__(mut self, rhs: Self)`

Immediate power of.

**Args:**

* ​rhs (`Self`): The exponent.

### `__ilshift__`

`__ilshift__(mut self, rhs: Self)`

Immediate bitwise left shift.

**Args:**

* ​rhs (`Self`): The right-hand-side value by which this object is bitwise
  shifted to the left.

### `__irshift__`

`__irshift__(mut self, rhs: Self)`

Immediate bitwise right shift.

**Args:**

* ​rhs (`Self`): The right-hand-side value by which this object is bitwise
  shifted to the right.

### `__iand__`

`__iand__(mut self, rhs: Self)`

Immediate bitwise AND.

**Args:**

* ​rhs (`Self`): The right-hand-side value with which this object is bitwise
  AND'ed.

### `__ixor__`

`__ixor__(mut self, rhs: Self)`

Immediate exclusive OR.

**Args:**

* ​rhs (`Self`): The right-hand-side value with which this object is
  exclusive OR'ed.

### `__ior__`

`__ior__(mut self, rhs: Self)`

Immediate bitwise OR.

**Args:**

* ​rhs (`Self`): The right-hand-side value with which this object is bitwise
  OR'ed.

### `copy`

`copy(self) -> Self`

Copy the object.

**Returns:**

A copy of the value.

### `__iter__`

`__iter__(self) -> _PyIter`

Iterate over the object.

**Returns:**

An iterator object.

**Raises:**

If the object is not iterable.

### `__getattr__`

`__getattr__(self, owned name: String) -> Self`

Return the value of the object attribute with the given name.

**Args:**

* ​name (`String`): The name of the object attribute to return.

**Returns:**

The value of the object attribute with the given name.

### `__setattr__`

`__setattr__(self, owned name: String, new_value: Self)`

Set the given value for the object attribute with the given name.

**Args:**

* ​name (`String`): The name of the object attribute to set.
* ​new\_value (`Self`): The new value to be set for that attribute.

### `__call__`

`__call__(self, *args: Self, *, owned **kwargs: Self) -> Self`

Call the underlying object as if it were a function.

**Args:**

* ​\*args (`Self`): Positional arguments to the function.
* ​\*\*kwargs (`Self`): Keyword arguments to the function.

**Returns:**

The return value from the called object.

**Raises:**

If the function cannot be called for any reason.

### `__len__`

`__len__(self) -> Int`

Returns the length of the object.

**Returns:**

The length of the object.

### `__hash__`

`__hash__(self) -> Int`

Returns the hash value of the object.

**Returns:**

The hash value of the object.

### `__int__`

`__int__(self) -> Self`

Convert the PythonObject to a Python `int` (i.e. arbitrary precision integer).

**Returns:**

A Python `int` object.

**Raises:**

An error if the conversion failed.

### `__float__`

`__float__(self) -> Self`

Convert the PythonObject to a Python `float` object.

**Returns:**

A Python `float` object.

**Raises:**

If the conversion fails.

### `__str__`

`__str__(self) -> Self`

Convert the PythonObject to a Python `str`.

**Returns:**

A Python `str` object.

**Raises:**

An error if the conversion failed.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this Python object to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `to_python_object`

`to_python_object(self) -> Self`

Convert this value to a PythonObject.

**Returns:**

A PythonObject representing the value.

### `unsafe_as_py_object_ptr`

`unsafe_as_py_object_ptr(self) -> PyObjectPtr`

Get the underlying PyObject pointer.

Safety:
Use-after-free: The caller must take care that `self` outlives the
usage of the pointer returned by this function.

**Returns:**

The underlying PyObject pointer.

### `steal_data`

`steal_data(owned self) -> PyObjectPtr`

Take ownership of the underlying pointer from the Python object.

**Returns:**

The underlying data.

### `unsafe_get_as_pointer`

`unsafe_get_as_pointer[dtype: DType](self) -> UnsafePointer[SIMD[dtype, 1]]`

Reinterpret a Python integer as a Mojo pointer.

Warning: converting from an integer to a pointer is unsafe! The
compiler assumes the resulting pointer DOES NOT alias any Mojo-derived
pointer. This is OK if the pointer originates from and is owned by
Python, e.g. the data underpinning a torch tensor.

**Parameters:**

* ​dtype (`DType`): The desired DType of the pointer.

**Returns:**

An `UnsafePointer` for the underlying Python data.

### `downcast_value_ptr`

`downcast_value_ptr[T: TypeIdentifiable](self, *, func: Optional[StringSlice[StaticConstantOrigin]] = Optional(None)) -> UnsafePointer[T]`

Get a pointer to the expected contained Mojo value of type `T`.

This method validates that this object actually contains an instance of
`T`, and will raise an error if it does not.

Mojo values are stored as Python objects backed by the `PyMojoObject[T]`
struct.

**Parameters:**

* ​T (`TypeIdentifiable`): The type of the Mojo value that this Python object is expected
  to contain.

**Args:**

* ​func (`Optional[StringSlice[StaticConstantOrigin]]`): Optional name of bound Mojo function that the raised
  TypeError should reference if downcasting fails.

**Returns:**

A pointer to the inner Mojo value.

**Raises:**

If the Python object does not contain an instance of the Mojo `T`
type.

### `unchecked_downcast_value_ptr`

`unchecked_downcast_value_ptr[T: AnyType](self) -> UnsafePointer[T]`

Get a pointer to the expected Mojo value of type `T`.

This function assumes that this Python object was allocated as an
instance of `PyMojoObject[T]`.

# Safety

The user must be certain that this Python object type matches the bound
Python type object for `T`.

**Parameters:**

* ​T (`AnyType`): The type of the Mojo value stored in this object.

**Returns:**

A pointer to the inner Mojo value.

---

## PythonTypeBuilder

`struct PythonTypeBuilder`

A builder for a Python 'type' binding.

This is typically used to build a type description of a `PyMojoObject[T]`.

This builder is used to declare method bindings for a Python type, and then
create the type binding.

Finalizing builder created with `PythonTypeObject.bind[T]()` will globally
register the resulting Python 'type' object as the single canonical type
object for the Mojo type `T`. Subsequent attempts to register a Python type
for `T` will raise an exception.

Registering a Python type object for `T` is necessary to be able to
construct a `PythonObject` from an instance of `T`, or to downcast an
existing `PythonObject` to a pointer to the inner `T` value.

## Fields

* ​type\_name (`StringSlice[StaticConstantOrigin]`): The name the type will be exposed as in the Python module.
* ​basicsize (`Int`): The required allocation size to hold an instance of this type as a Python object.
* ​methods (`List[PyMethodDef]`): List of method definitions that will be exposed on the Python type.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, type_name: StringSlice[StaticConstantOrigin], *, basicsize: Int)`

Construct a new builder for a Python type binding.

**Args:**

* ​type\_name (`StringSlice[StaticConstantOrigin]`): The name the type will be exposed as in the Python module.
* ​basicsize (`Int`): The required allocation size to hold an instance of this
  type as a Python object.

### `bind`

`static bind[T: Movable & Defaultable & Representable & TypeIdentifiable](type_name: StringSlice[StaticConstantOrigin]) -> Self`

Construct a new builder for a Python type that binds a Mojo type.

**Parameters:**

* ​T (`Movable & Defaultable & Representable & TypeIdentifiable`): The mojo type to bind.

**Args:**

* ​type\_name (`StringSlice[StaticConstantOrigin]`): The name the type will be exposed as in the Python module.

**Returns:**

A new type builder instance.

### `finalize`

`finalize(mut self) -> TypedPythonObject[__init__[__mlir_type.!kgen.string]("Type")]`

Finalize the builder and create a Python type object.

This method completes the construction of a Python type object from the
builder's configuration.

The method ensures that each Mojo type has exactly one corresponding
Python type object by registering the created type in a global registry.
This prevents accidental creation of multiple type objects for the same
Mojo type, which would break Python's type system assumptions.

Note:
After calling this method, the builder's internal state may be
modified (methods list is consumed), so the builder should not be
reused for creating additional type objects.

TODO: This should be enforced programmatically in the future.

**Returns:**

A `TypedPythonObject["Type"]` representing the newly created Python
type object that can be used to create instances or register with
Python modules.

**Raises:**

If the Python type object creation fails, typically due to invalid
type specifications or Python C API errors.

`finalize(mut self, module: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")])`

Finalize the builder and add the created type to a Python module.

This method completes the type building process by calling the parameterless
`finalize()` method to create the Python type object, then automatically
adds the resulting type to the specified Python module using the builder's
configured type name. After successful completion, the builder's method
list is cleared to prevent accidental reuse.

This is a convenience method that combines type finalization and module
registration in a single operation, which is the most common use case
when creating Python-accessible Mojo types.

Note:
After calling this method, the builder's internal state is modified
(methods list is cleared), so the builder should not be reused for
creating additional type objects. If you need the type object for
further operations, use the parameterless `finalize()` method instead
and manually add it to the module.

**Args:**

* ​module (`TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")]`): The Python module to which the finalized type will be added.
  The type will be accessible from Python code that imports
  this module using the name specified during builder construction.

**Raises:**

If the type object creation fails (see `finalize()` for details) or
if adding the type to the module fails, typically due to name conflicts
or module state issues.

### `def_py_c_method`

`def_py_c_method(mut self, method: fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PyObjectPtr signature for the type.

**Args:**

* ​method (`fn(PyObjectPtr, PyObjectPtr) -> PyObjectPtr`): The method to declare a binding for.
* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

### `def_py_method`

`def_py_method[method: fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PyObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) -> PythonObject`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

`def_py_method[method: fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) raises -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PyObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject, mut TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")]) raises -> PythonObject`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

### `def_method`

`def_method[method: fn(mut PythonObject) raises -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PythonObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject) raises -> PythonObject`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

`def_method[method: fn(mut PythonObject, mut PythonObject) raises -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PythonObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject, mut PythonObject) raises -> PythonObject`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

`def_method[method: fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PythonObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> PythonObject`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

`def_method[method: fn(mut PythonObject) -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PythonObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject) -> PythonObject`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

`def_method[method: fn(mut PythonObject, mut PythonObject) -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PythonObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject, mut PythonObject) -> PythonObject`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

`def_method[method: fn(mut PythonObject, mut PythonObject, mut PythonObject) -> PythonObject](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PythonObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject, mut PythonObject, mut PythonObject) -> PythonObject`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

`def_method[method: fn(mut PythonObject) raises -> None](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PythonObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject) raises -> None`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

`def_method[method: fn(mut PythonObject, mut PythonObject) raises -> None](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PythonObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject, mut PythonObject) raises -> None`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

`def_method[method: fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> None](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PythonObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject, mut PythonObject, mut PythonObject) raises -> None`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

`def_method[method: fn(mut PythonObject) -> None](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PythonObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject) -> None`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

`def_method[method: fn(mut PythonObject, mut PythonObject) -> None](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PythonObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject, mut PythonObject) -> None`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

`def_method[method: fn(mut PythonObject, mut PythonObject, mut PythonObject) -> None](mut self, method_name: StringSlice[StaticConstantOrigin], docstring: StringSlice[StaticConstantOrigin] = StringSlice()) -> ref [*[0,0]] Self`

Declare a binding for a method with PythonObject signature for the type.

**Parameters:**

* ​method (`fn(mut PythonObject, mut PythonObject, mut PythonObject) -> None`): The method to declare a binding for.

**Args:**

* ​method\_name (`StringSlice[StaticConstantOrigin]`): The name with which the method will be exposed on the
  type.
* ​docstring (`StringSlice[StaticConstantOrigin]`): The docstring for the method of the type.

**Returns:**

The builder with the method binding declared.

---

## q_smem_usage

`q_smem_usage[: DType, : DType, : DType, : Bool, : IndexList[3], //, config: MatmulConfig[$0, $1, $2, $3, $4], group_size: Int]() -> Int`

---

## q4_k_dequantize_impl

`q4_k_dequantize_impl(input_tensor: NDBuffer[uint8, 2, origin], output_tensor: NDBuffer[float32, 2, origin])`

---

## Q4sym

`struct Q4sym[group_size: Int, float_dtype: DType = float32]`

Q4sym: compresses values of type `float_dtype` to 4bit unsigned integers which have been dynamically symmetrically quantized with the given scale factor.

`group_size` determines the number of elements which share quantization
parameters.

We store things in a strided fashion:
Example:

Assume `group_size = 8` and we want to process uint4 numbers:
A, B, C, D, E, F, G, H which have associated bits aaaa, bbbb, cccc, ....

eeeeaaaa|ffffbbbb|ggggcccc|hhhhdddd

To uncompress to floating point, take the decoded uint4 value, subtract
the implicit zero-point of 2^4=8, and multiply by the scale factor.

## Parameters

* ​group\_size (`Int`): The number of encoded numbers stored in this struct.
* ​float\_dtype (`DType`): The floating point dtype this struct works with.

## Fields

* ​scale (`StaticTuple[SIMD[uint8, 1], 2]`): The FP16 scale of the group, stored as individual bytes.
* ​bits (`StaticTuple[SIMD[uint8, 1], (div_s(#lit.struct.extract, 2) + -1) if ((group_size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]`): The bits of the encoded uint4 numbers.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Construct a default initialized Q4sym.

`@implicit`
`__init__(out self, data: SIMD[float_dtype, group_size])`

Construct an encoded Q4sym from data.

**Args:**

* ​data (`SIMD[float_dtype, group_size]`): The floating point data to encode and store.

### `decode_scale`

`decode_scale(mut self) -> SIMD[float16, 1]`

Obtain the scale factor.

**Returns:**

The decoded scale factor.

### `decode_unsigned`

`decode_unsigned(mut self) -> SIMD[uint8, group_size]`

Decode the stored uint4 numbers to uint8.

**Returns:**

The decoded stored numbers as uint8 numbers. These have an implicit
zero-point of 8.

### `decode_signed`

`decode_signed(mut self) -> SIMD[int8, group_size]`

Decode the stored uint4 numbers to requantized int4 numbers.

This is done by simply subtracting an implicit zp of 8 from the
unsigned decoding.

**Returns:**

The decoded stored numbers as int8 numbers. These have a zero-point of
0\.

### `decode_fully`

`decode_fully(mut self) -> SIMD[float_dtype, group_size]`

Decode the stored numbers into floating point representation.

**Returns:**

The decoded numbers.

### `quantize_and_write_to_tensor`

`static quantize_and_write_to_tensor[rank: Int](input_tensor: NDBuffer[float_dtype, rank, origin], output_tensor: NDBuffer[uint8, rank, origin], input_shape: IndexList[rank])`

Encodes the floating point numbers in `input_tensor` along the inner-most dimension and writes the result to output\_tensor.

**Parameters:**

* ​rank (`Int`): The rank of the input and output tensors.

**Args:**

* ​input\_tensor (`NDBuffer[float_dtype, rank, origin]`): The input tensor we are encoding.
* ​output\_tensor (`NDBuffer[uint8, rank, origin]`): The output tensor containing the encoded input.
  The shape of the output should be the same as the input
  except along the inner dimension where if the original inner
  dimension was `d`, the corresponding output dimension should be:
  ceil(`d` / group\_size) \* sizeof(self).
* ​input\_shape (`IndexList[rank]`): The shape of the input tensor.

### `dequantize_and_write_to_tensor`

`static dequantize_and_write_to_tensor[rank: Int, //](input_tensor: NDBuffer[uint8, rank, origin], output_tensor: NDBuffer[float_dtype, rank, origin], output_shape: IndexList[rank])`

Encodes the floating point numbers in `input_tensor` along the inner-most dimension and writes the result to output\_tensor.

**Parameters:**

* ​rank (`Int`): The rank of the input and output tensors.

**Args:**

* ​input\_tensor (`NDBuffer[uint8, rank, origin]`): The input tensor we are decoding.
* ​output\_tensor (`NDBuffer[float_dtype, rank, origin]`): The output tensor containing the decoded input.
* ​output\_shape (`IndexList[rank]`): The shape of the output tensor.

---

## q6_k_dequantize_impl

`q6_k_dequantize_impl(input_tensor: NDBuffer[uint8, 2, origin], output_tensor: NDBuffer[float32, 2, origin], output_shape: IndexList[2])`

---

## qmatmul

## Aliases

### `K_BATCH_SIZE`

`alias K_BATCH_SIZE = 512`

Defines the batch size of K used to pack A and unpack B weights.

## Functions

* [​`matmul_qint4`](./matmul_qint4):
* [​`matmul_qint4_pack_b`](./matmul_qint4_pack_b):

---

## qmatmul_gpu

## Functions

* [​`args_to_tuple`](./args_to_tuple):
* [​`gpu_qint4_repack_GPTQ`](./gpu_qint4_repack_GPTQ):
* [​`gpu_qint4_repack_Q4_0`](./gpu_qint4_repack_Q4_0):
* [​`matmul_gpu_qint4`](./matmul_gpu_qint4):
* [​`matmul_gpu_qint4_impl`](./matmul_gpu_qint4_impl):
* [​`multistage_gemm_q`](./multistage_gemm_q):
* [​`multistage_mma_q`](./multistage_mma_q):
* [​`multistage_qgemm_kernel`](./multistage_qgemm_kernel):
* [​`pack_Q_tile`](./pack_Q_tile):
* [​`q_smem_usage`](./q_smem_usage):
* [​`repack_GPTQ_for_sm8x`](./repack_GPTQ_for_sm8x):
* [​`repack_Q4_0_for_sm8x`](./repack_Q4_0_for_sm8x):
* [​`unpack_4bit_int`](./unpack_4bit_int):

---

## qmatmul_k

## Functions

* [​`matmul_Q4_K`](./matmul_Q4_K):
* [​`matmul_Q4_K_pack_b`](./matmul_Q4_K_pack_b):
* [​`matmul_Q6_K`](./matmul_Q6_K):
* [​`matmul_Q6_K_pack_b`](./matmul_Q6_K_pack_b):

---

## qr_factorization

## Functions

* [​`apply_q`](./apply_q): Applies the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` to the `X` matrix.
* [​`form_q`](./form_q): Forms the Q factor from the implicit Q factor stored in `A` and `sigma` after calling `qr_factorization` and stores the result in `Q`.
* [​`qr_factorization`](./qr_factorization): Performs QR factorization of a matrix `A` using the Householder reflector method.

---

## qr_factorization

`qr_factorization[dtype: DType, element_layout: Layout](sigma: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], A: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Performs QR factorization of a matrix `A` using the Householder reflector method.

This function computes the QR factorization of matrix `A` in-place using
Householder reflections. The result is stored directly in the input matrix
`A`, with scaling factors in `sigma`. The implementation follows the LAPACK
algorithm for generating Householder reflectors in-place.

Algorithm:
The Householder reflector is defined as:
U = I - σww^H
where:
w = (x + νe₁)/ξ
σ = ξ/ν
ξ = x₀ + ν
ν = sign(x₀)‖x‖₂

```
This ensures that U^H x = -νe₁ and U^H U = I.
```

References:
\[1] Lehoucq, R. B. (1996). The computation of elementary unitary matrices.
ACM Transactions on Mathematical Software, 22(4), 393-400.

Note:
There is a typo in reference \[lawn72]. The correct result is U^H x =
-νe₁.

---

## quantization

APIs to quantize graph tensors.

This package includes a comprehensive set of tools for working with quantized
models in MAX Graph. It defines supported quantization encodings, configuration
parameters that control the quantization process, and block parameter
specifications for different quantization formats.

The module supports various quantization formats including 4-bit, 5-bit, and
6-bit precision with different encoding schemes. It also provides support for
GGUF-compatible formats for interoperability with other frameworks.

## `BlockParameters` {#max.graph.quantization.BlockParameters}

> *class* max.graph.quantization.BlockParameters(elements\_per\_block, block\_size)

Parameters describing the structure of a quantization block.

Block-based quantization stores elements in fixed-size blocks.
Each block contains a specific number of elements in a compressed format.

**Parameters:**

* **elements\_per\_block** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **block\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) )

### `block_size` {#max.graph.quantization.BlockParameters.block_size}

> block\_size\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `elements_per_block` {#max.graph.quantization.BlockParameters.elements_per_block}

> elements\_per\_block\*: [int](https://docs.python.org/3/library/functions.html#int)\*

## `QuantizationConfig` {#max.graph.quantization.QuantizationConfig}

> *class* max.graph.quantization.QuantizationConfig(quant\_method, bits, group\_size, desc\_act=False, sym=False)

Configuration for specifying quantization parameters that affect inference.

These parameters control how tensor values are quantized, including the method,
bit precision, grouping, and other characteristics that affect the trade-off
between model size, inference speed, and accuracy.

**Parameters:**

* **quant\_method** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **bits** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **group\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **desc\_act** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )
* **sym** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

### `bits` {#max.graph.quantization.QuantizationConfig.bits}

> bits\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `desc_act` {#max.graph.quantization.QuantizationConfig.desc_act}

> desc\_act\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False*

### `group_size` {#max.graph.quantization.QuantizationConfig.group_size}

> group\_size\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `quant_method` {#max.graph.quantization.QuantizationConfig.quant_method}

> quant\_method\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

### `sym` {#max.graph.quantization.QuantizationConfig.sym}

> sym\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= False*

## `QuantizationEncoding` {#max.graph.quantization.QuantizationEncoding}

> *class* max.graph.quantization.QuantizationEncoding(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None)

Quantization encodings supported by MAX Graph.

Each encoding represents a different method of quantizing model weights with
specific trade-offs between compression ratio, accuracy, and computational efficiency.

### `GPTQ` {#max.graph.quantization.QuantizationEncoding.GPTQ}

> GPTQ *= 'GPTQ'*

### `Q4_0` {#max.graph.quantization.QuantizationEncoding.Q4_0}

> Q4\_0 *= 'Q4\_0'*

### `Q4_K` {#max.graph.quantization.QuantizationEncoding.Q4_K}

> Q4\_K *= 'Q4\_K'*

### `Q5_K` {#max.graph.quantization.QuantizationEncoding.Q5_K}

> Q5\_K *= 'Q5\_K'*

### `Q6_K` {#max.graph.quantization.QuantizationEncoding.Q6_K}

> Q6\_K *= 'Q6\_K'*

### `block_parameters` {#max.graph.quantization.QuantizationEncoding.block_parameters}

> *property* block\_parameters\*: [BlockParameters](#max.graph.quantization.BlockParameters)\*

Gets the block parameters for this quantization encoding.

**Returns:**

The parameters describing how elements are organized
and encoded in blocks for this quantization encoding.

**Return type:**

[BlockParameters](#max.graph.quantization.BlockParameters)

### `block_size` {#max.graph.quantization.QuantizationEncoding.block_size}

> *property* block\_size\*: [int](https://docs.python.org/3/library/functions.html#int)\*

Number of bytes in encoded representation of block.

All quantization types currently supported by MAX Graph are
block-based: groups of a fixed number of elements are formed, and each
group is quantized together into a fixed-size output block.  This value
is the number of bytes resulting after encoding a single block.

**Returns:**

Size in bytes of each encoded quantization block.

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `elements_per_block` {#max.graph.quantization.QuantizationEncoding.elements_per_block}

> *property* elements\_per\_block\*: [int](https://docs.python.org/3/library/functions.html#int)\*

Number of elements per block.

All quantization types currently supported by MAX Graph are
block-based: groups of a fixed number of elements are formed, and each
group is quantized together into a fixed-size output block.  This value
is the number of elements gathered into a block.

**Returns:**

Number of original tensor elements in each quantized block.

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int)

### `is_gguf` {#max.graph.quantization.QuantizationEncoding.is_gguf}

> *property* is\_gguf\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

Checks if this quantization encoding is compatible with GGUF format.

GGUF is a format for storing large language models and compatible
quantized weights.

**Returns:**

True if this encoding is compatible with GGUF, False otherwise.

**Return type:**

[bool](https://docs.python.org/3/library/functions.html#bool)

### `name` {#max.graph.quantization.QuantizationEncoding.name}

> *property* name\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

Gets the lowercase name of the quantization encoding.

**Returns:**

Lowercase string representation of the quantization encoding.

**Return type:**

[str](https://docs.python.org/3/library/stdtypes.html#str)

---

## quantization

This package contains a set of APIs for quantizing tensor data.

Quantization is a technique used to reduce the precision of floating-point
numbers, which are used in most neural networks. Quantization is a type of
lossy compression, which means that some precision is lost, but the resulting
tensors take less memory and computations are faster.

## Modules

* [​`per_channel_grouped_4bit`](./per_channel_grouped_4bit/):
* [​`qmatmul`](./qmatmul/):
* [​`qmatmul_gpu`](./qmatmul_gpu/):
* [​`qmatmul_k`](./qmatmul_k/):

---

## Quantization

MAX allows you to load and run pre-quantized models through both its Python API
and CLI. This guide explains quantization concepts and how to work with quantized
models in your applications.

## Understanding quantization

Quantization reduces the numeric precision of model weights to decrease memory
usage and increase inference speed. For example, models originally trained with
`float32` weights can be represented using lower precision types like `int8` or
`int4`, reducing each scalar value from 32 bits to 8 or 4 bits.

When used properly, quantization does not significantly affect the model
accuracy. There are several different quantization encodings that provide
different levels of precision and encoding formats, each with its own trade-offs
that may work well for some models or graph operations ("ops") but not others.
Some models also work great with a mixture of quantization types, so that only
certain ops perform low-precision calculations while others retain high
precision.

## How to load pre-quantized models with MAX

You can load pre-quantized models using two primary approaches:

- By specifying a path to a quantized weight file
- By specifying the quantization encoding format for compatible models

When you have a quantized weight file, you can load it directly using the
`--weight-path` argument:

```bash
max serve --model-path=meta-llama/Llama-3.1-8B-Instruct \
    --weight-path=bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
```

MAX automatically detects the quantization format from the weight file. This
approach works for models with standard quantization formats like GGUF and AWQ.

For models that have been quantized using specific techniques but don't use a
separate weight file format, you can specify the quantization encoding directly
with the `--quantization-encoding` flag:

```bash
max generate --model-path=hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
    --quantization-encoding=gptq \
    --prompt "What is the meaning of life?"
```

The `--quantization-encoding` flag accepts the following values:

- `float32`: Full precision 32-bit floating point.
- `bfloat16`: Brain floating point 16-bit format.
- `q4_0`: 4-bit quantization format.
- `q4_k`: 4-bit quantization with K-means clustering.
- `q6_k`: 6-bit quantization with K-means clustering.
- `gptq`: Specialized quantization optimized for transformer-based models.

For more information on the `max` CLI, see the [MAX CLI](/max/max-cli)
documentation or the [MAX Serve API reference ](/max/api/serve).

## Quantized layer implementation

For developers building custom models with the MAX Graph API you can implement
custom quantized layers. This is useful when:

- You're building a model from scratch using the MAX Graph API 
- You need precise control over how quantization is implemented 
- You're implementing specialized model architectures that require custom
quantized operations

To implement a quantized layer in Python, you'll need to make a few key changes
compared to a standard linear layer. Let's look at the differences.

A standard linear layer in MAX might look like this:

```python
from max import nn
from max.dtype import DType
from max.graph import DeviceRef, Weight

class Linear(nn.Module):
    def __init__(self, in_dim, out_dim):
        super().__init__()
        self.weight = Weight(
            name="weight",
            dtype=DType.float32,
            shape=[in_dim, out_dim],
            device=DeviceRef.CPU(),
        )
        self.bias = Weight(name="bias", dtype=DType.float32, shape=[out_dim])

        def __call__(self, x):
            return x @ self.weight.T.to(x.device) + self.bias.to(x.device)
```

To enable support for GGUF quantization like
[`Q4_0`](/max/api/python/graph/quantization#max.graph.quantization.QuantizationEncoding.Q4_0),
[`Q4_K`](/max/api/python/graph/quantization#max.graph.quantization.QuantizationEncoding.Q4_K),
or other encodings, you need to:

1. Load weights from the quantized model checkpoint as `uint8` with the appropriate shape.
2. Replace the standard matrix multiplication `(@)` with the [`qmatmul`](/max/api/python/graph/ops#max.graph.ops.qmatmul) operation.
3. Specify the quantization encoding to use.

Here's how you might implement a quantized linear layer:

```python
from max import nn
from max.dtype import DType
from max.graph import DeviceRef, Weight, ops
from max.graph.quantization import QuantizationEncoding

class QuantizedLinear(nn.Module):
    def __init__(self, in_dim, out_dim, quantization_encoding):
        super().__init__()
        self.weight = Weight(
            name="weight",
            # The DType must be uint8.
            dtype=DType.uint8,
            # This shape must be updated to match the quantized shape
            shape=[in_dim, out_dim],
            device=DeviceRef.CPU(),
            quantization_encoding=quantization_encoding,
        )
        self.bias = Weight(name="bias", dtype=DType.float32, shape=[out_dim])

        def __call__(self, x):
            return ops.qmatmul(
                self.weight.quantization_encoding, None, x, self.weight.to(x.device)
            ) + bias.to(x.device)

quantized_linear = QuantizedLinear(in_dim, out_dim, QuantizationEncoding.Q4_0)
```

The [MAX graph quantization](/max/api/python/graph/quantization) class defines
the available quantization formats supported by MAX. These encodings include:

- [Q4_0](/max/api/python/graph/quantization#max.graph.quantization.QuantizationEncoding.Q4_0): 4-bit quantization format
- [Q4_K](/max/api/python/graph/quantization#max.graph.quantization.QuantizationEncoding.Q4_K): 4-bit quantization with K-means clustering
- [Q5_K](/max/api/python/graph/quantization#max.graph.quantization.QuantizationEncoding.Q5_K): 5-bit quantization with K-means clustering
- [Q6_K](/max/api/python/graph/quantization#max.graph.quantization.QuantizationEncoding.Q6_K): 6-bit quantization with K-means clustering
- [GPTQ](/max/api/python/graph/quantization#max.graph.quantization.QuantizationEncoding.GPTQ): Specialized quantization optimized for transformer-based models

With this implementation, you can add quantized weights into your MAX models. The
[`qmatmul`](/max/api/python/graph/ops#max.graph.ops.qmatmul) operation handles
the dequantization process during inference, giving you the performance benefits
of quantization without having to manage the low-level details.

---

## quantize_dynamic_scaled_fp8

`quantize_dynamic_scaled_fp8[out_dtype: DType, in_dtype: DType, scales_dtype: DType, //, group_size_or_per_token: Int](scaled_output: NDBuffer[out_dtype, 2, origin, shape, strides], scales: NDBuffer[scales_dtype, 2, origin, shape, strides], input: NDBuffer[in_dtype, 2, origin, shape, strides], scale_ub: SIMD[float32, 1], ctx: DeviceContext)`

---

## quantize_fp8_kernel

`quantize_fp8_kernel[out_type: DType, scales_type: DType, in_type: DType, warps_per_block: Int, group_size: Int](output: NDBuffer[out_type, 2, MutableAnyOrigin], scales: NDBuffer[scales_type, 2, MutableAnyOrigin], input: NDBuffer[in_type, 2, MutableAnyOrigin], scale_ub: SIMD[scales_type, 1])`

---

## quantize_static_scaled_fp8

`quantize_static_scaled_fp8[out_dtype: DType, in_dtype: DType, is_scale_inverted: Bool = True](out_buffer: NDBuffer[out_dtype, 2, origin, shape, strides], in_buffer: NDBuffer[in_dtype, 2, origin, shape, strides], scale: SIMD[float32, 1], context: DeviceContext)`

---

## QueuedTileScheduler

`@register_passable(trivial)`
`struct QueuedTileScheduler[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], /, decoding: Bool, num_ctas: SIMD[uint32, 1] = SIMD(Info(__init__[__mlir_type.!kgen.string]("H100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("hopper"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](9), __init__[__mlir_type.!kgen.string]("sm_90a"), 132, 32, 2048, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)), schedule: MHASchedule = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))]`

If `decoding == False`, then `num_heads` is `q_num_heads`. If `decoding == True`, then `num_heads` is `kv_num_heads`.

## Fields

* ​gidx\_ptr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(1)]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`MHATileScheduler`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `may_advance`

`alias may_advance = True`

### `mha_schedule`

`alias mha_schedule = schedule`

## Methods

### `__init__`

`__init__(gidx_ptr: UnsafePointer[SIMD[uint32, 1]]) -> Self`

### `get_current_work_info`

`get_current_work_info(self, ts: MHATileSummary, state: MHATileState) -> WorkInfo`

### `advance`

`advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]`

The parameter `func` must return a `Bool` indicating whether the `WorkInfo` arg is valid. This function returns whether the current idx corresponds to a valid `WorkInfo`. Note that if `MHASchedulerSynchronization` is `NONE`, then we assume it is only called by `thread_idx.x==0`.

### `grid_dim`

`static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]`

### `initial_state`

`initial_state(self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState`

### `unsafe_seq_info`

`unsafe_seq_info[ragged: Bool](self, ts: MHATileSummary, state: MHATileState) -> SeqInfo`

---

## quick_bench

## Structs

* [​`QuickBench`](/mojo/stdlib/benchmark/quick_bench/QuickBench): Defines a struct to facilitate benchmarking and avoiding `Bencher` boilerplate.

---

## QuickBench

`struct QuickBench`

Defines a struct to facilitate benchmarking and avoiding `Bencher` boilerplate.

## Fields

* ​m (`Bench`): Bench object to collect the results.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Just initialize the Bench object.

### `dump_report`

`dump_report(mut self)`

Prints out the report from a Benchmark execution collected in Bench object.

### `run`

`run[T_out: AnyTrivialRegType](mut self, func: fn() -> T_out, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with no input arguments and return type `T_out`.

**Parameters:**

* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn() -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0) -> T_out, x0: T0, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 1 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1) -> T_out, x0: T0, x1: T1, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 2 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2) -> T_out, x0: T0, x1: T1, x2: T2, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 3 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 4 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2, T3) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​x3 (`T3`): The 4th argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 5 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func.
* ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2, T3, T4) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​x3 (`T3`): The 4th argument of func.
* ​x4 (`T4`): The 5th argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 6 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func.
* ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func.
* ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2, T3, T4, T5) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​x3 (`T3`): The 4th argument of func.
* ​x4 (`T4`): The 5th argument of func.
* ​x5 (`T5`): The 6th argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 7 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func.
* ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func.
* ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func.
* ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2, T3, T4, T5, T6) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​x3 (`T3`): The 4th argument of func.
* ​x4 (`T4`): The 5th argument of func.
* ​x5 (`T5`): The 6th argument of func.
* ​x6 (`T6`): The 7th argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, T7: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6, T7) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, x7: T7, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 8 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func.
* ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func.
* ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func.
* ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func.
* ​T7 (`AnyTrivialRegType`): Type of the 8th argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2, T3, T4, T5, T6, T7) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​x3 (`T3`): The 4th argument of func.
* ​x4 (`T4`): The 5th argument of func.
* ​x5 (`T5`): The 6th argument of func.
* ​x6 (`T6`): The 7th argument of func.
* ​x7 (`T7`): The 8th argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, T7: AnyTrivialRegType, T8: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6, T7, T8) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, x7: T7, x8: T8, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 9 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func.
* ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func.
* ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func.
* ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func.
* ​T7 (`AnyTrivialRegType`): Type of the 8th argument of func.
* ​T8 (`AnyTrivialRegType`): Type of the 9th argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2, T3, T4, T5, T6, T7, T8) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​x3 (`T3`): The 4th argument of func.
* ​x4 (`T4`): The 5th argument of func.
* ​x5 (`T5`): The 6th argument of func.
* ​x6 (`T6`): The 7th argument of func.
* ​x7 (`T7`): The 8th argument of func.
* ​x8 (`T8`): The 9th argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

`run[T0: AnyTrivialRegType, T1: AnyTrivialRegType, T2: AnyTrivialRegType, T3: AnyTrivialRegType, T4: AnyTrivialRegType, T5: AnyTrivialRegType, T6: AnyTrivialRegType, T7: AnyTrivialRegType, T8: AnyTrivialRegType, T9: AnyTrivialRegType, /, T_out: AnyTrivialRegType](mut self, func: fn(T0, T1, T2, T3, T4, T5, T6, T7, T8, T9) -> T_out, x0: T0, x1: T1, x2: T2, x3: T3, x4: T4, x5: T5, x6: T6, x7: T7, x8: T8, x9: T9, *, bench_id: BenchId, measures: List[ThroughputMeasure] = List())`

Benchmark function `func` with 10 input argument and return type `T_out`.

**Parameters:**

* ​T0 (`AnyTrivialRegType`): Type of the 1st argument of func.
* ​T1 (`AnyTrivialRegType`): Type of the 2nd argument of func.
* ​T2 (`AnyTrivialRegType`): Type of the 3rd argument of func.
* ​T3 (`AnyTrivialRegType`): Type of the 4th argument of func.
* ​T4 (`AnyTrivialRegType`): Type of the 5th argument of func.
* ​T5 (`AnyTrivialRegType`): Type of the 6th argument of func.
* ​T6 (`AnyTrivialRegType`): Type of the 7th argument of func.
* ​T7 (`AnyTrivialRegType`): Type of the 8th argument of func.
* ​T8 (`AnyTrivialRegType`): Type of the 9th argument of func.
* ​T9 (`AnyTrivialRegType`): Type of the 10th argument of func.
* ​T\_out (`AnyTrivialRegType`): Output type of func.

**Args:**

* ​func (`fn(T0, T1, T2, T3, T4, T5, T6, T7, T8, T9) -> T_out`): The function to be benchmarked (run in benchmark iterations).
* ​x0 (`T0`): The 1st argument of func.
* ​x1 (`T1`): The 2nd argument of func.
* ​x2 (`T2`): The 3rd argument of func.
* ​x3 (`T3`): The 4th argument of func.
* ​x4 (`T4`): The 5th argument of func.
* ​x5 (`T5`): The 6th argument of func.
* ​x6 (`T6`): The 7th argument of func.
* ​x7 (`T7`): The 8th argument of func.
* ​x8 (`T8`): The 9th argument of func.
* ​x9 (`T9`): The 10th argument of func.
* ​bench\_id (`BenchId`): The benchmark Id object used for identification.
* ​measures (`List[ThroughputMeasure]`): Optional arg used to represent a list of ThroughputMeasure's.

---

## Quickstart

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import TutorialStack from '@site/src/components/TutorialStack';
import ContactSection from '@site/src/components/ContactSection';
import Requirements from '@site/src/components/Requirements';
import { requirementsOptionalDocker } from './requirements';
import InstallModular from '@site/docs/_includes/install-modular.mdx';

In this quickstart guide, you'll learn how to install Modular in a Python
environment and run inference with a GenAI model. We'll first use our Python
API to run offline inference, then start a local endpoint and use the OpenAI
Python API to send inference requests.

System requirements:

## Set up your project

First, install the `max` CLI and Python library:

:::note

When using `pip`, we use the `--index-url` argument to ensure that `torch`
installs CPU dependencies only, avoiding a lot of unnecessary GPU packages.
This is a temporary workaround until we can remove all dependencies on PyTorch.

:::

## Run offline inference

You can run inference locally with the `max` Python API. Just specify
the Hugging Face model you want and then generate results with one or more
prompts.

In this example, we use a Llama 3.1 model that's not gated on Hugging
Face, so you don't need an access token:

```python title="offline-inference.py"
from max.entrypoints.llm import LLM
from max.pipelines import PipelineConfig

def main():
    model_path = "modularai/Llama-3.1-8B-Instruct-GGUF"
    pipeline_config = PipelineConfig(model_path=model_path)
    llm = LLM(pipeline_config)

    prompts = [
        "In the beginning, there was",
        "I believe the meaning of life is",
        "The fastest way to learn python is",
    ]

    print("Generating responses...")
    responses = llm.generate(prompts, max_new_tokens=50)
    for i, (prompt, response) in enumerate(zip(prompts, responses)):
        print(f"========== Response {i} ==========")
        print(prompt + response)
        print()

if __name__ == "__main__":
    main()
```

Run it and you should see a response similar to this:

```sh
python offline-inference.py
```

```output
========== Response 0 ==========
In the beginning, there was Andromeda. The Andromeda galaxy, that is. It's the closest major galaxy to our own Milky Way, and it's been a source of fascination for astronomers and space enthusiasts for centuries. But what if I told you that there's

========== Response 1 ==========
I believe the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is to find your gift. The purpose of life is to give it away to others.
I believe that the meaning of life is

========== Response 2 ==========
The fastest way to learn python is to practice with real-world projects. Here are some ideas for projects that you can use to learn Python:

1.  **Command Line Calculator**: Create a command line calculator that can perform basic arithmetic operations like addition, subtraction, multiplication, and division.
```

More information about this API is available in the [offline inference
guide](/max/serve/offline-inference).

## Run inference with an endpoint

Now let's start a local server that runs the model using an OpenAI-compatible
endpoint:

1. Install the `openai` client library:

    
        ```bash
        pip install openai
        ```

      
        ```bash
        uv add openai
        ```

      
        ```bash
        magic add openai
        ```

      
2. Start the endpoint with the [`max`](/max/max-cli) CLI:

    ```python
    max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF
    ```

3. Create a new file that sends an inference request:

    ```python title="generate-text.py"
    from openai import OpenAI

    client = OpenAI(
        base_url="http://0.0.0.0:8000/v1",
        api_key="EMPTY",
    )

    completion = client.chat.completions.create(
        model="modularai/Llama-3.1-8B-Instruct-GGUF",
        messages=[
            {
              "role": "user",
              "content": "Who won the world series in 2020?"
            },
        ],
    )

    print(completion.choices[0].message.content)
    ```

    Notice that the `OpenAI` API requires the `api_key` argument, but our
    endpoint doesn't use it.

4. Run it and you should see results like this:

    ```sh
    python generate-text.py
    ```

    ```output
    The Los Angeles Dodgers won the 2020 World Series. They defeated the Tampa Bay Rays in the series 4 games to 2. This was the Dodgers' first World Series title since 1988.
    ```

That's it. You just served Llama 3 on your local CPU and ran inference
using our OpenAI-compatible [Serve API](/max/api/serve).

You can also [deploy the same endpoint to a cloud
GPU](/max/tutorials/max-serve-local-to-cloud) using our [Docker
container](/max/container).

To run a different model, change the `--model-path` to something else from [our
model repository](https://builds.modular.com/?category=models).

## Keep going

There's still a lot more to learn. Here are some directions you can go:

import SmallCards from '@site/src/components/SmallCards';
import { ArrowTransfer } from '@site/src/shared/Svgs/ArrowTransfer';
import { ArrowCloud } from '@site/src/shared/Svgs/ArrowCloud';
import { DesktopCode } from '@site/src/shared/Svgs/DesktopCode';
import { AIChip } from '@site/src/shared/Svgs/AIChip';
import { RecipesIcon } from '@site/src/shared/Svgs/RecipesIcon';
import { OpenBook } from '@site/src/shared/Svgs/OpenBook';
import { PuzzleIcon } from '@site/src/shared/Svgs/PuzzleIcon';
import { AIBrainIcon } from '@site/src/shared/Svgs/AIBrainIcon';

### Docs

export const docs = [
  {
    title: 'Serving',
    link: '/max/serve/',
    description: `Try more serving features like function calling, tool use, \
        structured output, and more.`,
    icon: ,
  },
  {
    title: 'Deploying',
    link: '/max/deploy/',
    description: `Try a tutorial to deploy a model on a cloud GPU using \
        our Docker container.`,
    icon: ,
  },
  {
    title: 'Developing',
    link: '/max/develop/',
    description: `Discover all the ways you can customize your AI \
        deployments, such as writing custom ops and GPU kernels in Mojo.`,
    icon: ,
  },
  {
    title: 'Mojo manual',
    link: '/mojo/manual/',
    description: `Learn to program in Mojo, a Pythonic systems programming \
        language that allows you to write code for both CPUs and GPUs.`,
    icon: ,
  },
];

### Resources

export const resources = [
  {
    title: 'Model repo',
    link: 'https://builds.modular.com/?category=models',
    description: `Hundreds of GenAI models accelerated with Modular.`,
    icon: ,
  },
  {
    title: 'Tutorials',
    link: '/max/tutorials/',
    description: `Step-by-step procedures to develop and deploy with Modular.`,
    icon: ,
  },
  {
    title: 'Recipes',
    link: 'https://builds.modular.com/?category=recipes',
    description: `Turn-key applications that use GenAI models with Modular.`,
    icon: ,
  },
  {
    title: 'GPU puzzles',
    link: 'https://builds.modular.com/puzzles',
    description: `A hands-on guide to mastering GPU programming with Mojo.`,
    icon: ,
  },
];

## Stay in touch

---

## radix_sort_pairs_kernel

`radix_sort_pairs_kernel[type: DType, out_idx_type: DType, current_bit: Int, ascending: Bool = False, BLOCK_SIZE: Int = 256, NUM_BITS_PER_PASS: Int = 4](input_keys_: UnsafePointer[SIMD[type, 1]], output_keys_: UnsafePointer[SIMD[type, 1]], input_key_ids_: UnsafePointer[SIMD[out_idx_type, 1]], output_key_ids_: UnsafePointer[SIMD[out_idx_type, 1]], num_keys: Int, skip_sort: UnsafePointer[SIMD[bool, 1]])`

Radix pair sort kernel for (default) descending order.

Implementation based on:
AMD. Introduction to GPU Radix Sort. GPUOpen, 2017. Available at:
.

**Parameters:**

* ​type (`DType`): DType - Data type.
* ​out\_idx\_type (`DType`): DType - Output index type.
* ​current\_bit (`Int`): Int - Current bit to start sorting NUM\_BITS\_PER\_PASS bits at.
* ​ascending (`Bool`): Bool - Whether to sort in ascending order.
* ​BLOCK\_SIZE (`Int`): Int - Block size.
* ​NUM\_BITS\_PER\_PASS (`Int`): Int - Number of bits per pass.

**Args:**

* ​input\_keys\_ (`UnsafePointer[SIMD[type, 1]]`): Input tensor values to sort.
* ​output\_keys\_ (`UnsafePointer[SIMD[type, 1]]`): Output tensor values sorted in (default) descending order.
* ​input\_key\_ids\_ (`UnsafePointer[SIMD[out_idx_type, 1]]`): Input tensor indices.
* ​output\_key\_ids\_ (`UnsafePointer[SIMD[out_idx_type, 1]]`): Output tensor indices sorted in (default) descending order.
* ​num\_keys (`Int`): Number of keys to sort per batch.
* ​skip\_sort (`UnsafePointer[SIMD[bool, 1]]`): Whether sorting is skipped for this batch.

---

## Ragged tensors

Ragged tensors is a method for batching multiple requests with differing
sequence lengths without the need for [padding tokens](padding-tokens.mdx).
Ragged tensors allow sequences of variable lengths to be processed together
efficiently by storing them in a compact, non-uniform format.

Also sometimes referred to as "packed tensors."

---

## ragged_attention

An opaque KV Cache optimized vanilla attention mechanism, with Mask Variants provided inside the Kernel.

## `RaggedAttention` {#max.nn.attention.ragged_attention.RaggedAttention}

> *class* max.nn.attention.ragged\_attention.RaggedAttention(\*, mask\_variant, num\_attention\_heads, num\_key\_value\_heads, hidden\_size, kv\_params, devices=None, dtype=float32, linear\_cls=\, stacked\_qkv=False, scale=None, has\_bias=False, clip\_qkv=None)

Layer that computes the self attention score for ragged inputs.

Initializes the attention layer.

**Parameters:**

* **rope** – The rope layer to borrow the freq\_cis value from.
* **num\_attention\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The number of attention heads.
* **num\_key\_value\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Number of key/value heads.
* **hidden\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The dimension of the hidden states.
* **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) ) – KV Cache Params, including the number of kv heads, the head dim, and data type.
* **dtype** ([`DType`](../../dtype.md#max.dtype.DType) ) – DType of the
* **devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]`  `|`  `None` ) – Device to place the weights and run the computation. If
  multiple are provided, the first device is used.
* **linear\_cls** (`Callable` `[` `...` `,`  [`Linear`](../linear.md#max.nn.linear.Linear) `]` ) – Linear class to use for the outputs dense layer.
* **stacked\_qkv** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether the weights are stacked together.
* **scale** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  `None` ) – Value used to scale the results of the attention output.
* **has\_bias** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – Whether to use an attention bias.
* **clip\_qkv** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  `None` ) – If provided, the QKV weights are clamped between
  \[-clip\_qkv, clip\_qkv]
* **mask\_variant** ([`MHAMaskVariant`](../kernels.md#max.nn.kernels.MHAMaskVariant) )

### `wqkv` {#max.nn.attention.ragged_attention.RaggedAttention.wqkv}

> *property* wqkv\*: [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue)\*

The concatenation of q, k, and v weight vectors.

---

## RaggedMHAOperand

`@register_passable(trivial)`
`struct RaggedMHAOperand[type_: DType, shape: DimList, stride: DimList]`

An implementation for ragged NDBuffer arguments to MHA kernels.

## Fields

* ​buffer (`NDBuffer[type_, 3, MutableAnyOrigin, shape, stride]`):
* ​cache\_row\_offsets (`NDBuffer[uint32, 1, MutableAnyOrigin]`):

## Implemented traits

`AnyType`,
`Copyable`,
`MHAOperand`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `type`

`alias type = type_`

## Methods

### `__init__`

`__init__(buffer: NDBuffer[type_, 3, MutableAnyOrigin, shape, stride], cache_row_offsets: NDBuffer[uint32, 1, MutableAnyOrigin, shape, strides]) -> Self`

### `block_paged_ptr`

`block_paged_ptr[tile_size: Int](self, batch_idx: SIMD[uint32, 1], start_tok_idx: SIMD[uint32, 1], head_idx: SIMD[uint32, 1], head_dim_idx: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0)) -> UnsafePointer[SIMD[type_, 1]]`

### `cache_length`

`cache_length(self, batch_idx: Int) -> Int`

### `max_context_length`

`max_context_length(self) -> SIMD[uint32, 1]`

---

## RaisingCoroutine

`@register_passable`
`struct RaisingCoroutine[type: AnyType, origins: origin.set]`

Represents a coroutine that can raise.

Coroutines can pause execution saving the state of the program (including
values of local variables and the location of the next instruction to be
executed). When the coroutine is resumed, execution continues from where it
left off, with the saved state restored.

## Parameters

* ​type (`AnyType`): Type of value returned upon completion of the coroutine.
* ​origins (`origin.set`): The origin set of the coroutine's captures.

## Implemented traits

`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(handle: !co.routine) -> Self`

Construct a coroutine object from a handle.

**Args:**

* ​handle (`!co.routine`): The init handle.

### `__await__`

`__await__(owned self, out result: type)`

Suspends the current coroutine until the coroutine is complete.

**Returns:**

The coroutine promise.

### `force_destroy`

`force_destroy(owned self)`

Destroy the coroutine object.

---

## rand

`rand[dtype: DType](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], size: Int, /, *, min: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](0), max: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1), int_scale: Optional[Int] = Optional(None))`

Fills memory with random values from a uniform distribution.

**Parameters:**

* ​dtype (`DType`): The dtype of the pointer.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The pointer to the memory area to fill.
* ​size (`Int`): The number of elements to fill.
* ​min (`SIMD[float64, 1]`): The minimum value for random.
* ​max (`SIMD[float64, 1]`): The maximum value for random.
* ​int\_scale (`Optional[Int]`): The scale for error checking (float type only).

---

## rand_uniform

## Functions

* [​`random_uniform`](./random_uniform): Call `output_fn` with values generated from a uniform distribution on \[lower\_bound, upper\_bound] for floating-point types or \[lower\_bound, upper\_bound) for integer types.

---

## randint

`randint[dtype: DType](ptr: UnsafePointer[SIMD[dtype, 1]], size: Int, low: Int, high: Int)`

Fills memory with uniform random in range \[low, high].

**Constraints:**

The type should be integral.

**Parameters:**

* ​dtype (`DType`): The dtype of the pointer.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1]]`): The pointer to the memory area to fill.
* ​size (`Int`): The number of elements to fill.
* ​low (`Int`): The minimal value for random.
* ​high (`Int`): The maximal value for random.

---

## randn

## Functions

* [​`random_normal`](./random_normal): Fill `output` with values generated from Normal(mean, variance) distribution.

---

## randn

`randn[dtype: DType](ptr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin], size: Int, mean: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](0), standard_deviation: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1))`

Fills memory with random values from a Normal(mean, standard\_deviation) distribution.

**Constraints:**

The type should be floating point.

**Parameters:**

* ​dtype (`DType`): The dtype of the pointer.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, origin=origin]`): The pointer to the memory area to fill.
* ​size (`Int`): The number of elements to fill.
* ​mean (`SIMD[float64, 1]`): Normal distribution mean.
* ​standard\_deviation (`SIMD[float64, 1]`): Normal distribution standard deviation.

---

## randn_float64

`randn_float64(mean: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](0), standard_deviation: SIMD[float64, 1] = __init__[__mlir_type.!pop.float_literal](1)) -> SIMD[float64, 1]`

Returns a random double sampled from a Normal(mean, standard\_deviation) distribution.

**Args:**

* ​mean (`SIMD[float64, 1]`): Normal distribution mean.
* ​standard\_deviation (`SIMD[float64, 1]`): Normal distribution standard deviation.

**Returns:**

A random float64 sampled from Normal(mean, standard\_deviation).

---

## random

Random number generation for GPU kernels.

This module implements a high-performance random number generator using the Philox algorithm,
which is designed for parallel and GPU computing. The Philox algorithm is a counter-based
random number generator that provides high-quality random numbers with excellent statistical
properties.

The main class is Random which generates both uniform random numbers and raw 32-bit integers.
It supports:

* Seeding for reproducible sequences
* Multiple independent subsequences
* Configurable number of rounds for quality vs performance tradeoff
* Vectorized operations for efficiency

Example:

```mojo
from gpu.random import Random
    rng = Random(seed=42)
    uniform_values = rng.step_uniform()  # Returns 4 random floats in [0,1)
    raw_values = rng.step()  # Returns 4 raw 32-bit integers
```

## Structs

* [​`Random`](/mojo/stdlib/gpu/random/Random): A high-performance random number generator using the Philox algorithm.

---

## random

Implements the random package.

## Modules

* [​`random`](/mojo/stdlib/random/random/): Provides functions for random numbers.

---

## random

Provides functions for random numbers.

You can import these APIs from the `random` package. For example:

```mojo
from random import seed
```

## Functions

* [​`rand`](/mojo/stdlib/random/random/rand): Fills memory with random values from a uniform distribution.
* [​`randint`](/mojo/stdlib/random/random/randint): Fills memory with uniform random in range \[low, high].
* [​`randn`](/mojo/stdlib/random/random/randn): Fills memory with random values from a Normal(mean, standard\_deviation) distribution.
* [​`randn_float64`](/mojo/stdlib/random/random/randn_float64): Returns a random double sampled from a Normal(mean, standard\_deviation) distribution.
* [​`random_float64`](/mojo/stdlib/random/random/random_float64): Returns a random `Float64` number from the given range.
* [​`random_si64`](/mojo/stdlib/random/random/random_si64): Returns a random `Int64` number from the given range.
* [​`random_ui64`](/mojo/stdlib/random/random/random_ui64): Returns a random `UInt64` number from the given range.
* [​`seed`](/mojo/stdlib/random/random/seed): Seeds the random number generator using the current time.
* [​`shuffle`](/mojo/stdlib/random/random/shuffle): Shuffles the elements of the list randomly.

---

## Random

`struct Random[rounds: Int = 6]`

A high-performance random number generator using the Philox algorithm.

The Philox algorithm is a counter-based random number generator designed for parallel
and GPU computing. It provides high-quality random numbers with excellent statistical properties.

## Parameters

* ​rounds (`Int`): Number of mixing rounds to perform. Higher values provide better statistical
  quality at the cost of performance. Default is 6.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, *, seed: SIMD[uint64, 1] = __init__[__mlir_type.!pop.int_literal](0), subsequence: SIMD[uint64, 1] = __init__[__mlir_type.!pop.int_literal](0), offset: SIMD[uint64, 1] = __init__[__mlir_type.!pop.int_literal](0))`

Initialize the random number generator.

**Args:**

* ​seed (`SIMD[uint64, 1]`): Initial seed value for reproducible sequences. Default is 0.
* ​subsequence (`SIMD[uint64, 1]`): Subsequence number for generating independent streams. Default is 0.
* ​offset (`SIMD[uint64, 1]`): Starting offset in the sequence. Default is 0.

### `step`

`step(mut self) -> SIMD[uint32, 4]`

Generate 4 random 32-bit unsigned integers.

**Returns:**

SIMD vector containing 4 random 32-bit unsigned integers.

### `step_uniform`

`step_uniform(mut self) -> SIMD[float32, 4]`

Generate 4 random floating point numbers uniformly distributed in \[0,1).

**Returns:**

SIMD vector containing 4 random float32 values in range \[0,1).

---

## random_float64

`random_float64(min: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](0), max: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](1)) -> SIMD[float64, 1]`

Returns a random `Float64` number from the given range.

**Args:**

* ​min (`SIMD[float64, 1]`): The minimum number in the range (default is 0.0).
* ​max (`SIMD[float64, 1]`): The maximum number in the range (default is 1.0).

**Returns:**

A random number from the specified range.

---

## random_normal

`random_normal[type: DType, mean: SIMD[float64, 1], variance: SIMD[float64, 1]](output: LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Fill `output` with values generated from Normal(mean, variance) distribution.

**Args:**

* ​output (`LayoutTensor[type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output buffer.

---

## random_si64

`random_si64(min: SIMD[int64, 1], max: SIMD[int64, 1]) -> SIMD[int64, 1]`

Returns a random `Int64` number from the given range.

**Args:**

* ​min (`SIMD[int64, 1]`): The minimum number in the range.
* ​max (`SIMD[int64, 1]`): The maximum number in the range.

**Returns:**

A random number from the specified range.

---

## random_ui64

`random_ui64(min: SIMD[uint64, 1], max: SIMD[uint64, 1]) -> SIMD[uint64, 1]`

Returns a random `UInt64` number from the given range.

**Args:**

* ​min (`SIMD[uint64, 1]`): The minimum number in the range.
* ​max (`SIMD[uint64, 1]`): The maximum number in the range.

**Returns:**

A random number from the specified range.

---

## random_uniform

`random_uniform[: origin.set, dtype: DType, rank: Int, //, output_fn: fn[Int, Int](idx: IndexList[$1], val: SIMD[dtype, $0]) capturing -> None, target: StringSlice[StaticConstantOrigin]](shape: IndexList[rank], lower_bound: SIMD[dtype, 1], upper_bound: SIMD[dtype, 1], seed_value: SIMD[uint64, 1], ctx: DeviceContextPtr)`

Call `output_fn` with values generated from a uniform distribution on \[lower\_bound, upper\_bound] for floating-point types or \[lower\_bound, upper\_bound) for integer types.

**Parameters:**

* ​dtype (`DType`): The data type to generate.
* ​rank (`Int`): The rank of the underlying buffer.
* ​output\_fn (`fn[Int, Int](idx: IndexList[$1], val: SIMD[dtype, $0]) capturing -> None`): The function which stores the generated values.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.

**Args:**

* ​shape (`IndexList[rank]`): The shape of the output being stored into by output\_fn.
* ​lower\_bound (`SIMD[dtype, 1]`): The lower bound on the uniform range.
* ​upper\_bound (`SIMD[dtype, 1]`): The upper bound on the uniform range.
* ​seed\_value (`SIMD[uint64, 1]`): Seed value used to initialize the random number generator.
* ​ctx (`DeviceContextPtr`): The device context.

---

## range

Implements a 'range' call.

These are Mojo built-ins, so you don't need to import them.

## Functions

* [​`range`](/mojo/stdlib/builtin/range/range): Constructs a \[0; end) Range.

---

## range

`range[T: Indexer, //](end: T) -> _ZeroStartingRange`

Constructs a \[0; end) Range.

**Parameters:**

* ​T (`Indexer`): The type of the end value.

**Args:**

* ​end (`T`): The end of the range.

**Returns:**

The constructed range.

`range[T: IntableRaising, //](end: T) -> _ZeroStartingRange`

Constructs a \[0; end) Range.

**Parameters:**

* ​T (`IntableRaising`): The type of the end value.

**Args:**

* ​end (`T`): The end of the range.

**Returns:**

The constructed range.

**Raises:**

An error if the conversion to an `Int` failed.

`range(end: PythonObject) -> _ZeroStartingRange`

Constructs a \[0; end) Range from a Python `int`.

**Args:**

* ​end (`PythonObject`): The end of the range as a Python `int`.

**Returns:**

The constructed range.

**Raises:**

An error if converting `end` to an `Int` failed.

`range[T0: Indexer, T1: Indexer, //](start: T0, end: T1) -> _SequentialRange`

Constructs a \[start; end) Range.

**Parameters:**

* ​T0 (`Indexer`): The type of the start value.
* ​T1 (`Indexer`): The type of the end value.

**Args:**

* ​start (`T0`): The start of the range.
* ​end (`T1`): The end of the range.

**Returns:**

The constructed range.

`range[T0: IntableRaising, T1: IntableRaising](start: T0, end: T1) -> _SequentialRange`

Constructs a \[start; end) Range.

**Parameters:**

* ​T0 (`IntableRaising`): The type of the start value.
* ​T1 (`IntableRaising`): The type of the end value.

**Args:**

* ​start (`T0`): The start of the range.
* ​end (`T1`): The end of the range.

**Returns:**

The constructed range.

**Raises:**

An error if converting `start` or `end` to an `Int` failed.

`range(start: PythonObject, end: PythonObject) -> _SequentialRange`

Constructs a \[start; end) Range from Python `int` objects.

**Args:**

* ​start (`PythonObject`): The start of the range as a Python `int`.
* ​end (`PythonObject`): The end of the range as a Python `int`.

**Returns:**

The constructed range.

**Raises:**

An error if converting `start` or `end` to an `Int` failed.

`range[T0: Indexer, T1: Indexer, T2: Indexer, //](start: T0, end: T1, step: T2) -> _StridedRange`

Constructs a \[start; end) Range with a given step.

**Parameters:**

* ​T0 (`Indexer`): The type of the start value.
* ​T1 (`Indexer`): The type of the end value.
* ​T2 (`Indexer`): The type of the step value.

**Args:**

* ​start (`T0`): The start of the range.
* ​end (`T1`): The end of the range.
* ​step (`T2`): The step for the range.

**Returns:**

The constructed range.

`range[T0: IntableRaising, T1: IntableRaising, T2: IntableRaising, //](start: T0, end: T1, step: T2) -> _StridedRange`

Constructs a \[start; end) Range with a given step.

**Parameters:**

* ​T0 (`IntableRaising`): The type of the start value.
* ​T1 (`IntableRaising`): The type of the end value.
* ​T2 (`IntableRaising`): The type of the step value.

**Args:**

* ​start (`T0`): The start of the range.
* ​end (`T1`): The end of the range.
* ​step (`T2`): The step for the range.

**Returns:**

The constructed range.

**Raises:**

An error if converting `start`, `end`, or `step` to an `Int` failed.

`range(start: PythonObject, end: PythonObject, step: PythonObject) -> _StridedRange`

Constructs a \[start; end) Range from Python `int` objects with a given step.

**Args:**

* ​start (`PythonObject`): The start of the range as a Python `int`.
* ​end (`PythonObject`): The end of the range as a Python `int`.
* ​step (`PythonObject`): The step for the range as a Python `int`.

**Returns:**

The constructed range.

**Raises:**

An error if converting `start`, `end`, or `step` to an `Int` failed.

`range(end: UInt) -> _UIntZeroStartingRange`

Constructs a \[0; end) Range.

**Args:**

* ​end (`UInt`): The end of the range.

**Returns:**

The constructed range.

`range(start: UInt, end: UInt, step: UInt = UInt(1)) -> _UIntStridedRange`

Constructs a \[start; end) Range with a given step.

**Args:**

* ​start (`UInt`): The start of the range.
* ​end (`UInt`): The end of the range.
* ​step (`UInt`): The step for the range.  Defaults to 1.

**Returns:**

The constructed range.

`range[dtype: DType, //](end: SIMD[dtype, 1]) -> _ZeroStartingScalarRange[dtype]`

Constructs a \[start; end) Range with a given step.

**Parameters:**

* ​dtype (`DType`): The range dtype.

**Args:**

* ​end (`SIMD[dtype, 1]`): The end of the range.

**Returns:**

The constructed range.

`range[dtype: DType, //](start: SIMD[dtype, 1], end: SIMD[dtype, 1]) -> _SequentialScalarRange[dtype]`

Constructs a \[start; end) Range with a given step.

**Parameters:**

* ​dtype (`DType`): The range dtype.

**Args:**

* ​start (`SIMD[dtype, 1]`): The start of the range.
* ​end (`SIMD[dtype, 1]`): The end of the range.

**Returns:**

The constructed range.

`range[dtype: DType, //](start: SIMD[dtype, 1], end: SIMD[dtype, 1], step: SIMD[dtype, 1]) -> _StridedScalarRange[dtype]`

Constructs a \[start; end) Range with a given step.

**Parameters:**

* ​dtype (`DType`): The range dtype.

**Args:**

* ​start (`SIMD[dtype, 1]`): The start of the range.
* ​end (`SIMD[dtype, 1]`): The end of the range.
* ​step (`SIMD[dtype, 1]`): The step for the range.  Defaults to 1.

**Returns:**

The constructed range.

---

## read_x

`read_x[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)`

---

## read_y

`read_y[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)`

---

## readfirstlane

`readfirstlane(value: SIMD[int32, 1]) -> SIMD[int32, 1]`

Get the value in the lowest active lane of the input operand.

**Args:**

* ​value (`SIMD[int32, 1]`): The input value.

**Returns:**

The value in the lowest active lane of the input operand.

`readfirstlane(value: UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`

Get the value in the lowest active lane of the input operand.

**Args:**

* ​value (`UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The input pointer.

**Returns:**

The value in the lowest active lane of the input operand.

`readfirstlane(value: Int) -> Int`

Get the value in the lowest active lane of the input operand.

**Args:**

* ​value (`Int`): The input pointer.

**Returns:**

The value in the lowest active lane of the input operand.

---

## rebind

Implements type rebind.

These are Mojo built-ins, so you don't need to import them.

## Functions

* [​`rebind`](/mojo/stdlib/builtin/rebind/rebind): Statically assert that a parameter input type `src_type` resolves to the same type as a parameter result type `dest_type` after function instantiation and "rebind" the input to the result type.

---

## rebind

`rebind[src_type: AnyTrivialRegType, //, dest_type: AnyTrivialRegType](src: src_type) -> dest_type`

Statically assert that a parameter input type `src_type` resolves to the same type as a parameter result type `dest_type` after function instantiation and "rebind" the input to the result type.

This function is meant to be used in uncommon cases where a parametric type
depends on the value of a constrained parameter in order to manually refine
the type with the constrained parameter value.

**Parameters:**

* ​src\_type (`AnyTrivialRegType`): The original type.
* ​dest\_type (`AnyTrivialRegType`): The type to rebind to.

**Args:**

* ​src (`src_type`): The value to rebind.

**Returns:**

The rebound value of `dest_type`.

`rebind[src_type: AnyType, //, dest_type: AnyType](ref src: src_type) -> ref [src] dest_type`

Statically assert that a parameter input type `src_type` resolves to the same type as a parameter result type `dest_type` after function instantiation and "rebind" the input to the result type, returning a reference to the input value with an adjusted type.

This function is meant to be used in uncommon cases where a parametric type
depends on the value of a constrained parameter in order to manually refine
the type with the constrained parameter value.

**Parameters:**

* ​src\_type (`AnyType`): The original type.
* ​dest\_type (`AnyType`): The type to rebind to.

**Args:**

* ​src (`src_type`): The value to rebind.

**Returns:**

A reference to the value rebound as `dest_type`.

---

## rebuild_mix_precision_static_tensor_specs_with_input_lambda

`rebuild_mix_precision_static_tensor_specs_with_input_lambda[func_type: AnyTrivialRegType, //, src_type: DType, dst_type: DType, rank: Int](spec: StaticTensorSpec[src_type, rank], in_lambda: func_type) -> StaticTensorSpec[dst_type, rank]`

---

## rebuild_mix_precision_static_tensor_specs_with_output_lambda

`rebuild_mix_precision_static_tensor_specs_with_output_lambda[func_type: AnyTrivialRegType, //, dst_type: DType, src_type: DType, rank: Int](spec: StaticTensorSpec[dst_type, rank], out_lambda: func_type) -> StaticTensorSpec[src_type, rank]`

---

## rebuild_static_tensor_specs_with_input_lambda

`rebuild_static_tensor_specs_with_input_lambda[func_type: AnyTrivialRegType, //, type: DType, rank: Int](spec: StaticTensorSpec[type, rank], in_lambda: func_type) -> StaticTensorSpec[type, rank]`

---

## rebuild_static_tensor_specs_with_output_lambda

`rebuild_static_tensor_specs_with_output_lambda[func_type: AnyTrivialRegType, //, type: DType, rank: Int](spec: StaticTensorSpec[type, rank], out_lambda: func_type) -> StaticTensorSpec[type, rank]`

---

## recip

`recip[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs elementwise reciprocal on a SIMD vector.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector to perform reciprocal on.

**Returns:**

The elementwise reciprocal of x.

---

## reciprocal

`reciprocal(x: SIMD[dtype, size]) -> SIMD[dtype, size]`

---

## reduce

`reduce[: origin.set, //, reducer: fn[ImmutableOrigin](a: Int, b: IntTuple[$0]) capturing -> Int](t: IntTuple[origin], initializer: Int) -> Int`

Apply a reduction function to an `IntTuple` with an initial value.

This function iterates through each element of the `IntTuple` and applies
the provided reduction function cumulatively, starting with the initializer.

**Parameters:**

* ​reducer (`fn[ImmutableOrigin](a: Int, b: IntTuple[$0]) capturing -> Int`): A function that combines the accumulated result with the next element.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to reduce.
* ​initializer (`Int`): The initial value for the reduction operation.

**Returns:**

The final accumulated result after applying the reduction function
to all elements in the `IntTuple`.

---

## reduce

`reduce[: origin.set, //, reduce_fn: fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]](src: NDBuffer[type, 1, origin], init: SIMD[dtype, 1]) -> SIMD[dtype, 1]`

Computes a custom reduction of buffer elements.

**Parameters:**

* ​reduce\_fn (`fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]`): The lambda implementing the reduction.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The input buffer.
* ​init (`SIMD[dtype, 1]`): The initial value to use in accumulator.

**Returns:**

The computed reduction value.

`reduce[: origin.set, //, map_fn: fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2], reduce_fn: fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1], reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], init: SIMD[type, 1])`

Performs a reduction across reduce\_axis of an NDBuffer (src) and stores the result in an NDBuffer (dst).

First src is reshaped into a 3D tensor. Without loss of generality, the three
axes will be referred to as \[H,W,C], where the axis to reduce across is W,
the axes before the reduce axis are packed into H, and the axes after the
reduce axis are packed into C. i.e. a tensor with dims \[D1, D2, ..., Di, ..., Dn]
reducing across axis i gets packed into a 3D tensor with dims \[H, W, C],
where H=prod(D1,...,Di-1), W = Di, and C = prod(Di+1,...,Dn).

**Parameters:**

* ​map\_fn (`fn[DType, DType, Int](SIMD[$0, $2], SIMD[$1, $2]) capturing -> SIMD[$0, $2]`): A mapping function. This function is used when to combine
  (accumulate) two chunks of input data: e.g. we load two 8xfloat32 vectors
  of elements and need to reduce them to a single 8xfloat32 vector.
* ​reduce\_fn (`fn[DType, Int](SIMD[$0, $1]) -> SIMD[$0, 1]`): A reduction function. This function is used to reduce a
  vector to a scalar. E.g. when we got 8xfloat32 vector and want to reduce
  it to 1xfloat32.
* ​reduce\_axis (`Int`): The axis to reduce across.

**Args:**

* ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer.
* ​dst (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The output buffer.
* ​init (`SIMD[type, 1]`): The initial value to use in accumulator.

---

## reduce

`reduce[val_type: DType, simd_width: Int, //, shuffle: fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1], func: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Performs a generic warp-wide reduction operation using shuffle operations.

This is a convenience wrapper around lane\_group\_reduce that operates on the entire warp.
It allows customizing both the shuffle operation and reduction function.

Example:

```mojo
    from gpu.warp import reduce, shuffle_down

    # Compute warp-wide sum using shuffle down
    @parameter
    fn add[type: DType, width: Int](x: SIMD[type, width], y: SIMD[type, width]) capturing -> SIMD[type, width]:
        return x + y

    val = SIMD[DType.float32, 4](2.0, 4.0, 6.0, 8.0)
    result = reduce[shuffle_down, add](val)
```

.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.
* ​shuffle (`fn[DType, Int](val: SIMD[$0, $1], offset: SIMD[uint32, 1]) -> SIMD[$0, $1]`): A function that performs the warp shuffle operation. Takes a SIMD value and
  offset and returns the shuffled result.
* ​func (`fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]`): A binary function that combines two SIMD values during reduction. This defines
  the reduction operation (e.g. add, max, min).

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value.

**Returns:**

A SIMD value containing the reduction result broadcast to all lanes in the warp.

---

## reduce_add_simd

`reduce_add_simd[simd_width: Int, step_simd_width: Int, type: DType](mut scalar: SIMD[type, 1], mut vector: SIMD[type, simd_width], val: SIMD[type, step_simd_width])`

This functions adds val to either the scalar value or the vector value depending on the step\_simd\_width. This is useful when the simd\_width varies between iterations as in vectorize.

---

## reduce_boolean

`reduce_boolean[: origin.set, : origin.set, //, reduce_fn: fn[DType, Int](SIMD[$0, $1]) capturing -> Bool, continue_fn: fn(Bool) capturing -> Bool](src: NDBuffer[type, 1, origin], init: Bool) -> Bool`

Computes a bool reduction of buffer elements. The reduction will early exit if the `continue_fn` returns False.

**Parameters:**

* ​reduce\_fn (`fn[DType, Int](SIMD[$0, $1]) capturing -> Bool`): A boolean reduction function. This function is used to reduce
  a vector to a scalar. E.g. when we got `8xfloat32` vector and want to
  reduce it to a `bool`.
* ​continue\_fn (`fn(Bool) capturing -> Bool`): A function to indicate whether we want to continue
  processing the rest of the iterations. This takes the result of the
  reduce\_fn and returns True to continue processing and False to early
  exit.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The input buffer.
* ​init (`Bool`): The initial value to use.

**Returns:**

The computed reduction value.

---

## ReduceOp

`@register_passable(trivial)`
`struct ReduceOp`

Represents reduction operations for parallel reduction algorithms.

This struct defines different reduction operations that can be performed
across multiple threads in parallel. These operations are commonly used
in parallel reduction algorithms on GPUs.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ADD`

`alias ADD = ReduceOp(0)`

Addition reduction operation.

Combines values by adding them together.

### `AND`

`alias AND = ReduceOp(3)`

Bitwise AND reduction operation.

Performs bitwise AND across all inputs.

### `MAX`

`alias MAX = ReduceOp(2)`

Maximum reduction operation.

Finds the maximum value across all inputs.

### `MIN`

`alias MIN = ReduceOp(1)`

Minimum reduction operation.

Finds the minimum value across all inputs.

### `OR`

`alias OR = ReduceOp(4)`

Bitwise OR reduction operation.

Performs bitwise OR across all inputs.

### `XOR`

`alias XOR = ReduceOp(5)`

Bitwise XOR reduction operation.

Performs bitwise XOR across all inputs.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Tests if two ReduceOp instances are equal.

**Args:**

* ​other (`Self`): The ReduceOp instance to compare against.

**Returns:**

True if the reduction operations are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Tests if two ReduceOp instances are not equal.

**Args:**

* ​other (`Self`): The ReduceOp instance to compare against.

**Returns:**

True if the reduction operations are different, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Tests if two ReduceOp instances are identical.

**Args:**

* ​other (`Self`): The ReduceOp instance to compare against.

**Returns:**

True if the reduction operations are identical, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Tests if two ReduceOp instances are not identical.

**Args:**

* ​other (`Self`): The ReduceOp instance to compare against.

**Returns:**

True if the reduction operations are not identical, False otherwise.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the reduction operation.

**Returns:**

A string describing the reduction operation.

### `mnemonic`

`mnemonic(self) -> StringSlice[StaticConstantOrigin]`

Returns the mnemonic string for the reduction operation.

**Returns:**

A string literal containing the reduction operation mnemonic.

---

## reduction

Implements SIMD reductions.

You can import these APIs from the `algorithm` package. For example:

```mojo
from algorithm import map_reduce
```

## Functions

* [​`all_true`](/mojo/stdlib/algorithm/reduction/all_true): Returns True if all the elements in a buffer are True and False otherwise.
* [​`any_true`](/mojo/stdlib/algorithm/reduction/any_true): Returns True if any the elements in a buffer are True and False otherwise.
* [​`cumsum`](/mojo/stdlib/algorithm/reduction/cumsum): Computes the cumulative sum of all elements in a buffer.    dst\[i] = src\[i] + src\[i-1] + ... + src\[0].
* [​`map_reduce`](/mojo/stdlib/algorithm/reduction/map_reduce): Stores the result of calling input\_gen\_fn in dst and simultaneously reduce the result using a custom reduction function.
* [​`max`](/mojo/stdlib/algorithm/reduction/max): Computes the max element in a buffer.
* [​`mean`](/mojo/stdlib/algorithm/reduction/mean): Computes the mean value of the elements in a buffer.
* [​`min`](/mojo/stdlib/algorithm/reduction/min): Computes the min element in a buffer.
* [​`none_true`](/mojo/stdlib/algorithm/reduction/none_true): Returns True if none of the elements in a buffer are True and False otherwise.
* [​`product`](/mojo/stdlib/algorithm/reduction/product): Computes the product of the buffer elements.
* [​`reduce`](/mojo/stdlib/algorithm/reduction/reduce): Computes a custom reduction of buffer elements.
* [​`reduce_boolean`](/mojo/stdlib/algorithm/reduction/reduce_boolean): Computes a bool reduction of buffer elements. The reduction will early exit if the `continue_fn` returns False.
* [​`sum`](/mojo/stdlib/algorithm/reduction/sum): Computes the sum of buffer elements.
* [​`variance`](/mojo/stdlib/algorithm/reduction/variance): Given a mean, computes the variance of elements in a buffer.

---

## ReductionMethod

`@register_passable(trivial)`
`struct ReductionMethod`

Enumerates the supported reduction methods.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `TENSOR_CORE`

`alias TENSOR_CORE = ReductionMethod(0)`

Use tensor core for reduction.

### `WARP`

`alias WARP = ReductionMethod(1)`

Use warp shuffle for reduction.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two ReductionMethod are equal.

**Args:**

* ​other (`Self`): The other ReductionMethod to compare.

**Returns:**

True if the ReductionMethod are equal, false otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if two ReductionMethod are not equal.

**Args:**

* ​other (`Self`): The other ReductionMethod to compare.

**Returns:**

True if the ReductionMethod are not equal, false otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Checks if two ReductionMethod are identical.

**Args:**

* ​other (`Self`): The other ReductionMethod to compare.

**Returns:**

True if the ReductionMethod are identical, false otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Checks if two ReductionMethod are not identical.

**Args:**

* ​other (`Self`): The other ReductionMethod to compare.

**Returns:**

True if the ReductionMethod are not identical, false otherwise.

---

## reflection

## Functions

* [​`get_linkage_name`](/mojo/stdlib/compile/reflection/get_linkage_name): Returns `func` symbol name.

---

## Register

A GPU register is the fastest form of storage within a [streaming
multiprocessor](streaming-multiprocessor.mdx) (SM). Registers store integer and
floating point values used frequently by a [thread](thread.mdx), reducing
reliance on slower [memory](memory.mdx) types (shared, global, or local
memory).

Registers are located within an SM in what is referred to as a *register file*.
The number of registers depends on the GPU architecture, but modern GPUs support
thousands of registers per SM.

For each thread that it executes, the SM allocates a set of registers for the
private use of that thread. The registers are associated with that thread
throughout its lifetime, even if the thread is not currently executing on the
SM's cores (for example, if it is blocked waiting for data from memory). A
thread can't access registers assigned to a different thread, preventing data
conflicts between threads. If the execution of a [kernel](kernel.mdx) function
by a thread requires more registers than available, the compiler arranges to
spill some register data to the thread's local [memory](memory.mdx). Because
local memory access is slower than register access, programmers should try to
design their kernels to avoid or limit the amount of spill.

---

## registry

Model registry, for tracking various model variants.

## `PipelineRegistry` {#max.pipelines.lib.registry.PipelineRegistry}

> *class* max.pipelines.lib.registry.PipelineRegistry(architectures)

**Parameters:**

**architectures** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`SupportedArchitecture`](#max.pipelines.lib.registry.SupportedArchitecture) `]` )

### `get_active_huggingface_config()` {#max.pipelines.lib.registry.PipelineRegistry.get_active_huggingface_config}

> get\_active\_huggingface\_config(huggingface\_repo)

Retrieves or creates a cached HuggingFace AutoConfig for the given
model configuration.

This method maintains a cache of HuggingFace configurations to avoid
reloading them unnecessarily which incurs a huggingface hub API call.
If a config for the given model hasn’t been loaded before, it will
create a new one using AutoConfig.from\_pretrained() with the model’s
settings.

**Parameters:**

**huggingface\_repo** ([`HuggingFaceRepo`](hf_utils.md#max.pipelines.lib.hf_utils.HuggingFaceRepo) ) – The HuggingFaceRepo containing the model.

**Returns:**

The HuggingFace configuration object for the model.

**Return type:**

AutoConfig

### `get_active_tokenizer()` {#max.pipelines.lib.registry.PipelineRegistry.get_active_tokenizer}

> get\_active\_tokenizer(huggingface\_repo)

Retrieves or creates a cached HuggingFace AutoTokenizer for the given
model configuration.

This method maintains a cache of HuggingFace tokenizers to avoid
reloading them unnecessarily which incurs a huggingface hub API call.
If a tokenizer for the given model hasn’t been loaded before, it will
create a new one using AutoTokenizer.from\_pretrained() with the model’s
settings.

**Parameters:**

**huggingface\_repo** ([`HuggingFaceRepo`](hf_utils.md#max.pipelines.lib.hf_utils.HuggingFaceRepo) ) – The HuggingFaceRepo containing the model.

**Returns:**

The HuggingFace tokenizer for the model.

**Return type:**

PreTrainedTokenizer | PreTrainedTokenizerFast

### `register()` {#max.pipelines.lib.registry.PipelineRegistry.register}

> register(architecture, \*, allow\_override=False)

Add new architecture to registry.

**Parameters:**

* **architecture** ([`SupportedArchitecture`](#max.pipelines.lib.registry.SupportedArchitecture) )
* **allow\_override** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Return type:**

None

### `reset()` {#max.pipelines.lib.registry.PipelineRegistry.reset}

> reset()

**Return type:**

None

### `retrieve()` {#max.pipelines.lib.registry.PipelineRegistry.retrieve}

> retrieve(pipeline\_config, task=PipelineTask.TEXT\_GENERATION, override\_architecture=None)

**Parameters:**

* **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) )
* **task** ([`PipelineTask`](core.md#max.pipelines.core.PipelineTask) )
* **override\_architecture** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `None` )

**Return type:**

[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[PipelineTokenizer](core.md#max.pipelines.core.PipelineTokenizer), PipelineTypes]

### `retrieve_architecture()` {#max.pipelines.lib.registry.PipelineRegistry.retrieve_architecture}

> retrieve\_architecture(huggingface\_repo)

**Parameters:**

**huggingface\_repo** ([`HuggingFaceRepo`](hf_utils.md#max.pipelines.lib.hf_utils.HuggingFaceRepo) )

**Return type:**

[*SupportedArchitecture*](#max.pipelines.lib.registry.SupportedArchitecture) | None

### `retrieve_factory()` {#max.pipelines.lib.registry.PipelineRegistry.retrieve_factory}

> retrieve\_factory(pipeline\_config, task=PipelineTask.TEXT\_GENERATION, override\_architecture=None)

**Parameters:**

* **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) )
* **task** ([`PipelineTask`](core.md#max.pipelines.core.PipelineTask) )
* **override\_architecture** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `None` )

**Return type:**

[tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[PipelineTokenizer](core.md#max.pipelines.core.PipelineTokenizer), Callable\[\[], PipelineTypes]]

## `SupportedArchitecture` {#max.pipelines.lib.registry.SupportedArchitecture}

> *class* max.pipelines.lib.registry.SupportedArchitecture(name, example\_repo\_ids, default\_encoding, supported\_encodings, pipeline\_model, task, tokenizer, default\_weights\_format, multi\_gpu\_supported=False, rope\_type=RopeType.none, weight\_adapters=None)

Initializes a model architecture supported by MAX pipelines.

New architectures should be registered into the [`PipelineRegistry`](#max.pipelines.lib.registry.PipelineRegistry).

**Parameters:**

* **name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – Architecture name.
* **example\_repo\_ids** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `]` ) – Hugging Face repo\_id which runs this architecture.
* **default\_encoding** (`SupportedEncoding` ) – Default encoding for the model.
* **supported\_encodings** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` `SupportedEncoding` `,`  [`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`KVCacheStrategy`](../nn/kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheStrategy) `]` `]` ) – Alternate encodings supported.
* **pipeline\_model** ([`type`](https://docs.python.org/3/library/functions.html#type) `[` [`PipelineModel`](pipeline.md#max.pipelines.lib.pipeline.PipelineModel) `]` ) – `PipelineModel` class that defines the model graph
  and execution.
* **task** ([`PipelineTask`](core.md#max.pipelines.core.PipelineTask) ) – Which pipeline task should the model run with.
* **tokenizer** (`Callable` `[` `...` `,`  [`PipelineTokenizer`](core.md#max.pipelines.core.PipelineTokenizer) `]` ) – Tokenizer used to preprocess model inputs.
* **default\_weights\_format** (`WeightsFormat` ) – The weights format used in pipeline\_model.
* **weight\_converters** – A dictionary of weight loaders to use if the
  input checkpoint has a different format than the default.
* **multi\_gpu\_supported** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )
* **rope\_type** (`RopeType` )
* **weight\_adapters** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` `WeightsFormat` `,`  `WeightsAdapter` `]`  `|`  `None` )

### `tokenizer_cls` {#max.pipelines.lib.registry.SupportedArchitecture.tokenizer_cls}

> *property* tokenizer\_cls\*: [type](https://docs.python.org/3/library/functions.html#type)\[[PipelineTokenizer](core.md#max.pipelines.core.PipelineTokenizer)]\*

## `get_pipeline_for_task()` {#max.pipelines.lib.registry.get_pipeline_for_task}

> max.pipelines.lib.registry.get\_pipeline\_for\_task(task, pipeline\_config)

**Parameters:**

* **task** ([`PipelineTask`](core.md#max.pipelines.core.PipelineTask) )
* **pipeline\_config** ([`PipelineConfig`](config.md#max.pipelines.lib.config.PipelineConfig) )

**Return type:**

[type](https://docs.python.org/3/library/functions.html#type)\[[TextGenerationPipeline](pipeline.md#max.pipelines.lib.pipeline.TextGenerationPipeline)] | [type](https://docs.python.org/3/library/functions.html#type)\[EmbeddingsPipeline] | [type](https://docs.python.org/3/library/functions.html#type)\[SpeculativeDecodingTextGenerationPipeline] | [type](https://docs.python.org/3/library/functions.html#type)\[AudioGeneratorPipeline]

---

## relu

`relu[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]`

Compute the Relu Op using the equation $max(0, x)$.

**Parameters:**

* ​type (`DType`): DType used for the computation.
* ​simd\_width (`Int`): SIMD width used for the computation.

**Args:**

* ​x (`SIMD[type, simd_width]`): The value to compute the RELU operation on.

**Returns:**

The result of the RELU operation.

---

## relu_n1

`relu_n1[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]`

Compute the Relu N1 Op using the equation $max(min(x,1),-1)$.

**Parameters:**

* ​type (`DType`): DType used for the computation.
* ​simd\_width (`Int`): SIMD width used for the computation.

**Args:**

* ​x (`SIMD[type, simd_width]`): The value to compute the RELU N1 operation on.

**Returns:**

The result of the RELU N1 operation.

---

## remainder

`remainder[dtype: DType, width: Int, //](x: SIMD[dtype, width], y: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `remainder` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The first input argument.
* ​y (`SIMD[dtype, width]`): The second input argument.

**Returns:**

The `remainder` of the inputs.

---

## remove

`remove[PathLike: PathLike](path: PathLike)`

Removes the specified file.

If the path is a directory or it can not be deleted, an error is raised.
Absolute and relative paths are allowed, relative paths are resolved from cwd.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the file.

---

## removedirs

`removedirs[PathLike: PathLike](path: PathLike)`

Removes a leaf directory and all empty intermediate ones.

Directories corresponding to rightmost path segments will be pruned away
until either the whole path is consumed or an error occurs. Errors during
this latter phase are ignored, which occur when a directory was not empty.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

---

## reorder_padding

`reorder_padding[rank: Int](pad: DimList) -> DimList`

---

## repack_GPTQ_for_sm8x

`repack_GPTQ_for_sm8x[in_layout: Layout, out_layout: Layout, scales_type: DType, group_size: Int, has_perm: Bool, *, perm_layout: Layout = Layout()](in_tensor: LayoutTensor[uint8, in_layout, MutableAnyOrigin], out_tensor: LayoutTensor[uint8, out_layout, MutableAnyOrigin], perm_idx: LayoutTensor[int32, perm_layout, MutableAnyOrigin])`

---

## repack_Q4_0_for_sm8x

`repack_Q4_0_for_sm8x[q_layout: Layout, repack_layout: Layout, scales_type: DType](q_weight: LayoutTensor[uint8, q_layout, MutableAnyOrigin], q_packed_weight: LayoutTensor[uint8, repack_layout, MutableAnyOrigin])`

---

## repeat_interleave

## Functions

* [​`repeat_interleave`](./repeat_interleave): Fill `output` by repeating values from `input` along `axis` based on the values in `repeats` buffer.
* [​`repeat_interleave_shape`](./repeat_interleave_shape):

---

## repeat_interleave

`repeat_interleave[type: DType, rank: Int, type_repeats: DType](input: NDBuffer[type, rank, origin], repeats: NDBuffer[type_repeats, 1, origin], axis: Int, output: NDBuffer[type, rank, origin])`

Fill `output` by repeating values from `input` along `axis` based on the values in `repeats` buffer.

This is intended to implement the same functionality as torch.repeat:

**Args:**

* ​input (`NDBuffer[type, rank, origin]`): The input buffer.
* ​repeats (`NDBuffer[type_repeats, 1, origin]`): The number of repetitions each element in input.
* ​axis (`Int`): The axis along which to repeat values.
* ​output (`NDBuffer[type, rank, origin]`): The output buffer.

---

## repeat_interleave_shape

`repeat_interleave_shape[type_repeats: DType](input: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], repeats: NDBuffer[type_repeats, 1, origin], axis: Int) -> IndexList[rank]`

---

## Report

`struct Report`

Contains the average execution time, iterations, min and max of each batch.

## Fields

* ​warmup\_duration (`Int`): The total duration it took to warmup.
* ​runs (`List[Batch]`): A `List` of benchmark runs.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Default initializer for the Report.

Sets all values to 0

`__init__(out self, *, other: Self)`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Creates a shallow copy (it doesn't copy the data).

**Args:**

* ​existing (`Self`): The `Report` to copy.

### `iters`

`iters(self) -> Int`

The total benchmark iterations.

**Returns:**

The total benchmark iterations.

### `duration`

`duration(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]`

The total duration it took to run all benchmarks.

**Args:**

* ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`).

**Returns:**

The total duration it took to run all benchmarks.

### `mean`

`mean(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]`

The average duration of all benchmark runs.

**Args:**

* ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`).

**Returns:**

The average duration of all benchmark runs.

### `min`

`min(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]`

The batch of benchmarks that was the fastest to run.

**Args:**

* ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`).

**Returns:**

The fastest duration out of all batches.

### `max`

`max(self, unit: String = __init__[__mlir_type.!kgen.string]("s")) -> SIMD[float64, 1]`

The batch of benchmarks that was the slowest to run.

**Args:**

* ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`).

**Returns:**

The slowest duration out of all batches.

### `print`

`print(self, unit: String = __init__[__mlir_type.!kgen.string]("s"))`

Prints out the shortened version of the report.

**Args:**

* ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`).

### `print_full`

`print_full(self, unit: String = __init__[__mlir_type.!kgen.string]("s"))`

Prints out the full version of the report with each batch of benchmark runs.

**Args:**

* ​unit (`String`): The time unit to display for example: ns, ms, s (default `s`).

---

## repr

Provide the `repr` function.

The functions and traits provided here are built-ins, so you don't need to import them.

## Traits

* [​`Representable`](/mojo/stdlib/builtin/repr/Representable): A trait that describes a type that has a String representation.

## Functions

* [​`repr`](/mojo/stdlib/builtin/repr/repr): Returns the string representation of the given value.

---

## repr

`repr[T: Representable](value: T) -> String`

Returns the string representation of the given value.

**Parameters:**

* ​T (`Representable`): The type of `value`. Must implement the `Representable` trait.

**Args:**

* ​value (`T`): The value to get the string representation of.

**Returns:**

The string representation of the given value.

`repr(value: None) -> String`

Returns the string representation of `None`.

**Args:**

* ​value (`None`): A `None` value.

**Returns:**

The string representation of `None`.

---

## Representable

A trait that describes a type that has a String representation.

Any type that conforms to the `Representable` trait can be used with the
`repr` function. Any conforming type must also implement the `__repr__` method.
Here is an example:

```mojo
struct Dog(Representable):
    var name: String
    var age: Int

    fn __repr__(self) -> String:
        return "Dog(name=" + repr(self.name) + ", age=" + repr(self.age) + ")"

var dog = Dog("Rex", 5)
print(repr(dog))
# Dog(name='Rex', age=5)
```

The method `__repr__` should compute the "official" string representation of a type.

If at all possible, this should look like a valid Mojo expression
that could be used to recreate a struct instance with the same
value (given an appropriate environment).
So a returned String of the form `module_name.SomeStruct(arg1=value1, arg2=value2)` is advised.
If this is not possible, a string of the form ``
should be returned.

The return value must be a `String` instance.
This is typically used for debugging, so it is important that the representation is information-rich and unambiguous.

Note that when computing the string representation of a collection (`Dict`, `List`, `Set`, etc...),
the `repr` function is called on each element, not the `String()` function.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__repr__`

`__repr__(self: _Self) -> String`

Get the string representation of the type instance, if possible, compatible with Mojo syntax.

**Returns:**

The string representation of the instance.

---

## reshape

## Functions

* [​`ndbuffer_reshape`](./ndbuffer_reshape):
* [​`reshape`](./reshape):
* [​`reshape_shape`](./reshape_shape):

---

## reshape

`reshape[rank: Int, type: DType, //, output_rank: Int, single_thread_blocking_override: Bool = True](input: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], new_shape: IndexList[output_rank]) -> NDBuffer[type, output_rank, origin]`

---

## reshape_shape

`reshape_shape[input_rank: Int, output_rank: Int, input_type: DType, target_shape_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], target_shape_buf: NDBuffer[target_shape_type, 1, origin]) -> IndexList[output_rank]`

---

## resize

## Structs

* [​`CoordinateTransformationMode`](./CoordinateTransformationMode):
* [​`InterpolationMode`](./InterpolationMode):
* [​`Interpolator`](./Interpolator):
* [​`RoundMode`](./RoundMode):

## Functions

* [​`coord_transform`](./coord_transform):
* [​`interpolate_point_1d`](./interpolate_point_1d):
* [​`linear_filter`](./linear_filter): This is a tent filter.
* [​`resize_linear`](./resize_linear): Resizes input to output shape using linear interpolation.
* [​`resize_nearest_neighbor`](./resize_nearest_neighbor):

---

## resize_linear

`resize_linear[coordinate_transformation_mode: CoordinateTransformationMode, antialias: Bool, rank: Int, type: DType](input: NDBuffer[type, rank, origin], output: NDBuffer[type, rank, origin])`

Resizes input to output shape using linear interpolation.

Does not use anti-aliasing filter for downsampling (coming soon).

**Parameters:**

* ​coordinate\_transformation\_mode (`CoordinateTransformationMode`): How to map a coordinate in output to a coordinate in input.
* ​antialias (`Bool`): Whether or not to use an antialiasing linear/cubic filter, which when downsampling, uses
  more points to avoid aliasing artifacts. Effectively stretches the filter by a factor of 1 / scale.
* ​rank (`Int`): Rank of the input and output.
* ​type (`DType`): Type of input and output.

**Args:**

* ​input (`NDBuffer[type, rank, origin]`): The input to be resized.
* ​output (`NDBuffer[type, rank, origin]`): The output containing the resized input.

---

## resize_nearest_neighbor

`resize_nearest_neighbor[coordinate_transformation_mode: CoordinateTransformationMode, round_mode: RoundMode, rank: Int, type: DType](input: NDBuffer[type, rank, origin], output: NDBuffer[type, rank, origin])`

---

## Result

`@register_passable(trivial)`
`struct Result`

## Fields

* ​code (`SIMD[int32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`

## Aliases

### `ALREADY_INITIALIZED`

`alias ALREADY_INITIALIZED = Result(__init__[__mlir_type.!pop.int_literal](5))`

Deprecated: Multiple initializations are now allowed through ref counting

### `ARGUMENT_VERSION_MISMATCH`

`alias ARGUMENT_VERSION_MISMATCH = Result(__init__[__mlir_type.!pop.int_literal](25))`

The provided version is invalid/unsupported

### `CORRUPTED_INFOROM`

`alias CORRUPTED_INFOROM = Result(__init__[__mlir_type.!pop.int_literal](14))`

infoROM is corrupted

### `DEPRECATED`

`alias DEPRECATED = Result(__init__[__mlir_type.!pop.int_literal](26))`

The requested functionality has been deprecated

### `DRIVER_NOT_LOADED`

`alias DRIVER_NOT_LOADED = Result(__init__[__mlir_type.!pop.int_literal](9))`

NVIDIA driver is not loaded

### `FREQ_NOT_SUPPORTED`

`alias FREQ_NOT_SUPPORTED = Result(__init__[__mlir_type.!pop.int_literal](24))`

Ran out of critical resources, other than memory

### `FUNCTION_NOT_FOUND`

`alias FUNCTION_NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](13))`

Local version of NVML doesn't implement this function

### `GPU_IS_LOST`

`alias GPU_IS_LOST = Result(__init__[__mlir_type.!pop.int_literal](15))`

The GPU has fallen off the bus or has otherwise become inaccessible

### `GPU_NOT_FOUND`

`alias GPU_NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](28))`

No GPUs were found

### `IN_USE`

`alias IN_USE = Result(__init__[__mlir_type.!pop.int_literal](19))`

An operation cannot be performed because the GPU is currently in use

### `INSUFFICIENT_POWER`

`alias INSUFFICIENT_POWER = Result(__init__[__mlir_type.!pop.int_literal](8))`

A device's external power cables are not properly attached

### `INSUFFICIENT_RESOURCES`

`alias INSUFFICIENT_RESOURCES = Result(__init__[__mlir_type.!pop.int_literal](23))`

Ran out of critical resources, other than memory

### `INSUFFICIENT_SIZE`

`alias INSUFFICIENT_SIZE = Result(__init__[__mlir_type.!pop.int_literal](7))`

An input argument is not large enough

### `INVALID_ARGUMENT`

`alias INVALID_ARGUMENT = Result(__init__[__mlir_type.!pop.int_literal](2))`

A supplied argument is invalid

### `IRQ_ISSUE`

`alias IRQ_ISSUE = Result(__init__[__mlir_type.!pop.int_literal](11))`

NVIDIA Kernel detected an interrupt issue with a GPU

### `LIB_RM_VERSION_MISMATCH`

`alias LIB_RM_VERSION_MISMATCH = Result(__init__[__mlir_type.!pop.int_literal](18))`

RM detects a driver/library version mismatch

### `LIBRARY_NOT_FOUND`

`alias LIBRARY_NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](12))`

NVML Shared Library couldn't be found or loaded

### `MEMORY`

`alias MEMORY = Result(__init__[__mlir_type.!pop.int_literal](20))`

Insufficient memory

### `NO_DATA`

`alias NO_DATA = Result(__init__[__mlir_type.!pop.int_literal](21))`

No data

### `NO_PERMISSION`

`alias NO_PERMISSION = Result(__init__[__mlir_type.!pop.int_literal](4))`

The current user does not have permission for operation

### `NOT_FOUND`

`alias NOT_FOUND = Result(__init__[__mlir_type.!pop.int_literal](6))`

A query to find an object was unsuccessful

### `NOT_READY`

`alias NOT_READY = Result(__init__[__mlir_type.!pop.int_literal](27))`

The system is not ready for the request

### `NOT_SUPPORTED`

`alias NOT_SUPPORTED = Result(__init__[__mlir_type.!pop.int_literal](3))`

The requested operation is not available on target device

### `OPERATING_SYSTEM`

`alias OPERATING_SYSTEM = Result(__init__[__mlir_type.!pop.int_literal](17))`

The GPU control device has been blocked by the operating system/cgroups

### `RESET_REQUIRED`

`alias RESET_REQUIRED = Result(__init__[__mlir_type.!pop.int_literal](16))`

The GPU requires a reset before it can be used again

### `SUCCESS`

`alias SUCCESS = Result(__init__[__mlir_type.!pop.int_literal](0))`

The operation was successful

### `TIMEOUT`

`alias TIMEOUT = Result(__init__[__mlir_type.!pop.int_literal](10))`

User provided timeout passed

### `UNINITIALIZED`

`alias UNINITIALIZED = Result(__init__[__mlir_type.!pop.int_literal](1))`

NVML was not first initialized with nvmlInit()

### `UNKNOWN`

`alias UNKNOWN = Result(__init__[__mlir_type.!pop.int_literal](999))`

An internal driver error occurred

### `VGPU_ECC_NOT_SUPPORTED`

`alias VGPU_ECC_NOT_SUPPORTED = Result(__init__[__mlir_type.!pop.int_literal](22))`

The requested vgpu operation is not available on target device, becasue ECC is enabled

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

### `__ne__`

`__ne__(self, other: Self) -> Bool`

### `__str__`

`__str__(self) -> String`

---

## reverse

`reverse(src: IntTuple[origin]) -> IntTuple`

Reverses the order of elements in an `IntTuple`, recursively.

This function reverses the top-level elements of the `IntTuple` and
recursively reverses any nested `IntTuple`s.

Example:

```mojo
from layout.int_tuple import IntTuple, reverse
var t = IntTuple(1, 2, IntTuple(3, 4))
var reversed = reverse(t) # returns ((4, 3), 2, 1)
```

.

**Args:**

* ​src (`IntTuple[origin]`): The source `IntTuple` to reverse.

**Returns:**

A new `IntTuple` with elements in reversed order.

---

## reverse_idx

`reverse_idx[transpose: Bool](x: Int, y: Int) -> IndexList[2]`

---

## reversed

Provides the `reversed` function for reverse iteration over collections.

These are Mojo built-ins, so you don't need to import them.

## Traits

* [​`ReversibleRange`](/mojo/stdlib/builtin/reversed/ReversibleRange): The `ReversibleRange` trait describes a range that can be reversed.

## Functions

* [​`reversed`](/mojo/stdlib/builtin/reversed/reversed): Get a reversed iterator of the input range.

---

## reversed

`reversed[T: ReversibleRange](value: T) -> _StridedRange`

Get a reversed iterator of the input range.

**Note**: iterators are currently non-raising.

**Parameters:**

* ​T (`ReversibleRange`): The type conforming to ReversibleRange.

**Args:**

* ​value (`T`): The range to get the reversed iterator of.

**Returns:**

The reversed iterator of the range.

`reversed[T: Copyable & Movable](ref value: List[T, hint_trivial_type]) -> _ListIter[T, hint_trivial_type, value_is_origin, False]`

Get a reversed iterator of the input list.

**Note**: iterators are currently non-raising.

**Parameters:**

* ​T (`Copyable & Movable`): The type of the elements in the list.

**Args:**

* ​value (`List[T, hint_trivial_type]`): The list to get the reversed iterator of.

**Returns:**

The reversed iterator of the list.

`reversed[T: Copyable & Movable](ref value: Deque[T]) -> _DequeIter[T, value_is_origin, False]`

Get a reversed iterator of the deque.

**Note**: iterators are currently non-raising.

**Parameters:**

* ​T (`Copyable & Movable`): The type of the elements in the deque.

**Args:**

* ​value (`Deque[T]`): The deque to get the reversed iterator of.

**Returns:**

The reversed iterator of the deque.

`reversed[K: KeyElement, V: Copyable & Movable](ref value: Dict[K, V]) -> _DictKeyIter[K, V, value_is_origin, False]`

Get a reversed iterator of the input dict.

**Note**: iterators are currently non-raising.

**Parameters:**

* ​K (`KeyElement`): The type of the keys in the dict.
* ​V (`Copyable & Movable`): The type of the values in the dict.

**Args:**

* ​value (`Dict[K, V]`): The dict to get the reversed iterator of.

**Returns:**

The reversed iterator of the dict keys.

`reversed[K: KeyElement, V: Copyable & Movable, dict_mutability: Bool, dict_origin: Origin[dict_mutability]](ref value: _DictValueIter[K, V, dict_origin]) -> _DictValueIter[K, V, dict_origin, False]`

Get a reversed iterator of the input dict values.

**Note**: iterators are currently non-raising.

**Parameters:**

* ​K (`KeyElement`): The type of the keys in the dict.
* ​V (`Copyable & Movable`): The type of the values in the dict.
* ​dict\_mutability (`Bool`): Whether the reference to the dict values is mutable.
* ​dict\_origin (`Origin[dict_mutability]`): The origin of the dict values.

**Args:**

* ​value (`_DictValueIter[K, V, dict_origin]`): The dict values to get the reversed iterator of.

**Returns:**

The reversed iterator of the dict values.

`reversed[K: KeyElement, V: Copyable & Movable, dict_mutability: Bool, dict_origin: Origin[dict_mutability]](ref value: _DictEntryIter[K, V, dict_origin]) -> _DictEntryIter[K, V, dict_origin, False]`

Get a reversed iterator of the input dict items.

**Note**: iterators are currently non-raising.

**Parameters:**

* ​K (`KeyElement`): The type of the keys in the dict.
* ​V (`Copyable & Movable`): The type of the values in the dict.
* ​dict\_mutability (`Bool`): Whether the reference to the dict items is mutable.
* ​dict\_origin (`Origin[dict_mutability]`): The origin of the dict items.

**Args:**

* ​value (`_DictEntryIter[K, V, dict_origin]`): The dict items to get the reversed iterator of.

**Returns:**

The reversed iterator of the dict items.

`reversed[T: Copyable & Movable](value: Span[T, origin]) -> _SpanIter[T, origin, False]`

Get a reversed iterator of the input Span.

**Note**: iterators are currently non-raising.

**Parameters:**

* ​T (`Copyable & Movable`): The type of the elements in the Span.

**Args:**

* ​value (`Span[T, origin]`): The Span to get the reversed iterator of.

**Returns:**

The reversed iterator of the Span.

---

## ReversibleRange

The `ReversibleRange` trait describes a range that can be reversed.

Any type that conforms to `ReversibleRange` works with the builtin
[`reversed()`](/mojo/stdlib/builtin/reversed.html) functions.

The `ReversibleRange` trait requires the type to define the `__reversed__()`
method.

**Note**: iterators are currently non-raising.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__reversed__`

`__reversed__(self: _Self) -> _StridedRange`

Get a reversed iterator for the type.

**Note**: iterators are currently non-raising.

**Returns:**

The reversed iterator of the type.

---

## right_inverse

`right_inverse(layout: Layout) -> Layout`

Creates a right inverse of a layout.

The right inverse of a layout maps memory indices back to logical coordinates.
This is useful for converting between different memory layouts.

**Args:**

* ​layout (`Layout`): The layout to invert.

**Returns:**

A new layout representing the right inverse of the input layout.

---

## rmdir

`rmdir[PathLike: PathLike](path: PathLike)`

Removes the specified directory.

If the path is not a directory or it can not be deleted, an error is raised.
Absolute and relative paths are allowed, relative paths are resolved from cwd.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

---

## rms_norm

Normalization layer.

## `DistributedRMSNorm` {#max.nn.norm.rms_norm.DistributedRMSNorm}

> *class* max.nn.norm.rms\_norm.DistributedRMSNorm(\*args, devices, \*\*kwargs)

**Parameters:**

**devices** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `DeviceRef` `]` )

## `RMSNorm` {#max.nn.norm.rms_norm.RMSNorm}

> *class* max.nn.norm.rms\_norm.RMSNorm(dim, dtype, eps=1e-06, weight\_offset=0.0, multiply\_before\_cast=True)

Computes the Root Mean Square normalization on inputs.

**Parameters:**

* **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – Size of last dimension of the expected input.
* **eps** ([`float`](https://docs.python.org/3/library/functions.html#float) ) – Value added to denominator for numerical stability.
* **weight\_offset** ([`float`](https://docs.python.org/3/library/functions.html#float) ) – Constant offset added to the learned weights at runtime.
  For Gemma-style RMSNorm, this should be set to 1.0.
* **multiply\_before\_cast** ([`bool`](https://docs.python.org/3/library/functions.html#bool) ) – True if we multiply the inputs by the learned
  weights before casting to the input type (Gemma3-style). False if we
  cast the inputs to the input type first, then multiply by the learned
  weights (Llama-style).
* **dtype** ([`DType`](../../dtype.md#max.dtype.DType) )

## `RMSNormV1` {#max.nn.norm.rms_norm.RMSNormV1}

> *class* max.nn.norm.rms\_norm.RMSNormV1(weight, eps=1e-06, weight\_offset=0.0, multiply\_before\_cast=True)

Computes the Root Mean Square normalization on inputs.

Deprecated: Use RMSNorm instead.

**Parameters:**

* **weight** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )
* **eps** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **weight\_offset** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **multiply\_before\_cast** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

### `eps` {#max.nn.norm.rms_norm.RMSNormV1.eps}

> eps\*: [float](https://docs.python.org/3/library/functions.html#float)\* *= 1e-06*

### `multiply_before_cast` {#max.nn.norm.rms_norm.RMSNormV1.multiply_before_cast}

> multiply\_before\_cast\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= True*

### `weight` {#max.nn.norm.rms_norm.RMSNormV1.weight}

> weight\*: Value\[TensorType] | [TensorValue](../../graph/TensorValue.md#max.graph.TensorValue) | [Shape](../../graph/type.md#max.graph.type.Shape) | [Dim](../../graph/type.md#max.graph.type.Dim) | [int](https://docs.python.org/3/library/functions.html#int) | [float](https://docs.python.org/3/library/functions.html#float) | [integer](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) | [floating](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating) | [ndarray](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)\*

### `weight_offset` {#max.nn.norm.rms_norm.RMSNormV1.weight_offset}

> weight\_offset\*: [float](https://docs.python.org/3/library/functions.html#float)\* *= 0.0*

---

## rms_norm

`rms_norm[type: DType, rank: Int, input_0_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], /, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), multiply_before_cast: Bool = True](shape: IndexList[rank], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], output: NDBuffer[type, rank, origin], ctx: DeviceContextPtr)`

---

## rms_norm_cpu

`rms_norm_cpu[type: DType, //, input_fn: fn[Int](Int, Int) capturing -> SIMD[type, $0], output_fn: fn[Int](Int, Int, SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], out_shape: IndexList[2])`

`rms_norm_cpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int](IndexList[rank], SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](shape: IndexList[rank], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1])`

---

## rms_norm_gpu

`rms_norm_gpu[type: DType, rank: Int, //, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int](IndexList[rank], SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](shape: IndexList[rank, element_type=element_type], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], ctx: DeviceContext)`

---

## rms_norm_gpu_block

`rms_norm_gpu_block[type: DType, //, simd_width: Int, max_warps_per_block: Int, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], output_fn: fn[Int](row: Int, col: Int, val: SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](gamma: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], num_cols: Int)`

---

## rms_norm_gpu_warp_tiling

`rms_norm_gpu_warp_tiling[type: DType, //, simd_width: Int, max_warps_per_block: Int, input_fn: fn[Int](row: Int, col: Int) capturing -> SIMD[type, $0], output_fn: fn[Int](row: Int, col: Int, val: SIMD[type, $0]) capturing -> None, multiply_before_cast: Bool](gamma: NDBuffer[type, 1, MutableAnyOrigin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], num_cols: Int)`

---

## rms_norm_kv_cache_ragged_continuous_batching

`rms_norm_kv_cache_ragged_continuous_batching[type: DType, num_heads: Int, head_dim: Int, //, target: StringSlice[StaticConstantOrigin], multiply_before_cast: Bool](kv_collection: ContinuousBatchingKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim))], gamma: NDBuffer[type, 1, origin, shape, strides], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], layer_idx: SIMD[uint32, 1], total_seq_len: SIMD[uint32, 1], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], context: DeviceContextPtr)`

Performs RMSNorm in place on new entries in the key cache.

This is done by first creating the ragged tensor weight\_shape
(total\_seq\_len, num\_heads, head\_dim) of the new token tensor.
To do this we need to pass in `total_seq_len` on host.
Then, using `input_row_offsets` we find the corresponding batch and token
index, and use that together with the static head and channel indices to
store to/load from the key cache.
This uses the input/output lambdas on the RMSNorm kernel.

This function could apply RMSNorm to a subset of dimensions in each head,
determined by the size of the gamma tensor. In this case, it operates on a
ragged tensor view of the key cache with shape (total\_seq\_len, num\_heads,
rms\_norm\_cols), where rms\_norm\_cols is the length of gamma and must be

---

## rms_norm_kv_cache_ragged_paged

`rms_norm_kv_cache_ragged_paged[type: DType, num_heads: Int, head_dim: Int, //, target: StringSlice[StaticConstantOrigin], multiply_before_cast: Bool](kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], gamma: NDBuffer[type, 1, origin, shape, strides], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1], layer_idx: SIMD[uint32, 1], total_seq_len: SIMD[uint32, 1], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], context: DeviceContextPtr)`

Performs RMSNorm in place on new entries in the key cache.

This is done by first creating the ragged tensor weight\_shape
(total\_seq\_len, num\_heads, head\_dim) of the new token tensor.
To do this we need to pass in `total_seq_len` on host.
Then, using `input_row_offsets` we find the corresponding batch and token
index, and use that together with the static head and channel indices to
store to/load from the key cache.
This uses the input/output lambdas on the RMSNorm kernel.

This function could apply RMSNorm to a subset of dimensions in each head,
determined by the size of the gamma tensor. In this case, it operates on a
ragged tensor view of the key cache with shape (total\_seq\_len, num\_heads,
rms\_norm\_cols), where rms\_norm\_cols is the length of gamma and must be

---

## rms_norm_shape

`rms_norm_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], gamma: NDBuffer[type, 1, origin], epsilon: SIMD[type, 1], weight_offset: SIMD[type, 1]) -> IndexList[rank]`

---

## roi_align

## Structs

* [​`Weighted2DPoint`](./Weighted2DPoint): Utility class to wrap 2-d point coordinates and floating point weight for bilinear interpolation.

## Functions

* [​`roi_align_nhwc`](./roi_align_nhwc): Compute ROIAlign a batch of rois of shape \[M, 5] where the first dim is the batch index, followed by region box coordinates (y0, x0) (y1, x1). For inputs of NHWC format. The output shape is \[M, output\_height, output\_width, C].

---

## roi_align_nhwc

`roi_align_nhwc[type: DType, output_layout: Layout, input_layout: Layout, roi_layout: Layout, //, aligned: Bool, mode: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("AVG")](output: LayoutTensor[type, output_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], input: LayoutTensor[type, input_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], rois: LayoutTensor[type, roi_layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output_height: Int, output_width: Int, in_spatial_scale: SIMD[dtype, 1], in_sampling_ratio: SIMD[dtype, 1])`

Compute ROIAlign a batch of rois of shape \[M, 5] where the first dim is the batch index, followed by region box coordinates (y0, x0) (y1, x1). For inputs of NHWC format. The output shape is \[M, output\_height, output\_width, C].

**Parameters:**

* ​type (`DType`): Type of the input tensor.
* ​output\_layout (`Layout`): The output layout.
* ​input\_layout (`Layout`): The input layout.
* ​roi\_layout (`Layout`): The layout of the regions of interests (ROI).
* ​aligned (`Bool`): If not true offset the ROIs by 0.5.
* ​mode (`StringSlice[StaticConstantOrigin]`): The pooling mode "AVG" for average and "MAX" for max pooling.

---

## rope_k_cache

`rope_k_cache[type: DType, cache_t: KVCacheT, width: Int, //, *, interleaved: Bool](k_cache: cache_t, b_idx: Int, h_idx: Int, s_idx: Int, d_idx: Int, freq_val: SIMD[type, width], head_size: Int)`

---

## rope_q_proj

`rope_q_proj[type: DType, rank: Int, width: Int, //, *, interleaved: Bool](q_proj: NDBuffer[type, rank, origin, shape, strides], output: NDBuffer[type, rank, origin, shape, strides], idx: IndexList[rank], freq_val: SIMD[type, width], head_size: Int)`

---

## rotary_embedding

The rope embedding used within the model.

## `DeepseekYarnRopeScalingParams` {#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams}

> *class* max.nn.rotary\_embedding.DeepseekYarnRopeScalingParams(scaling\_factor: [float](https://docs.python.org/3/library/functions.html#float), original\_max\_position\_embeddings: [int](https://docs.python.org/3/library/functions.html#int), beta\_fast: [int](https://docs.python.org/3/library/functions.html#int), beta\_slow: [int](https://docs.python.org/3/library/functions.html#int), mscale: [float](https://docs.python.org/3/library/functions.html#float), mscale\_all\_dim: [float](https://docs.python.org/3/library/functions.html#float))

**Parameters:**

* **scaling\_factor** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **original\_max\_position\_embeddings** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **beta\_fast** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **beta\_slow** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **mscale** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **mscale\_all\_dim** ([`float`](https://docs.python.org/3/library/functions.html#float) )

### `beta_fast` {#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams.beta_fast}

> beta\_fast\*: [int](https://docs.python.org/3/library/functions.html#int)\*

Fast interpolation rate.

### `beta_slow` {#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams.beta_slow}

> beta\_slow\*: [int](https://docs.python.org/3/library/functions.html#int)\*

Slow interpolation rate.

### `mscale` {#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams.mscale}

> mscale\*: [float](https://docs.python.org/3/library/functions.html#float)\*

Scaling factor for middle frequencies.

### `mscale_all_dim` {#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams.mscale_all_dim}

> mscale\_all\_dim\*: [float](https://docs.python.org/3/library/functions.html#float)\*

Scaling factor applied to all dimensions.

### `original_max_position_embeddings` {#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams.original_max_position_embeddings}

> original\_max\_position\_embeddings\*: [int](https://docs.python.org/3/library/functions.html#int)\*

Original maximum sequence length during training.

### `scaling_factor` {#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams.scaling_factor}

> scaling\_factor\*: [float](https://docs.python.org/3/library/functions.html#float)\*

Scaling factor for frequency interpolation.

## `DeepseekYarnRotaryEmbedding` {#max.nn.rotary_embedding.DeepseekYarnRotaryEmbedding}

> *class* max.nn.rotary\_embedding.DeepseekYarnRotaryEmbedding(dim, n\_heads, theta, max\_seq\_len, device, head\_dim=None, \_freqs\_cis=None, interleaved=True, scaling\_params=None)

Deepseek’s YaRN (Yet another RoPE eNhancement) Rotary Position Embedding layer.

Unlike Llama3RotaryEmbedding, the dim argument here is the rope dimension
of the model, not the hidden dimension.

**Parameters:**

* **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **theta** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **device** (`DeviceRef` )
* **head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **\_freqs\_cis** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )
* **interleaved** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )
* **scaling\_params** ([`DeepseekYarnRopeScalingParams`](#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams)  `|`  `None` )

### `compute_scale()` {#max.nn.rotary_embedding.DeepseekYarnRotaryEmbedding.compute_scale}

> compute\_scale(user\_scale=None)

**Parameters:**

**user\_scale** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  `None` )

**Return type:**

[float](https://docs.python.org/3/library/functions.html#float)

### `freqs_cis_base()` {#max.nn.rotary_embedding.DeepseekYarnRotaryEmbedding.freqs_cis_base}

> freqs\_cis\_base()

Computes the frequency tensor for complex exponentials (cis)
for a given seq\_len. Tensor is scaled with theta parameter.
Required to apply Rotary Position Embedding (RoPE) to tensor.
See ‘Roformer: Enhanced Transformer with Rotary Embedding’
(arxiv.org/pdf/2104.09864).

**Returns:**

The frequency tensor for complex exponentials with shape
(max\_seq\_len, rope\_dim // 2, 2)

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

### `scaling_params` {#max.nn.rotary_embedding.DeepseekYarnRotaryEmbedding.scaling_params}

> scaling\_params\*: [DeepseekYarnRopeScalingParams](#max.nn.rotary_embedding.DeepseekYarnRopeScalingParams) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

## `LinearScalingParams` {#max.nn.rotary_embedding.LinearScalingParams}

> *class* max.nn.rotary\_embedding.LinearScalingParams(factor: [float](https://docs.python.org/3/library/functions.html#float))

**Parameters:**

**factor** ([`float`](https://docs.python.org/3/library/functions.html#float) )

### `factor` {#max.nn.rotary_embedding.LinearScalingParams.factor}

> factor\*: [float](https://docs.python.org/3/library/functions.html#float)\*

Main scaling factor for the frequency components of the rope.

## `Llama3RopeScalingParams` {#max.nn.rotary_embedding.Llama3RopeScalingParams}

> *class* max.nn.rotary\_embedding.Llama3RopeScalingParams(factor: [float](https://docs.python.org/3/library/functions.html#float), low\_freq\_factor: [float](https://docs.python.org/3/library/functions.html#float), high\_freq\_factor: [float](https://docs.python.org/3/library/functions.html#float), orig\_max\_position: [int](https://docs.python.org/3/library/functions.html#int))

**Parameters:**

* **factor** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **low\_freq\_factor** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **high\_freq\_factor** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **orig\_max\_position** ([`int`](https://docs.python.org/3/library/functions.html#int) )

### `factor` {#max.nn.rotary_embedding.Llama3RopeScalingParams.factor}

> factor\*: [float](https://docs.python.org/3/library/functions.html#float)\*

Main scaling factor for the frequency components of the rope.

### `high_freq_factor` {#max.nn.rotary_embedding.Llama3RopeScalingParams.high_freq_factor}

> high\_freq\_factor\*: [float](https://docs.python.org/3/library/functions.html#float)\*

Factor to scale the high frequency components of the rope.

### `low_freq_factor` {#max.nn.rotary_embedding.Llama3RopeScalingParams.low_freq_factor}

> low\_freq\_factor\*: [float](https://docs.python.org/3/library/functions.html#float)\*

Factor to scale the low frequency components of the rope.

### `orig_max_position` {#max.nn.rotary_embedding.Llama3RopeScalingParams.orig_max_position}

> orig\_max\_position\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The original maximum position length supported by the model.

## `Llama3RotaryEmbedding` {#max.nn.rotary_embedding.Llama3RotaryEmbedding}

> *class* max.nn.rotary\_embedding.Llama3RotaryEmbedding(dim, n\_heads, theta, max\_seq\_len, device, head\_dim=None, \_freqs\_cis=None, interleaved=True, scaling\_params=None)

RotaryEmbedding for Llama3 that takes rope scaling into account.

**Parameters:**

* **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **theta** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **device** (`DeviceRef` )
* **head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **\_freqs\_cis** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )
* **interleaved** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )
* **scaling\_params** ([`Llama3RopeScalingParams`](#max.nn.rotary_embedding.Llama3RopeScalingParams)  `|`  `None` )

### `scaling_params` {#max.nn.rotary_embedding.Llama3RotaryEmbedding.scaling_params}

> scaling\_params\*: [Llama3RopeScalingParams](#max.nn.rotary_embedding.Llama3RopeScalingParams) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

Scaling parameters to enable llama to function with a longer context length.

## `OptimizedRotaryEmbedding` {#max.nn.rotary_embedding.OptimizedRotaryEmbedding}

> *class* max.nn.rotary\_embedding.OptimizedRotaryEmbedding(dim, n\_heads, theta, max\_seq\_len, device, head\_dim=None, \_freqs\_cis=None, interleaved=True)

Optimized version of RotaryEmbedding using 2D frequency tensor representation.

**Parameters:**

* **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **theta** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **device** (`DeviceRef` )
* **head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **\_freqs\_cis** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )
* **interleaved** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

### `freqs_cis` {#max.nn.rotary_embedding.OptimizedRotaryEmbedding.freqs_cis}

> *property* freqs\_cis

## `RotaryEmbedding` {#max.nn.rotary_embedding.RotaryEmbedding}

> *class* max.nn.rotary\_embedding.RotaryEmbedding(dim, n\_heads, theta, max\_seq\_len, device, head\_dim=None, \_freqs\_cis=None, interleaved=True)

RotaryEmbedding layer to calculate and apply the frequency tensor for complex exponentials.

**Parameters:**

* **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **theta** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **max\_seq\_len** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **device** (`DeviceRef` )
* **head\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **\_freqs\_cis** (`Value` `[` `TensorType` `]`  `|`  [`TensorValue`](../graph/TensorValue.md#max.graph.TensorValue)  `|`  [`Shape`](../graph/type.md#max.graph.type.Shape)  `|`  [`Dim`](../graph/type.md#max.graph.type.Dim)  `|`  [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`float`](https://docs.python.org/3/library/functions.html#float)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer)  `|`  [`floating`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.floating)  `|`  [`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)  `|`  `None` )
* **interleaved** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

### `compute_scale()` {#max.nn.rotary_embedding.RotaryEmbedding.compute_scale}

> compute\_scale(user\_scale=None)

**Parameters:**

**user\_scale** ([`float`](https://docs.python.org/3/library/functions.html#float)  `|`  `None` )

**Return type:**

[float](https://docs.python.org/3/library/functions.html#float)

### `device` {#max.nn.rotary_embedding.RotaryEmbedding.device}

> device\*: DeviceRef\*

### `dim` {#max.nn.rotary_embedding.RotaryEmbedding.dim}

> dim\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `freqs_cis` {#max.nn.rotary_embedding.RotaryEmbedding.freqs_cis}

> *property* freqs\_cis\*: [TensorValue](../graph/TensorValue.md#max.graph.TensorValue)\*

### `freqs_cis_base()` {#max.nn.rotary_embedding.RotaryEmbedding.freqs_cis_base}

> freqs\_cis\_base()

Computes the frequency tensor for complex exponentials (cis)
for a given seq\_len. Tensor is scaled with theta parameter.
Required to apply Rotary Position Embedding (RoPE) to tensor.
See ‘Roformer: Enhanced Transformer with Rotary Embedding’
(arxiv.org/pdf/2104.09864).

**Returns:**

The frequency tensor for complex exponentials with shape (max\_seq\_len \* 2, head\_dim / 2, 2)

**Return type:**

[*TensorValue*](../graph/TensorValue.md#max.graph.TensorValue)

### `head_dim` {#max.nn.rotary_embedding.RotaryEmbedding.head_dim}

> head\_dim\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\* *= None*

head\_dim = dim // n\_heads if not specified in the config.

### `interleaved` {#max.nn.rotary_embedding.RotaryEmbedding.interleaved}

> interleaved\*: [bool](https://docs.python.org/3/library/functions.html#bool)\* *= True*

### `max_seq_len` {#max.nn.rotary_embedding.RotaryEmbedding.max_seq_len}

> max\_seq\_len\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The maximum sequence length for model’s input.

### `n_heads` {#max.nn.rotary_embedding.RotaryEmbedding.n_heads}

> n\_heads\*: [int](https://docs.python.org/3/library/functions.html#int)\*

### `theta` {#max.nn.rotary_embedding.RotaryEmbedding.theta}

> theta\*: [float](https://docs.python.org/3/library/functions.html#float)\*

Hyperparameter used to control the frequency scaling of the sinusoidal components of the embeddings.

---

## rotate_bits_left

`rotate_bits_left[shift: Int](x: Int) -> Int`

Shifts the bits of an input to the left by `shift` bits (with wrap-around).

**Constraints:**

`-size shift (`Int`): The number of bit positions by which to rotate the bits of the
  integer to the left (with wrap-around).

**Args:**

* ​x (`Int`): The input value.

**Returns:**

The input rotated to the left by `shift` elements (with wrap-around).

`rotate_bits_left[dtype: DType, width: Int, //, shift: Int](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Shifts bits to the left by `shift` positions (with wrap-around) for each element of a SIMD vector.

**Constraints:**

`0 dtype (`DType`): The `dtype` of the input and output SIMD vector. Must be integral and unsigned.
* ​width (`Int`): The width of the SIMD vector.
* ​shift (`Int`): The number of positions to rotate left.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector input.

**Returns:**

SIMD vector with each element rotated left by `shift` bits.

---

## rotate_bits_right

`rotate_bits_right[shift: Int](x: Int) -> Int`

Shifts the bits of an input to the right by `shift` bits (with wrap-around).

**Constraints:**

`-size shift (`Int`): The number of bit positions by which to rotate the bits of the
  integer to the right (with wrap-around).

**Args:**

* ​x (`Int`): The input value.

**Returns:**

The input rotated to the right by `shift` elements (with wrap-around).

`rotate_bits_right[dtype: DType, width: Int, //, shift: Int](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Shifts bits to the right by `shift` positions (with wrap-around) for each element of a SIMD vector.

**Constraints:**

`0 dtype (`DType`): The `dtype` of the input and output SIMD vector. Must be integral and unsigned.
* ​width (`Int`): The width of the SIMD vector.
* ​shift (`Int`): The number of positions to rotate right.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector input.

**Returns:**

SIMD vector with each element rotated right by `shift` bits.

---

## round

`round[T: Roundable, //](number: T) -> T`

Get the rounded value of the given object.

**Parameters:**

* ​T (`Roundable`): The type conforming to Roundable.

**Args:**

* ​number (`T`): The object to get the rounded value of.

**Returns:**

The rounded value of the object.

`round[T: Roundable, //](number: T, ndigits: Int) -> T`

Get the value of this object, rounded to a specified number of digits after the decimal point.

**Parameters:**

* ​T (`Roundable`): The type conforming to Roundable.

**Args:**

* ​number (`T`): The object to get the rounded value of.
* ​ndigits (`Int`): The number of digits to round to.

**Returns:**

The rounded value of the object.

---

## Roundable

The `Roundable` trait describes a type that defines a rounding operation.

Types that conform to `Roundable` will work with the builtin `round`
function. The round operation always returns the same type as the input.

For example:

```mojo
@fieldwise_init
struct Complex(Roundable):
    var re: Float64
    var im: Float64

    fn __round__(self) -> Self:
        return Self(round(self.re), round(self.im))

    fn __round__(self, ndigits: Int) -> Self:
        return Self(round(self.re, ndigits), round(self.im, ndigits))
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__round__`

`__round__(self: _Self) -> _Self`

Get a rounded value for the type.

**Returns:**

The rounded value.

`__round__(self: _Self, ndigits: Int) -> _Self`

Get a rounded value for the type.

**Args:**

* ​ndigits (`Int`): Number of digits after the decimal point.

**Returns:**

The rounded value.

---

## RoundMode

`struct RoundMode`

## Fields

* ​value (`Int`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `Ceil`

`alias Ceil = RoundMode(3)`

### `Floor`

`alias Floor = RoundMode(2)`

### `HalfDown`

`alias HalfDown = RoundMode(0)`

### `HalfUp`

`alias HalfUp = RoundMode(1)`

## Methods

### `__init__`

`@implicit`
`__init__(out self, value: Int)`

### `__eq__`

`__eq__(self, other: Self) -> Bool`

---

## RTLD

`struct RTLD`

Enumeration of the RTLD flags used during dynamic library loading.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `GLOBAL`

`alias GLOBAL = 256 if os_is_linux() else 8`

Make symbols available for symbol resolution of subsequently loaded libraries.

### `LAZY`

`alias LAZY = 1`

Load library lazily (defer function resolution until needed).

### `LOCAL`

`alias LOCAL = 4`

Make symbols not available for symbol resolution of subsequently loaded libraries.

### `NOW`

`alias NOW = 2`

Load library immediately (resolve all symbols on load).

---

## run

`run[func: fn() raises -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report`

Benchmarks the function passed in as a parameter.

Benchmarking continues until 'min\_time\_ns' has elapsed and either
`max_time_ns` OR `max_iters` is achieved.

**Parameters:**

* ​func (`fn() raises -> None`): The function to benchmark.

**Args:**

* ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`).
* ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`).
* ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`).
* ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time
  measurement.

**Returns:**

Average execution time of func in ns.

`run[func: fn() -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report`

Benchmarks the function passed in as a parameter.

Benchmarking continues until 'min\_time\_ns' has elapsed and either
`max_time_ns` OR `max_iters` is achieved.

**Parameters:**

* ​func (`fn() -> None`): The function to benchmark.

**Args:**

* ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`).
* ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`).
* ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`).
* ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time
  measurement.

**Returns:**

Average execution time of func in ns.

`run[: origin.set, //, func: fn() raises capturing -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report`

Benchmarks the function passed in as a parameter.

Benchmarking continues until 'min\_time\_ns' has elapsed and either
`max_time_ns` OR `max_iters` is achieved.

**Parameters:**

* ​func (`fn() raises capturing -> None`): The function to benchmark.

**Args:**

* ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`).
* ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`).
* ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`).
* ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time
  measurement.

**Returns:**

Average execution time of func in ns.

`run[: origin.set, //, func: fn() capturing -> None](max_iters: Int = 1000000000, min_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](2), max_runtime_secs: SIMD[float64, 1] = __init__[__mlir_type.!pop.int_literal](60), max_batch_size: Int = 0) -> Report`

Benchmarks the function passed in as a parameter.

Benchmarking continues until 'min\_time\_ns' has elapsed and either
`max_time_ns` OR `max_iters` is achieved.

**Parameters:**

* ​func (`fn() capturing -> None`): The function to benchmark.

**Args:**

* ​max\_iters (`Int`): Max number of iterations to run (default `1_000_000_000`).
* ​min\_runtime\_secs (`SIMD[float64, 1]`): Upper bound on benchmarking time in secs (default `2`).
* ​max\_runtime\_secs (`SIMD[float64, 1]`): Lower bound on benchmarking time in secs (default `60`).
* ​max\_batch\_size (`Int`): The maximum number of iterations to perform per time
  measurement.

**Returns:**

Average execution time of func in ns.

---

## run

`run(cmd: String) -> String`

Runs the specified command and returns the output as a string.

This function executes the given command in a subprocess, captures its
standard output, and returns it as a string. It automatically handles
opening and closing the subprocess.

**Args:**

* ​cmd (`String`): The command to execute as a string.

**Returns:**

The standard output of the command as a string, with trailing
whitespace removed.

**Raises:**

This function raises if:

* The command cannot be executed.
* There is an IO error reading from the subprocess.
* The data written by the subprocess is not valid UTF-8.

---

## run_radix_sort_pairs_gpu

`run_radix_sort_pairs_gpu[type: DType, out_idx_type: DType, rank: Int, ascending: Bool = False, BLOCK_SIZE: Int = 256, NUM_BITS_PER_PASS: Int = 4](ctx: DeviceContext, mut input_keys: NDBuffer[type, rank, MutableAnyOrigin], mut output_keys: NDBuffer[type, rank, MutableAnyOrigin], mut input_key_ids: NDBuffer[out_idx_type, rank, MutableAnyOrigin], mut output_key_ids: NDBuffer[out_idx_type, rank, MutableAnyOrigin], skip_sort: NDBuffer[bool, rank, origin])`

---

## runtime

Implements the runtime package.

## Modules

* [​`asyncrt`](/mojo/stdlib/runtime/asyncrt/): This module implements the low level concurrency library.
* [​`tracing`](/mojo/stdlib/runtime/tracing/): Provides tracing utilities.

---

## runtime_layout

Provides the `RuntimeLayout` type and functions for working with it. You can use `RuntimeLayout` to define a layout where the dimensions are not known at compile time.

You can import these APIs from `layout.runtime_layout`.

```mojo
from layout.runtime_layout import RuntimeLayout, make_layout
```

## Structs

* [​`RuntimeLayout`](./RuntimeLayout): A runtime-configurable layout that uses `RuntimeTuple` for storage.

## Functions

* [​`coalesce`](./coalesce): Coalesce adjacent dimensions in a runtime layout when possible.
* [​`make_layout`](./make_layout): Combine two runtime layouts into a single composite layout.

---

## runtime_tuple

Provides the `RuntimeTuple` data structure and related utility functions for handling tuple-like data with both compile-time and runtime elements. `RuntimeTuple` is designed for high-performance tensor operations, supporting efficient manipulation of multi-dimensional data structures like shapes, indices, and coordinates.

Key features:

* Hybrid compile-time/runtime value handling
* Optimized for parallel execution and hardware acceleration
* Support for nested tuple structures
* Efficient conversion between linear indices and multi-dimensional coordinates
* Specialized operations for tensor shape calculations

The module includes functions for tuple manipulation (concatenation, flattening),
coordinate transformations (`idx2crd`, `crd2idx`), and specialized tensor operations
like shape division and prefix products.

## Structs

* [​`RuntimeTuple`](./RuntimeTuple): A struct representing tuple-like data with compile-time and runtime elements. RuntimeTuple combines static (compile-time) and dynamic (runtime) handling of tuple-like data structures, typically used for tensor shapes, indices, and coordinates in high-performance computing contexts. This struct is optimized for parallel execution and hardware acceleration, allowing efficient manipulation of multi-dimensional data. It supports both known compile-time values and runtime-determined values.

## Functions

* [​`concat`](./concat): Concatenates two `IntTuple` instances into a single `IntTuple`.
* [​`crd2idx`](./crd2idx): Converts multi-dimensional coordinates to a linear index.
* [​`idx2crd`](./idx2crd): Converts a linear index to multi-dimensional coordinates. This function transforms a flat index into coordinate values based on the provided shape and stride information. This is essential for mapping linear memory accesses to multi-dimensional tensor elements.
* [​`is_int`](./is_int): Determines if a `RuntimeTuple` represents a scalar integer value.
* [​`is_tuple`](./is_tuple): Determines if a `RuntimeTuple` represents a tuple rather than a scalar value.
* [​`prefix_product`](./prefix_product): Computes the prefix products of elements in the `RuntimeTuple`.
* [​`product`](./product): Computes the product of all elements in the `RuntimeTuple`.
* [​`shape_div`](./shape_div): Performs specialized shape division between `RuntimeTuple`s.
* [​`signum`](./signum): Returns the sign of an integer value.

---

## RuntimeLayout

`@register_passable(trivial)`
`struct RuntimeLayout[layout: Layout, /, *, element_type: DType = int64, linear_idx_type: DType = int64]`

A runtime-configurable layout that uses `RuntimeTuple` for storage.

This struct provides a layout implementation that can be modified at runtime,
unlike the static [`Layout`](/mojo/stdlib/layout/layout/Layout) type. It
uses [`RuntimeTuple`](/mojo/stdlib/layout/runtime_tuple/RuntimeTuple) for
shape and stride storage.

The layout must have statically known dimensions at compile time, but the
actual shape and stride values can be modified during execution.

## Parameters

* ​layout (`Layout`): The static `Layout` type to base this runtime layout on.
* ​element\_type (`DType`): The integer type of the each dimension element. Must be signed.
* ​linear\_idx\_type (`DType`): The integer type of the linear index into memory returned by `crd2idx`. Must be signed.

## Fields

* ​shape (`RuntimeTuple[layout.shape, element_type=element_type]`): The shape of the layout as a runtime tuple.
  Stores the size of each dimension. Uses the specified bitwidth and is
  unsigned. Must match the static layout's shape dimensions.
* ​stride (`RuntimeTuple[layout.stride, element_type=linear_idx_type]`): The stride of the layout as a runtime tuple.
  Stores the stride (step size) for each dimension. Uses 64-bit unsigned
  integers since strides can be large values. Must match the static layout's
  stride dimensions.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Initialize a `RuntimeLayout` with default values.

Creates a new `RuntimeLayout` instance with default shape and stride
values. Requires that the static layout has known dimensions at compile
time.

**Constraints:**

The static layout that this runtime layout is based on must have all
dimensions known.

`__init__(shape: RuntimeTuple[layout.shape, element_type=element_type], stride: RuntimeTuple[layout.stride, element_type=linear_idx_type]) -> Self`

Initialize a `RuntimeLayout` with specified shape and stride.

**Args:**

* ​shape (`RuntimeTuple[layout.shape, element_type=element_type]`): A `RuntimeTuple` containing the dimensions of each axis.
* ​stride (`RuntimeTuple[layout.stride, element_type=linear_idx_type]`): A `RuntimeTuple` containing the stride values for each axis.

### `__call__`

`__call__(self, idx: Int) -> SIMD[linear_idx_type, 1]`

Convert a single index to a flat linear index.

**Args:**

* ​idx (`Int`): The one-dimensional index to convert.

**Returns:**

The corresponding flat linear index in the layout.

`__call__[: ImmutableOrigin, //, t: IntTuple[$0]](self, idx: RuntimeTuple[t, element_type=element_type]) -> SIMD[linear_idx_type, 1]`

Convert a multi-dimensional index to a flat linear index.

**Parameters:**

* ​t (`IntTuple[$0]`): The `IntTuple` type for the index.

**Args:**

* ​idx (`RuntimeTuple[t, element_type=element_type]`): A `RuntimeTuple` containing the multi-dimensional coordinates.

**Returns:**

The corresponding flat linear index in the layout.

### `size`

`size(self) -> Int`

Calculate the total number of elements in the layout.

**Returns:**

The product of all dimensions in the shape, representing the total
number of elements that can be addressed by this layout.

### `bound_check_required`

`bound_check_required(self) -> Bool`

Determine if bounds checking is required for this layout.

**Returns:**

True if any dimension in the shape differs from the static layout's
shape, False otherwise.

### `cast`

`cast[element_type: DType, /, *, linear_idx_type: DType = linear_idx_type](self) -> RuntimeLayout[layout, element_type=element_type, linear_idx_type=linear_idx_type]`

Cast the layout to use a different element bitwidth.

**Parameters:**

* ​element\_type (`DType`): The target data type.
* ​linear\_idx\_type (`DType`): The target linear idx type.

**Returns:**

A new `RuntimeLayout` with the shape cast to the specified type.

### `__str__`

`__str__(self) -> String`

Convert the layout to a string representation.

**Returns:**

A string representation of the layout.

### `row_major`

`static row_major[rank: Int, //](shape: IndexList[rank, element_type=element_type]) -> Self`

Create a row-major layout from the given shape.

In row-major layout, elements with adjacent rightmost indices are
adjacent in memory.

**Parameters:**

* ​rank (`Int`): The number of dimensions in the layout.

**Args:**

* ​shape (`IndexList[rank, element_type=element_type]`): An `IndexList` containing the dimensions of each axis.

**Returns:**

A `RuntimeLayout` with row-major stride ordering.

### `col_major`

`static col_major[rank: Int, //](shape: IndexList[rank, element_type=element_type]) -> Self`

Create a column-major layout from the given shape.

In column-major layout, elements with adjacent leftmost indices are
adjacent in memory.

**Parameters:**

* ​rank (`Int`): The number of dimensions in the layout.

**Args:**

* ​shape (`IndexList[rank, element_type=element_type]`): An `IndexList` containing the dimensions of each axis.

**Returns:**

A `RuntimeLayout` with column-major stride ordering.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Write a string representation of the layout to a writer.

**Parameters:**

* ​W (`Writer`): The `Writer` type.

**Args:**

* ​writer (`W`): The `Writer` object to write the layout representation to.

### `sublayout`

`sublayout[i: Int](self) -> RuntimeLayout[layout[i], element_type=element_type, linear_idx_type=linear_idx_type]`

Extract a nested sublayout at the specified index.

**Parameters:**

* ​i (`Int`): The index of the nested layout to extract.

**Returns:**

A `RuntimeLayout` representing the nested layout at index i.

### `dim`

`dim(self, i: Int) -> Int`

Get the size of the dimension at the specified index.

**Args:**

* ​i (`Int`): The index of the dimension to retrieve.

**Returns:**

The size of the dimension at index `i`.

### `__len__`

`static __len__() -> Int`

Get the number of dimensions in the layout.

**Returns:**

The number of dimensions (rank) of the layout.

---

## RuntimeTensorSpec

`@register_passable(trivial)`
`struct RuntimeTensorSpec[type: DType, rank: Int]`

## Fields

* ​shape (`IndexList[rank]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__getitem__`

`__getitem__(self, idx: Int) -> Int`

### `bytecount`

`bytecount(self) -> Int`

Gets the total byte count.

**Returns:**

The total byte count.

---

## RuntimeTuple

`@register_passable(trivial)`
`struct RuntimeTuple[origin: ImmutableOrigin, //, S: IntTuple[origin] = IntTuple(-1), /, *, element_type: DType = int64]`

A struct representing tuple-like data with compile-time and runtime elements. RuntimeTuple combines static (compile-time) and dynamic (runtime) handling of tuple-like data structures, typically used for tensor shapes, indices, and coordinates in high-performance computing contexts. This struct is optimized for parallel execution and hardware acceleration, allowing efficient manipulation of multi-dimensional data. It supports both known compile-time values and runtime-determined values.

## Parameters

* ​origin (`ImmutableOrigin`): The origin corresponding to the `IntTuple`.
* ​S (`IntTuple[origin]`): `IntTuple` with compile-time known values (or `UNKNOWN_VALUE` for runtime values).
* ​element\_type (`DType`): Integer type of the underlying elements.

## Fields

* ​value (`IndexList[len[::Sized](flatten[::Origin[::Bool(S)), element_type=element_type]`): Storage for the actual tuple values, implemented as an IndexList with the appropriate size and element type.

## Implemented traits

`AnyType`,
`Copyable`,
`Intable`,
`Movable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `scalar_length`

`alias scalar_length = len[::Sized](flatten[::Origin[::Bool(S))`

The total number of scalar elements in this RuntimeTuple after flattening nested tuples.

## Methods

### `__init__`

`__init__() -> Self`

Initialize a `RuntimeTuple` with default values.

For dimensions with known compile-time values in S, uses those values.
For unknown dimensions, initializes them to UNKNOWN\_VALUE.

`@implicit`
`__init__(*values: Int) -> Self`

Initialize a `RuntimeTuple` with the provided values.

**Args:**

* ​\*values (`Int`): Variadic number of integer values to initialize the tuple with.

`@implicit`
`__init__[l: Int](values: IndexList[l, element_type=element_type]) -> Self`

Initialize a `RuntimeTuple` from an `IndexList`.

**Parameters:**

* ​l (`Int`): Compile-time length of the input `IndexList`.

**Args:**

* ​values (`IndexList[l, element_type=element_type]`): `IndexList` to initialize from. Must have same length as the `RuntimeTuple`.
  The values will be cast to the appropriate element type if needed.

### `__getitem__`

`__getitem__[i: Int](self) -> RuntimeTuple[S[i], element_type=element_type]`

Retrieves the element at the specified index in the tuple.

This method provides array-like indexing for RuntimeTuple, allowing access
to individual elements or sub-tuples. It handles the internal offset calculation
to access the correct elements in the flattened storage array.

**Parameters:**

* ​i (`Int`): The index of the element to retrieve.

**Returns:**

A new `RuntimeTuple` containing the element or sub-tuple at the specified index.

### `__setitem__`

`__setitem__[i: Int](mut self, val: SIMD[element_type, 1])`

Sets the value of the element at the specified index in the tuple.

This method enables array-like assignment for RuntimeTuple elements,
handling the internal offset calculation to modify the correct element
in the flattened storage array.

**Parameters:**

* ​i (`Int`): The index of the element to modify.

**Args:**

* ​val (`SIMD[element_type, 1]`): The new value to assign to the element.

### `offset_until`

`static offset_until[i: Int]() -> Int`

Calculates the offset in the flattened value array for a given tuple index.

This method computes the sum of lengths of all flattened subtuple elements
that come before the specified index, which is used for indexing into the
internal storage.

**Parameters:**

* ​i (`Int`): The tuple index to calculate the offset for.

**Returns:**

The offset in the flattened array where the i-th element begins.

### `get_int`

`get_int(self) -> SIMD[element_type, 1]`

Returns the integer value of this RuntimeTuple.

For tuples with a known compile-time value, returns that value.
For tuples with a runtime value, returns the first element of the
internal storage array.

**Returns:**

The integer value of this RuntimeTuple.

### `__str__`

`__str__(self) -> String`

Converts the RuntimeTuple to its string representation.

This method provides a human-readable string representation of the tuple,
which is useful for debugging and logging.

**Returns:**

A string representation of the `RuntimeTuple`.

### `concat`

`concat[: ImmutableOrigin, //, R: IntTuple[$0]](self, rhs: RuntimeTuple[R, element_type=element_type]) -> RuntimeTuple[concat[::Origin[::Bool(S, R), element_type=element_type]`

Concatenates two `RuntimeTuple`s together.

This method combines the current `RuntimeTuple` with another one, preserving
both compile-time and runtime values. It handles the complexity of merging
the underlying storage arrays while maintaining the proper semantic structure.

**Parameters:**

* ​R (`IntTuple[$0]`): The `IntTuple` type parameter of the right-hand side RuntimeTuple.

**Args:**

* ​rhs (`RuntimeTuple[R, element_type=element_type]`): The `RuntimeTuple` to concatenate to the end of this one.

**Returns:**

A new `RuntimeTuple` containing all elements from both tuples in sequence.

### `flatten`

`flatten(self) -> RuntimeTuple[flatten[::Origin[::Bool(S), element_type=element_type]`

Flattens a potentially nested `RuntimeTuple` into a single-level tuple. This method converts a hierarchical structure of tuples into a flat representation, preserving all values but removing the nested structure. This is useful for operations that need to treat all elements uniformly.

**Returns:**

A new `RuntimeTuple` containing all elements in a flat (non-nested) structure.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes the RuntimeTuple to a Writer object.

This method is used by the string conversion system to generate a string
representation of the RuntimeTuple. It handles both scalar values and
nested tuple structures, producing a properly formatted output.

**Parameters:**

* ​W (`Writer`): The Writer type to use for output.

**Args:**

* ​writer (`W`): The Writer object to write the string representation to.

### `__len__`

`__len__(self) -> Int`

Returns the length (number of top-level elements) of the `RuntimeTuple`.

This method provides the standard Python-like len() functionality,
giving the number of elements at the top level of the tuple structure.
For nested tuples, this returns the number of first-level entries,
not the total number of scalar values.

**Returns:**

The number of top-level elements in the tuple.

### `cast`

`cast[type: DType](self) -> RuntimeTuple[S, element_type=type]`

Casts the RuntimeTuple to use a different numeric type. This method creates a new RuntimeTuple with the same structure and values but using a different underlying numeric type for storage. This is useful for changing precision or signedness of the data.

**Parameters:**

* ​type (`DType`): The target DType to cast the elements to.

**Returns:**

A new `RuntimeTuple` with elements cast to the specified type.

### `__int__`

`__int__(self) -> Int`

Converts the RuntimeTuple to an integer value.

This method enables implicit conversion of a RuntimeTuple to an integer,
but is constrained to only work on scalar tuples (those that contain a single value).

**Returns:**

The integer value of the tuple.

---

## S_ISBLK

`S_ISBLK[intable: Intable](mode: intable) -> Bool`

Returns True if the mode is a block device.

**Parameters:**

* ​intable (`Intable`): A type conforming to Intable.

**Args:**

* ​mode (`intable`): The file mode.

**Returns:**

True if the mode is a block device and False otherwise.

---

## S_ISCHR

`S_ISCHR[intable: Intable](mode: intable) -> Bool`

Returns True if the mode is a character device.

**Parameters:**

* ​intable (`Intable`): A type conforming to Intable.

**Args:**

* ​mode (`intable`): The file mode.

**Returns:**

True if the mode is a character device and False otherwise.

---

## S_ISDIR

`S_ISDIR[intable: Intable](mode: intable) -> Bool`

Returns True if the mode is a directory.

**Parameters:**

* ​intable (`Intable`): A type conforming to Intable.

**Args:**

* ​mode (`intable`): The file mode.

**Returns:**

True if the mode is a directory and False otherwise.

---

## S_ISFIFO

`S_ISFIFO[intable: Intable](mode: intable) -> Bool`

Returns True if the mode is a fifo.

**Parameters:**

* ​intable (`Intable`): A type conforming to Intable.

**Args:**

* ​mode (`intable`): The file mode.

**Returns:**

True if the mode is a fifo and False otherwise.

---

## S_ISLNK

`S_ISLNK[intable: Intable](mode: intable) -> Bool`

Returns True if the mode is a symlink.

**Parameters:**

* ​intable (`Intable`): A type conforming to Intable.

**Args:**

* ​mode (`intable`): The file mode.

**Returns:**

True if the mode is a symlink and False otherwise.

---

## S_ISREG

`S_ISREG[intable: Intable](mode: intable) -> Bool`

Returns True if the mode is a regular file.

**Parameters:**

* ​intable (`Intable`): A type conforming to Intable.

**Args:**

* ​mode (`intable`): The file mode.

**Returns:**

True if the mode is a regular file and False otherwise.

---

## S_ISSOCK

`S_ISSOCK[intable: Intable](mode: intable) -> Bool`

Returns True if the mode is a socket.

**Parameters:**

* ​intable (`Intable`): A type conforming to Intable.

**Args:**

* ​mode (`intable`): The file mode.

**Returns:**

True if the mode is a socket and False otherwise.

---

## sampling

## `rejection_sampler()` {#max.pipelines.lib.sampling.rejection_sampler}

> max.pipelines.lib.sampling.rejection\_sampler(top\_k, device)

**Parameters:**

* **top\_k** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **device** (`DeviceRef` )

**Return type:**

[*Graph*](../graph/Graph.md#max.graph.Graph)

## `token_sampler()` {#max.pipelines.lib.sampling.token_sampler}

> max.pipelines.lib.sampling.token\_sampler(sampling\_config, device, return\_logits=False)

**Parameters:**

* **sampling\_config** (`SamplingConfig` )
* **device** (`DeviceRef` )
* **return\_logits** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Return type:**

[*Graph*](../graph/Graph.md#max.graph.Graph)

---

## sampling

## Functions

* [​`apply_penalties_to_logits`](./apply_penalties_to_logits): Apply penalties to the logits based on the frequency of the tokens in the batch.
* [​`update_frequency_data`](./update_frequency_data): Update the frequency data for the given new tokens.

---

## scalb

`scalb[dtype: DType, width: Int, //](arg0: SIMD[dtype, width], arg1: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `scalb` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​arg0 (`SIMD[dtype, width]`): The first input argument.
* ​arg1 (`SIMD[dtype, width]`): The second input argument.

**Returns:**

The `scalb` of the inputs.

---

## scale_and_mask_helper

`scale_and_mask_helper[p_type: DType, p_layout: Layout, mask_t: MHAMask, score_mod_t: ScoreModTrait, group: Int, num_n_mmas: Int, WN: Int, MMA_N: Int, simd_width: Int, use_score_mod: Bool = False](p_reg_tile: LayoutTensor[p_type, p_layout, origin, address_space=AddressSpace(5)], scale: SIMD[float32, 1], num_keys: UInt, bound: UInt, lane: UInt, warp: UInt, mask: mask_t, score_mod: score_mod_t, kv_tile_start_row: Int, mask_stride: UInt, max_seq_len: Int)`

---

## scale_min_k4

`scale_min_k4(src_ptr: UnsafePointer[block_Q4_K], g: Int) -> Tuple[SIMD[float32, 1], SIMD[float32, 1]]`

---

## scatter

`scatter[dtype: DType, size: Int, //](value: SIMD[dtype, size], owned base: SIMD[index, size], mask: SIMD[bool, size], alignment: Int = 0)`

Takes scalar values from a SIMD vector and `scatters` them into a vector of pointers.

The scatter operation stores scalar values from a SIMD vector of memory
locations and scatters them into a vector of pointers. The memory locations
are provided in the vector of pointers `base` as addresses. The memory is
stored according to the provided mask. The mask holds a bit for each vector
lane, and is used to prevent memory accesses to the masked-off lanes.

The `value` operand is a vector value to be written to memory. The `base`
operand is a vector of pointers, pointing to where the value elements
should be stored. It has the same underlying type as the value operand. The
`mask` operand, mask, is a vector of boolean values. The types of the
`mask` and the `value` operand must have the same number of vector
elements.

Scatter with overlapping addresses is guaranteed to be ordered from
least-significant to most-significant element.

In general, for some vector `value`, vector of pointers `base`, and mask
`mask` a call of the form:

```mojo
scatter(value, base, mask)
```

is equivalent to the following sequence of scalar stores in C++:

```cpp
for (int i = 0; i dtype (`DType`): DType of `value`, the result SIMD buffer.
* ​size (`Int`): Size of `value`, the result SIMD buffer.

**Args:**

* ​value (`SIMD[dtype, size]`): The vector that will contain the result of the scatter operation.
* ​base (`SIMD[index, size]`): The vector containing memory addresses that scatter will access.
* ​mask (`SIMD[bool, size]`): A binary vector which prevents memory access to certain lanes of
  the base vector.
* ​alignment (`Int`): The alignment of the source addresses. Must be 0 or a power
  of two constant integer value.

---

## scatter_elements

`scatter_elements[reduce_fn: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1], rank: Int, input_type: DType, indices_type: DType](input: ManagedTensorSlice[io_spec, static_spec=static_spec], indices: ManagedTensorSlice[io_spec, static_spec=static_spec], updates: ManagedTensorSlice[io_spec, static_spec=static_spec], _axis: Int, output: ManagedTensorSlice[io_spec, static_spec=static_spec])`

Implements ONNX ScatterElements op which is equivalent to Pytorch scatter.

---

## scatter_elements_shape

`scatter_elements_shape[rank: Int, input_type: DType, indices_type: DType, //, *, single_thread_blocking_override: Bool](input: NDBuffer[input_type, rank, origin], updates: NDBuffer[input_type, rank, origin], indices: NDBuffer[indices_type, rank, origin], axis: Int) -> IndexList[rank]`

Compute the output shape of a `scatter_elements` operation, and assert the inputs are compatible.

**Parameters:**

* ​rank (`Int`): Rank of the input tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​indices\_type (`DType`): Type of the indices tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input (`NDBuffer[input_type, rank, origin]`): The input tensor.
* ​updates (`NDBuffer[input_type, rank, origin]`): The input tensor.
* ​indices (`NDBuffer[indices_type, rank, origin]`): The indices tensor.
* ​axis (`Int`): The axis.

**Returns:**

The output shape.

---

## scatter_nd

`scatter_nd[output_type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, updates_rank: Int, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](data: NDBuffer[output_type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], updates: NDBuffer[output_type, updates_rank, origin], output: NDBuffer[output_type, data_rank, origin], context: DeviceContextPtr = DeviceContextPtr())`

Scatter\_nd operation without any reduction.

---

## scatter_nd_generator

`scatter_nd_generator[output_type: DType, indices_type: DType, data_rank: Int, indices_rank: Int, updates_rank: Int, single_thread_blocking_override: Bool, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu"), /, reduce_fn: OptionalReg[fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), *, _trace_description: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("scatter_nd")](data: NDBuffer[output_type, data_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin], updates: NDBuffer[output_type, updates_rank, origin], output: NDBuffer[output_type, data_rank, origin], context: DeviceContextPtr = DeviceContextPtr())`

Implements ONNX ScatterND operation as defined in .

**Parameters:**

* ​output\_type (`DType`): Type of data, updates, and output tensors.
* ​indices\_type (`DType`): Type of the indices tensor.
* ​data\_rank (`Int`): Rank of input (data) tensor (data\_rank >= 1).
* ​indices\_rank (`Int`): Rank of input (data) tensor (indices\_rank >= 1).
* ​updates\_rank (`Int`): Rank of updates tensor (updates\_rank = data\_rank +
  indices\_rank - indices\_shape\[-1] - 1).
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​target (`StringSlice[StaticConstantOrigin]`): Target cpu or cuda.
* ​reduce\_fn (`OptionalReg[fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]`): Reduction function to apply: none (default), add, mul, max,
  min.
* ​\_trace\_description (`StringSlice[StaticConstantOrigin]`): A description of the function, used for profiling and tracing.

**Args:**

* ​data (`NDBuffer[output_type, data_rank, origin]`): Tensor of rank data\_rank >= 1.
* ​indices (`NDBuffer[indices_type, indices_rank, origin]`): Tensor of rank indices\_rank containing indices for the scatter
  operation.
* ​updates (`NDBuffer[output_type, updates_rank, origin]`): Tensor containing values to update output tensor based on
  indices tensor.
* ​output (`NDBuffer[output_type, data_rank, origin]`): Tensor of rank data\_rank, shaped the same as data tensor.
* ​context (`DeviceContextPtr`): Pointer to DeviceContext.

---

## scatter_nd_shape

`scatter_nd_shape[input_rank: Int, updates_rank: Int, indices_rank: Int, input_type: DType, indices_type: DType, single_thread_blocking_override: Bool](input: NDBuffer[input_type, input_rank, origin], updates: NDBuffer[input_type, updates_rank, origin], indices: NDBuffer[indices_type, indices_rank, origin]) -> IndexList[input_rank]`

Compute the output shape of a `scatter_nd` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_rank (`Int`): Rank of the input tensor.
* ​updates\_rank (`Int`): Rank of the updates tensor.
* ​indices\_rank (`Int`): Rank of the indices tensor.
* ​input\_type (`DType`): Type of the input tensor.
* ​indices\_type (`DType`): Type of the indices tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input (`NDBuffer[input_type, input_rank, origin]`): The input tensor.
* ​updates (`NDBuffer[input_type, updates_rank, origin]`): The input tensor.
* ​indices (`NDBuffer[indices_type, indices_rank, origin]`): The indices tensor.

**Returns:**

The output shape.

---

## schedule_barrier

`schedule_barrier(mask: AMDScheduleBarrierMask = AMDScheduleBarrierMask(0))`

Controls instruction scheduling across a barrier point in AMD GPU code.

This function creates a scheduling barrier that controls which types of instructions
can be reordered across it by the compiler. The mask parameter specifies which
instruction categories (ALU, memory, etc) are allowed to cross the barrier during
scheduling optimization.

Note:
This function only has an effect on AMD GPUs. On other platforms it will
raise a compile time error.

**Args:**

* ​mask (`AMDScheduleBarrierMask`): A bit mask of AMDScheduleBarrierMask flags indicating which instruction
  types can be scheduled across this barrier. Default is NONE, meaning no
  instructions can cross.

---

## schedule_group_barrier

`schedule_group_barrier(mask: AMDScheduleBarrierMask, size: SIMD[int32, 1], sync_id: SIMD[int32, 1])`

Controls instruction scheduling across a barrier point in AMD GPU code by creating schedule groups.

This function creates a scheduling barrier that groups instructions into sequences with custom ordering.
It affects the code that precedes the barrier. The barrier ensures instructions are scheduled according
to the specified group parameters.

Note:
This function only has an effect on AMD GPUs. On other platforms it will raise a compile time error.
The sync\_id parameter allows creating multiple schedule groups that can be ordered relative to each other.

**Args:**

* ​mask (`AMDScheduleBarrierMask`): A bit mask of AMDScheduleBarrierMask flags indicating which instruction types can be
  scheduled across this barrier. Similar to schedule\_barrier masks.
* ​size (`SIMD[int32, 1]`): The number of times to repeat the instruction sequence in the schedule group.
* ​sync\_id (`SIMD[int32, 1]`): A unique identifier for the group that determines the ordering of instructions
  within the same schedule group.

---

## Scope

`struct Scope`

Represents memory synchronization scope levels for GPU memory operations.

Defines different scopes of memory visibility and synchronization, from
thread-local to system-wide. Each scope level determines how memory
operations are ordered and visible across different execution units.

The scope levels form a hierarchy, with each higher level providing
stronger ordering guarantees but potentially higher synchronization costs.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`Movable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `BLOCK`

`alias BLOCK = Scope(3)`

Block-level scope. Memory operations ordered within a thread block/CTA.

### `CLUSTER`

`alias CLUSTER = Scope(4)`

Cluster-level scope. Memory operations ordered within a thread block cluster.

### `GPU`

`alias GPU = Scope(5)`

GPU-level scope. Memory operations are ordered across all threads on the GPU.

### `NONE`

`alias NONE = Scope(0)`

No memory ordering guarantees. Operations may be reordered freely.

### `SYSTEM`

`alias SYSTEM = Scope(6)`

System-wide scope. Memory operations ordered across the entire system.

### `THREAD`

`alias THREAD = Scope(1)`

Thread-level scope. Memory operations are ordered within a single thread.

### `WARP`

`alias WARP = Scope(2)`

Warp-level scope. Memory operations are ordered within a warp of threads.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two `Scope` instances are equal.

Uses pointer comparison for efficiency.

**Args:**

* ​other (`Self`): The other `Scope` instance to compare with.

**Returns:**

True if the instances are the same, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if two `Scope` instances are not equal.

**Args:**

* ​other (`Self`): The other `Scope` instance to compare with.

**Returns:**

True if the instances are different, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Checks if two `Scope` instances have the same value.

Compares the underlying integer values.

**Args:**

* ​other (`Self`): The other `Scope` instance to compare with.

**Returns:**

True if the values are the same, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Checks if two `Scope` instances have different values.

**Args:**

* ​other (`Self`): The other `Scope` instance to compare with.

**Returns:**

True if the values are different, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut w: W)`

Writes the string representation of the scope to a writer.

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer interface.

**Args:**

* ​w (`W`): The writer to write to.

### `__str__`

`__str__(self) -> String`

Returns the string representation of the memory scope.

**Returns:**

A string representation of the memory scope.

### `__repr__`

`__repr__(self) -> String`

Returns the string representation of the memory scope.

**Returns:**

A string representation of the memory scope.

### `mnemonic`

`mnemonic(self) -> StringSlice[StaticConstantOrigin]`

Returns the mnemonic string representation of the memory scope.

Converts the memory scope level into a string mnemonic used by LLVM/NVVM
intrinsics for memory operations.

**Returns:**

A string literal containing the mnemonic.

---

## ScoreModTrait

The ScoreMod trait desctribes score\_mod for mha kernel like alibi bias.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `name_str`

`alias name_str`

## Methods

### `score_mod`

`score_mod[type: DType, width: Int, //, *, element_type: DType = int32](self: _Self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width], max_prompt_len: Int = 0) -> SIMD[type, width]`

Return score vector at given coordinates given a score\_mod.

Arguments:
coord is (seq\_id, head, q\_idx, k\_idx)
score\_vec is at `coord` of the score matrix

Score\_mod calculates a tensor given the functor and adds to score\_vec.

---

## seed

`seed()`

Seeds the random number generator using the current time.

`seed(a: Int)`

Seeds the random number generator using the value provided.

**Args:**

* ​a (`Int`): The seed value.

---

## select_config

`select_config[a_type: DType, b_type: DType, c_type: DType, transpose_b: Bool = False](M: Int, N: Int, K: Int, ctx: DeviceContext) -> MatmulConfig[a_type, b_type, c_type, transpose_b]`

---

## select_inner_kernel

`select_inner_kernel[a_type: DType, b_type: DType, c_type: DType]() -> InnerKernelID`

---

## select_k_atom

`select_k_atom[type: DType, swizzle_mode: TensorMapSwizzle]() -> Layout`

Creates a core matrix layout for tensor core operations.

Constructs the fundamental atomic layout for tensor core operations based on the
specified data type and swizzle mode. This layout represents the minimal dense
matrix structure that can be efficiently processed by tensor cores.

**Parameters:**

* ​type (`DType`): Element data type of the tensor.
* ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern swizzling mode.

**Returns:**

`Layout` - A core matrix layout optimized for tensor core operations.

---

## Self-attention

Self-attention is a mechanism in a [transformer](transformer.mdx) model that
calculates the importance of different tokens (such as words) in a sequence,
relative to each other. Each token is said to "attend to" all other tokens in
the sequence by assigning an "attention score" to each one.

In a large language model (LLM), self-attention allows the model to build an
understanding of the whole text by evaluating how each word is relevant to all
other words in the text, no matter how far they are from each other.

The attention scores are computed using query, key, and value (QKV) vectors
that pertain to each token:

- The **query** is a vector that expresses what information a token is
*looking for* among all the other tokens (like a search query).

- The **key** is a vector that describes the information a token *offers* to
other tokens (like an answer to a token's query).

- The **value** is a vector that provides the **contextually-relevant
information** about this token.

After calculating attention scores by comparing the **query** and **key**
vectors between tokens, self-attention uses the scores to apply weighted
information from each token's **value** into a new [embedding](embedding.mdx)
for each token. Thus, self-attention outputs a new token embedding for each
token that carries information about its relationship with the other tokens in
the sequence.

The model also saves the calculated keys and values into the [KV
cache](kv-cache.mdx) to avoid redundant recompute for the same tokens during
the next [autoregression](autoregression.mdx) cycle.

---

## semaphore

This module provides a device-wide semaphore implementation for NVIDIA GPUs.

The Semaphore struct enables inter-CTA (Cooperative Thread Array) synchronization
by providing atomic operations and memory barriers. It uses NVIDIA-specific intrinsics
to implement efficient thread synchronization.

Example:

````
```mojo
from gpu import Semaphore

var lock = UnsafePointer[Int32](...)
var sem = Semaphore(lock, thread_id)

# Wait for a specific state
sem.wait(0)

# Release the semaphore
sem.release(1)
```
````

## Structs

* [​`Semaphore`](/mojo/stdlib/gpu/semaphore/Semaphore): A device-wide semaphore implementation for GPUs.

---

## Semaphore

`@register_passable`
`struct Semaphore`

A device-wide semaphore implementation for GPUs.

This struct provides atomic operations and memory barriers for inter-CTA synchronization.
It uses a single thread per CTA to perform atomic operations on a shared lock variable.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(lock: UnsafePointer[SIMD[int32, 1]], thread_id: Int) -> Self`

Initialize a new Semaphore instance.

**Args:**

* ​lock (`UnsafePointer[SIMD[int32, 1]]`): Pointer to shared lock variable in global memory.
* ​thread\_id (`Int`): Thread ID within the CTA, used to determine if this thread
  should perform atomic operations.

### `fetch`

`fetch(mut self)`

Fetch the current state of the semaphore from global memory.

Only the designated wait thread (thread 0) performs the actual load,
using an acquire memory ordering to ensure proper synchronization.

### `state`

`state(self) -> SIMD[int32, 1]`

Get the current state of the semaphore.

**Returns:**

The current state value of the semaphore.

### `wait`

`wait(mut self, status: Int = 0)`

Wait until the semaphore reaches the specified state.

Uses a barrier-based spin loop where all threads participate in checking
the state. Only proceeds when the state matches the expected status.

**Args:**

* ​status (`Int`): The state value to wait for (defaults to 0).

### `release`

`release(mut self, status: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](0))`

Release the semaphore by setting it to the specified state.

Ensures all threads have reached this point via a barrier before
the designated thread updates the semaphore state.

**Args:**

* ​status (`SIMD[int32, 1]`): The new state value to set (defaults to 0).

---

## sendmsg

`sendmsg(opcode: SIMD[int32, 1], msg: SIMD[int32, 1])`

Send a message to fixed function hardware. Refer to the specific ISA manual for the ops and messages.

**Args:**

* ​opcode (`SIMD[int32, 1]`): The operation to perform.
* ​msg (`SIMD[int32, 1]`): The message to send.

---

## SeqInfo

`@register_passable(trivial)`
`struct SeqInfo`

## Fields

* ​seq\_len (`SIMD[uint32, 1]`):
* ​start\_of\_seq (`SIMD[uint32, 1]`):
* ​prompt\_offset (`SIMD[uint32, 1]`):
* ​head\_idx (`SIMD[uint32, 1]`):
* ​prompt\_idx (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(seq_len: SIMD[uint32, 1], start_of_seq: SIMD[uint32, 1], work: WorkInfo) -> Self`

### `is_valid`

`is_valid(self) -> Bool`

### `create`

`static create[ragged: Bool](work: WorkInfo, valid_length: NDBuffer[uint32, 1, MutableAnyOrigin], max_seq_len: SIMD[uint32, 1]) -> Self`

---

## sequential

A General sequential layer, each layer is executed with the outputs of the previous.

## `Sequential` {#max.nn.sequential.Sequential}

> *class* max.nn.sequential.Sequential(layers)

A sequential stack of layers where each layer is called by the outputs
of the previous layer.

**Parameters:**

**layers** ([`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`Layer`](layer.md#max.nn.layer.Layer) `]` )

---

## Serverless GPU inference on Google Cloud Run

import SmallCards from '@site/src/components/SmallCards';

Google Cloud Run is a fully managed compute platform that lets you run any
container, making it a great option for deploying an AI endpoint with MAX. This
tutorial guides you through the process of deploying Llama 3 with [MAX
container](https://docs.modular.com/max/container/) on [Google Cloud
Run](https://cloud.google.com/run), so you get automatic scaling and serverless
deployment without managing any of the infrastructure yourself.

## Requirements

Before starting this tutorial, ensure that you have:

- A Google Cloud account with [billing enabled](https://cloud.google.com/billing/docs/how-to/verify-billing-enabled#confirm_billing_is_enabled_on_a_project)
- The `gcloud` CLI tool [installed](https://cloud.google.com/sdk/docs/install) and [initialized](https://cloud.google.com/sdk/docs/initializing)

Also make sure your Google Cloud project has access to the necessary [quotas
and system limits](https://cloud.google.com/docs/quotas/understand-limits). For
more information on compatible GPUs, see [GCP's supported GPU
types](https://cloud.google.com/run/docs/configuring/services/gpu#gpu-type).

We recommend the following hardware resources:

- **GPU**: NVIDIA L4 (or another [compatible GPU](/max/faq#gpu-requirements))
- **CPU**: 8 vCPUs
- **Memory**: At least 32 GiB

## Deploy MAX to Cloud Run

This section guides you through deploying the MAX container for Llama 3.1
inference on Google Cloud Run with GPU acceleration.

1. Before deploying, set up the required environment variables, including your
Google Cloud [project ID](https://cloud.google.com/storage/docs/projects) and a
[supported region](https://cloud.google.com/run/docs/configuring/services/gpu#supported-regions)
for Cloud Run with GPUs.

    :::note

    Because Cloud Run with GPUs is in public preview, you should use a
    separate project for your GPU services, and not the same project that contains
    your other production workloads.

    :::

    ```bash
    export PROJECT_ID="your-project-id"
    export REGION="us-central1"
    ```

2. To use Cloud Run and Cloud Build, you must enable the necessary APIs:

    ```bash
    gcloud services enable \
      run.googleapis.com \
      cloudbuild.googleapis.com
    ```

3. Now, deploy the MAX container to Cloud Run using the following command:

    ```bash
    gcloud beta run deploy max-nvidia-full \
      --image=modular/max-nvidia-full \
      --region=us-central1 \
      --platform=managed \
      --memory=32Gi \
      --cpu=8 \
      --timeout=1200 \
      --port=8000 \
      --min-instances=1 \
      --max-instances=5 \
      --concurrency=5 \
      --cpu-boost \
      --args="--model-path=modularai/Llama-3.1-8B-Instruct-GGUF" \
      --allow-unauthenticated \
      --gpu=1 \
      --gpu-type=nvidia-l4 \
      --set-env-vars=HF_HUB_ENABLE_HF_TRANSFER=1 \
      --startup-probe=tcpSocket.port=8000,initialDelaySeconds=240,timeoutSeconds=240,periodSeconds=240,failureThreshold=5
    ```

This last command deploys a Google Cloud Run service named `max-nvidia-full`
using the `modular/max-nvidia-full` container image, allocating 32Gi of memory,
8 CPUs, and 1 NVIDIA L4 GPU in the `us-central1` region with autoscaling
between 1 and 5 instances. The `--concurrency=5` flag limits each instance to
handling a maximum of 5 concurrent requests, triggering a new instance if the
limit is exceeded.

You can adjust the maximum concurrent requests to balance throughput, latency,
and cost. Lower `--concurrency` values reduce latency but require more
instances, while higher values increase per-instance throughput but may raise
latency. For guidance on tuning cost and performance tradeoffs to your specific
use-case, see [Throughput versus latency versus cost tradeoffs](https://cloud.google.com/run/docs/tips/general#throughput-latency-cost-tradeoff).

The command allows unauthenticated access and configures a startup probe on
port 8000 that allows more time to download and start using a large language
model. The model used here is the `modularai/Llama-3.1-8B-Instruct-GGUF` model.
Once the deployment is complete, Cloud Run provides a service URL where you can
send inference requests to the Llama 3.1 model.

## Test the deployment

After deployment completes, you can test the OpenAI-compatible endpoint.

1. Get the Cloud Run service URL with the following command:

    ```bash
    SERVICE_URL=$(gcloud run services describe max-nvidia-full \
      --region=us-central1 \
      --format='value(status.url)')
    ```

2. Send a chat completion inference request to the `max-nvidia-full` service.

    ```bash
    curl -N ${SERVICE_URL}/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
          "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
          "messages": [
              {"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": "Why is the sky blue?"}
          ]
      }' | grep -o '"content":"[^"]*"' | sed 's/"content":"//g' | sed 's/"//g' | tr -d '\n' | sed 's/\\n/\n/g'
    ```

## Metrics

Retrieve metrics about your Cloud Run service with the following command:

```bash
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=max-nvidia-full" --limit 10
```

You can also check the
[Google Cloud Run console](https://console.cloud.google.com/run) for
visualizations and detailed metrics about your `max-nvidia-full` service.

For more information on metrics and telemetry specific to the MAX container,
see [Metrics](/max/container/#metrics).

:::note

MAX container metrics are anonymous by default. To help our team analyze your
deployment performance, you can add identifying environment variables. For more
information see [Deployment and user ID](/max/container/#deployment-and-user-id).

:::

## Cost considerations

When deploying applications on Google Cloud Run, understanding pricing factors
can help you manage costs effectively. Cloud Run follows a pay-per-use model,
meaning you only pay for the exact resources consumed during request execution.

### Pricing factors

Cloud Run pricing is based on several key components:

- **Request count**: You are billed per HTTP request processed by your service.
- **Resource allocation**: The cost varies depending on the allocated CPU,
  memory, and (if applicable) GPU resources.
- **Request duration**: You pay for the time each request takes to execute,
  measured in milliseconds.

See [Cloud Run pricing](https://cloud.google.com/run/pricing) for more
information on pricing details.

### Cost optimization strategies

To minimize costs while maintaining performance, consider these optimization
techniques:

1. **Right-size resources**: Start with minimal CPU and memory allocations
  during development and testing. Avoid over-provisioning unless necessary.
2. **Configure scaling wisely**: Set appropriate minimum and maximum instance
  limits to prevent unnecessary scaling and costs.
3. **Monitor cold starts**: If cold start latency affects performance, consider
  keeping a small number of instances always running, but balance this with cost
  trade-offs.
4. **Use spot instances**: For non-critical or batch workloads, spot instances
  can offer significant savings compared to standard pricing.

## Clean up

After you're done testing your service, remove the deployment and free up
resources with the following command:

```bash
gcloud run services delete max-nvidia-full --region=${REGION}
```

## Next steps

MAX includes a benchmarking script that allows you to evaluate throughput,
latency, and GPU utilization metrics. For more detailed instructions on
benchmarking, please see
[Benchmark MAX](https://github.com/modular/modular/tree/main/benchmark).

To stay up to date with new releases,
[sign up for our newsletter](https://www.modular.com/modverse#signup) and
[join our community](https://www.modular.com/community). If you're interested in
becoming a design partner to get early access and give us feedback, please
[contact us](https://www.modular.com/company/contact).

You can also explore other GPU deployment options with MAX.

export const cards = [
  {
    title: 'Deploy Llama 3 on GPU with MAX',
    link: '/max/tutorials/max-serve-local-to-cloud',
    description: `Learn how to deploy Llama 3 on GPU with MAX.`,
  },
  {
    title: 'Deploy Llama 3 on GPU-powered Kubernetes clusters',
    link: '/max/tutorials/deploy-max-serve-on-kubernetes',
    description:
    `Learn how to deploy Llama 3 using Kubernetes, MAX, and NVIDIA GPUs`,
  },
];

---

## Serving

import MDXListing from '@site/src/components/Listing/MDXListing';
import TutorialStack from '@site/src/components/TutorialStack';

Our high-performance serving library provides an OpenAI-compatible REST
endpoint, enabling a smooth transition from OpenAI services or other libraries
like vLLM and SGLang. MAX handles the complete request lifecycle with built-in
support for function calling, structured output, and more, plus a Python API
for offline inference.

## Guides

export const docs = [
        '../model-formats.mdx',
        '*'
    ]

## Tutorials

export const tutorials = [
  'start-a-chat-endpoint',
  'run-embeddings-with-max-serve',
  'deploy-llama-vision',
];

---

## set

Implements the  Set datatype.

## Structs

* [​`Set`](/mojo/stdlib/collections/set/Set): A set data type.

---

## Set

`struct Set[T: KeyElement]`

A set data type.

O(1) average-case amortized add, remove, and membership check.

```mojo
from collections import Set

var set = { 1, 2, 3 }
print(len(set))  # 3
set.add(4)

for element in set:
    print(element[])

set -= Set[Int](3, 4, 5)
print(set == Set[Int](1, 2))  # True
print(set | Set[Int](0, 1) == Set[Int](0, 1, 2))  # True
var element = set.pop()
print(len(set))  # 1
```

## Parameters

* ​T (`KeyElement`): The element type of the set. Must implement KeyElement.

## Implemented traits

`AnyType`,
`Boolable`,
`Comparable`,
`Copyable`,
`EqualityComparable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`Hashable`,
`KeyElement`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(out self, *ts: T, *, __set_literal__: Tuple[] = Tuple())`

Construct a set from initial elements.

**Args:**

* ​\*ts (`T`): Variadic of elements to add to the set.
* ​**set\_literal** (`Tuple[]`): Tell Mojo to use this method for set literals.

`@implicit`
`__init__(out self, elements: List[T, hint_trivial_type])`

Construct a set from a List of elements.

**Args:**

* ​elements (`List[T, hint_trivial_type]`): A vector of elements to add to the set.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Copy constructor.

**Args:**

* ​other (`Self`): The existing Set instance to copy from.

### `__bool__`

`__bool__(self) -> Bool`

Whether the set is non-empty or not.

**Returns:**

True if the set is non-empty, False if it is empty.

### `__lt__`

`__lt__(self, other: Self) -> Bool`

Overloads the other (`Self`): The set to compare against for the strict subset relationship.

**Returns:**

True if the set is a strict subset of the `other` set, False otherwise.

### `__le__`

`__le__(self, other: Self) -> Bool`

Overloads the other (`Self`): Another Set instance to check against.

**Returns:**

True if this set is a subset of the `other` set, False otherwise.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Set equality.

**Args:**

* ​other (`Self`): Another Set instance to check equality against.

**Returns:**

True if the sets contain the same elements and False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Set inequality.

**Args:**

* ​other (`Self`): Another Set instance to check equality against.

**Returns:**

True if the sets are different and False otherwise.

### `__gt__`

`__gt__(self, other: Self) -> Bool`

Overloads the > operator for strict superset comparison of sets.

**Args:**

* ​other (`Self`): The set to compare against for the strict superset relationship.

**Returns:**

True if the set is a strict superset of the `other` set, False otherwise.

### `__ge__`

`__ge__(self, other: Self) -> Bool`

Overloads the >= operator for sets. Works like as `issuperset` method.

**Args:**

* ​other (`Self`): Another Set instance to check against.

**Returns:**

True if this set is a superset of the `other` set, False otherwise.

### `__contains__`

`__contains__(self, t: T) -> Bool`

Whether or not the set contains an element.

**Args:**

* ​t (`T`): The element to check membership in the set.

**Returns:**

Whether or not the set contains the element.

### `__sub__`

`__sub__(self, other: Self) -> Self`

Set subtraction.

**Args:**

* ​other (`Self`): Another Set instance to subtract from this one.

**Returns:**

A new set containing elements of this set, but not containing
any elements which were in the `other` set.

### `__and__`

`__and__(self, other: Self) -> Self`

The set intersection operator.

**Args:**

* ​other (`Self`): Another Set instance to intersect with this one.

**Returns:**

A new set containing only the elements which appear in both
this set and the `other` set.

### `__or__`

`__or__(self, other: Self) -> Self`

The set union operator.

**Args:**

* ​other (`Self`): Another Set instance to union with this one.

**Returns:**

A new set containing any elements which appear in either
this set or the `other` set.

### `__xor__`

`__xor__(self, other: Self) -> Self`

Overloads the ^ operator for sets. Works like as `symmetric_difference` method.

**Args:**

* ​other (`Self`): The set to find the symmetric difference with.

**Returns:**

A new set containing the symmetric difference of the two sets.

### `__isub__`

`__isub__(mut self, other: Self)`

In-place set subtraction.

Updates the set to remove any elements from the `other` set.

**Args:**

* ​other (`Self`): Another Set instance to subtract from this one.

### `__iand__`

`__iand__(mut self, other: Self)`

In-place set intersection.

Updates the set to contain only the elements which are already in
the set and are also contained in the `other` set.

**Args:**

* ​other (`Self`): Another Set instance to intersect with this one.

### `__ixor__`

`__ixor__(mut self, other: Self)`

Overloads the ^= operator. Works like as `symmetric_difference_update` method.

Updates the set with the symmetric difference of itself and another set.

**Args:**

* ​other (`Self`): The set to find the symmetric difference with.

### `__ior__`

`__ior__(mut self, other: Self)`

In-place set union.

Updates the set to contain all elements in the `other` set
as well as keeping all elements it already contained.

**Args:**

* ​other (`Self`): Another Set instance to union with this one.

### `__len__`

`__len__(self) -> Int`

The size of the set.

**Returns:**

The number of elements in the set.

### `__hash__`

`__hash__(self) -> UInt`

A hash value of the elements in the set.

The hash value is order independent, so s1 == s2 -> hash(s1) == hash(s2).

**Returns:**

A hash value of the set suitable for non-cryptographic purposes.

### `__str__`

`__str__[U: KeyElement & Representable, //](self: Set[U]) -> String`

Returns the string representation of the set.

**Parameters:**

* ​U (`KeyElement & Representable`): The type of the List elements. Must implement the `Representable`
  and `KeyElement` traits.

**Returns:**

The string representation of the set.

### `__repr__`

`__repr__[U: KeyElement & Representable, //](self: Set[U]) -> String`

Returns the string representation of the set.

**Parameters:**

* ​U (`KeyElement & Representable`): The type of the List elements. Must implement the `Representable`
  and `KeyElement` traits.

**Returns:**

The string representation of the set.

### `write_to`

`write_to[W: Writer, U: KeyElement & Representable, //](self: Set[U], mut writer: W)`

Write Set string representation to a `Writer`.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writer trait.
* ​U (`KeyElement & Representable`): The type of the List elements. Must implement the `Representable`
  and `KeyElement` traits.

**Args:**

* ​writer (`W`): The object to write to.

### `__iter__`

`__iter__(ref self) -> _DictKeyIter[T, NoneType, self_is_origin._data]`

Iterate over elements of the set, returning immutable references.

**Returns:**

An iterator of immutable references to the set elements.

### `add`

`add(mut self, t: T)`

Add an element to the set.

**Args:**

* ​t (`T`): The element to add to the set.

### `remove`

`remove(mut self, t: T)`

Remove an element from the set.

**Args:**

* ​t (`T`): The element to remove from the set.

**Raises:**

If the element isn't in the set to remove.

### `pop`

`pop(mut self) -> T`

Remove any one item from the set, and return it.

As an implementation detail this will remove the first item
according to insertion order. This is practically useful
for breadth-first search implementations.

**Returns:**

The element which was removed from the set.

**Raises:**

If the set is empty.

### `union`

`union(self, other: Self) -> Self`

Set union.

**Args:**

* ​other (`Self`): Another Set instance to union with this one.

**Returns:**

A new set containing any elements which appear in either
this set or the `other` set.

### `intersection`

`intersection(self, other: Self) -> Self`

Set intersection.

**Args:**

* ​other (`Self`): Another Set instance to intersect with this one.

**Returns:**

A new set containing only the elements which appear in both
this set and the `other` set.

### `difference`

`difference(self, other: Self) -> Self`

Set difference.

**Args:**

* ​other (`Self`): Another Set instance to find the difference with this one.

**Returns:**

A new set containing elements that are in this set but not in
the `other` set.

### `update`

`update(mut self, other: Self)`

In-place set update.

Updates the set to contain all elements in the `other` set
as well as keeping all elements it already contained.

**Args:**

* ​other (`Self`): Another Set instance to union with this one.

### `intersection_update`

`intersection_update(mut self, other: Self)`

In-place set intersection update.

Updates the set by retaining only elements found in both this set and the `other` set,
removing all other elements. The result is the intersection of this set with `other`.

**Args:**

* ​other (`Self`): Another Set instance to intersect with this one.

### `difference_update`

`difference_update(mut self, other: Self)`

In-place set subtraction.

Updates the set by removing all elements found in the `other` set,
effectively keeping only elements that are unique to this set.

**Args:**

* ​other (`Self`): Another Set instance to subtract from this one.

### `issubset`

`issubset(self, other: Self) -> Bool`

Check if this set is a subset of another set.

**Args:**

* ​other (`Self`): Another Set instance to check against.

**Returns:**

True if this set is a subset of the `other` set, False otherwise.

### `isdisjoint`

`isdisjoint(self, other: Self) -> Bool`

Check if this set is disjoint with another set.

**Args:**

* ​other (`Self`): Another Set instance to check against.

**Returns:**

True if this set is disjoint with the `other` set, False otherwise.

### `issuperset`

`issuperset(self, other: Self) -> Bool`

Check if this set is a superset of another set.

**Args:**

* ​other (`Self`): Another Set instance to check against.

**Returns:**

True if this set is a superset of the `other` set, False otherwise.

### `symmetric_difference`

`symmetric_difference(self, other: Self) -> Self`

Returns the symmetric difference of two sets.

**Args:**

* ​other (`Self`): The set to find the symmetric difference with.

**Returns:**

A new set containing the symmetric difference of the two sets.

### `symmetric_difference_update`

`symmetric_difference_update(mut self, other: Self)`

Updates the set with the symmetric difference of itself and another set.

**Args:**

* ​other (`Self`): The set to find the symmetric difference with.

### `discard`

`discard(mut self, value: T)`

Remove a value from the set if it exists. Pass otherwise.

**Args:**

* ​value (`T`): The element to remove from the set.

### `clear`

`clear(mut self)`

Removes all elements from the set.

This method modifies the set in-place, removing all of its elements.
After calling this method, the set will be empty.

---

## setenv

`setenv(owned name: String, owned value: String, overwrite: Bool = True) -> Bool`

Changes or adds an environment variable.

**Constraints:**

The function only works on macOS or Linux and returns False otherwise.

**Args:**

* ​name (`String`): The name of the environment variable.
* ​value (`String`): The value of the environment variable.
* ​overwrite (`Bool`): If an environment variable with the given name already exists,
  its value is not changed unless `overwrite` is True.

**Returns:**

False if the name is empty or contains an `=` character. In any other
case, True is returned.

---

## shallow_apply

`shallow_apply[func: fn[ImmutableOrigin](IntTuple[$0]) -> Int](t: IntTuple[origin]) -> IntTuple`

Apply a function to each top-level element of an `IntTuple`.

Unlike `apply()`, this function only operates on the immediate children
of the input tuple without recursing into nested tuples.

**Parameters:**

* ​func (`fn[ImmutableOrigin](IntTuple[$0]) -> Int`): Function that takes an `IntTuple` and returns an `Int`.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` whose elements will be transformed.

**Returns:**

A new `IntTuple` with the function applied to each top-level element.

---

## shape_div

`shape_div(a: IntTuple[origin], b: IntTuple[origin]) -> IntTuple`

Performs division operation between shape tuples.

Handles four cases:

1. tuple-tuple: Performs shape\_div element-wise when dimensions match
2. tuple-int: Folds the division of b across each element of a
   Example: `shape_div((4,5,6),40)` -> `shape_div((1,5,6),10)` -> `shape_div((1,1,6),2)` -> `(1,1,3)`
3. int-tuple: Returns `shape_div(a, product(b))`
4. int-int: Enforces the divisibility condition `a % b == 0 || b % a == 0` when possible
   Returns `a / b` with rounding away from `0` (that is, `1` or `-1` when `a a (`IntTuple[origin]`): The dividend `IntTuple`.
* ​b (`IntTuple[origin]`): The divisor `IntTuple`.

**Returns:**

A new `IntTuple` containing the result of the division operation

---

## shape_div

`shape_div[: ImmutableOrigin, : ImmutableOrigin, //, a_t: IntTuple[$1], b_t: IntTuple[$0]](a: RuntimeTuple[a_t, element_type=element_type], b: RuntimeTuple[b_t, element_type=element_type]) -> RuntimeTuple[shape_div[::Origin[::Bool(a_t, b_t)]`

Performs specialized shape division between `RuntimeTuple`s.

This function implements a special division operation specifically designed for
tensor shape calculations. Unlike standard division, it handles special cases:

1. If shapes are directly divisible (a % b == 0), returns a standard division (a // b)
2. If shapes are inversely divisible (b % a == 0), returns the signed reciprocal
3. If shapes are incompatible, aborts with an error

This operation is essential for transformations between tensor layouts and computing
broadcasting semantics.

**Parameters:**

* ​a\_t (`IntTuple[$1]`): Type of the first operand.
* ​b\_t (`IntTuple[$0]`): Type of the second operand.

**Args:**

* ​a (`RuntimeTuple[a_t, element_type=element_type]`): The dividend `RuntimeTuple`.
* ​b (`RuntimeTuple[b_t, element_type=element_type]`): The divisor `RuntimeTuple`.

**Returns:**

A new `RuntimeTuple` containing the result of the shape division.

---

## shapes

## Functions

* [​`get_sliding_window_out_dim`](./get_sliding_window_out_dim): Return output dimension for a sliding window operation along some dimension.

---

## SharedMemBarrier

`@register_passable(trivial)`
`struct SharedMemBarrier`

A hardware-accelerated synchronization primitive for GPU shared memory operations.

This struct provides a barrier mechanism optimized for coordinating thread execution
and memory transfers in GPU kernels, particularly for Tensor Memory Accelerator (TMA)
operations. It enables efficient synchronization between threads and memory operations
by leveraging hardware-specific barrier instructions.

Key features:

* Thread synchronization across thread blocks
* Memory transfer completion tracking
* Hardware-accelerated barrier operations
* Support for phased synchronization

This barrier is particularly useful for ensuring that shared memory operations
complete before dependent computations begin, which is critical for maintaining
data consistency in high-performance GPU kernels.

## Fields

* ​mbar (`SIMD[int64, 1]`): Shared memory location used for the barrier state.
  This field stores an 8-byte aligned shared memory location that
  maintains the state of the barrier. The memory must be in shared address
  space to be accessible by all threads in a block.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `init`

`init(ref [3] self, num_threads: SIMD[int32, 1] = __init__[__mlir_type.!pop.int_literal](1))`

Initialize the barrier state with the expected number of threads.

Sets up the barrier to expect arrivals from the specified number of threads
before it can be satisfied. This is essential for coordinating thread
synchronization in GPU kernels.

**Args:**

* ​num\_threads (`SIMD[int32, 1]`): Number of threads that must arrive at the barrier
  before it is satisfied. Defaults to 1.

### `expect_bytes`

`expect_bytes(ref [3] self, bytes: SIMD[int32, 1])`

Configure the barrier to expect a specific number of bytes to be transferred.

Used with TMA operations to indicate the expected size of data transfer.
The barrier will be satisfied when the specified number of bytes has been
transferred, enabling efficient coordination of memory operations.

**Args:**

* ​bytes (`SIMD[int32, 1]`): Number of bytes expected to be transferred.

### `wait`

`wait(ref [3] self, phase: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](0))`

Wait until the barrier is satisfied.

Blocks the calling thread until the barrier is satisfied, either by
the expected number of threads arriving or the expected data transfer
completing. This method implements an efficient spin-wait mechanism
optimized for GPU execution.

Note:
Minimizes thread divergence during synchronization by using
hardware-accelerated barrier instructions.

**Args:**

* ​phase (`SIMD[uint32, 1]`): The phase value to check against. Defaults to 0.

### `unsafe_ptr`

`unsafe_ptr(ref [3] self) -> UnsafePointer[SIMD[int64, 1], address_space=AddressSpace(3), alignment=8, mut=self_is_mut, origin=self_is_origin]`

Get an unsafe pointer to the barrier's memory location.

Provides low-level access to the shared memory location storing the barrier state.
This method is primarily used internally by other barrier operations that need
direct access to the underlying memory.

**Returns:**

An unsafe pointer to the barrier's memory location in shared memory,
properly typed and aligned for barrier operations.

### `arrive_cluster`

`arrive_cluster(ref [3] self, cta_id: SIMD[uint32, 1], count: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1))`

Signal arrival at the barrier from a specific CTA (Cooperative Thread Array) in a cluster.

This method is used in multi-CTA scenarios to coordinate barrier arrivals
across different CTAs within a cluster. It enables efficient synchronization
across thread blocks in clustered execution models.

**Args:**

* ​cta\_id (`SIMD[uint32, 1]`): The ID of the CTA (Cooperative Thread Array) that is arriving.
* ​count (`SIMD[uint32, 1]`): The number of arrivals to signal. Defaults to 1.

### `arrive`

`arrive(ref [3] self) -> Int`

Signal arrival at the barrier and return the arrival count.

This method increments the arrival count at the barrier and returns
the updated count. It's used to track how many threads have reached
the synchronization point.

**Returns:**

The updated arrival count after this thread's arrival.

---

## shiftl

`shiftl(a: Int, s: Int) -> Int`

Shift left or right based on sign of shift amount.

Performs a left shift if `s` is positive, or a right shift if
`s` is negative.

**Args:**

* ​a (`Int`): The integer value to shift.
* ​s (`Int`): The shift amount. Positive for left, negative for right.

**Returns:**

The shifted integer value.

`shiftl(a: SIMD[dtype, 1], s: SIMD[dtype, 1]) -> SIMD[dtype, 1]`

Shift left/right based on sign of shift for scalars.

Scalar version of `shiftl`.  Left shift if `s` is positive,
right shift if `s` is negative.

**Args:**

* ​a (`SIMD[dtype, 1]`): The scalar value to shift.
* ​s (`SIMD[dtype, 1]`): The scalar shift amount. Positive for left, negative right.

**Returns:**

The shifted scalar value.

---

## shiftr

`shiftr(a: Int, s: Int) -> Int`

Shift right or left based on sign of shift amount.

Performs a right shift if `s` is positive, or a left shift if
`s` is negative.

**Args:**

* ​a (`Int`): The integer value to shift.
* ​s (`Int`): The shift amount. Positive for right, negative for left.

**Returns:**

The shifted integer value.

`shiftr(a: SIMD[dtype, 1], s: SIMD[dtype, 1]) -> SIMD[dtype, 1]`

Shift right/left based on sign of shift for scalars.

Scalar version of `shiftr`.  Right shift if `s` is positive,
left shift if `s` is negative.

**Args:**

* ​a (`SIMD[dtype, 1]`): The scalar value to shift.
* ​s (`SIMD[dtype, 1]`): The scalar shift amount. Positive for right, negative left.

**Returns:**

The shifted scalar value.

---

## shuffle

`shuffle[T: Copyable & Movable, //](mut list: List[T])`

Shuffles the elements of the list randomly.

Performs an in-place Fisher-Yates shuffle on the provided list.

**Parameters:**

* ​T (`Copyable & Movable`): The type of element in the List.

**Args:**

* ​list (`List[T]`): The list to modify.

---

## shuffle_down

`shuffle_down[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Copies values from threads with higher lane IDs in the warp.

Performs a shuffle operation where each thread receives a value from a thread with a
higher lane ID, offset by the specified amount. Uses the full warp mask by default.

For example, with offset=1:

* Thread 0 gets value from thread 1
* Thread 1 gets value from thread 2
* Thread N gets value from thread N+1
* Last N threads get undefined values

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled down the warp.
* ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values down by. Must be positive.

**Returns:**

The SIMD value from the thread offset lanes higher in the warp.
Returns undefined values for threads where lane\_id + offset >= WARP\_SIZE.

`shuffle_down[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Copies values from threads with higher lane IDs in the warp using a custom mask.

Performs a shuffle operation where each thread receives a value from a thread with a
higher lane ID, offset by the specified amount. The mask parameter controls which
threads participate in the shuffle.

For example, with offset=1:

* Thread 0 gets value from thread 1
* Thread 1 gets value from thread 2
* Thread N gets value from thread N+1
* Last N threads get undefined values

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​mask (`UInt`): A bitmask controlling which threads participate in the shuffle.
  Only threads with their corresponding bit set will exchange values.
* ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled down the warp.
* ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values down by. Must be positive.

**Returns:**

The SIMD value from the thread offset lanes higher in the warp.
Returns undefined values for threads where lane\_id + offset >= WARP\_SIZE
or where the corresponding mask bit is not set.

---

## shuffle_idx

`shuffle_idx[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Copies a value from a source lane to other lanes in a warp.

```
Broadcasts a value from a source thread in a warp to all participating threads
without using shared memory. This is a convenience wrapper that uses the full
warp mask by default.
```

Example:

```mojo
    from gpu.warp import shuffle_idx

    val = SIMD[DType.float32, 16](1.0)

    # Broadcast value from lane 0 to all lanes
    result = shuffle_idx(val, 0)

    # Get value from lane 5
    result = shuffle_idx(val, 5)
```

.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32, half).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​val (`SIMD[type, simd_width]`): The SIMD value to be broadcast from the source lane.
* ​offset (`SIMD[uint32, 1]`): The source lane ID to copy the value from.

**Returns:**

A SIMD vector where all lanes contain the value from the source lane specified by offset.

`shuffle_idx[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Copies a value from a source lane to other lanes in a warp with explicit mask control.

```
Broadcasts a value from a source thread in a warp to participating threads specified by
the mask. This provides fine-grained control over which threads participate in the shuffle
operation.
```

Example:

```mojo
    from gpu.warp import shuffle_idx

    # Only broadcast to first 16 lanes
    var mask = 0xFFFF  # 16 ones
    var val = SIMD[DType.float32, 32](1.0)
    var result = shuffle_idx(mask, val, 5)
```

.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32, half).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​mask (`UInt`): A bit mask specifying which lanes participate in the shuffle (1 bit per lane).
* ​val (`SIMD[type, simd_width]`): The SIMD value to be broadcast from the source lane.
* ​offset (`SIMD[uint32, 1]`): The source lane ID to copy the value from.

**Returns:**

A SIMD vector where participating lanes (set in mask) contain the value from the
source lane specified by offset. Non-participating lanes retain their original values.

---

## shuffle_up

`shuffle_up[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Copies values from threads with lower lane IDs in the warp.

Performs a shuffle operation where each thread receives a value from a thread with a
lower lane ID, offset by the specified amount. Uses the full warp mask by default.

For example, with offset=1:

* Thread N gets value from thread N-1
* Thread 1 gets value from thread 0
* Thread 0 gets undefined value

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled up the warp.
* ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values up by.

**Returns:**

The SIMD value from the thread offset lanes lower in the warp.
Returns undefined values for threads where lane\_id - offset 

`shuffle_up[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Copies values from threads with lower lane IDs in the warp.

Performs a shuffle operation where each thread receives a value from a thread with a
lower lane ID, offset by the specified amount. The operation is performed only for
threads specified in the mask.

For example, with offset=1:

* Thread N gets value from thread N-1 if both threads are in the mask
* Thread 1 gets value from thread 0 if both threads are in the mask
* Thread 0 gets undefined value
* Threads not in the mask get undefined values

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​mask (`UInt`): The warp mask specifying which threads participate in the shuffle.
* ​val (`SIMD[type, simd_width]`): The SIMD value to be shuffled up the warp.
* ​offset (`SIMD[uint32, 1]`): The number of lanes to shift values up by.

**Returns:**

The SIMD value from the thread offset lanes lower in the warp.
Returns undefined values for threads where lane\_id - offset

---

## shuffle_xor

`shuffle_xor[type: DType, simd_width: Int, //](val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Exchanges values between threads in a warp using a butterfly pattern.

Performs a butterfly exchange pattern where each thread swaps values with another thread
whose lane ID differs by a bitwise XOR with the given offset. This creates a butterfly
communication pattern useful for parallel reductions and scans.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​val (`SIMD[type, simd_width]`): The SIMD value to be exchanged with another thread.
* ​offset (`SIMD[uint32, 1]`): The lane offset to XOR with the current thread's lane ID to determine
  the exchange partner. Common values are powers of 2 for butterfly patterns.

**Returns:**

The SIMD value from the thread at lane (current\_lane XOR offset).

`shuffle_xor[type: DType, simd_width: Int, //](mask: UInt, val: SIMD[type, simd_width], offset: SIMD[uint32, 1]) -> SIMD[type, simd_width]`

Exchanges values between threads in a warp using a butterfly pattern with masking.

Performs a butterfly exchange pattern where each thread swaps values with another thread
whose lane ID differs by a bitwise XOR with the given offset. The mask parameter allows
controlling which threads participate in the exchange.

Example:

```mojo
    from gpu.warp import shuffle_xor

    # Exchange values between even-numbered threads 4 lanes apart
    mask = 0xAAAAAAAA  # Even threads only
    var val = SIMD[DType.float32, 16](42.0)  # Example value
    result = shuffle_xor(mask, val, 4.0)
```

.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in each SIMD vector.

**Args:**

* ​mask (`UInt`): A bit mask specifying which threads participate in the exchange.
  Only threads with their corresponding bit set in the mask will exchange values.
* ​val (`SIMD[type, simd_width]`): The SIMD value to be exchanged with another thread.
* ​offset (`SIMD[uint32, 1]`): The lane offset to XOR with the current thread's lane ID to determine
  the exchange partner. Common values are powers of 2 for butterfly patterns.

**Returns:**

The SIMD value from the thread at lane (current\_lane XOR offset) if both threads
are enabled by the mask, otherwise the original value is preserved.

---

## sign

`sign[type: DType, simd_width: Int](x: SIMD[type, simd_width]) -> SIMD[type, simd_width]`

Compute the sign (0, 1) of the input value.

**Parameters:**

* ​type (`DType`): DType used for the computation.
* ​simd\_width (`Int`): SIMD width used for the computation.

**Args:**

* ​x (`SIMD[type, simd_width]`): The value to compute the sign operation on.

**Returns:**

The result of the sign operation.

---

## Signal

`@register_passable(trivial)`
`struct Signal`

A synchronization primitive for coordinating GPU thread blocks across multiple devices.

This struct provides counter-based synchronization between thread blocks on different GPUs.
It maintains two sets of counters:

1. self\_counter: Used by blocks on the current GPU to signal their progress
2. peer\_counter: Used to track progress of blocks on other GPUs

Note:
The counters use unsigned integers that may overflow, but this is safe since
unsigned integer overflow has well-defined behavior.

## Fields

* ​self\_counter (`StaticTuple[StaticTuple[SIMD[uint32, 1], 8], 512]`): A 2D array of counters with shape (MAX\_NUM\_BLOCKS\_UPPER\_BOUND, MAX\_GPUS). Each counter tracks the progress of a specific thread block on the current GPU. Thread blocks increment their corresponding counter to signal completion of a phase, allowing other GPUs to detect when synchronization points are reached. The counters use atomic operations to ensure proper synchronization across devices.
* ​peer\_counter (`StaticTuple[StaticTuple[StaticTuple[SIMD[uint32, 1], 8], 512], 2]`): A 3D array of counters with shape (2, MAX\_NUM\_BLOCKS\_UPPER\_BOUND, MAX\_GPUS). Contains two sets of counters to handle two synchronization points safely. The dual counter design prevents race conditions where a peer block arrives at the second sync point before the current block passes the first sync point.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

---

## signum

`signum(a: Int) -> Int`

Calculate the sign of an integer.

This function determines the sign of the input integer and returns a corresponding
indicator value.

Example:

```mojo
from layout.int_tuple import signum

var result1 = signum(5)    # Returns 1
var result2 = signum(-10)  # Returns -1
var result3 = signum(0)    # Returns 0
```

.

**Args:**

* ​a (`Int`): The integer value to determine the sign of.

**Returns:**

1 if `a` > 0, -1 if `a`

---

## signum

`signum(a: Int) -> Int`

Returns the sign of an integer value.

This helper function determines whether a number is positive, negative, or zero,
returning 1 for positive, -1 for negative, and 0 for zero.

**Args:**

* ​a (`Int`): The integer value to determine the sign of.

**Returns:**

1 if a > 0, -1 if a

---

## simd

Implements SIMD primitives and abstractions.

Provides high-performance SIMD primitives and abstractions for
vectorized computation in Mojo. It enables efficient data-parallel operations
by leveraging hardware vector processing units across different architectures.

Key Features:

1. Architecture-agnostic SIMD abstractions with automatic hardware detection
2. Optimized vector operations for common numerical computations
3. Explicit control over vectorization strategies and memory layouts
4. Zero-cost abstractions that compile to efficient machine code
5. Support for different vector widths and element types

Primary Components:

* Vector types: Strongly-typed vector containers with element-wise operations
* SIMD intrinsics: Low-level access to hardware SIMD instructions
* Vectorized algorithms: Common algorithms optimized for SIMD execution
* Memory utilities: Aligned memory allocation and vector load/store operations

Performance Considerations:

* Vector width selection should match target hardware capabilities
* Memory alignment affects load/store performance
* Data layout transformations may be necessary for optimal vectorization

Integration:
This module is designed to work seamlessly with other Mojo numerical computing
components, including tensor operations, linear algebra routines, and
domain-specific libraries for machine learning and scientific computing.

## Aliases

### `BFloat16`

`alias BFloat16 = SIMD[bfloat16, 1]`

Represents a 16-bit brain floating point value.

### `Byte`

`alias Byte = SIMD[uint8, 1]`

Represents a byte (backed by an 8-bit unsigned integer).

### `Float16`

`alias Float16 = SIMD[float16, 1]`

Represents a 16-bit floating point value.

### `Float32`

`alias Float32 = SIMD[float32, 1]`

Represents a 32-bit floating point value.

### `Float64`

`alias Float64 = SIMD[float64, 1]`

Represents a 64-bit floating point value.

### `Float8_e4m3fn`

`alias Float8_e4m3fn = SIMD[float8_e4m3fn, 1]`

Represents the E4M3 floating point format defined in the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1).

This type is named differently across libraries and vendors, for example:

* Mojo, PyTorch, JAX, and LLVM refer to it as `e4m3fn`.
* OCP, NVIDIA CUDA, and AMD ROCm refer to it as `e4m3`.

In these contexts, they are all referring to the same finite type specified
in the OFP8 standard above, encoded as `seeeemmm`:

* (s)ign: 1 bit
* (e)xponent: 4 bits
* (m)antissa: 3 bits
* exponent bias: 7
* nan: 01111111, 11111111
* -0: 10000000
* fn: finite (no inf or -inf encodings)

### `Float8_e4m3fnuz`

`alias Float8_e4m3fnuz = SIMD[float8_e4m3fnuz, 1]`

Represents an 8-bit e4m3fnuz floating point format, encoded as `seeeemmm`: - (s)ign: 1 bit - (e)xponent: 4 bits - (m)antissa: 3 bits - exponent bias: 8 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding)

### `Float8_e5m2`

`alias Float8_e5m2 = SIMD[float8_e5m2, 1]`

Represents the 8-bit E5M2 floating point format from the [OFP8 standard](https://www.opencompute.org/documents/ocp-8-bit-floating-point-specification-ofp8-revision-1-0-2023-12-01-pdf-1), encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 15 - nan: {0,1}11111{01,10,11} - inf: 01111100 - -inf: 11111100 - -0: 10000000

### `Float8_e5m2fnuz`

`alias Float8_e5m2fnuz = SIMD[float8_e5m2fnuz, 1]`

Represents an 8-bit floating point format, encoded as `seeeeemm`: - (s)ign: 1 bit - (e)xponent: 5 bits - (m)antissa: 2 bits - exponent bias: 16 - nan: 10000000 - fn: finite (no inf or -inf encodings) - uz: unsigned zero (no -0 encoding)

### `Int128`

`alias Int128 = SIMD[si128, 1]`

Represents a 128-bit signed scalar integer.

### `Int16`

`alias Int16 = SIMD[int16, 1]`

Represents a 16-bit signed scalar integer.

### `Int256`

`alias Int256 = SIMD[si256, 1]`

Represents a 256-bit signed scalar integer.

### `Int32`

`alias Int32 = SIMD[int32, 1]`

Represents a 32-bit signed scalar integer.

### `Int64`

`alias Int64 = SIMD[int64, 1]`

Represents a 64-bit signed scalar integer.

### `Int8`

`alias Int8 = SIMD[int8, 1]`

Represents an 8-bit signed scalar integer.

### `Scalar`

`alias Scalar = SIMD[?, 1]`

Represents a scalar dtype.

### `UInt128`

`alias UInt128 = SIMD[ui128, 1]`

Represents a 128-bit unsigned scalar integer.

### `UInt16`

`alias UInt16 = SIMD[uint16, 1]`

Represents a 16-bit unsigned scalar integer.

### `UInt256`

`alias UInt256 = SIMD[ui256, 1]`

Represents a 256-bit unsigned scalar integer.

### `UInt32`

`alias UInt32 = SIMD[uint32, 1]`

Represents a 32-bit unsigned scalar integer.

### `UInt64`

`alias UInt64 = SIMD[uint64, 1]`

Represents a 64-bit unsigned scalar integer.

### `UInt8`

`alias UInt8 = SIMD[uint8, 1]`

Represents an 8-bit unsigned scalar integer.

## Structs

* [​`SIMD`](/mojo/stdlib/builtin/simd/SIMD): Represents a small vector that is backed by a hardware vector element.

---

## SIMD

`@register_passable(trivial)`
`struct SIMD[dtype: DType, size: Int]`

Represents a small vector that is backed by a hardware vector element.

SIMD allows a single instruction to be executed across the multiple data
elements of the vector.

**Constraints:**

The size of the SIMD vector to be positive and a power of 2.

## Parameters

* ​dtype (`DType`): The data type of SIMD vector elements.
* ​size (`Int`): The size of the SIMD vector.

## Fields

* ​value (`simd, #lit.struct.extract>`): The underlying storage for the vector.

## Implemented traits

`Absable`,
`AnyType`,
`Boolable`,
`CeilDivable`,
`Ceilable`,
`Copyable`,
`DevicePassable`,
`ExplicitlyCopyable`,
`Floatable`,
`Floorable`,
`Hashable`,
`Indexer`,
`Intable`,
`Movable`,
`PythonConvertible`,
`Representable`,
`Roundable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Aliases

### `device_type`

`alias device_type = SIMD[dtype, size]`

SIMD types are remapped to the same type when passed to accelerator devices.

### `element_type`

`alias element_type = dtype`

### `MAX`

`alias MAX = SIMD(max_or_inf[::DType]())`

Gets the maximum value for the SIMD value, potentially +inf.

### `MAX_FINITE`

`alias MAX_FINITE = SIMD(max_finite[::DType]())`

Returns the maximum finite value of SIMD value.

### `MIN`

`alias MIN = SIMD(min_or_neg_inf[::DType]())`

Gets the minimum value for the SIMD value, potentially -inf.

### `MIN_FINITE`

`alias MIN_FINITE = SIMD(min_finite[::DType]())`

Returns the minimum (lowest) finite value of SIMD value.

## Methods

### `__init__`

`__init__() -> Self`

Default initializer of the SIMD vector.

By default the SIMD vectors are initialized to all zeros.

`__init__[other_dtype: DType, //](value: SIMD[other_dtype, size], /) -> Self`

Initialize from another SIMD of the same size. If the value passed is a scalar, you can initialize a SIMD vector with more elements.

Example:

```mojo
print(UInt64(UInt8(42))) # 42
print(SIMD[DType.uint64, 4](UInt8(42))) # [42, 42, 42, 42]
```

Casting behavior:

```mojo
# Basic casting preserves value within range
Int8(UInt8(127)) == Int8(127)

# Numbers above signed max wrap to negative using two's complement
Int8(UInt8(128)) == Int8(-128)
Int8(UInt8(129)) == Int8(-127)
Int8(UInt8(256)) == Int8(0)

# Negative signed cast to unsigned using two's complement
UInt8(Int8(-128)) == UInt8(128)
UInt8(Int8(-127)) == UInt8(129)
UInt8(Int8(-1)) == UInt8(255)

# Truncate precision after downcast and upcast
Float64(Float32(Float64(123456789.123456789))) == Float64(123456792.0)

# Rightmost bits of significand become 0's on upcast
Float64(Float32(0.3)) == Float64(0.30000001192092896)

# Numbers equal after truncation of float literal and cast truncation
Float32(Float64(123456789.123456789)) == Float32(123456789.123456789)

# Float to int/uint floors
Int64(Float64(42.2)) == Int64(42)
```

.

**Parameters:**

* ​other\_dtype (`DType`): The type of the value that is being cast from.

**Args:**

* ​value (`SIMD[other_dtype, size]`): The value to cast from.

`@implicit`
`__init__(value: UInt, /) -> Self`

Initializes the SIMD vector with an unsigned integer.

The unsigned integer value is splatted across all the elements of the SIMD
vector.

**Args:**

* ​value (`UInt`): The input value.

`@implicit`
`__init__(value: Int, /) -> Self`

Initializes the SIMD vector with a signed integer.

The signed integer value is splatted across all the elements of the SIMD
vector.

**Args:**

* ​value (`Int`): The input value.

`__init__[T: Floatable, //](value: T, /) -> SIMD[float64, 1]`

Initialize a Float64 from a type conforming to Floatable.

**Parameters:**

* ​T (`Floatable`): The Floatable type.

**Args:**

* ​value (`T`): The object to get the float point representation of.

`__init__[T: FloatableRaising, //](out self: SIMD[float64, 1], value: T, /)`

Initialize a Float64 from a type conforming to FloatableRaising.

**Parameters:**

* ​T (`FloatableRaising`): The FloatableRaising type.

**Args:**

* ​value (`T`): The object to get the float point representation of.

**Raises:**

If the type does not have a float point representation.

`__init__[*, _: Int = 0](out self: SIMD[float64, 1], value: PythonObject, /)`

Initialize a Float64 from a PythonObject.

**Parameters:**

* ​\_ (`Int`): A dummy parameter to ensure this overload has lower priority than
  the others. Its value is ignored.

**Args:**

* ​value (`PythonObject`): The PythonObject to convert.

**Raises:**

If the conversion to double fails.

`@implicit`
`__init__(value: IntLiteral[value], /) -> Self`

Initializes the SIMD vector with an integer.

The integer value is splatted across all the elements of the SIMD
vector.

**Args:**

* ​value (`IntLiteral[value]`): The input value.

`@implicit`
`__init__(value: Bool, /) -> SIMD[bool, size]`

Initializes the SIMD vector with a bool value.

The bool value is splatted across all elements of the SIMD vector.

**Args:**

* ​value (`Bool`): The bool value.

`@implicit`
`__init__(value: simd, #lit.struct.extract>, /) -> Self`

Initializes the SIMD vector with the underlying mlir value.

**Args:**

* ​value (`simd, #lit.struct.extract>`): The input value.

`@implicit`
`__init__(value: SIMD[dtype, 1], /) -> Self`

Constructs a SIMD vector by splatting a scalar value.

The input value is splatted across all elements of the SIMD vector.

**Args:**

* ​value (`SIMD[dtype, 1]`): The value to splat to the elements of the vector.

`__init__(*elems: SIMD[dtype, 1]) -> Self`

Constructs a SIMD vector via a variadic list of elements.

The input values are assigned to the corresponding elements of the SIMD
vector.

**Constraints:**

The number of input values is equal to size of the SIMD vector.

**Args:**

* ​\*elems (`SIMD[dtype, 1]`): The variadic list of elements from which the SIMD vector is
  constructed.

`@implicit`
`__init__(value: FloatLiteral[value], /) -> Self`

Initializes the SIMD vector with a float.

The value is splatted across all the elements of the SIMD
vector.

**Args:**

* ​value (`FloatLiteral[value]`): The input value.

### `__bool__`

`__bool__(self) -> Bool`

Converts the SIMD scalar into a boolean value.

**Constraints:**

The size of the SIMD vector must be 1.

**Returns:**

True if the SIMD scalar is non-zero and False otherwise.

### `__getitem__`

`__getitem__(self, idx: Int) -> SIMD[dtype, 1]`

Gets an element from the vector.

**Args:**

* ​idx (`Int`): The element index.

**Returns:**

The value at position `idx`.

### `__setitem__`

`__setitem__(mut self, idx: Int, val: SIMD[dtype, 1])`

Sets an element in the vector.

**Args:**

* ​idx (`Int`): The index to set.
* ​val (`SIMD[dtype, 1]`): The value to set.

### `__neg__`

`__neg__(self) -> Self`

Defines the unary `-` operation.

**Returns:**

The negation of this SIMD vector.

### `__pos__`

`__pos__(self) -> Self`

Defines the unary `+` operation.

**Returns:**

This SIMD vector.

### `__invert__`

`__invert__(self) -> Self`

Returns `~self`.

**Constraints:**

The element type of the SIMD vector must be boolean or integral.

**Returns:**

The `~self` value.

### `__lt__`

`__lt__(self, rhs: Self) -> SIMD[bool, size]`

Compares two SIMD vectors using less-than comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

A new bool SIMD vector of the same size whose element at position
`i` is True or False depending on the expression
`self[i] 

### `__le__`

`__le__(self, rhs: Self) -> SIMD[bool, size]`

Compares two SIMD vectors using less-than-or-equal comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

A new bool SIMD vector of the same size whose element at position
`i` is True or False depending on the expression
`self[i] 

### `__eq__`

`__eq__(self, rhs: Self) -> SIMD[bool, size]`

Compares two SIMD vectors using equal-to comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

A new bool SIMD vector of the same size whose element at position
`i` is True or False depending on the expression
`self[i] == rhs[i]`.

### `__ne__`

`__ne__(self, rhs: Self) -> SIMD[bool, size]`

Compares two SIMD vectors using not-equal comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

A new bool SIMD vector of the same size whose element at position
`i` is True or False depending on the expression
`self[i] != rhs[i]`.

### `__gt__`

`__gt__(self, rhs: Self) -> SIMD[bool, size]`

Compares two SIMD vectors using greater-than comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

A new bool SIMD vector of the same size whose element at position
`i` is True or False depending on the expression
`self[i] > rhs[i]`.

### `__ge__`

`__ge__(self, rhs: Self) -> SIMD[bool, size]`

Compares two SIMD vectors using greater-than-or-equal comparison.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

**Returns:**

A new bool SIMD vector of the same size whose element at position
`i` is True or False depending on the expression
`self[i] >= rhs[i]`.

### `__contains__`

`__contains__(self, value: SIMD[dtype, 1]) -> Bool`

Whether the vector contains the value.

**Args:**

* ​value (`SIMD[dtype, 1]`): The value.

**Returns:**

Whether the vector contains the value.

### `__add__`

`__add__(self, rhs: Self) -> Self`

Computes `self + rhs`.

**Args:**

* ​rhs (`Self`): The rhs value.

**Returns:**

A new vector whose element at position `i` is computed as
`self[i] + rhs[i]`.

### `__sub__`

`__sub__(self, rhs: Self) -> Self`

Computes `self - rhs`.

**Args:**

* ​rhs (`Self`): The rhs value.

**Returns:**

A new vector whose element at position `i` is computed as
`self[i] - rhs[i]`.

### `__mul__`

`__mul__(self, rhs: Self) -> Self`

Computes `self * rhs`.

**Args:**

* ​rhs (`Self`): The rhs value.

**Returns:**

A new vector whose element at position `i` is computed as
`self[i] * rhs[i]`.

### `__truediv__`

`__truediv__(self, rhs: Self) -> Self`

Computes `self / rhs`.

**Args:**

* ​rhs (`Self`): The rhs value.

**Returns:**

A new vector whose element at position `i` is computed as
`self[i] / rhs[i]`.

### `__floordiv__`

`__floordiv__(self, rhs: Self) -> Self`

Returns the division of self and rhs rounded down to the nearest integer.

**Constraints:**

The element type of the SIMD vector must be numeric.

**Args:**

* ​rhs (`Self`): The value to divide with.

**Returns:**

`floor(self / rhs)` value.

### `__mod__`

`__mod__(self, rhs: Self) -> Self`

Returns the remainder of self divided by rhs.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

The remainder of dividing self by rhs.

### `__pow__`

`__pow__(self, exp: Int) -> Self`

Computes the vector raised to the power of the input integer value.

**Args:**

* ​exp (`Int`): The exponent value.

**Returns:**

A SIMD vector where each element is raised to the power of the
specified exponent value.

`__pow__(self, exp: Self) -> Self`

Computes the vector raised elementwise to the right hand side power.

**Args:**

* ​exp (`Self`): The exponent value.

**Returns:**

A SIMD vector where each element is raised to the power of the
specified exponent value.

### `__lshift__`

`__lshift__(self, rhs: Self) -> Self`

Returns `self rhs (`Self`): The RHS value.

**Returns:**

`self 

### `__rshift__`

`__rshift__(self, rhs: Self) -> Self`

Returns `self >> rhs`.

**Constraints:**

The element type of the SIMD vector must be integral.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self >> rhs`.

### `__and__`

`__and__(self, rhs: Self) -> Self`

Returns `self & rhs`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self & rhs`.

### `__or__`

`__or__(self, rhs: Self) -> Self`

Returns `self | rhs`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self | rhs`.

### `__xor__`

`__xor__(self, rhs: Self) -> Self`

Returns `self ^ rhs`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self ^ rhs`.

### `__radd__`

`__radd__(self, value: Self) -> Self`

Returns `value + self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value + self`.

### `__rsub__`

`__rsub__(self, value: Self) -> Self`

Returns `value - self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value - self`.

### `__rmul__`

`__rmul__(self, value: Self) -> Self`

Returns `value * self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value * self`.

### `__rtruediv__`

`__rtruediv__(self, value: Self) -> Self`

Returns `value / self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value / self`.

### `__rfloordiv__`

`__rfloordiv__(self, rhs: Self) -> Self`

Returns the division of rhs and self rounded down to the nearest integer.

**Constraints:**

The element type of the SIMD vector must be numeric.

**Args:**

* ​rhs (`Self`): The value to divide by self.

**Returns:**

`floor(rhs / self)` value.

### `__rmod__`

`__rmod__(self, value: Self) -> Self`

Returns `value mod self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value mod self`.

### `__rpow__`

`__rpow__(self, base: Self) -> Self`

Returns `base ** self`.

**Args:**

* ​base (`Self`): The base value.

**Returns:**

`base ** self`.

### `__rlshift__`

`__rlshift__(self, value: Self) -> Self`

Returns `value value (`Self`): The other value.

**Returns:**

`value 

### `__rrshift__`

`__rrshift__(self, value: Self) -> Self`

Returns `value >> self`.

**Constraints:**

The element type of the SIMD vector must be integral.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value >> self`.

### `__rand__`

`__rand__(self, value: Self) -> Self`

Returns `value & self`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value & self`.

### `__ror__`

`__ror__(self, value: Self) -> Self`

Returns `value | self`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value | self`.

### `__rxor__`

`__rxor__(self, value: Self) -> Self`

Returns `value ^ self`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value ^ self`.

### `__iadd__`

`__iadd__(mut self, rhs: Self)`

Performs in-place addition.

The vector is mutated where each element at position `i` is computed as
`self[i] + rhs[i]`.

**Args:**

* ​rhs (`Self`): The rhs of the addition operation.

### `__isub__`

`__isub__(mut self, rhs: Self)`

Performs in-place subtraction.

The vector is mutated where each element at position `i` is computed as
`self[i] - rhs[i]`.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

### `__imul__`

`__imul__(mut self, rhs: Self)`

Performs in-place multiplication.

The vector is mutated where each element at position `i` is computed as
`self[i] * rhs[i]`.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

### `__itruediv__`

`__itruediv__(mut self, rhs: Self)`

In-place true divide operator.

The vector is mutated where each element at position `i` is computed as
`self[i] / rhs[i]`.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

### `__ifloordiv__`

`__ifloordiv__(mut self, rhs: Self)`

In-place flood div operator.

The vector is mutated where each element at position `i` is computed as
`self[i] // rhs[i]`.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

### `__imod__`

`__imod__(mut self, rhs: Self)`

In-place mod operator.

The vector is mutated where each element at position `i` is computed as
`self[i] % rhs[i]`.

**Args:**

* ​rhs (`Self`): The rhs of the operation.

### `__ipow__`

`__ipow__(mut self, rhs: Int)`

In-place pow operator.

The vector is mutated where each element at position `i` is computed as
`pow(self[i], rhs)`.

**Args:**

* ​rhs (`Int`): The rhs of the operation.

### `__ilshift__`

`__ilshift__(mut self, rhs: Self)`

Computes `self rhs (`Self`): The RHS value.

### `__irshift__`

`__irshift__(mut self, rhs: Self)`

Computes `self >> rhs` and save the result in `self`.

**Constraints:**

The element type of the SIMD vector must be integral.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__iand__`

`__iand__(mut self, rhs: Self)`

Computes `self & rhs` and save the result in `self`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ixor__`

`__ixor__(mut self, rhs: Self)`

Computes `self ^ rhs` and save the result in `self`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ior__`

`__ior__(mut self, rhs: Self)`

Computes `self | rhs` and save the result in `self`.

**Constraints:**

The element type of the SIMD vector must be bool or integral.

**Args:**

* ​rhs (`Self`): The RHS value.

### `get_type_name`

`static get_type_name() -> String`

Gets this type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls.

**Returns:**

This type's name.

### `get_device_type_name`

`static get_device_type_name() -> String`

Gets device\_type's name, for use in error messages when handing arguments to kernels. TODO: This will go away soon, when we get better error messages for kernel calls.

**Returns:**

This type's name.

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

### `from_bits`

`static from_bits[int_dtype: DType, //](value: SIMD[int_dtype, size]) -> Self`

Initializes the SIMD vector from the bits of an integral SIMD vector.

**Parameters:**

* ​int\_dtype (`DType`): The integral type of the input SIMD vector.

**Args:**

* ​value (`SIMD[int_dtype, size]`): The SIMD vector to copy the bits from.

**Returns:**

The bitcast SIMD vector.

### `to_python_object`

`to_python_object(self) -> PythonObject`

Convert this value to a PythonObject.

**Returns:**

A PythonObject representing the value.

### `__len__`

`__len__(self) -> Int`

Gets the length of the SIMD vector.

**Returns:**

The length of the SIMD vector.

### `__int__`

`__int__(self) -> Int`

Casts to the value to an Int. If there is a fractional component, then the fractional part is truncated.

**Constraints:**

The size of the SIMD vector must be 1.

**Returns:**

The value as an integer.

### `__index__`

`__index__(self) -> index`

Convert to index.

**Returns:**

The corresponding \_\_mlir\_type.index value.

### `__float__`

`__float__(self) -> SIMD[float64, 1]`

Casts the value to a float.

**Constraints:**

The size of the SIMD vector must be 1.

**Returns:**

The value as a float.

### `__str__`

`__str__(self) -> String`

Get the SIMD as a string.

**Returns:**

A string representation.

### `__repr__`

`__repr__(self) -> String`

Get the representation of the SIMD value e.g. "SIMD\[DType.int8, 2]\(1, 2)".

**Returns:**

The representation of the SIMD value.

### `__floor__`

`__floor__(self) -> Self`

Performs elementwise floor on the elements of a SIMD vector.

**Returns:**

The elementwise floor of this SIMD vector.

### `__ceil__`

`__ceil__(self) -> Self`

Performs elementwise ceiling on the elements of a SIMD vector.

**Returns:**

The elementwise ceiling of this SIMD vector.

### `__trunc__`

`__trunc__(self) -> Self`

Performs elementwise truncation on the elements of a SIMD vector.

**Returns:**

The elementwise truncated values of this SIMD vector.

### `__abs__`

`__abs__(self) -> Self`

Defines the absolute value operation.

**Returns:**

The absolute value of this SIMD vector.

### `__round__`

`__round__(self) -> Self`

Performs elementwise rounding on the elements of a SIMD vector.

This rounding goes to the nearest integer with ties away from zero.

**Returns:**

The elementwise rounded value of this SIMD vector.

`__round__(self, ndigits: Int) -> Self`

Performs elementwise rounding on the elements of a SIMD vector.

This rounding goes to the nearest integer with ties away from zero.

**Args:**

* ​ndigits (`Int`): The number of digits to round to.

**Returns:**

The elementwise rounded value of this SIMD vector.

### `__hash__`

`__hash__(self) -> UInt`

Hash the value using builtin hash.

**Returns:**

A 64-bit hash value. This value is *not* suitable for cryptographic
uses. Its intended usage is for data structures. See the `hash`
builtin documentation for more details.

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with this SIMD value.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

### `__ceildiv__`

`__ceildiv__(self, denominator: Self) -> Self`

Return the rounded-up result of dividing self by denominator.

**Args:**

* ​denominator (`Self`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

### `cast`

`cast[target: DType](self) -> SIMD[target, size]`

Casts the elements of the SIMD vector to the target element type.

Casting behavior:

```mojo
# Basic casting preserves value within range
Int8(UInt8(127)) == Int8(127)

# Numbers above signed max wrap to negative using two's complement
Int8(UInt8(128)) == Int8(-128)
Int8(UInt8(129)) == Int8(-127)
Int8(UInt8(256)) == Int8(0)

# Negative signed cast to unsigned using two's complement
UInt8(Int8(-128)) == UInt8(128)
UInt8(Int8(-127)) == UInt8(129)
UInt8(Int8(-1)) == UInt8(255)

# Truncate precision after downcast and upcast
Float64(Float32(Float64(123456789.123456789))) == Float64(123456792.0)

# Rightmost bits of significand become 0's on upcast
Float64(Float32(0.3)) == Float64(0.30000001192092896)

# Numbers equal after truncation of float literal and cast truncation
Float32(Float64(123456789.123456789)) == Float32(123456789.123456789)

# Float to int/uint floors
Int64(Float64(42.2)) == Int64(42)
```

.

**Parameters:**

* ​target (`DType`): The target DType.

**Returns:**

A new SIMD vector whose elements have been casted to the target
element type.

### `is_power_of_two`

`is_power_of_two(self) -> SIMD[bool, size]`

Checks if the input value is a power of 2 for each element of a SIMD vector.

**Constraints:**

The element type of the input vector must be integral.

**Returns:**

A SIMD value where the element at position `i` is True if the integer at
position `i` of the input value is a power of 2, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this SIMD value to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `to_bits`

`to_bits[int_dtype: DType = _integral_type_of[::DType]()](self) -> SIMD[int_dtype, size]`

Bitcasts the SIMD vector to an integer SIMD vector.

**Parameters:**

* ​int\_dtype (`DType`): The integer type to cast to.

**Returns:**

An integer representation of the floating-point value.

### `from_bytes`

`static from_bytes[big_endian: Bool = is_big_endian[__mlir_type.!kgen.target]()](bytes: InlineArray[SIMD[uint8, 1], dtype.sizeof()]) -> SIMD[dtype, 1]`

Converts a byte array to an scalar integer.

**Parameters:**

* ​big\_endian (`Bool`): Whether the byte array is big-endian.

**Args:**

* ​bytes (`InlineArray[SIMD[uint8, 1], dtype.sizeof()]`): The byte array to convert.

**Returns:**

The integer value.

### `as_bytes`

`as_bytes[big_endian: Bool = is_big_endian[__mlir_type.!kgen.target]()](self) -> InlineArray[SIMD[uint8, 1], dtype.sizeof()]`

Convert the scalar integer to a byte array.

**Parameters:**

* ​big\_endian (`Bool`): Whether the byte array should be big-endian.

**Returns:**

The byte array.

### `clamp`

`clamp(self, lower_bound: Self, upper_bound: Self) -> Self`

Clamps the values in a SIMD vector to be in a certain range.

Clamp cuts values in the input SIMD vector off at the upper bound and
lower bound values. For example,  SIMD vector `[0, 1, 2, 3]` clamped to
a lower bound of 1 and an upper bound of 2 would return `[1, 1, 2, 2]`.

**Args:**

* ​lower\_bound (`Self`): Minimum of the range to clamp to.
* ​upper\_bound (`Self`): Maximum of the range to clamp to.

**Returns:**

A new SIMD vector containing x clamped to be within lower\_bound and
upper\_bound.

### `fma`

`fma(self, multiplier: Self, accumulator: Self) -> Self`

Performs a fused multiply-add operation, i.e. `self*multiplier + accumulator`.

**Args:**

* ​multiplier (`Self`): The value to multiply.
* ​accumulator (`Self`): The value to accumulate.

**Returns:**

A new vector whose element at position `i` is computed as
`self[i]*multiplier[i] + accumulator[i]`.

### `shuffle`

`shuffle[*mask: Int](self) -> Self`

Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`.

**Parameters:**

* ​\*mask (`Int`): The permutation to use in the shuffle.

**Returns:**

A new vector with the same length as the mask where the value at
position `i` is `(self)[permutation[i]]`.

`shuffle[*mask: Int](self, other: Self) -> Self`

Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`.

**Parameters:**

* ​\*mask (`Int`): The permutation to use in the shuffle.

**Args:**

* ​other (`Self`): The other vector to shuffle with.

**Returns:**

A new vector with the same length as the mask where the value at
position `i` is `(self + other)[permutation[i]]`.

`shuffle[: DType, //, mask: IndexList[size, element_type=$0]](self) -> Self`

Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`.

**Parameters:**

* ​mask (`IndexList[size, element_type=$0]`): The permutation to use in the shuffle.

**Returns:**

A new vector with the same length as the mask where the value at
position `i` is `(self)[permutation[i]]`.

`shuffle[: DType, //, mask: IndexList[size, element_type=$0]](self, other: Self) -> Self`

Shuffles (also called blend) the values of the current vector with the `other` value using the specified mask (permutation). The mask values must be within `2 * len(self)`.

**Parameters:**

* ​mask (`IndexList[size, element_type=$0]`): The permutation to use in the shuffle.

**Args:**

* ​other (`Self`): The other vector to shuffle with.

**Returns:**

A new vector with the same length as the mask where the value at
position `i` is `(self + other)[permutation[i]]`.

### `slice`

`slice[output_width: Int, /, *, offset: Int = 0](self) -> SIMD[dtype, output_width]`

Returns a slice of the vector of the specified width with the given offset.

**Constraints:**

`output_width + offset` must not exceed the size of this SIMD
vector.

**Parameters:**

* ​output\_width (`Int`): The output SIMD vector size.
* ​offset (`Int`): The given offset for the slice.

**Returns:**

A new vector whose elements map to
`self[offset:offset+output_width]`.

### `insert`

`insert[*, offset: Int = 0](self, value: SIMD[dtype, size]) -> Self`

Returns a new vector where the elements between `offset` and `offset + input_width` have been replaced with the elements in `value`.

**Parameters:**

* ​offset (`Int`): The offset to insert at.

**Args:**

* ​value (`SIMD[dtype, size]`): The value to be inserted.

**Returns:**

A new vector whose elements at `self[offset:offset+input_width]`
contain the values of `value`.

### `join`

`join(self, other: Self) -> SIMD[dtype, (size * 2)]`

Concatenates the two vectors together.

**Args:**

* ​other (`Self`): The other SIMD vector.

**Returns:**

A new vector `self_0, self_1, ..., self_n, other_0, ..., other_n`.

### `interleave`

`interleave(self, other: Self) -> SIMD[dtype, (size * 2)]`

Constructs a vector by interleaving two input vectors.

**Args:**

* ​other (`Self`): The other SIMD vector.

**Returns:**

A new vector `self_0, other_0, ..., self_n, other_n`.

### `split`

`split(self) -> Tuple[SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)], SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]]`

Splits the SIMD vector into 2 subvectors.

**Returns:**

A new vector `self_0:N/2, self_N/2:N`.

### `deinterleave`

`deinterleave(self) -> Tuple[SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)], SIMD[dtype, (div_s(#lit.struct.extract, 2) + -1) if ((size , 2) == 0) ^ True)) else div_s(#lit.struct.extract, 2)]]`

Constructs two vectors by deinterleaving the even and odd lanes of the vector.

**Constraints:**

The vector size must be greater than 1.

**Returns:**

Two vectors the first of the form `self_0, self_2, ..., self_{n-2}`
and the other being `self_1, self_3, ..., self_{n-1}`.

### `reduce`

`reduce[func: fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1], size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using a provided reduce operator.

**Constraints:**

`size_out` must not exceed width of the vector.

**Parameters:**

* ​func (`fn[DType, Int](SIMD[$0, $1], SIMD[$0, $1]) capturing -> SIMD[$0, $1]`): The reduce function to apply to elements in this SIMD.
* ​size\_out (`Int`): The width of the reduction.

**Returns:**

A new scalar which is the reduction of all vector elements.

### `reduce_max`

`reduce_max[size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using the `max` operator.

**Constraints:**

`size_out` must not exceed width of the vector.
The element type of the vector must be integer or FP.

**Parameters:**

* ​size\_out (`Int`): The width of the reduction.

**Returns:**

The maximum element of the vector.

### `reduce_min`

`reduce_min[size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using the `min` operator.

**Constraints:**

`size_out` must not exceed width of the vector.
The element type of the vector must be integer or FP.

**Parameters:**

* ​size\_out (`Int`): The width of the reduction.

**Returns:**

The minimum element of the vector.

### `reduce_add`

`reduce_add[size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using the `add` operator.

**Constraints:**

`size_out` must not exceed width of the vector.

**Parameters:**

* ​size\_out (`Int`): The width of the reduction.

**Returns:**

The sum of all vector elements.

### `reduce_mul`

`reduce_mul[size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using the `mul` operator.

**Constraints:**

`size_out` must not exceed width of the vector.
The element type of the vector must be integer or FP.

**Parameters:**

* ​size\_out (`Int`): The width of the reduction.

**Returns:**

The product of all vector elements.

### `reduce_and`

`reduce_and[size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using the bitwise `&` operator.

**Constraints:**

`size_out` must not exceed width of the vector.
The element type of the vector must be integer or boolean.

**Parameters:**

* ​size\_out (`Int`): The width of the reduction.

**Returns:**

The reduced vector.

### `reduce_or`

`reduce_or[size_out: Int = 1](self) -> SIMD[dtype, size_out]`

Reduces the vector using the bitwise `|` operator.

**Constraints:**

`size_out` must not exceed width of the vector.
The element type of the vector must be integer or boolean.

**Parameters:**

* ​size\_out (`Int`): The width of the reduction.

**Returns:**

The reduced vector.

### `reduce_bit_count`

`reduce_bit_count(self) -> Int`

Returns the total number of bits set in the SIMD vector.

**Constraints:**

Must be either an integral or a boolean type.

**Returns:**

Count of set bits across all elements of the vector.

### `select`

`select[result_dtype: DType](self, true_case: SIMD[result_dtype, size], false_case: SIMD[result_dtype, size]) -> SIMD[result_dtype, size]`

Selects the values of the `true_case` or the `false_case` based on the current boolean values of the SIMD vector.

**Constraints:**

The element type of the vector must be boolean.

**Parameters:**

* ​result\_dtype (`DType`): The element type of the input and output SIMD vectors.

**Args:**

* ​true\_case (`SIMD[result_dtype, size]`): The values selected if the positional value is True.
* ​false\_case (`SIMD[result_dtype, size]`): The values selected if the positional value is False.

**Returns:**

A new vector of the form
`[true_case[i] if elem else false_case[i] for i, elem in enumerate(self)]`.

### `rotate_left`

`rotate_left[shift: Int](self) -> Self`

Shifts the elements of a SIMD vector to the left by `shift` elements (with wrap-around).

**Constraints:**

`-size shift (`Int`): The number of positions by which to rotate the elements of
  SIMD vector to the left (with wrap-around).

**Returns:**

The SIMD vector rotated to the left by `shift` elements
(with wrap-around).

### `rotate_right`

`rotate_right[shift: Int](self) -> Self`

Shifts the elements of a SIMD vector to the right by `shift` elements (with wrap-around).

**Constraints:**

`-size shift (`Int`): The number of positions by which to rotate the elements of
  SIMD vector to the right (with wrap-around).

**Returns:**

The SIMD vector rotated to the right by `shift` elements
(with wrap-around).

### `shift_left`

`shift_left[shift: Int](self) -> Self`

Shifts the elements of a SIMD vector to the left by `shift` elements (no wrap-around, fill with zero).

**Constraints:**

`0 shift (`Int`): The number of positions by which to rotate the elements of
  SIMD vector to the left (no wrap-around, fill with zero).

**Returns:**

The SIMD vector rotated to the left by `shift` elements (no
wrap-around, fill with zero).

### `shift_right`

`shift_right[shift: Int](self) -> Self`

Shifts the elements of a SIMD vector to the right by `shift` elements (no wrap-around, fill with zero).

**Constraints:**

`0 shift (`Int`): The number of positions by which to rotate the elements of
  SIMD vector to the right (no wrap-around, fill with zero).

**Returns:**

The SIMD vector rotated to the right by `shift` elements (no
wrap-around, fill with zero).

### `reversed`

`reversed(self) -> Self`

Reverses the SIMD vector by indexes.

Examples:

```mojo
print(SIMD[DType.uint8, 4](1, 2, 3, 4).reversed()) # [4, 3, 2, 1]
```

.

**Returns:**

The by index reversed vector.

---

## simdbitwidth

`simdbitwidth[target: target = _current_target()]() -> Int`

Returns the vector size (in bits) of the specified target.

**Parameters:**

* ​target (`target`): The target architecture.

**Returns:**

The vector size (in bits) of the specified target.

---

## simdbytewidth

`simdbytewidth[target: target = _current_target()]() -> Int`

Returns the vector size (in bytes) of the specified target.

**Parameters:**

* ​target (`target`): The target architecture.

**Returns:**

The vector size (in bytes) of the host system.

---

## simdwidthof

`simdwidthof[type: AnyTrivialRegType, target: target = _current_target()]() -> Int`

Returns the vector size of the type on the host system.

**Parameters:**

* ​type (`AnyTrivialRegType`): The type in question.
* ​target (`target`): The target architecture.

**Returns:**

The vector size of the type on the host system.

`simdwidthof[dtype: DType, target: target = _current_target()]() -> Int`

Returns the vector size of the type on the host system.

**Parameters:**

* ​dtype (`DType`): The DType in question.
* ​target (`target`): The target architecture.

**Returns:**

The vector size of the dtype on the host system.

---

## sin

`sin[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `sin` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `sin` of the input.

---

## sinh

`sinh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `sinh` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `sinh` of the input.

---

## size

`size(a: IntTuple[origin]) -> Int`

Calculate the total size (product of all elements) of an `IntTuple`.

This function computes the product of all integer values in the `IntTuple`,
regardless of nesting level.

**Args:**

* ​a (`IntTuple[origin]`): The `IntTuple` whose elements will be multiplied together.

**Returns:**

The product of all elements in the `IntTuple`.

---

## size

`size(l: Layout) -> Int`

Returns the total number of elements in the layout's domain.

This is a standalone function equivalent to the Layout.size() method.

**Args:**

* ​l (`Layout`): The layout to calculate the size for.

**Returns:**

The total number of elements in the layout.

---

## Sized

The `Sized` trait describes a type that has an integer length (such as a string or array).

Any type that conforms to `Sized` or
[`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) works with the
built-in [`len()`](/mojo/stdlib/builtin/len/len) function.

The `Sized` trait requires a type to implement the `__len__()`
method. For example:

```mojo
struct Foo(Sized):
    var length: Int

    fn __len__(self) -> Int:
        return self.length
```

You can pass an instance of `Foo` to the `len()` function to get its
length:

```mojo
var foo = Foo(42)
print(len(foo) == 42)
```

```plaintext
True
```

**Note:** If the `__len__()` method can raise an error, use the
[`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) trait instead.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__len__`

`__len__(self: _Self) -> Int`

Get the length of the type.

**Returns:**

The length of the type.

---

## SizedRaising

The `SizedRaising` trait describes a type that has an integer length, which might raise an error if the length can't be determined.

Any type that conforms to [`Sized`](/mojo/stdlib/builtin/len/Sized) or
`SizedRaising` works with the built-in
[`len()`](/mojo/stdlib/builtin/len/len) function.

The `SizedRaising` trait requires a type to implement the `__len__()`
method, which can raise an error. For example:

```mojo
struct Foo(SizedRaising):
    var length: Int

    fn __len__(self) raises -> Int:
        if self.length 

`__len__(self: _Self) -> Int`

Get the length of the type.

**Returns:**

The length of the type.

**Raises:**

If the length cannot be computed.

---

## sizeof

`sizeof[type: AnyType, target: target = _current_target()]() -> Int`

Returns the size of (in bytes) of the type.

Example:

```mojo
from sys.info import sizeof
def main():
    print(
        sizeof[UInt8]() == 1,
        sizeof[UInt16]() == 2,
        sizeof[Int32]() == 4,
        sizeof[Float64]() == 8,
        sizeof[
            SIMD[DType.uint8, 4]
        ]() == 4,
    )
```

Note: `align_of` is in same module.

**Parameters:**

* ​type (`AnyType`): The type in question.
* ​target (`target`): The target architecture.

**Returns:**

The size of the type in bytes.

`sizeof[dtype: DType, target: target = _current_target()]() -> Int`

Returns the size of (in bytes) of the dtype.

**Parameters:**

* ​dtype (`DType`): The DType in question.
* ​target (`target`): The target architecture.

**Returns:**

The size of the dtype in bytes.

---

## sleep

`sleep(sec: SIMD[float64, 1])`

Suspends the current thread for the seconds specified.

**Args:**

* ​sec (`SIMD[float64, 1]`): The number of seconds to sleep for.

`sleep(sec: UInt)`

Suspends the current thread for the seconds specified.

**Args:**

* ​sec (`UInt`): The number of seconds to sleep for.

---

## slice

## Functions

* [​`copy_to_slice`](./copy_to_slice):
* [​`slice_as_copy`](./slice_as_copy):
* [​`slice_as_view`](./slice_as_view):
* [​`slice_dim_as_view`](./slice_dim_as_view):
* [​`slice_shape`](./slice_shape):

---

## slice

`slice(end: Int) -> Slice`

Construct slice given the end value.

**Args:**

* ​end (`Int`): The end value.

**Returns:**

The constructed slice.

`slice(start: Int, end: Int) -> Slice`

Construct slice given the start and end values.

**Args:**

* ​start (`Int`): The start value.
* ​end (`Int`): The end value.

**Returns:**

The constructed slice.

`slice(start: Optional[Int], end: Optional[Int], step: Optional[Int]) -> Slice`

Construct a Slice given the start, end and step values.

**Args:**

* ​start (`Optional[Int]`): The start value.
* ​end (`Optional[Int]`): The end value.
* ​step (`Optional[Int]`): The step value.

**Returns:**

The constructed slice.

---

## Slice

`struct Slice`

Represents a slice expression.

Objects of this type are generated when slice syntax is used within square
brackets, e.g.:

```mojo
var msg: String = "Hello Mojo"

# Both are equivalent and print "Mojo".
print(msg[6:])
print(msg.__getitem__(Slice(6, len(msg))))
```

## Fields

* ​start (`Optional[Int]`): The starting index of the slice.
* ​end (`Optional[Int]`): The end index of the slice.
* ​step (`Optional[Int]`): The step increment value of the slice.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Movable`,
`Representable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(out self, start: Int, end: Int)`

Construct slice given the start and end values.

**Args:**

* ​start (`Int`): The start value.
* ​end (`Int`): The end value.

`__init__(out self, start: Optional[Int], end: Optional[Int], step: Optional[Int])`

Construct slice given the start, end and step values.

**Args:**

* ​start (`Optional[Int]`): The start value.
* ​end (`Optional[Int]`): The end value.
* ​step (`Optional[Int]`): The step value.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compare this slice to the other.

**Args:**

* ​other (`Self`): The slice to compare to.

**Returns:**

True if start, end, and step values of this slice match the
corresponding values of the other slice and False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Compare this slice to the other.

**Args:**

* ​other (`Self`): The slice to compare to.

**Returns:**

False if start, end, and step values of this slice match the
corresponding values of the other slice and True otherwise.

### `copy`

`copy(self) -> Self`

Creates a deep copy of the Slice.

**Returns:**

A copy of the value.

### `__str__`

`__str__(self) -> String`

Gets the string representation of the span.

**Returns:**

The string representation of the span.

### `__repr__`

`__repr__(self) -> String`

Gets the string representation of the span.

**Returns:**

The string representation of the span.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Write Slice string representation to a `Writer`.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `indices`

`indices(self, length: Int) -> Tuple[Int, Int, Int]`

Returns a tuple of 3 integers representing the start, end, and step    of the slice if applied to a container of the given length.

Uses the target container length to normalize negative, out of bounds,
or None indices.

Negative indices are wrapped using the length of the container.

```mojo
s = slice(0, -1, 1)
i = s.indices(5) # returns (0, 4, 1)
```

None indices are defaulted to the start or the end of the container
based on whether `step` is positive or negative.

```mojo
s = slice(None, None, 1)
i = s.indices(5) # returns (0, 5, 1)
```

Out of bounds indices are clamped using the size of the container.

```mojo
s = slice(20)
i = s.indices(5) # returns (0, 5, 1)
```

**Args:**

* ​length (`Int`): The length of the target container.

**Returns:**

A tuple containing three integers for start, end, and step.

---

## slice_as_copy

`slice_as_copy[type: DType, index_type: DType, in_rank: Int](output: NDBuffer[type, in_rank, origin], tensor: NDBuffer[type, in_rank, origin], start: NDBuffer[index_type, 1, origin], end: NDBuffer[index_type, 1, origin], step: NDBuffer[index_type, 1, origin])`

---

## slice_as_view

`slice_as_view[type: DType, start_type: DType, end_type: DType, step_type: DType, rank: Int](tensor: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], starts: NDBuffer[start_type, 1, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], ends: NDBuffer[end_type, 1, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], steps: NDBuffer[step_type, 1, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> NDBuffer[type, rank, origin]`

---

## slice_dim_as_view

`slice_dim_as_view[type: DType, rank: Int, dim: Int](tensor: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], start: Int, end: Int, step: Int) -> NDBuffer[type, rank, origin]`

---

## slice_shape

`slice_shape[input_rank: Int, input_type: DType, start_type: DType, stop_type: DType, step_type: DType, single_thread_blocking_override: Bool](input_buf: NDBuffer[input_type, input_rank, origin], start_buf: NDBuffer[start_type, 1, origin], stop_buf: NDBuffer[stop_type, 1, origin], step_buf: NDBuffer[step_type, 1, origin]) -> IndexList[input_rank]`

---

## SlidingWindowCausalMask

`@register_passable(trivial)`
`struct SlidingWindowCausalMask[window_size: Int]`

Mask implementing Sliding Window attention.

Considering the following case:

* Q\_len = 7
* K\_len = 7
* window\_size = 3

The mask will be applied as follows:
K > 0 1 2 3 4 5 6
Q v x------------x
0 | 1 0 0 0 0 0 0
1 | 1 1 0 0 0 0 0
2 | 1 1 1 0 0 0 0
3 | 0 1 1 1 0 0 0
4 | 0 0 1 1 1 0 0
5 | 0 0 0 1 1 1 0
6 | 0 0 0 0 1 1 1

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`MHAMask`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `apply_log2e_after_mask`

`alias apply_log2e_after_mask = False`

### `mask_out_of_bound`

`alias mask_out_of_bound = True`

### `mask_safe_out_of_bounds`

`alias mask_safe_out_of_bounds = True`

## Methods

### `mask`

`mask[type: DType, width: Int, //, *, element_type: DType = uint32](self, coord: IndexList[4, element_type=element_type], score_vec: SIMD[type, width]) -> SIMD[type, width]`

### `status`

`status[*, element_type: DType = uint32](self, tile_offset: IndexList[2, element_type=element_type], tile_size: IndexList[2, element_type=element_type]) -> TileMaskStatus`

---

## sm_id

`sm_id() -> UInt`

Returns the Streaming Multiprocessor (SM) ID of the current thread.

The SM ID uniquely identifies which physical streaming multiprocessor the thread is
executing on. This is useful for SM-level optimizations and understanding hardware
utilization.

If called on non-NVIDIA GPUs, this function aborts as this functionality
is only supported on NVIDIA hardware.

**Returns:**

The SM ID of the current thread.

---

## softmax

## Functions

* [​`identity`](./identity):
* [​`logsoftmax`](./logsoftmax): Performs an unbatched logsoftmax on an input tensor using the three-pass algorithm.
* [​`mul`](./mul):
* [​`reciprocal`](./reciprocal):
* [​`reduce_add_simd`](./reduce_add_simd): This functions adds val to either the scalar value or the vector value depending on the step\_simd\_width. This is useful when the simd\_width varies between iterations as in vectorize.
* [​`softmax`](./softmax):
* [​`softmax_2_pass`](./softmax_2_pass): Performs an unbatched softmax on an input tensor using the two-pass online algorithm.
* [​`softmax_3_pass`](./softmax_3_pass): Performs an unbatched softmax on an input tensor using the three-pass algorithm.
* [​`softmax_kernel`](./softmax_kernel):
* [​`sub`](./sub):

---

## softmax

`softmax[type: DType, simd_width: Int, rank: Int, static_shape: DimList](input: NDBuffer[type, rank, origin, static_shape], output: NDBuffer[type, rank, origin, static_shape], axis: Int)`

`softmax[: origin.set, //, type: DType, simd_width: Int, rank: Int, static_shape: DimList, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](shape: IndexList[rank], output: NDBuffer[type, rank, origin, static_shape], axis: Int, context: DeviceContextPtr = DeviceContextPtr())`

---

## softmax_2_pass

`softmax_2_pass[simd_width: Int, buffer_size: Dim, type: DType](output: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)], input: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)])`

Performs an unbatched softmax on an input tensor using the two-pass online algorithm.

The unbatched two-pass online softmax is described in "Online
normalizer calculation for softmax" () and
"A full-stack search technique for domain optimized deep learning
accelerators" () and is
defined as:

procedure SoftmaxUnbatched(InputInput)
runningMax = -∞
runningSum = 0
STAGE 1:
for i = 0 to N do
newMax = max(runningMax, Input\[i])
runningSum = runningSum\*exp(runningMax-newMax) + exp(Input\[i]-newMax)
runningMax = newMax
end for
for i = 0 to N do
Output\[i] = exp(Input\[i] - runningMax) / runningSum
end for

**Parameters:**

* ​simd\_width (`Int`): The simd\_width to use in vectorization.
* ​buffer\_size (`Dim`): The size of the input and output buffers.
* ​type (`DType`): The type of the input and output buffers.

**Args:**

* ​output (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The output buffer in which to store the softmax values.
* ​input (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The input buffer used to compute the softmax.

---

## softmax_3_pass

`softmax_3_pass[simd_width: Int, buffer_size: Dim, type: DType, origins: origin.set, input_fn_1d: fn[Int](Int) capturing -> SIMD[type, $0]](output: NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)])`

Performs an unbatched softmax on an input tensor using the three-pass algorithm.

The unbatched three-pass softmax is defined as:
procedure SoftmaxUnbatched(InputInput)
maxVal = -∞
denom = 0
STEP 1: find the max value in each batch
for i = 0 to N do
maxVal = max(maxVal, Input\[b, i])
end for
STEP 2: compute the exponential for each batch
for i = 0 to N do
Output\[b, i] = exp(Input\[b, i] - maxVal)
denom += Output\[b, i]
end for
STEP 3: normalize each batch
for i = 0 to N do
Output\[b, i] /= denom
end for

**Parameters:**

* ​simd\_width (`Int`): The simd\_width to use in vectorization.
* ​buffer\_size (`Dim`): The size of the input and output buffers.
* ​type (`DType`): The type of the input and output buffers.
* ​origins (`origin.set`): The OriginSet of captured arguments by the input\_fn\_1d.
* ​input\_fn\_1d (`fn[Int](Int) capturing -> SIMD[type, $0]`): The elementwise input lambda.

**Args:**

* ​output (`NDBuffer[type, 1, origin, __init__[::Intable](buffer_size)]`): The output buffer in which to store the softmax values.

---

## softmax_kernel

`softmax_kernel[: origin.set, //, BLOCK_SIZE: Int, input_fn: fn[DType, Int, Int](IndexList[$2]) capturing -> SIMD[$0, $1], type: DType, rank: Int, accum_type: DType = get_accum_type[::DType,::DType]()](shape: IndexList[rank], output: NDBuffer[type, rank, MutableAnyOrigin])`

---

## sort

Implements the built-in `sort` function.

These are Mojo built-ins, so you don't need to import them.

## Aliases

### `insertion_sort_threshold`

`alias insertion_sort_threshold = 32`

## Functions

* [​`partition`](/mojo/stdlib/builtin/sort/partition): Partition the input buffer inplace such that first k elements are the largest (or smallest if cmp\_fn is

---

## sort

`sort[: origin.set, T: Copyable & Movable, origin: MutableOrigin, //, cmp_fn: fn(T, T) capturing -> Bool, *, stable: Bool = False](span: Span[T, origin])`

Sort the list inplace. The function doesn't return anything, the list is updated inplace.

**Parameters:**

* ​T (`Copyable & Movable`): Copyable & Movable type of the underlying data.
* ​origin (`MutableOrigin`): Origin of span.
* ​cmp\_fn (`fn(T, T) capturing -> Bool`): The comparison function.
* ​stable (`Bool`): Whether the sort should be stable.

**Args:**

* ​span (`Span[T, origin]`): The span to be sorted.

`sort[: origin.set, origin: MutableOrigin, //, cmp_fn: fn(Int, Int) capturing -> Bool, *, stable: Bool = False](span: Span[Int, origin])`

Sort the list inplace. The function doesn't return anything, the list is updated inplace.

**Parameters:**

* ​origin (`MutableOrigin`): Origin of span.
* ​cmp\_fn (`fn(Int, Int) capturing -> Bool`): The comparison function.
* ​stable (`Bool`): Whether the sort should be stable.

**Args:**

* ​span (`Span[Int, origin]`): The span to be sorted.

`sort[origin: MutableOrigin, //, *, stable: Bool = False](span: Span[Int, origin])`

Sort the list inplace. The function doesn't return anything, the list is updated inplace.

**Parameters:**

* ​origin (`MutableOrigin`): Origin of span.
* ​stable (`Bool`): Whether the sort should be stable.

**Args:**

* ​span (`Span[Int, origin]`): The span to be sorted.

`sort[dtype: DType, origin: MutableOrigin, //, *, stable: Bool = False](span: Span[SIMD[dtype, 1], origin])`

Sort the list inplace. The function doesn't return anything, the list is updated inplace.

**Parameters:**

* ​dtype (`DType`): Copyable & Movable type of the underlying data.
* ​origin (`MutableOrigin`): Origin of span.
* ​stable (`Bool`): Whether the sort should be stable.

**Args:**

* ​span (`Span[SIMD[dtype, 1], origin]`): The span to be sorted.

`sort[T: Copyable & Movable & Comparable, origin: MutableOrigin, //, *, stable: Bool = False](span: Span[T, origin])`

Sort list of the order comparable elements in-place.

**Parameters:**

* ​T (`Copyable & Movable & Comparable`): The order comparable collection element type.
* ​origin (`MutableOrigin`): Origin of span.
* ​stable (`Bool`): Whether the sort should be stable.

**Args:**

* ​span (`Span[T, origin]`): The span to be sorted.

---

## sort_buf_descending

`sort_buf_descending[type: DType, out_idx_type: DType, rank: Int, //](mut buf_keys: NDBuffer[type, rank, origin], mut buf_ids: NDBuffer[out_idx_type, rank, origin], vocab_size: Int)`

Sort each batch separately in descending order using parallel merge sort.

---

## sorted

`sorted[cmp: fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool = __lt__[::Origin[::Bool[?, ?]](tuple: IntTuple[origin]) -> IntTuple`

Sort an IntTuple using the provided comparison function.

This function implements a merge sort algorithm to efficiently sort
the elements of an IntTuple. The sorting is stable and has `O(n log n)`
time complexity.

**Parameters:**

* ​cmp (`fn[ImmutableOrigin, ImmutableOrigin](IntTuple[$0], IntTuple[$1]) -> Bool`): A comparison function that takes two `IntTuple` elements and
  returns True if the first should come before the second.
  Defaults to the `lt` function which performs lexicographical ordering.

**Args:**

* ​tuple (`IntTuple[origin]`): The `IntTuple` to be sorted.

**Returns:**

A new `IntTuple` containing the same elements as the input but sorted
according to the comparison function.

---

## span

Implements the `Span` type.

You can import these APIs from the `memory` module. For example:

```mojo
from memory import Span
```

## Structs

* [​`Span`](/mojo/stdlib/memory/span/Span): A non-owning view of contiguous data.

---

## Span

`@register_passable(trivial)`
`struct Span[mut: Bool, //, T: Copyable & Movable, origin: Origin[mut], *, address_space: AddressSpace = AddressSpace(0), alignment: Int = _default_alignment[::AnyType]()]`

A non-owning view of contiguous data.

## Parameters

* ​mut (`Bool`): Whether the span is mutable.
* ​T (`Copyable & Movable`): The type of the elements in the span.
* ​origin (`Origin[mut]`): The origin of the Span.
* ​address\_space (`AddressSpace`): The address space associated with the allocated memory.
* ​alignment (`Int`): The minimum alignment of the underlying pointer known statically.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `Immutable`

`alias Immutable = Span[T, (muttoimm origin._mlir_origin)]`

The immutable version of the `Span`.

### `Mutable`

`alias Mutable = Span[T, (mutcast origin._mlir_origin)]`

The mutable version of the `Span`.

## Methods

### `__init__`

`__init__() -> Self`

Create an empty / zero-length span.

`__init__(*, ptr: UnsafePointer[T, address_space=address_space, alignment=alignment], length: UInt) -> Self`

Unsafe construction from a pointer and length.

**Args:**

* ​ptr (`UnsafePointer[T, address_space=address_space, alignment=alignment]`): The underlying pointer of the span.
* ​length (`UInt`): The length of the view.

`@implicit`
`__init__(ref [origin, address_space] list: List[T, hint_trivial_type]) -> Self`

Construct a `Span` from a `List`.

**Args:**

* ​list (`List[T, hint_trivial_type]`): The list to which the span refers.

`@implicit`
`__init__[size: Int, //](ref [origin] array: InlineArray[T, size]) -> Self`

Construct a `Span` from an `InlineArray`.

**Parameters:**

* ​size (`Int`): The size of the `InlineArray`.

**Args:**

* ​array (`InlineArray[T, size]`): The array to which the span refers.

### `__bool__`

`__bool__(self) -> Bool`

Check if a span is non-empty.

**Returns:**

True if a span is non-empty, False otherwise.

### `__getitem__`

`__getitem__[I: Indexer](self, idx: I) -> ref [origin, address_space] T`

Get a reference to an element in the span.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The index of the value to return.

**Returns:**

An element reference.

`__getitem__(self, slc: Slice) -> Self`

Get a new span from a slice of the current span.

Allocation:
This function allocates when the step is negative, to avoid a memory
leak, take ownership of the value.

**Args:**

* ​slc (`Slice`): The slice specifying the range of the new subslice.

**Returns:**

A new span that points to the same data as the current span.

### `__eq__`

`__eq__[T: EqualityComparable & Copyable & Movable, rhs_alignment: Int, //](self: Span[T, origin, alignment=alignment], rhs: Span[T, origin, alignment=rhs_alignment]) -> Bool`

Verify if span is equal to another span.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the span. Must implement the
  traits `EqualityComparable`, `Copyable` and `Movable`.
* ​rhs\_alignment (`Int`): The inferred alignment of the rhs span.

**Args:**

* ​rhs (`Span[T, origin, alignment=rhs_alignment]`): The span to compare against.

**Returns:**

True if the spans are equal in length and contain the same elements, False otherwise.

### `__ne__`

`__ne__[T: EqualityComparable & Copyable & Movable, //](self: Span[T, origin, alignment=alignment], rhs: Span[T, origin]) -> Bool`

Verify if span is not equal to another span.

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the elements in the span. Must implement the
  traits `EqualityComparable`, `Copyable` and `Movable`.

**Args:**

* ​rhs (`Span[T, origin]`): The span to compare against.

**Returns:**

True if the spans are not equal in length or contents, False otherwise.

### `__contains__`

`__contains__[dtype: DType, //](self: Span[SIMD[dtype, 1], origin, address_space=address_space, alignment=alignment], value: SIMD[dtype, 1]) -> Bool`

Verify if a given value is present in the Span.

**Parameters:**

* ​dtype (`DType`): The DType of the scalars stored in the Span.

**Args:**

* ​value (`SIMD[dtype, 1]`): The value to find.

**Returns:**

True if the value is contained in the list, False otherwise.

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of the provided `Span`.

**Returns:**

A copy of the `Span`.

### `__iter__`

`__iter__(self) -> _SpanIter[T, origin, address_space=address_space, alignment=alignment]`

Get an iterator over the elements of the `Span`.

**Returns:**

An iterator over the elements of the `Span`.

### `__reversed__`

`__reversed__(self) -> _SpanIter[T, origin, False, address_space, alignment]`

Iterate backwards over the `Span`.

**Returns:**

A reversed iterator of the `Span` elements.

### `__len__`

`__len__(self) -> Int`

Returns the length of the span. This is a known constant value.

**Returns:**

The size of the span.

### `get_immutable`

`get_immutable(self) -> Span[T, (muttoimm origin._mlir_origin)]`

Return an immutable version of this `Span`.

**Returns:**

An immutable version of the same `Span`.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`

Retrieves a pointer to the underlying memory.

**Returns:**

The pointer to the underlying memory.

### `as_ref`

`as_ref(self) -> Pointer[T, origin, address_space]`

Gets a `Pointer` to the first element of this span.

**Returns:**

A `Pointer` pointing at the first element of this span.

### `copy_from`

`copy_from[origin: MutableOrigin, other_alignment: Int, //](self: Span[T, origin, alignment=alignment], other: Span[T, origin, alignment=other_alignment])`

Performs an element wise copy from all elements of `other` into all elements of `self`.

**Parameters:**

* ​origin (`MutableOrigin`): The inferred mutable origin of the data within the Span.
* ​other\_alignment (`Int`): The inferred alignment of the data within the Span.

**Args:**

* ​other (`Span[T, origin, alignment=other_alignment]`): The `Span` to copy all elements from.

### `fill`

`fill[origin: MutableOrigin, //](self: Span[T, origin, alignment=alignment], value: T)`

Fill the memory that a span references with a given value.

**Parameters:**

* ​origin (`MutableOrigin`): The inferred mutable origin of the data within the Span.

**Args:**

* ​value (`T`): The value to assign to each element.

### `swap_elements`

`swap_elements(self: Span[T, origin, alignment=alignment], a: UInt, b: UInt)`

Swap the values at indices `a` and `b`.

**Args:**

* ​a (`UInt`): The first argument index.
* ​b (`UInt`): The second argument index.

**Raises:**

If a or b are larger than the length of the span.

### `__merge_with__`

`__merge_with__[: Bool, : Origin[$0], //, other_type: AnyStruct[Span[T, $1, address_space=address_space, alignment=alignment]]](self) -> Span[T, origin, address_space=address_space, alignment=alignment]`

Returns a pointer merged with the specified `other_type`.

**Parameters:**

* ​other\_type (`AnyStruct[Span[T, $1, address_space=address_space, alignment=alignment]]`): The type of the pointer to merge with.

**Returns:**

A pointer merged with the specified `other_type`.

---

## Speculative decoding

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import TutorialStack from '@site/src/components/TutorialStack';

Speculative decoding is an algorithm designed to accelerate the decoding
process for large language models without sacrificing the quality of the
generated text or requiring modifications to the models themselves.

This technique employs a smaller, faster **draft model** to generate several
potential next tokens in parallel, which are then efficiently validated against
a larger, more powerful target model using a modified rejection sampling
technique. This leads to reduced overall latency and improved throughput during
token generation.

By accepting correct predictions and only resampling when necessary, speculative
decoding achieves a significant speedup in token generation, effectively
bypassing memory bandwidth limitations often encountered during standard
autoregressive decoding.

:::caution

Speculative decoding with MAX is still in preview and some aspects may change
as we refine the implementation. Expect **ongoing** improvements and potential
adjustments based on feedback and performance optimizations.

:::

## When to use speculative decoding

You'll want to use speculative decoding when your primary goal is to accelerate
the decoding process of large language models and reduce latency. For example,
if you are using a 405 billion parameter model, you can use speculative
decoding to reduce latency by using a 135 million parameter draft model.

## How speculative decoding works

By default, speculative decoding is disabled in MAX. It can be enabled using
the `--draft-model-path` flag. This flag takes a path to a model that will be
used to generate speculative tokens. This is the model name as it appears on
Hugging Face or as a path to a local directory containing a model.

All model-specific parameters can be prefixed with `--draft-` to configure the
draft model independently from the main model.

For example:

- `--draft-model-path`: Path to the draft model
- `--draft-quantization-encoding`: Quantization encoding for the draft model
- `--draft-weight-path`: Path to draft model weights

The performance of speculative decoding primarily depends on two factors:

- **Acceptance rate**: How often the target model confirms the draft model's
  predictions.
- **Token generation pattern**: The system is optimized when more draft tokens
can be evaluated in a single step of the target model. This is controlled by the
`--max-num-steps` parameter, which sets the maximum number of tokens the draft
model generates before verification by the target model.

## Quickstart

You can use speculative decoding with MAX to accelerate model inference by using
a smaller draft model to predict tokens that are verified by the main model.

Serve your model with MAX and specify the draft model path using the
`--draft-model-path` flag:

```sh
max serve --model-path HuggingFaceTB/SmolLM2-360M-Instruct \
    --draft-model-path HuggingFaceTB/SmolLM2-135M-Instruct \
    --device-memory-utilization=0.6 \
    --max-num-steps=5 \
    --no-enable-chunked-prefill
```

The endpoint is ready when you see the URI printed in your terminal:

```output
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
```
Once the model is served, you can make requests to the API endpoints.

  
Install the `openai` package:

```sh
pip install openai
```

Then create a new Python file and import the `openai` package:

```python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",  # Your MAX endpoint
    api_key="not-needed"  # API key can be any string when using MAX locally
)

# Make a chat completion request
response = client.chat.completions.create(
    model="HuggingFaceTB/SmolLM2-360M-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What are the benefits of speculative decoding?"}
    ],
    max_tokens=500
)

# Print the response
print(response.choices[0].message.content)
```

  
In a new terminal, make a chat completion request using curl:

```sh
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "HuggingFaceTB/SmolLM2-360M-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What are the benefits of speculative decoding?"}
    ],
    "max_tokens": 500
  }'
```

  
You can also use the `generate` command to generate text:

```sh
max generate --model-path HuggingFaceTB/SmolLM2-360M-Instruct \
    --draft-model-path HuggingFaceTB/SmolLM-135M \
    --max-length=200 \
    --prompt="What are the benefits of speculative decoding?" \
    --device-memory-utilization=0.6 \
    --devices=gpu \
    --no-enable-chunked-prefill
```

## Next steps

Now that you know the basics of speculative decoding, you can get started with
MAX on GPUs.

export const tutorials = [
  'max-serve-local-to-cloud',
  'deploy-max-serve-on-kubernetes',
];

---

## SpinWaiter

`struct SpinWaiter`

A proxy for the C++ runtime's SpinWaiter type.

## Fields

* ​storage (`UnsafePointer[NoneType]`): Pointer to the underlying SpinWaiter instance.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Initializes a SpinWaiter instance.

### `__del__`

`__del__(owned self)`

Destroys the SpinWaiter instance.

### `wait`

`wait(self)`

Blocks the current task for a duration determined by the underlying policy.

---

## split

## Functions

* [​`split`](./split):

---

## split

`split[type: DType, rank: Int, num_outputs: Int, target: StringSlice[StaticConstantOrigin], trace_description: StringSlice[StaticConstantOrigin]](input: NDBuffer[type, rank, origin], axis: Int, outputs: StaticTuple[NDBuffer[type, rank, MutableAnyOrigin], num_outputs], ctx: DeviceContext)`

---

## split

`split[PathLike: PathLike, //](path: PathLike) -> Tuple[String, String]`

Split a given pathname into two components: head and tail. This is useful for separating the directory path from the filename. If the input path ends with a separator, the tail component will be empty. If there is no separator in the path, the head component will be empty, and the entire path will be considered the tail. Trailing separators in the head are stripped unless the head is the root directory.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to be split.

**Returns:**

A tuple containing two strings: (head, tail).

---

## split_extension

`split_extension[PathLike: PathLike, //](path: PathLike) -> Tuple[String, String]`

Splits `path` into the root and extension.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to be split.

**Returns:**

A tuple containing two strings: (root, extension).

---

## split_k_reduce

`split_k_reduce[c_type: DType, work_space_type: DType, c_shape: DimList, work_space_shape: DimList, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1})](c: NDBuffer[c_type, 2, origin, c_shape], work_space: NDBuffer[work_space_type, 3, origin, work_space_shape], ctx: DeviceContext)`

---

## SplitKPartition

`@register_passable(trivial)`
`struct SplitKPartition[dtype: DType]`

## Fields

* ​ptr (`UnsafePointer[SIMD[dtype, 1]]`):
* ​num\_partitions\_value (`SIMD[uint32, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`MHAPartitionScheme`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `accum_dtype`

`alias accum_dtype = dtype`

### `do_partition`

`alias do_partition = True`

## Methods

### `__init__`

`__init__(ptr: UnsafePointer[SIMD[dtype, 1]], num_partitions_value: SIMD[uint32, 1]) -> Self`

### `num_partitions`

`num_partitions(self) -> SIMD[uint32, 1]`

### `get_exp_sum_qk_max_pointer`

`get_exp_sum_qk_max_pointer(self) -> UnsafePointer[SIMD[dtype, 1]]`

---

## splitroot

`splitroot[PathLike: PathLike, //](path: PathLike) -> Tuple[String, String, String]`

Splits `path` into drive, root and tail. The tail contains anything after the root.

**Parameters:**

* ​PathLike (`PathLike`): The type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to be split.

**Returns:**

A tuple containing three strings: (drive, root, tail).

---

## sqrt

`sqrt(x: Int) -> Int`

Performs square root on an integer.

**Args:**

* ​x (`Int`): The integer value to perform square root on.

**Returns:**

The square root of x.

`sqrt[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs elementwise square root on the elements of a SIMD vector.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector to perform square root on.

**Returns:**

The elementwise square root of x.

---

## st_matrix

`st_matrix[dtype: DType, //, simd_width: Int, *, transpose: Bool = False](ptr: UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)], d: SIMD[float32, simd_width])`

Performs warp-synchronized copy from registers to shared memory.

This function stores data from registers to shared memory in a format that can be
directly used by tensor core Matrix Multiply-Accumulate (MMA) instructions. It uses
the NVIDIA stmatrix instruction to perform an efficient warp-synchronized store.

Note:
The function performs a warp-synchronized operation - all threads in the warp
must execute this instruction to avoid deadlock.

**Constraints:**

* Must be used with shared memory pointers.
* Number of registers must be 1, 2, or 4.
* Data must be properly aligned for matrix operations.
* All threads in warp must participate.
* Only supported on NVIDIA GPUs with tensor core capabilities.

**Parameters:**

* ​dtype (`DType`): Data type of elements to store.
* ​simd\_width (`Int`): Width of the SIMD vector.
* ​transpose (`Bool`): If True, transposes the matrix during store.

**Args:**

* ​ptr (`UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)]`): Pointer to shared memory where data will be stored.
* ​d (`SIMD[float32, simd_width]`): SIMD vector containing the data to store.

---

## st_matrix_n_atom

`st_matrix_n_atom[num_stmatrix: Int]() -> Layout`

Creates a layout for N-major `st_matrix` atom in the context of WGMMA C matrix.

The domain of this layout is the warp group local thread index. Thus, the
layout takes \[0, 128) as input and returns an offset for a logical array
with an element size of 128-bit.

**Parameters:**

* ​num\_stmatrix (`Int`): Number of N-dimension tiles in the C matrix.

**Returns:**

`Layout` - A layout that maps warp group local thread index to an offset
for a logical array with an element size of 128-bit.

---

## st_matrix_n_layout

`st_matrix_n_layout[c_type: DType, WG_BN: Int, num_m_mmas: Int, num_consumer: Int]() -> Layout`

Creates a layout for N-major `st_matrix` in the context of WGMMA C matrix.

The layout modes are: the warp group local thread index, the N-dimension
tiling size `WG_BN // 16`, the number of MMA tiles `num_m_mmas` in the
M-dimension, and the number of consumers `num_consumer`. The output is an
offset for a logical array with the element type `c_type`.

**Parameters:**

* ​c\_type (`DType`): Data type of the C matrix.
* ​WG\_BN (`Int`): Size of the K dimension in the C matrix in shared memory.
* ​num\_m\_mmas (`Int`): Number of MMA tiles in the M dimension.
* ​num\_consumer (`Int`): Number of consumers.

**Returns:**

`Layout` - A layout that maps warp group local thread index to an offset
for a logical array with the element type `c_type`.

---

## stack_allocation

`stack_allocation[count: Int, dtype: DType, /, alignment: Int = alignof[::DType,__mlir_type.!kgen.target]() if is_gpu() else 1, address_space: AddressSpace = AddressSpace(0)]() -> UnsafePointer[SIMD[dtype, 1], address_space=address_space]`

Allocates data buffer space on the stack given a data type and number of elements.

**Parameters:**

* ​count (`Int`): Number of elements to allocate memory for.
* ​dtype (`DType`): The data type of each element.
* ​alignment (`Int`): Address alignment of the allocated data.
* ​address\_space (`AddressSpace`): The address space of the pointer.

**Returns:**

A data pointer of the given type pointing to the allocated space.

`stack_allocation[count: Int, type: AnyType, /, name: Optional[StringSlice[StaticConstantOrigin]] = Optional(None), alignment: Int = alignof[::AnyType,__mlir_type.!kgen.target]() if is_gpu() else 1, address_space: AddressSpace = AddressSpace(0)]() -> UnsafePointer[type, address_space=address_space]`

Allocates data buffer space on the stack given a data type and number of elements.

**Parameters:**

* ​count (`Int`): Number of elements to allocate memory for.
* ​type (`AnyType`): The data type of each element.
* ​name (`Optional[StringSlice[StaticConstantOrigin]]`): The name of the global variable (only honored in certain cases).
* ​alignment (`Int`): Address alignment of the allocated data.
* ​address\_space (`AddressSpace`): The address space of the pointer.

**Returns:**

A data pointer of the given type pointing to the allocated space.

---

## stack_allocation_like

`stack_allocation_like[layout: Layout, dtype: DType, *, address_space: AddressSpace, target_address_space: AddressSpace = AddressSpace(0)](in_tensor: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, layout, MutableAnyOrigin, address_space=target_address_space, masked=masked]`

Create a stack-allocated tensor with the same layout as an existing tensor.

This function creates a new tensor on the stack with the same layout, data
type, and masking properties as the input tensor, but potentially with a
different address space. This is useful for creating temporary tensors that
match the structure of existing tensors.

Example:

```mojo
from layout import LayoutTensor, Layout
from layout.layout_tensor import stack_allocation_like

var global_tensor = LayoutTensor[DType.float32, Layout((10, 10)),
                                address_space=AddressSpace.GLOBAL]()
var stack_tensor: LayoutTensor[DType.float32, Layout((10, 10)),
                                MutableAnyOrigin, address_space=AddressSpace.GENERIC]
stack_allocation_like(global_tensor, stack_tensor)
```

Performance:

* Creates a tensor on the stack, which is typically faster to allocate and
  access than heap-allocated memory.
* Stack allocations have automatic lifetime management, reducing memory
  management overhead.
* Stack size is limited, so be cautious with large tensor allocations.

Notes:

* The new tensor will have the same layout, data type, and masking properties
  as the input tensor.
* The address space can be changed, which is useful for moving data between
  different memory regions (e.g., from global to shared memory).
* Stack allocations are automatically freed when they go out of scope.
* The function uses the stack\_allocation method of the result tensor type.

**Parameters:**

* ​layout (`Layout`): The layout of the tensor to allocate.
* ​dtype (`DType`): The data type of the tensor elements.
* ​address\_space (`AddressSpace`): The address space of the input tensor.
* ​target\_address\_space (`AddressSpace`): The address space for the new tensor. Defaults to
  GENERIC.

**Args:**

* ​in\_tensor (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to match the layout of.

**Returns:**

A new tensor allocated on the stack with the same layout as the input
tensor.

---

## Start a chat endpoint

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import TutorialStack from '@site/src/components/TutorialStack';
import InstallModular from '@site/docs/_includes/install-modular.mdx';
import Requirements from '@site/src/components/Requirements';
import { requirementsNoGPU } from '@site/docs/max/requirements';

The MAX framework simplifies the process to serve open source
models with the same API interface as OpenAI. This allows you to replace
commercial models with alternatives from the [MAX
Builds](https://builds.modular.com/?category=models) site with minimal code
changes.

This tutorial shows you how to serve Llama 3.1 locally with the `max`
CLI and interact with it through REST and Python APIs. You'll learn to configure
the server and make requests using the OpenAI client libraries as a drop-in
replacement.

System requirements:

## Set up your environment

Create a Python project to install our APIs and CLI tools:

## Serve your model

Use the [`max serve`](/max/max-cli/#serve) command to start a
local model server with the Llama 3.1 model:

```bash
max serve \
  --model-path modularai/Llama-3.1-8B-Instruct-GGUF
```

While this example uses the Llama 3.1 model, you can replace it with any of the
models listed in the [MAX Builds](https://builds.modular.com/?category=models) site.

:::note

When searching for a model using the MAX Builds site, ensure that the model type
can fit into memory of your machine. You can filter and sort models by hardware
type, and size of the model. For more information and to learn how to use the MAX
Builds site, see [MAX Builds in 60
seconds](https://www.youtube.com/watch?v=EqM1TB1GgCc).

:::

The server is ready when you see a message indicating it's running on
http://0.0.0.0:8000:

```output
Server ready on http://0.0.0.0:8000 (Press CTRL+C to quit)
```

For a complete list of `max` CLI commands and options, refer to the
[MAX CLI reference](/max/max-cli).

## Interact with the model

After the server is running, you can interact with the model using different
methods. The MAX endpoint supports OpenAI REST APIs, so you can
send requests from your client using the `openai` Python API.

You can use OpenAI's Python client to interact with the model.

To get started, install the OpenAI Python client:

```bash
pip install openai
```

Then, create a client and make a request to the model:

```python title="generate-text.py"
from openai import OpenAI

client = OpenAI(
    base_url = 'http://0.0.0.0:8000/v1',
    api_key='EMPTY', # required by the API, but not used by MAX
)

response = client.chat.completions.create(
  model="modularai/Llama-3.1-8B-Instruct-GGUF",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assistant", "content": "The LA Dodgers won in 2020."},
    {"role": "user", "content": "Where was it played?"}
  ]
)
print(response.choices[0].message.content)
```

In this example, you're using the OpenAI Python client to interact with the MAX
endpoint running on local host `8000`. The `client` object is initialized with
the base URL `http://0.0.0.0:8000/v1` and the API key is ignored.

When you run this code, the model should respond with information about the 2020
World Series location:

```sh
python generate-text.py
```

```output
The 2020 World Series was played at Globe Life Field in Arlington, Texas. It was a neutral site due to the COVID-19 pandemic.
```

The following `curl` command sends a simple chat request to the model's chat
completions endpoint:

```bash
curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
        "messages": [
            {
            "role": "system",
            "content": "You are a helpful assistant."
            },
            {
            "role": "user",
            "content": "Hello, how are you?"
            }
        ],
        "max_tokens": 100
    }'
```

You should receive a response similar to this:

```json
{
  "id": "18b0abd2d2fd463ea43efe2c147bcac0",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": " I'm doing well, thank you for asking. How can I assist you today?",
        "refusal": "",
        "tool_calls": null,
        "role": "assistant",
        "function_call": null
      },
      "logprobs": {
        "content": [],
        "refusal": []
      }
    }
  ],
  "created": 1743543698,
  "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
  "service_tier": null,
  "system_fingerprint": null,
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 17,
    "prompt_tokens": null,
    "total_tokens": 17
  }
}
```

For complete details on all available API endpoints and options, see the [MAX
Serve API documentation](/max/api/serve).

## Next steps

Now that you have successfully set up MAX with OpenAI-compatible endpoints,
checkout out these other tutorials:

export const maxTutorials = [
  'deploy-llama-vision',
  'run-embeddings-with-max-serve',
];

---

## stat

`stat[PathLike: PathLike](path: PathLike) -> stat_result`

Get the status of a file or a file descriptor.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the directory.

**Returns:**

Returns the stat\_result on the path.

---

## stat

Implements the stat package.

## Modules

* [​`stat`](/mojo/stdlib/stat/stat/): Implements the stat module.

---

## stat

Implements the stat module.

## Aliases

### `S_IFBLK`

`alias S_IFBLK = 24576`

Bits that determine the block device.

### `S_IFCHR`

`alias S_IFCHR = 8192`

Bits that determine the char device.

### `S_IFDIR`

`alias S_IFDIR = 16384`

Bits that determine the directory.

### `S_IFIFO`

`alias S_IFIFO = 4096`

Bits that determine the fifo.

### `S_IFLNK`

`alias S_IFLNK = 40960`

Bits that determine the symlink.

### `S_IFMT`

`alias S_IFMT = 61440`

Bits that determine the file type.

### `S_IFREG`

`alias S_IFREG = 32768`

Bits that determine the regular file.

### `S_IFSOCK`

`alias S_IFSOCK = 49152`

Bits that determine the socket.

## Functions

* [​`S_ISBLK`](/mojo/stdlib/stat/stat/S_ISBLK): Returns True if the mode is a block device.
* [​`S_ISCHR`](/mojo/stdlib/stat/stat/S_ISCHR): Returns True if the mode is a character device.
* [​`S_ISDIR`](/mojo/stdlib/stat/stat/S_ISDIR): Returns True if the mode is a directory.
* [​`S_ISFIFO`](/mojo/stdlib/stat/stat/S_ISFIFO): Returns True if the mode is a fifo.
* [​`S_ISLNK`](/mojo/stdlib/stat/stat/S_ISLNK): Returns True if the mode is a symlink.
* [​`S_ISREG`](/mojo/stdlib/stat/stat/S_ISREG): Returns True if the mode is a regular file.
* [​`S_ISSOCK`](/mojo/stdlib/stat/stat/S_ISSOCK): Returns True if the mode is a socket.

---

## stat_result

`struct stat_result`

Object whose fields correspond  to the members of the stat structure.

## Fields

* ​st\_mode (`Int`): File mode: file type and file mode bits (permissions).
* ​st\_ino (`Int`): Platform dependent, but if non-zero, uniquely identifies the file for a given value of st\_dev.
* ​st\_dev (`Int`): Identifier of the device on which this file resides.
* ​st\_nlink (`Int`): Number of hard links.
* ​st\_uid (`Int`): User identifier of the file owner.
* ​st\_gid (`Int`): Group identifier of the file owner.
* ​st\_size (`Int`): Size of the file in bytes, if it is a regular file or a symbolic link.
* ​st\_atimespec (`_CTimeSpec`): Time of file most recent access.
* ​st\_mtimespec (`_CTimeSpec`): Time of file most recent modification.
* ​st\_ctimespec (`_CTimeSpec`): Time of file most recent change.
* ​st\_birthtimespec (`_CTimeSpec`): Time of file creation.
* ​st\_blocks (`Int`): Number of 512-byte blocks allocated for file.
* ​st\_blksize (`Int`): Preferred blocksize for efficient file system I/O.
* ​st\_rdev (`Int`): Type of device if an inode device.
* ​st\_flags (`Int`): User defined flags for file.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(out self, *, st_mode: Int, st_ino: Int, st_dev: Int, st_nlink: Int, st_uid: Int, st_gid: Int, st_size: Int, st_atimespec: _CTimeSpec, st_mtimespec: _CTimeSpec, st_ctimespec: _CTimeSpec, st_birthtimespec: _CTimeSpec, st_blocks: Int, st_blksize: Int, st_rdev: Int, st_flags: Int)`

Initialize the stat\_result structure.

**Args:**

* ​st\_mode (`Int`): File mode: file type and file mode bits (permissions).
* ​st\_ino (`Int`): Uniquely identifier for a file.
* ​st\_dev (`Int`): Identifier of the device on which this file resides.
* ​st\_nlink (`Int`): Number of hard links.
* ​st\_uid (`Int`): User identifier of the file owner.
* ​st\_gid (`Int`): Group identifier of the file owner.
* ​st\_size (`Int`): Size of the file (bytes), if it is a file or a symlink.
* ​st\_atimespec (`_CTimeSpec`): Time of file most recent access.
* ​st\_mtimespec (`_CTimeSpec`): Time of file most recent modification.
* ​st\_ctimespec (`_CTimeSpec`): Time of file most recent change.
* ​st\_birthtimespec (`_CTimeSpec`): Time of file creation.
* ​st\_blocks (`Int`): Number of 512-byte blocks allocated for file.
* ​st\_blksize (`Int`): Preferred blocksize for efficient file system I/O.
* ​st\_rdev (`Int`): Type of device if an inode device.
* ​st\_flags (`Int`): User defined flags for file.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this path to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__str__`

`__str__(self) -> String`

Constructs a string representation of stat\_result.

**Returns:**

A string representation of stat\_result.

### `__repr__`

`__repr__(self) -> String`

Constructs a representation of stat\_result.

**Returns:**

A representation of stat\_result.

---

## static

`static[d: Int]() -> ValueOrUnknown[d]`

Creates a static dimension with compile-time value.

**Parameters:**

* ​d (`Int`): The compile-time dimension value to use.

**Returns:**

`ValueOrUnknown[d]` - A static dimension with the given value.

---

## static_tuple

Implements StaticTuple, a statically-sized uniform container.

You can import these APIs from the `utils` package. For example:

```mojo
from utils import StaticTuple
```

## Structs

* [​`StaticTuple`](/mojo/stdlib/utils/static_tuple/StaticTuple): A statically sized tuple type which contains elements of homogeneous types.

---

## StaticInt

`@register_passable(trivial)`
`struct StaticInt[value: Int]`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Intable`,
`Movable`,
`OptionallyStaticInt`,
`UnknownDestructibility`

## Aliases

### `static_value`

`alias static_value = OptionalReg[Int]({:@stdlib::@builtin::@int::@Int value, 0})`

## Methods

### `__init__`

`__init__() -> Self`

### `__int__`

`__int__(self) -> Int`

### `as_uint32`

`as_uint32(self) -> SIMD[uint32, 1]`

---

## StaticTuple

`@register_passable(trivial)`
`struct StaticTuple[element_type: AnyTrivialRegType, size: Int]`

A statically sized tuple type which contains elements of homogeneous types.

## Parameters

* ​element\_type (`AnyTrivialRegType`): The type of the elements in the tuple.
* ​size (`Int`): The size of the tuple.

## Fields

* ​array (`array, element_type>`): The underlying storage for the static tuple.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `type`

`alias type = array, element_type>`

## Methods

### `__init__`

`__init__() -> Self`

Constructs an empty (undefined) tuple.

`@implicit`
`__init__(array: array, element_type>) -> Self`

Constructs from an array type.

**Args:**

* ​array (`array, element_type>`): Underlying MLIR array type.

`@implicit`
`__init__(*elems: element_type) -> Self`

Constructs a static tuple given a set of arguments.

**Args:**

* ​\*elems (`element_type`): The element types.

`@implicit`
`__init__(values: VariadicList[element_type]) -> Self`

Creates a tuple constant using the specified values.

**Args:**

* ​values (`VariadicList[element_type]`): The list of values.

`__init__(*, other: Self) -> Self`

Explicitly copy the provided StaticTuple.

**Args:**

* ​other (`Self`): The StaticTuple to copy.

### `__getitem__`

`__getitem__[index: Int](self) -> element_type`

Returns the value of the tuple at the given index.

**Parameters:**

* ​index (`Int`): The index into the tuple.

**Returns:**

The value at the specified position.

`__getitem__[I: Indexer, //](self, idx: I) -> element_type`

Returns the value of the tuple at the given dynamic index.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The index into the tuple.

**Returns:**

The value at the specified position.

### `__setitem__`

`__setitem__[I: Indexer, //](mut self, idx: I, val: element_type)`

Stores a single value into the tuple at the specified dynamic index.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The index into the tuple.
* ​val (`element_type`): The value to store.

`__setitem__[idx: Int](mut self, val: element_type)`

Stores a single value into the tuple at the specified index.

**Parameters:**

* ​idx (`Int`): The index into the tuple.

**Args:**

* ​val (`element_type`): The value to store.

### `__len__`

`__len__(self) -> Int`

Returns the length of the array. This is a known constant value.

**Returns:**

The size of the list.

---

## store_matrix_d

`store_matrix_d[dtype: DType, //, m: Int, n: Int, k: Int, n_blocks: Int = 1](d_ptr: UnsafePointer[SIMD[dtype, 1]], d: SIMD[dtype, 4], tile_row: Int, tile_col: Int, ldm: Int)`

Stores matrix D tile from registers to memory after tensor core operation.

This function dispatches to architecture-specific implementations for storing the
results of a tensor core matrix multiply-accumulate operation. It handles the
different memory layouts required by NVIDIA and AMD tensor cores.

Note:

* Automatically selects appropriate implementation based on GPU architecture.
* Each thread stores 4 elements in architecture-specific positions.
* Must be called by all threads in a warp.

**Parameters:**

* ​dtype (`DType`): Data type of the matrix elements.
* ​m (`Int`): Number of rows in matrix D.
* ​n (`Int`): Number of columns in matrix D.
* ​k (`Int`): Inner dimension for matrix multiply.
* ​n\_blocks (`Int`): Number of blocks.

**Args:**

* ​d\_ptr (`UnsafePointer[SIMD[dtype, 1]]`): Pointer to destination memory for matrix D.
* ​d (`SIMD[dtype, 4]`): SIMD vector containing 4 elements to store.
* ​tile\_row (`Int`): Starting row index of the tile in matrix D.
* ​tile\_col (`Int`): Starting column index of the tile in matrix D.
* ​ldm (`Int`): Leading dimension (stride) of matrix D.

---

## store_release

`store_release[type: DType, //, scope: Scope = Scope(6), memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], value: SIMD[type, 1])`

Performs an atomic store with release memory ordering semantics.

This function provides a memory barrier that ensures all previous memory operations
from the calling thread are visible to other threads before this store is performed.

Note:

* Only supported on GPUs.
* Maps directly to PTX st.release instruction on NVIDIA, LLVM atomic
  store on AMDGPU.
* Ensures all previous memory operations complete before this store.
* Critical for implementing synchronization primitives.

**Parameters:**

* ​type (`DType`): The data type to store.
* ​scope (`Scope`): Memory scope for the operation (default: Scope.SYSTEM).
* ​memory (`Bool`): Whether to include memory side effects in constraints (default: True).

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to store to.
* ​value (`SIMD[type, 1]`): Value to store.

---

## store_volatile

`store_volatile[type: DType, //, memory: Bool = True](ptr: UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], value: SIMD[type, 1])`

Performs a volatile store operation that cannot be optimized away.

This function guarantees that the store operation will be performed exactly as
specified, without being reordered or optimized away by the compiler.

Note:

* Only supported on NVIDIA GPUs.
* Maps directly to PTX st.volatile instruction.
* Prevents compiler optimization of the store operation.
* Useful for memory-mapped I/O or synchronization primitives.
* May have performance implications compared to regular stores.

**Parameters:**

* ​type (`DType`): The data type to store.
* ​memory (`Bool`): Whether to include memory side effects in constraints (default: True).

**Args:**

* ​ptr (`UnsafePointer[SIMD[type, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to the memory location to store to.
* ​value (`SIMD[type, 1]`): Value to store.

---

## store_x

`store_x[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)`

---

## store_y

`store_y[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)`

---

## store_z

`store_z[row_count: Int, type: DType](src: UnsafePointer[SIMD[type, 1]], start_index: Int)`

---

## str

Provides the `str` function.

These are Mojo built-ins, so you don't need to import them.

## Traits

* [​`Stringable`](/mojo/stdlib/builtin/str/Stringable): The `Stringable` trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String).
* [​`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising): The StringableRaising trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String).

---

## Streaming multiprocessor

A streaming multiprocessor (SM) is the fundamental processing unit of a GPU,
designed to execute multiple parallel workloads efficiently. Each SM contains
several cores, which perform the actual computations of the
[threads](thread.mdx) executing on the SM, along with shared resources like
[registers](register.mdx), shared [memory](memory.mdx), and control
mechanisms to coordinate the execution of threads.

The number of SMs and the number of cores on a GPU depends on its architecture.
For example, the NVIDIA H100 PCIe contains 114 SMs, with 128 32-bit floating
point cores per SM.

---

## strided_load

`strided_load[dtype: DType, //, simd_width: Int, *, invariant: Bool = False](addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], stride: Int, mask: SIMD[bool, simd_width] = SIMD(True)) -> SIMD[dtype, simd_width]`

Loads values from addr according to a specific stride.

**Parameters:**

* ​dtype (`DType`): DType of `value`, the value to store.
* ​simd\_width (`Int`): The width of the SIMD vectors.
* ​invariant (`Bool`): Whether the memory is load invariant.

**Args:**

* ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The memory location to load data from.
* ​stride (`Int`): How many lanes to skip before loading again.
* ​mask (`SIMD[bool, simd_width]`): A binary vector which prevents memory access to certain lanes of
  `value`.

**Returns:**

A vector containing the loaded data.

---

## strided_store

`strided_store[dtype: DType, //, simd_width: Int](value: SIMD[dtype, simd_width], addr: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], stride: Int, mask: SIMD[bool, simd_width] = SIMD(True))`

Loads values from addr according to a specific stride.

**Parameters:**

* ​dtype (`DType`): DType of `value`, the value to store.
* ​simd\_width (`Int`): The width of the SIMD vectors.

**Args:**

* ​value (`SIMD[dtype, simd_width]`): The values to store.
* ​addr (`UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): The location to store values at.
* ​stride (`Int`): How many lanes to skip before storing again.
* ​mask (`SIMD[bool, simd_width]`): A binary vector which prevents memory access to certain lanes of
  `value`.

---

## string

The string package provides comprehensive Unicode string handling functionality for Mojo.

This package implements Unicode-aware string types and operations, with UTF-8 support.
It includes efficient implementations for string manipulation, formatting, and Unicode
operations while maintaining memory safety and performance.

Key Components:

* `String`: The main string type supporting UTF-8 encoded text,
* `StringSlice`: Memory-efficient string view type for zero-copy operations
* `Codepoint`: Unicode code point handling and operations
* Format: String formatting and interpolation utilities

Core Features:

* Unicode support with UTF-8 encoding
* Efficient string slicing and views
* String formatting and interpolation
* Memory-safe string operations
* Unicode case conversion
* Unicode property lookups and validation

Example:

```mojo
    # Basic string creation and manipulation
    var s = String("Hello, 世界")
    var slice = s[0:5] # "Hello"

    # Unicode-aware operations
    for c in s.codepoints():
        print(c.to_uppercase())

    # String formatting
    var name = "Mojo"
    var formatted = String("Hello, {name}!")
```

Note:

String stores data using UTF-8, and all operations (unless clearly noted) are intended to
be fully Unicode compliant and maintain correct UTF-8 encoded data.
A handful of operations are known to not be Unicode / UTF-8 compliant yet, but will be
fixed as time permits.

## Modules

* [​`codepoint`](/mojo/stdlib/collections/string/codepoint/): Unicode codepoint handling.
* [​`format`](/mojo/stdlib/collections/string/format/): String formatting utilities for Mojo.
* [​`string`](/mojo/stdlib/collections/string/string/): The core `String` type implementation for Mojo.
* [​`string_slice`](/mojo/stdlib/collections/string/string_slice/): The `StringSlice` type implementation for efficient string operations.

---

## string

The core `String` type implementation for Mojo.

This module provides the primary `String` type and its fundamental operations.
The `String` type is a mutable string, and is designed to handle UTF-8 encoded
text efficiently while providing a safe and ergonomic interface for string
manipulation.

Related types:

* [`StringSlice`](/mojo/stdlib/collections/string/string_slice/). A non-owning
  view of string data, which can be either mutable or immutable.
* [`StaticString`](/mojo/stdlib/collections/string/string_slice/#aliases). An
  alias for an immutable constant `StringSlice`.
* [`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral/). A
  string literal. String literals are compile-time values. For use at runtime,
  you usually want wrap a `StringLiteral` in a `String` (for a mutable string)
  or `StaticString` (for an immutable constant string).

Key Features:

* Short string optimization (SSO) and lazy copying of constant string data.
* O(1) copy operation.
* Memory-safe string operations.
* Efficient string concatenation and slicing.
* String-to-number conversions (
  [`atof()`](/mojo/stdlib/collections/string/string/atof),
  [`atol()`](/mojo/stdlib/collections/string/string/atol)).
* Character code conversions (
  [`chr()`](/mojo/stdlib/collections/string/string/chr),
  [`ord()`](/mojo/stdlib/collections/string/string/ord)).
* String formatting with
  [`format()`](/mojo/stdlib/collections/string/string/String/#format).

The `String` type has Unicode support through UTF-8 encoding. A handful of
operations are known to not be Unicode / UTF-8 compliant yet, but will be fixed
as time permits.

This type is in the prelude, so it is automatically imported into every Mojo
program.

Example:

```mojo
# String creation and basic operations
var s1 = String("Hello")
var s2 = String("World")
var combined = s1 + " " + s2  # "Hello World"

# String-to-number conversion
var num = atof("3.14")
var int_val = atol("42")

# Character operations
var char = chr(65)  # "A"
var code = ord("A")  # 65

# String formatting
print(String("Codepoint {} is {}").format(code, char)) # Codepoint 65 is A

# ASCII utilities
var ascii_str = ascii("Hello")  # ASCII-only string
```

## Structs

* [​`String`](/mojo/stdlib/collections/string/string/String): Represents a mutable string.

## Functions

* [​`ascii`](/mojo/stdlib/collections/string/string/ascii): Get the ASCII representation of the object.
* [​`atof`](/mojo/stdlib/collections/string/string/atof): Parses the given string as a floating point and returns that value.
* [​`atol`](/mojo/stdlib/collections/string/string/atol): Parses and returns the given string as an integer in the given base.
* [​`chr`](/mojo/stdlib/collections/string/string/chr): Returns a String based on the given Unicode code point. This is the inverse of the `ord()` function.
* [​`ord`](/mojo/stdlib/collections/string/string/ord): Returns an integer that represents the codepoint of a single-character string.

---

## String

`struct String`

Represents a mutable string.

See the [`string` module](/mojo/stdlib/collections/string/string/) for
more information and examples.

## Implemented traits

`AnyType`,
`Boolable`,
`Comparable`,
`ConvertibleFromPython`,
`Copyable`,
`Defaultable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`FloatableRaising`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`Hashable`,
`IntableRaising`,
`KeyElement`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`PathLike`,
`PythonConvertible`,
`Representable`,
`Sized`,
`Stringable`,
`TypeIdentifiable`,
`UnknownDestructibility`,
`Writable`,
`Writer`,
`_HashableWithHasher`

## Aliases

### `ASCII_LETTERS`

`alias ASCII_LETTERS = "abcdefghijklmnopqrstuvwxyz".__add__[__mlir_type.!kgen.string]("ABCDEFGHIJKLMNOPQRSTUVWXYZ")`

### `ASCII_LOWERCASE`

`alias ASCII_LOWERCASE = "abcdefghijklmnopqrstuvwxyz"`

### `ASCII_UPPERCASE`

`alias ASCII_UPPERCASE = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"`

### `DIGITS`

`alias DIGITS = "0123456789"`

### `HEX_DIGITS`

`alias HEX_DIGITS = "0123456789".__add__[__mlir_type.!kgen.string]("abcdef").__add__[__mlir_type.!kgen.string]("ABCDEF")`

### `OCT_DIGITS`

`alias OCT_DIGITS = "01234567"`

### `PRINTABLE`

`alias PRINTABLE = "0123456789".__add__[__mlir_type.!kgen.string]("abcdefghijklmnopqrstuvwxyz".__add__[__mlir_type.!kgen.string]("ABCDEFGHIJKLMNOPQRSTUVWXYZ")).__add__[__mlir_type.!kgen.string]("!\22#$%&'()*+,-./:;?@[\\]^_`{|}\~").**add**\[\_\_mlir\_type.!kgen.string]\(" \t\n\r\v\f")\`

### `PUNCTUATION`

`alias PUNCTUATION = "!\22#$%&'()*+,-./:;?@[\\]^_`{|}\~"\`

### `TYPE_ID`

`alias TYPE_ID = "stdlib.String"`

## Methods

### `__init__`

`__init__(out self)`

Construct an empty string.

`__init__(out self, *, capacity: Int)`

Construct an empty string with a given capacity.

**Args:**

* ​capacity (`Int`): The capacity of the string to allocate.

`@implicit`
`__init__(out self, data: StringSlice[StaticConstantOrigin])`

Construct a string from a static constant string without allocating.

**Args:**

* ​data (`StringSlice[StaticConstantOrigin]`): The static constant string to refer to.

`@implicit`
`__init__(out self, data: StringLiteral[value])`

Construct a string from a string literal without allocating.

**Args:**

* ​data (`StringLiteral[value]`): The static constant string to refer to.

`__init__(out self, *, bytes: Span[SIMD[uint8, 1], origin])`

Construct a string by copying the data. This constructor is explicit because it can involve memory allocation.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The bytes to copy.

`__init__[T: Stringable](out self, value: T)`

Initialize from a type conforming to `Stringable`.

**Parameters:**

* ​T (`Stringable`): The type conforming to Stringable.

**Args:**

* ​value (`T`): The object to get the string representation of.

`__init__[T: StringableRaising](out self, value: T)`

Initialize from a type conforming to `StringableRaising`.

**Parameters:**

* ​T (`StringableRaising`): The type conforming to Stringable.

**Args:**

* ​value (`T`): The object to get the string representation of.

**Raises:**

If there is an error when computing the string representation of the type.

`__init__[*Ts: Writable](out self, *args: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""))`

Construct a string by concatenating a sequence of Writable arguments.

Examples:

Construct a String from several `Writable` arguments:

```mojo
var string = String(1, 2.0, "three", sep=", ")
print(string) # "1, 2.0, three"
```

.

**Parameters:**

* ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy
  `Writable`.

**Args:**

* ​\*args (`*Ts`): A sequence of Writable arguments.
* ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements.
* ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements.

`__init__[*Ts: Writable](out self, args: VariadicPack[is_owned, origin, Writable, Ts], sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""))`

Construct a string by passing a variadic pack.

Examples:

```mojo
fn variadic_pack_to_string[
    *Ts: Writable,
](*args: *Ts) -> String:
    return String(args)

string = variadic_pack_to_string(1, ", ", 2.0, ", ", "three")
```

.

**Parameters:**

* ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy
  `Writable`.

**Args:**

* ​args (`VariadicPack[is_owned, origin, Writable, Ts]`): A VariadicPack of Writable arguments.
* ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements.
* ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements.

`__init__(out self, *, unsafe_uninit_length: UInt)`

Construct a String with the specified length, with uninitialized memory. This is unsafe, as it relies on the caller initializing the elements with unsafe operations, not assigning over the uninitialized data.

**Args:**

* ​unsafe\_uninit\_length (`UInt`): The number of bytes to allocate.

`__init__(out self, *, unsafe_from_utf8_ptr: UnsafePointer[SIMD[int8, 1], mut=mut, origin=origin])`

Creates a string from a UTF-8 encoded nul-terminated pointer.

Safety:

* `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data.
* `unsafe_from_utf8_ptr` MUST be null terminated.

**Args:**

* ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[int8, 1], mut=mut, origin=origin]`): An `UnsafePointer[Byte]` of null-terminated bytes encoded in UTF-8.

`__init__(out self, *, unsafe_from_utf8_ptr: UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin])`

Creates a string from a UTF-8 encoded nul-terminated pointer.

Safety:

* `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data.
* `unsafe_from_utf8_ptr` MUST be null terminated.

**Args:**

* ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin]`): An `UnsafePointer[Byte]` of null-terminated bytes encoded in UTF-8.

`__init__(out self, obj: PythonObject)`

Construct a `String` from a PythonObject.

**Args:**

* ​obj (`PythonObject`): The PythonObject to convert from.

**Raises:**

An error if the conversion failed.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Copy initialize the string from another string.

**Args:**

* ​other (`Self`): The string to copy.

### `__moveinit__`

`__moveinit__(out self, owned other: Self)`

Move initialize the string from another string.

**Args:**

* ​other (`Self`): The string to move.

### `__del__`

`__del__(owned self)`

Destroy the string data.

### `__bool__`

`__bool__(self) -> Bool`

Checks if the string is not empty.

**Returns:**

True if the string length is greater than zero, and False otherwise.

### `__getitem__`

`__getitem__[I: Indexer](self, idx: I) -> Self`

Gets the character at the specified position.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The index value.

**Returns:**

A new string containing the character at the specified position.

`__getitem__(self, span: Slice) -> Self`

Gets the sequence of characters at the specified positions.

**Args:**

* ​span (`Slice`): A slice that specifies positions of the new substring.

**Returns:**

A new string containing the string at the specified positions.

### `__lt__`

`__lt__(self, rhs: Self) -> Bool`

Compare this String to the RHS using LT comparison.

**Args:**

* ​rhs (`Self`): The other String to compare against.

**Returns:**

True if this String is strictly less than the RHS String and False
otherwise.

### `__le__`

`__le__(self, rhs: Self) -> Bool`

Compare this String to the RHS using LE comparison.

**Args:**

* ​rhs (`Self`): The other String to compare against.

**Returns:**

True iff this String is less than or equal to the RHS String.

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compares two Strings if they have the same values.

**Args:**

* ​other (`Self`): The rhs of the operation.

**Returns:**

True if the Strings are equal and False otherwise.

`__eq__(self, other: StringSlice[origin]) -> Bool`

Compares two Strings if they have the same values.

**Args:**

* ​other (`StringSlice[origin]`): The rhs of the operation.

**Returns:**

True if the Strings are equal and False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Compares two Strings if they do not have the same values.

**Args:**

* ​other (`Self`): The rhs of the operation.

**Returns:**

True if the Strings are not equal and False otherwise.

`__ne__(self, other: StringSlice[origin]) -> Bool`

Compares two Strings if they have the same values.

**Args:**

* ​other (`StringSlice[origin]`): The rhs of the operation.

**Returns:**

True if the Strings are equal and False otherwise.

### `__gt__`

`__gt__(self, rhs: Self) -> Bool`

Compare this String to the RHS using GT comparison.

**Args:**

* ​rhs (`Self`): The other String to compare against.

**Returns:**

True iff this String is strictly greater than the RHS String.

### `__ge__`

`__ge__(self, rhs: Self) -> Bool`

Compare this String to the RHS using GE comparison.

**Args:**

* ​rhs (`Self`): The other String to compare against.

**Returns:**

True iff this String is greater than or equal to the RHS String.

### `__contains__`

`__contains__(self, substr: StringSlice[origin]) -> Bool`

Returns True if the substring is contained within the current string.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to check.

**Returns:**

True if the string contains the substring.

### `__add__`

`__add__(self, other: StringSlice[origin]) -> Self`

Creates a string by appending a string slice at the end.

**Args:**

* ​other (`StringSlice[origin]`): The string slice to append.

**Returns:**

The new constructed string.

### `__mul__`

`__mul__(self, n: Int) -> Self`

Concatenates the string `n` times.

**Args:**

* ​n (`Int`): The number of times to concatenate the string.

**Returns:**

The string concatenated `n` times.

### `__radd__`

`__radd__(self, other: StringSlice[origin]) -> Self`

Creates a string by prepending another string slice to the start.

**Args:**

* ​other (`StringSlice[origin]`): The string to prepend.

**Returns:**

The new constructed string.

### `__iadd__`

`__iadd__(mut self, other: StringSlice[origin])`

Appends another string slice to this string.

**Args:**

* ​other (`StringSlice[origin]`): The string to append.

### `copy`

`copy(self) -> Self`

Explicitly copy the provided value.

**Returns:**

A copy of the value.

### `capacity`

`capacity(self) -> UInt`

Get the capacity of the string.

**Returns:**

The capacity of the string.

### `write_bytes`

`write_bytes(mut self, bytes: Span[SIMD[uint8, 1], origin])`

Write a byte span to this String.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The byte span to write to this String. Must NOT be
  null terminated.

### `write`

`write[*Ts: Writable](mut self, *args: *Ts)`

Write a sequence of Writable arguments to the provided Writer.

**Parameters:**

* ​\*Ts (`Writable`): Types of the provided argument sequence.

**Args:**

* ​\*args (`*Ts`): Sequence of arguments to write to this Writer.

`static write[*Ts: Writable](*args: *Ts, *, sep: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](""), end: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("")) -> Self`

Construct a string by concatenating a sequence of Writable arguments.

This is used only when reusing the `write_to` method for
`__str__` in order to avoid an endless loop recalling
the constructor:

```mojo
fn write_to[W: Writer](self, mut writer: W):
    writer.write_bytes(self.as_bytes())

fn __str__(self) -> String:
    return String.write(self)
```

Otherwise you can use the `String` constructor directly without calling
the `String.write` static method:

```mojo
var msg = String("my message", 42, 42.2, True)
```

.

**Parameters:**

* ​\*Ts (`Writable`): The types of the arguments to format. Each type must be satisfy
  `Writable`.

**Args:**

* ​\*args (`*Ts`): A sequence of Writable arguments.
* ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements.
* ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements.

**Returns:**

A string formed by formatting the argument sequence.

### `append_byte`

`append_byte(mut self, byte: SIMD[uint8, 1])`

Append a byte to the string.

**Args:**

* ​byte (`SIMD[uint8, 1]`): The byte to append.

### `__iter__`

`__iter__(self) -> CodepointSliceIter[self]`

Iterate over the string, returning immutable references.

**Returns:**

An iterator of references to the string elements.

### `__reversed__`

`__reversed__(self) -> CodepointSliceIter[self, False]`

Iterate backwards over the string, returning immutable references.

**Returns:**

A reversed iterator of references to the string elements.

### `__len__`

`__len__(self) -> Int`

Get the string length of in bytes.

This function returns the number of bytes in the underlying UTF-8
representation of the string.

To get the number of Unicode codepoints in a string, use
`len(str.codepoints())`.

# Examples

Query the length of a string, in bytes and Unicode codepoints:

```mojo
from testing import assert_equal

var s = String("ನಮಸ್ಕಾರ")

assert_equal(len(s), 21)
assert_equal(len(s.codepoints()), 7)
```

Strings containing only ASCII characters have the same byte and
Unicode codepoint length:

```mojo
from testing import assert_equal

var s = String("abc")

assert_equal(len(s), 3)
assert_equal(len(s.codepoints()), 3)
```

.

**Returns:**

The string length in bytes.

### `__str__`

`__str__(self) -> Self`

Gets the string itself.

This method ensures that you can pass a `String` to a method that
takes a `Stringable` value.

**Returns:**

The string itself.

### `__repr__`

`__repr__(self) -> Self`

Return a Mojo-compatible representation of the `String` instance.

**Returns:**

A new representation of the string.

### `__fspath__`

`__fspath__(self) -> Self`

Return the file system path representation (just the string itself).

**Returns:**

The file system path representation as a string.

### `to_python_object`

`to_python_object(self) -> PythonObject`

Convert this value to a PythonObject.

**Returns:**

A PythonObject representing the value.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this string to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `join`

`join[*Ts: Writable](self, *elems: *Ts) -> Self`

Joins string elements using the current string as a delimiter.

**Parameters:**

* ​\*Ts (`Writable`): The types of the elements.

**Args:**

* ​\*elems (`*Ts`): The input values.

**Returns:**

The joined string.

`join[T: Copyable & Movable & Writable, //, buffer_size: Int = 4096](self, elems: List[T, hint_trivial_type]) -> Self`

Joins string elements using the current string as a delimiter. Defaults to writing to the stack if total bytes of `elems` is less than `buffer_size`, otherwise will allocate once to the heap and write directly into that. The `buffer_size` defaults to 4096 bytes to match the default page size on arm64 and x86-64, but you can increase this if you're joining a very large `List` of elements to write into the stack instead of the heap.

**Parameters:**

* ​T (`Copyable & Movable & Writable`): The type of the elements. Must implement the `Copyable`,
  `Movable` and `Writable` traits.
* ​buffer\_size (`Int`): The max size of the stack buffer.

**Args:**

* ​elems (`List[T, hint_trivial_type]`): The input values.

**Returns:**

The joined string.

### `codepoints`

`codepoints(self) -> CodepointsIter[self]`

Returns an iterator over the `Codepoint`s encoded in this string slice.

# Examples

Print the characters in a string:

```mojo
from testing import assert_equal

var s = String("abc")
var iter = s.codepoints()
assert_equal(iter.__next__(), Codepoint.ord("a"))
assert_equal(iter.__next__(), Codepoint.ord("b"))
assert_equal(iter.__next__(), Codepoint.ord("c"))
assert_equal(iter.__has_next__(), False)
```

`codepoints()` iterates over Unicode codepoints, and supports multibyte
codepoints:

```mojo
from testing import assert_equal

# A visual character composed of a combining sequence of 2 codepoints.
var s = String("á")
assert_equal(s.byte_length(), 3)

var iter = s.codepoints()
assert_equal(iter.__next__(), Codepoint.ord("a"))
 # U+0301 Combining Acute Accent
assert_equal(iter.__next__().to_u32(), 0x0301)
assert_equal(iter.__has_next__(), False)
```

.

**Returns:**

An iterator type that returns successive `Codepoint` values stored in
this string slice.

### `codepoint_slices`

`codepoint_slices(self) -> CodepointSliceIter[self]`

Returns an iterator over single-character slices of this string.

Each returned slice points to a single Unicode codepoint encoded in the
underlying UTF-8 representation of this string.

# Examples

Iterate over the character slices in a string:

```mojo
from testing import assert_equal, assert_true

var s = String("abc")
var iter = s.codepoint_slices()
assert_true(iter.__next__() == "a")
assert_true(iter.__next__() == "b")
assert_true(iter.__next__() == "c")
assert_equal(iter.__has_next__(), False)
```

.

**Returns:**

An iterator of references to the string elements.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[SIMD[uint8, 1], mut=False, origin=self]`

Retrieves a pointer to the underlying memory.

**Returns:**

The pointer to the underlying memory.

### `unsafe_ptr_mut`

`unsafe_ptr_mut(mut self) -> UnsafePointer[SIMD[uint8, 1], origin=self]`

Retrieves a mutable pointer to the underlying memory, copying to a new buffer if this was previously pointing to a static constant.

**Returns:**

The pointer to the underlying memory.

### `unsafe_cstr_ptr`

`unsafe_cstr_ptr(mut self) -> UnsafePointer[SIMD[int8, 1], origin=self]`

Retrieves a C-string-compatible pointer to the underlying memory.

The returned pointer is guaranteed to be null, or NUL terminated.

**Returns:**

The pointer to the underlying memory.

### `as_bytes`

`as_bytes(self) -> Span[SIMD[uint8, 1], self]`

Returns a contiguous slice of the bytes owned by this string.

**Returns:**

A contiguous slice pointing to the bytes owned by this string.

### `as_bytes_mut`

`as_bytes_mut(mut self) -> Span[SIMD[uint8, 1], self]`

Returns a mutable contiguous slice of the bytes owned by this string. This name has a \_mut suffix so the as\_bytes() method doesn't have to guarantee mutability.

**Returns:**

A contiguous slice pointing to the bytes owned by this string.

### `as_string_slice`

`as_string_slice(self) -> StringSlice[self]`

Returns a string slice of the data owned by this string.

**Returns:**

A string slice pointing to the data owned by this string.

### `as_string_slice_mut`

`as_string_slice_mut(mut self) -> StringSlice[self]`

Returns a mutable string slice of the data owned by this string.

**Returns:**

A string slice pointing to the data owned by this string.

### `byte_length`

`byte_length(self) -> Int`

Get the string length in bytes.

**Returns:**

The length of this string in bytes.

### `count`

`count(self, substr: StringSlice[origin]) -> Int`

Return the number of non-overlapping occurrences of substring `substr` in the string.

If sub is empty, returns the number of empty strings between characters
which is the length of the string plus one.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to count.

**Returns:**

The number of occurrences of `substr`.

### `find`

`find(self, substr: StringSlice[origin], start: Int = 0) -> Int`

Finds the offset of the first occurrence of `substr` starting at `start`. If not found, returns -1.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to find.
* ​start (`Int`): The offset from which to find.

**Returns:**

The offset of `substr` relative to the beginning of the string.

### `rfind`

`rfind(self, substr: StringSlice[origin], start: Int = 0) -> Int`

Finds the offset of the last occurrence of `substr` starting at `start`. If not found, returns -1.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to find.
* ​start (`Int`): The offset from which to find.

**Returns:**

The offset of `substr` relative to the beginning of the string.

### `isspace`

`isspace(self) -> Bool`

Determines whether every character in the given String is a python whitespace String. This corresponds to Python's [universal separators](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`.

**Returns:**

True if the whole String is made up of whitespace characters
listed above, otherwise False.

### `split`

`split(self, sep: StringSlice[origin], maxsplit: Int = -1) -> List[String]`

Split the string by a separator.

Examples:

```mojo
# Splitting a space
_ = String("hello world").split(" ") # ["hello", "world"]
# Splitting adjacent separators
_ = String("hello,,world").split(",") # ["hello", "", "world"]
# Splitting with maxsplit
_ = String("1,2,3").split(",", 1) # ['1', '2,3']
```

.

**Args:**

* ​sep (`StringSlice[origin]`): The string to split on.
* ​maxsplit (`Int`): The maximum amount of items to split from String.
  Defaults to unlimited.

**Returns:**

A List of Strings containing the input split by the separator.

**Raises:**

If the separator is empty.

`split(self, sep: NoneType = NoneType(None), maxsplit: Int = -1) -> List[String]`

Split the string by every Whitespace separator.

Examples:

```mojo
# Splitting an empty string or filled with whitespaces
_ = String("      ").split() # []
_ = String("").split() # []

# Splitting a string with leading, trailing, and middle whitespaces
_ = String("      hello    world     ").split() # ["hello", "world"]
# Splitting adjacent universal newlines:
_ = String(
    "hello \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029world"
).split()  # ["hello", "world"]
```

.

**Args:**

* ​sep (`NoneType`): None.
* ​maxsplit (`Int`): The maximum amount of items to split from String. Defaults
  to unlimited.

**Returns:**

A List of Strings containing the input split by the separator.

### `splitlines`

`splitlines(self, keepends: Bool = False) -> List[String]`

Split the string at line boundaries. This corresponds to Python's [universal newlines:](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `"\r\n"` and `"\t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`.

**Args:**

* ​keepends (`Bool`): If True, line breaks are kept in the resulting strings.

**Returns:**

A List of Strings containing the input split by line boundaries.

### `replace`

`replace(self, old: StringSlice[origin], new: StringSlice[origin]) -> Self`

Return a copy of the string with all occurrences of substring `old` if replaced by `new`.

**Args:**

* ​old (`StringSlice[origin]`): The substring to replace.
* ​new (`StringSlice[origin]`): The substring to replace with.

**Returns:**

The string where all occurrences of `old` are replaced with `new`.

### `strip`

`strip(self, chars: StringSlice[origin]) -> StringSlice[self]`

Return a copy of the string with leading and trailing characters removed.

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A copy of the string with no leading or trailing characters.

`strip(self) -> StringSlice[self]`

Return a copy of the string with leading and trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

**Returns:**

A copy of the string with no leading or trailing whitespaces.

### `rstrip`

`rstrip(self, chars: StringSlice[origin]) -> StringSlice[self]`

Return a copy of the string with trailing characters removed.

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A copy of the string with no trailing characters.

`rstrip(self) -> StringSlice[self]`

Return a copy of the string with trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

**Returns:**

A copy of the string with no trailing whitespaces.

### `lstrip`

`lstrip(self, chars: StringSlice[origin]) -> StringSlice[self]`

Return a copy of the string with leading characters removed.

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A copy of the string with no leading characters.

`lstrip(self) -> StringSlice[self]`

Return a copy of the string with leading whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

**Returns:**

A copy of the string with no leading whitespaces.

### `__hash__`

`__hash__(self) -> UInt`

Hash the underlying buffer using builtin hash.

**Returns:**

A 64-bit hash value. This value is *not* suitable for cryptographic
uses. Its intended usage is for data structures. See the `hash`
builtin documentation for more details.

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with the underlying bytes.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

### `lower`

`lower(self) -> Self`

Returns a copy of the string with all cased characters converted to lowercase.

**Returns:**

A new string where cased letters have been converted to lowercase.

### `upper`

`upper(self) -> Self`

Returns a copy of the string with all cased characters converted to uppercase.

**Returns:**

A new string where cased letters have been converted to uppercase.

### `startswith`

`startswith(self, prefix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool`

Checks if the string starts with the specified prefix between start and end positions. Returns True if found and False otherwise.

**Args:**

* ​prefix (`StringSlice[origin]`): The prefix to check.
* ​start (`Int`): The start offset from which to check.
* ​end (`Int`): The end offset from which to check.

**Returns:**

True if the `self[start:end]` is prefixed by the input prefix.

### `endswith`

`endswith(self, suffix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool`

Checks if the string end with the specified suffix between start and end positions. Returns True if found and False otherwise.

**Args:**

* ​suffix (`StringSlice[origin]`): The suffix to check.
* ​start (`Int`): The start offset from which to check.
* ​end (`Int`): The end offset from which to check.

**Returns:**

True if the `self[start:end]` is suffixed by the input suffix.

### `removeprefix`

`removeprefix(self, prefix: StringSlice[origin], /) -> StringSlice[self]`

Returns a new string with the prefix removed if it was present.

Examples:

```mojo
print(String('TestHook').removeprefix('Test')) # 'Hook'
print(String('BaseTestCase').removeprefix('Test')) # 'BaseTestCase'
```

**Args:**

* ​prefix (`StringSlice[origin]`): The prefix to remove from the string.

**Returns:**

`string[len(prefix):]` if the string starts with the prefix string,
or a copy of the original string otherwise.

### `removesuffix`

`removesuffix(self, suffix: StringSlice[origin], /) -> StringSlice[self]`

Returns a new string with the suffix removed if it was present.

Examples:

```mojo
print(String('TestHook').removesuffix('Hook')) # 'Test'
print(String('BaseTestCase').removesuffix('Test')) # 'BaseTestCase'
```

**Args:**

* ​suffix (`StringSlice[origin]`): The suffix to remove from the string.

**Returns:**

`string[:-len(suffix)]` if the string ends with the suffix string,
or a copy of the original string otherwise.

### `__int__`

`__int__(self) -> Int`

Parses the given string as a base-10 integer and returns that value. If the string cannot be parsed as an int, an error is raised.

**Returns:**

An integer value that represents the string, or otherwise raises.

### `__float__`

`__float__(self) -> SIMD[float64, 1]`

Parses the string as a float point number and returns that value. If the string cannot be parsed as a float, an error is raised.

**Returns:**

A float value that represents the string, or otherwise raises.

### `format`

`format[*Ts: Stringable & Representable](self, *args: *Ts) -> Self`

Produce a formatted string using the current string as a template.

The template, or "format string" can contain literal text and/or
replacement fields delimited with curly braces (`{}`). Returns a copy of
the format string with the replacement fields replaced with string
representations of the `args` arguments.

For more information, see the discussion in the
[`format` module](/mojo/stdlib/collections/string/format/).

Example:

```mojo
# Manual indexing:
print(String("{0} {1} {0}").format("Mojo", 1.125)) # Mojo 1.125 Mojo
# Automatic indexing:
print(String("{} {}").format(True, "hello world")) # True hello world
```

**Parameters:**

* ​\*Ts (`Stringable & Representable`): The types of substitution values that implement `Representable`
  and `Stringable` (to be changed and made more flexible).

**Args:**

* ​\*args (`*Ts`): The substitution values.

**Returns:**

The template with the given values substituted.

### `isdigit`

`isdigit(self) -> Bool`

A string is a digit string if all characters in the string are digits and there is at least one character in the string.

Note that this currently only works with ASCII strings.

**Returns:**

True if all characters are digits and it's not empty else False.

### `isupper`

`isupper(self) -> Bool`

Returns True if all cased characters in the string are uppercase and there is at least one cased character.

**Returns:**

True if all cased characters in the string are uppercase and there
is at least one cased character, False otherwise.

### `islower`

`islower(self) -> Bool`

Returns True if all cased characters in the string are lowercase and there is at least one cased character.

**Returns:**

True if all cased characters in the string are lowercase and there
is at least one cased character, False otherwise.

### `isprintable`

`isprintable(self) -> Bool`

Returns True if all characters in the string are ASCII printable.

Note that this currently only works with ASCII strings.

**Returns:**

True if all characters are printable else False.

### `rjust`

`rjust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> Self`

Returns the string right justified in a string of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns right justified string, or self if width is not bigger than self length.

### `ljust`

`ljust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> Self`

Returns the string left justified in a string of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns left justified string, or self if width is not bigger than self length.

### `center`

`center(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> Self`

Returns the string center justified in a string of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns center justified string, or self if width is not bigger than self length.

### `resize`

`resize(mut self, length: Int, fill_byte: SIMD[uint8, 1] = __init__[__mlir_type.!pop.int_literal](0))`

Resize the string to a new length.

Notes:
If the new length is greater than the current length, the string is
extended by the difference, and the new bytes are initialized to
`fill_byte`.

**Args:**

* ​length (`Int`): The new length of the string.
* ​fill\_byte (`SIMD[uint8, 1]`): The byte to fill any new space with.

`resize(mut self, *, unsafe_uninit_length: Int)`

Resizes the string to the given new size leaving any new data uninitialized.

If the new size is smaller than the current one, elements at the end
are discarded. If the new size is larger than the current one, the
string is extended and the new data is left uninitialized.

**Args:**

* ​unsafe\_uninit\_length (`Int`): The new size.

### `reserve`

`reserve(mut self, new_capacity: UInt)`

Reserves the requested capacity.

Notes:
If the current capacity is greater or equal, this is a no-op.
Otherwise, the storage is reallocated and the data is moved.

**Args:**

* ​new\_capacity (`UInt`): The new capacity in stored bytes.

---

## string_literal

Implements the StringLiteral struct.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`StringLiteral`](/mojo/stdlib/builtin/string_literal/StringLiteral): This type represents a string literal.

---

## string_slice

The `StringSlice` type implementation for efficient string operations.

This module provides the `StringSlice` type, which is a lightweight view into
string data that enables zero-copy string operations. `StringSlice` is designed
for high-performance string manipulation while maintaining memory safety and
UTF-8 awareness.

The `StringSlice` type is particularly useful for:

* High-performance string operations without copying.
* Efficient string parsing and tokenization.

`StaticString` is an alias for an immutable constant `StringSlice`.

`StringSlice` and `StaticString` are in the prelude, so they are automatically
imported into every Mojo program.

Example:

```mojo
# Create a string slice
var text = StringSlice("Hello, 世界")

# Zero-copy slicing
var hello = text[0:5] # Hello

# Unicode-aware operations
var world = text[7:13]  # "世界"

# String comparison
if text.startswith("Hello"):
    print("Found greeting")

# String formatting
var format_string = StaticString("{}: {}")
print(format_string.format("bats", 6)) # bats: 6
```

## Aliases

### `StaticString`

`alias StaticString = StringSlice[StaticConstantOrigin]`

An immutable static string slice.

## Structs

* [​`CodepointsIter`](/mojo/stdlib/collections/string/string_slice/CodepointsIter): Iterator over the `Codepoint`s in a string slice, constructed by `StringSlice.codepoints()`.
* [​`CodepointSliceIter`](/mojo/stdlib/collections/string/string_slice/CodepointSliceIter): Iterator for `StringSlice` over substring slices containing a single Unicode codepoint.
* [​`StringSlice`](/mojo/stdlib/collections/string/string_slice/StringSlice): A non-owning view to encoded string data.

## Functions

* [​`get_static_string`](/mojo/stdlib/collections/string/string_slice/get_static_string): Form a StaticString from compile-time StringSlice values. This guarantees that the returned string is compile-time constant in static memory.  It also guarantees that there is a 'nul' zero byte at the end, which is not included in the returned range.

---

## Stringable

The `Stringable` trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String).

Any type that conforms to `Stringable` or
[`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising) works
with the built-in [`print()`](/mojo/stdlib/builtin/io/print) and
[`String()`](/mojo/stdlib/builtin/str/str) functions.

The `Stringable` trait requires the type to define the `__str__()` method.
For example:

```mojo
struct Foo(Stringable):
    var s: String

    fn __str__(self) -> String:
        return self.s
```

Now you can pass an instance of `Foo` to the `String()` function to get back a
`String`:

```mojo
var foo = Foo("test")
print(String(foo) == "test")
```

```plaintext
True
```

**Note:** If the `__str__()` method might raise an error, use the
[`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising)
trait, instead.

About the difference between `__repr__()` and `__str__()`:
The method `__repr__` computes the "official" string representation of an object
while `__str__` computes the "informal" or nicely printable string representation of an object.

This method differs from `__repr__()` in that there is no expectation that `__str__()`
return a valid Mojo expression: a more convenient or concise representation can be used.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__str__`

`__str__(self: _Self) -> String`

Get the string representation of the type.

**Returns:**

The string representation of the type.

---

## StringableRaising

The StringableRaising trait describes a type that can be converted to a [`String`](/mojo/stdlib/collections/string/String).

Any type that conforms to
[`Stringable`](/mojo/stdlib/builtin/str/Stringable) or
`StringableRaising` works with the built-in
[`print()`](/mojo/stdlib/builtin/io/print) and
[`String()`](/mojo/stdlib/builtin/str/str) functions.

The `StringableRaising` trait requires the type to define the `__str__()`
method, which can raise an error. For example:

```mojo
struct Foo(StringableRaising):
    var s: String

    fn __str__(self) raises -> String:
        if self.s == "":
            raise Error("Empty String")
        return self.s
```

Now you can pass an instance of `Foo` to the `String()` function to get back a
`String`:

```mojo
fn main() raises:
    var foo = Foo("test")
    print(String(foo) == "test")
```

```plaintext
True
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__str__`

`__str__(self: _Self) -> String`

Get the string representation of the type.

**Returns:**

The string representation of the type.

**Raises:**

If there is an error when computing the string representation of the type.

---

## StringLiteral

`@register_passable(trivial)`
`struct StringLiteral[value: string]`

This type represents a string literal.

String literals are all null-terminated for compatibility with C APIs, but
this is subject to change. String literals store their length as an integer,
and this does not include the null terminator.

## Parameters

* ​value (`string`): The underlying string value.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`ExplicitlyCopyable`,
`FloatableRaising`,
`IntableRaising`,
`Movable`,
`PathLike`,
`PythonConvertible`,
`Representable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Constructor for any value.

### `__bool__`

`__bool__(self) -> Bool`

Convert the string to a bool value.

**Returns:**

True if the string is not empty.

### `__getitem__`

`__getitem__[IndexerType: Indexer](self, idx: IndexerType) -> String`

Gets the character at the specified position.

**Parameters:**

* ​IndexerType (`Indexer`): The inferred type of an indexer argument.

**Args:**

* ​idx (`IndexerType`): The index value.

**Returns:**

A new string containing the character at the specified position.

### `__lt__`

`__lt__(self, rhs: StringSlice[origin]) -> Bool`

Compare this value to the RHS using lesser than (LT) comparison.

**Args:**

* ​rhs (`StringSlice[origin]`): The other value to compare against.

**Returns:**

True if this is strictly less than the RHS and False otherwise.

### `__le__`

`__le__(self, rhs: StringSlice[origin]) -> Bool`

Compare this value to the RHS using lesser than or equal to (LE) comparison.

**Args:**

* ​rhs (`StringSlice[origin]`): The other value to compare against.

**Returns:**

True if this is less than or equal to the RHS and False otherwise.

### `__eq__`

`__eq__(self, rhs: StringSlice[origin]) -> Bool`

Compare two string literals for equality.

**Args:**

* ​rhs (`StringSlice[origin]`): The string to compare.

**Returns:**

True if they are equal.

### `__ne__`

`__ne__(self, rhs: StringSlice[origin]) -> Bool`

Compare two string literals for inequality.

**Args:**

* ​rhs (`StringSlice[origin]`): The string to compare.

**Returns:**

True if they are not equal.

### `__gt__`

`__gt__(self, rhs: StringSlice[origin]) -> Bool`

Compare this value to the RHS using greater than (GT) comparison.

**Args:**

* ​rhs (`StringSlice[origin]`): The other value to compare against.

**Returns:**

True if this is strictly greater than the RHS and False otherwise.

### `__ge__`

`__ge__(self, rhs: StringSlice[origin]) -> Bool`

Compare this value to the RHS using greater than or equal to (GE) comparison.

**Args:**

* ​rhs (`StringSlice[origin]`): The other value to compare against.

**Returns:**

True if this is greater than or equal to the RHS and False otherwise.

### `__add__`

`__add__(self, rhs: StringLiteral[value]) -> StringLiteral[#pop.string_concat]`

Concatenate two string literals.

**Args:**

* ​rhs (`StringLiteral[value]`): The string to concat.

**Returns:**

The concatenated string.

### `__mul__`

`__mul__(self, n: Int) -> String`

Concatenates the string `n` times.

**Args:**

* ​n (`Int`): The number of times to concatenate the string.

**Returns:**

The string concatenated `n` times.

### `copy`

`copy(self) -> Self`

Copy constructor.

**Returns:**

A copy of the value.

### `to_python_object`

`to_python_object(self) -> PythonObject`

Convert this value to a PythonObject.

**Returns:**

A PythonObject representing the value.

### `__len__`

`__len__(self) -> Int`

Get the string length.

**Returns:**

The length of this value.

### `__int__`

`__int__(self) -> Int`

Parses the given string as a base-10 integer and returns that value. If the string cannot be parsed as an int, an error is raised.

**Returns:**

An integer value that represents the string, or otherwise raises.

### `__float__`

`__float__(self) -> SIMD[float64, 1]`

Parses the string as a float point number and returns that value. If the string cannot be parsed as a float, an error is raised.

**Returns:**

A float value that represents the string, or otherwise raises.

### `__str__`

`__str__(self) -> String`

Convert the string literal to a string.

**Returns:**

A new string.

### `__repr__`

`__repr__(self) -> String`

Return a representation of this value.

You don't need to call this method directly, use `repr("...")` instead.

**Returns:**

A new representation of the string.

### `__fspath__`

`__fspath__(self) -> String`

Return the file system path representation of the object.

**Returns:**

The file system path representation as a string.

### `__iter__`

`__iter__(self) -> CodepointSliceIter[StaticConstantOrigin]`

Return an iterator over the string literal.

**Returns:**

An iterator over the string.

### `__reversed__`

`__reversed__(self) -> CodepointSliceIter[StaticConstantOrigin, False]`

Iterate backwards over the string, returning immutable references.

**Returns:**

A reversed iterator over the string.

### `__merge_with__`

`__merge_with__[: string, //, other_type: AnyStruct[StringLiteral[$0]]](self) -> StringSlice[StaticConstantOrigin]`

Returns a StaticString after merging with another string literal.

**Parameters:**

* ​other\_type (`AnyStruct[StringLiteral[$0]]`): The type of the string literal to merge with.

**Returns:**

A StaticString after merging with the specified `other_type`.

### `byte_length`

`byte_length(self) -> Int`

Get the string length in bytes.

Notes:
This does not include the trailing null terminator in the count.

**Returns:**

The length of this string in bytes.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[SIMD[uint8, 1], mut=False, origin=StaticConstantOrigin]`

Get raw pointer to the underlying data.

**Returns:**

The raw pointer to the data.

### `unsafe_cstr_ptr`

`unsafe_cstr_ptr(self) -> UnsafePointer[SIMD[int8, 1], mut=False, origin=StaticConstantOrigin]`

Retrieves a C-string-compatible pointer to the underlying memory.

The returned pointer is guaranteed to be NUL terminated, and not null.

**Returns:**

The pointer to the underlying memory.

### `as_string_slice`

`as_string_slice(self) -> StringSlice[StaticConstantOrigin]`

Returns a string slice of this static string literal.

**Returns:**

A string slice pointing to this static string literal.

### `as_bytes`

`as_bytes(self) -> Span[SIMD[uint8, 1], StaticConstantOrigin]`

Returns a contiguous Span of the bytes owned by this string.

**Returns:**

A contiguous slice pointing to the bytes owned by this string.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this string literal to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `find`

`find(self, substr: StringSlice[StaticConstantOrigin], start: Int = 0) -> Int`

Finds the offset of the first occurrence of `substr` starting at `start`. If not found, returns -1.

**Args:**

* ​substr (`StringSlice[StaticConstantOrigin]`): The substring to find.
* ​start (`Int`): The offset from which to find.

**Returns:**

The offset of `substr` relative to the beginning of the string.

### `rfind`

`rfind(self, substr: StringSlice[StaticConstantOrigin], start: Int = 0) -> Int`

Finds the offset of the last occurrence of `substr` starting at `start`. If not found, returns -1.

**Args:**

* ​substr (`StringSlice[StaticConstantOrigin]`): The substring to find.
* ​start (`Int`): The offset from which to find.

**Returns:**

The offset of `substr` relative to the beginning of the string.

### `count`

`count(self, substr: StringSlice[origin]) -> Int`

Return the number of non-overlapping occurrences of substring `substr` in the string literal.

If sub is empty, returns the number of empty strings between characters
which is the length of the string plus one.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to count.

**Returns:**

The number of occurrences of `substr`.

### `lower`

`lower(self) -> String`

Returns a copy of the string literal with all cased characters converted to lowercase.

**Returns:**

A new string where cased letters have been converted to lowercase.

### `upper`

`upper(self) -> String`

Returns a copy of the string literal with all cased characters converted to uppercase.

**Returns:**

A new string where cased letters have been converted to uppercase.

### `rjust`

`rjust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String`

Returns the string right justified in a string literal of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns right justified string, or self if width is not bigger than self length.

### `ljust`

`ljust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String`

Returns the string left justified in a string literal of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns left justified string, or self if width is not bigger than self length.

### `center`

`center(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String`

Returns the string center justified in a string literal of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns center justified string, or self if width is not bigger than self length.

### `startswith`

`startswith(self, prefix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool`

Checks if the string literal starts with the specified prefix between start and end positions. Returns True if found and False otherwise.

**Args:**

* ​prefix (`StringSlice[origin]`): The prefix to check.
* ​start (`Int`): The start offset from which to check.
* ​end (`Int`): The end offset from which to check.

**Returns:**

True if the `self[start:end]` is prefixed by the input prefix.

### `endswith`

`endswith(self, suffix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool`

Checks if the string literal end with the specified suffix between start and end positions. Returns True if found and False otherwise.

**Args:**

* ​suffix (`StringSlice[origin]`): The suffix to check.
* ​start (`Int`): The start offset from which to check.
* ​end (`Int`): The end offset from which to check.

**Returns:**

True if the `self[start:end]` is suffixed by the input suffix.

### `isdigit`

`isdigit(self) -> Bool`

Returns True if all characters in the string literal are digits.

Note that this currently only works with ASCII strings.

**Returns:**

True if all characters are digits else False.

### `isupper`

`isupper(self) -> Bool`

Returns True if all cased characters in the string literal are uppercase and there is at least one cased character.

Note that this currently only works with ASCII strings.

**Returns:**

True if all cased characters in the string literal are uppercase
and there is at least one cased character, False otherwise.

### `islower`

`islower(self) -> Bool`

Returns True if all cased characters in the string literal are lowercase and there is at least one cased character.

Note that this currently only works with ASCII strings.

**Returns:**

True if all cased characters in the string literal are lowercase
and there is at least one cased character, False otherwise.

### `strip`

`strip(self) -> String`

Return a copy of the string literal with leading and trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

**Returns:**

A string with no leading or trailing whitespaces.

`strip(self, chars: StringSlice[origin]) -> String`

Return a copy of the string literal with leading and trailing characters removed.

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A string with no leading or trailing characters.

### `rstrip`

`rstrip(self, chars: StringSlice[origin]) -> String`

Return a copy of the string literal with trailing characters removed.

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A string with no trailing characters.

`rstrip(self) -> String`

Return a copy of the string with trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

**Returns:**

A copy of the string with no trailing whitespaces.

### `lstrip`

`lstrip(self, chars: StringSlice[origin]) -> String`

Return a copy of the string with leading characters removed.

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A copy of the string with no leading characters.

`lstrip(self) -> String`

Return a copy of the string with leading whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

**Returns:**

A copy of the string with no leading whitespaces.

---

## StringSlice

`@register_passable(trivial)`
`struct StringSlice[mut: Bool, //, origin: Origin[mut]]`

A non-owning view to encoded string data.

This type is guaranteed to have the same ABI (size, alignment, and field
layout) as the `llvm::StringRef` type.

See the
[`string_slice` module](/mojo/stdlib/collections/string/string_slice/)
for more information and examples.

Notes:
TODO: The underlying string data is guaranteed to be encoded using
UTF-8.

## Parameters

* ​mut (`Bool`): Whether the slice is mutable.
* ​origin (`Origin[mut]`): The origin of the underlying string data.

## Implemented traits

`AnyType`,
`Boolable`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`FloatableRaising`,
`Hashable`,
`IntableRaising`,
`KeyElement`,
`Movable`,
`PathLike`,
`PythonConvertible`,
`Representable`,
`Sized`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Aliases

### `Immutable`

`alias Immutable = StringSlice[(muttoimm origin._mlir_origin)]`

The immutable version of the `StringSlice`.

### `Mutable`

`alias Mutable = StringSlice[(mutcast origin._mlir_origin)]`

The mutable version of the `StringSlice`.

## Methods

### `__init__`

`__init__() -> Self`

Create an empty / zero-length slice.

`@implicit`
`__init__(lit: StringLiteral[value]) -> StringSlice[StaticConstantOrigin]`

Construct a new `StringSlice` from a `StringLiteral`.

**Args:**

* ​lit (`StringLiteral[value]`): The literal to construct this `StringSlice` from.

`__init__(*, unsafe_from_utf8: Span[SIMD[uint8, 1], origin]) -> Self`

Construct a new `StringSlice` from a sequence of UTF-8 encoded bytes.

Safety:
`unsafe_from_utf8` MUST be valid UTF-8 encoded data.

**Args:**

* ​unsafe\_from\_utf8 (`Span[SIMD[uint8, 1], origin]`): A `Span[Byte]` encoded in UTF-8.

`__init__(*, unsafe_from_utf8_ptr: UnsafePointer[SIMD[uint8, 1]]) -> Self`

Construct a new StringSlice from a `UnsafePointer[Byte]` pointing to null-terminated UTF-8 encoded bytes.

Safety:

* `unsafe_from_utf8_ptr` MUST point to data that is valid for
  `origin`.
* `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data.
* `unsafe_from_utf8_ptr` MUST be null terminated.

**Args:**

* ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[uint8, 1]]`): An `UnsafePointer[Byte]` of null-terminated bytes encoded in UTF-8.

`__init__(*, unsafe_from_utf8_ptr: UnsafePointer[SIMD[int8, 1]]) -> Self`

Construct a new StringSlice from a `UnsafePointer[c_char]` pointing to null-terminated UTF-8 encoded bytes.

Safety:

* `unsafe_from_utf8_ptr` MUST be valid UTF-8 encoded data.
* `unsafe_from_utf8_ptr` MUST be null terminated.

**Args:**

* ​unsafe\_from\_utf8\_ptr (`UnsafePointer[SIMD[int8, 1]]`): An `UnsafePointer[c_char]` of null-terminated bytes encoded in UTF-8.

`__init__(*, ptr: UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin], length: UInt) -> Self`

Construct a `StringSlice` from a pointer to a sequence of UTF-8 encoded bytes and a length.

Safety:

* `ptr` MUST point to at least `length` bytes of valid UTF-8 encoded
  data.
* `ptr` must point to data that is live for the duration of
  `origin`.

**Args:**

* ​ptr (`UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin]`): A pointer to a sequence of bytes encoded in UTF-8.
* ​length (`UInt`): The number of bytes of encoded data.

`@implicit`
`__init__[origin: ImmutableOrigin, //](ref [origin] value: String) -> StringSlice[origin]`

Construct an immutable StringSlice.

**Parameters:**

* ​origin (`ImmutableOrigin`): The immutable origin.

**Args:**

* ​value (`String`): The string value.

### `__bool__`

`__bool__(self) -> Bool`

Check if a string slice is non-empty.

**Returns:**

True if a string slice is non-empty, False otherwise.

### `__getitem__`

`__getitem__(self, span: Slice) -> Self`

Gets the sequence of characters at the specified positions.

Raises: This function will raise if the specified slice start or end
position are outside the bounds of the string, or if they do not
both fall on codepoint boundaries.

**Args:**

* ​span (`Slice`): A slice that specifies positions of the new substring.

**Returns:**

A new StringSlice containing the substring at the specified positions.

`__getitem__[I: Indexer](self, idx: I) -> String`

Gets the character at the specified position.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The index value.

**Returns:**

A new string containing the character at the specified position.

### `__lt__`

`__lt__(self, rhs: StringSlice[origin]) -> Bool`

Verify if the `StringSlice` bytes are strictly less than the input in overlapping content.

**Args:**

* ​rhs (`StringSlice[origin]`): The other `StringSlice` to compare against.

**Returns:**

If the `StringSlice` bytes are strictly less than the input in
overlapping content.

### `__eq__`

`__eq__(self, rhs_same: Self) -> Bool`

Verify if a `StringSlice` is equal to another `StringSlice` with the same origin.

**Args:**

* ​rhs\_same (`Self`): The `StringSlice` to compare against.

**Returns:**

If the `StringSlice` is equal to the input in length and contents.

`__eq__(self, rhs: StringSlice[origin]) -> Bool`

Verify if a `StringSlice` is equal to another `StringSlice`.

**Args:**

* ​rhs (`StringSlice[origin]`): The `StringSlice` to compare against.

**Returns:**

If the `StringSlice` is equal to the input in length and contents.

### `__ne__`

`__ne__(self, rhs_same: Self) -> Bool`

Verify if a `StringSlice` is not equal to another `StringSlice` with the same origin.

**Args:**

* ​rhs\_same (`Self`): The `StringSlice` to compare against.

**Returns:**

If the `StringSlice` is not equal to the input in length and
contents.

`__ne__(self, rhs: StringSlice[origin]) -> Bool`

Verify if span is not equal to another `StringSlice`.

**Args:**

* ​rhs (`StringSlice[origin]`): The `StringSlice` to compare against.

**Returns:**

If the `StringSlice` is not equal to the input in length and
contents.

### `__contains__`

`__contains__(self, substr: StringSlice[origin]) -> Bool`

Returns True if the substring is contained within the current string.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to check.

**Returns:**

True if the string contains the substring.

### `__add__`

`__add__(self, rhs: StringSlice[origin]) -> String`

Returns a string with this value prefixed on another string.

**Args:**

* ​rhs (`StringSlice[origin]`): The right side of the result.

**Returns:**

The result string.

### `__mul__`

`__mul__(self, n: Int) -> String`

Concatenates the string `n` times.

**Args:**

* ​n (`Int`): The number of times to concatenate the string.

**Returns:**

The string concatenated `n` times.

### `__radd__`

`__radd__(self, lhs: StringSlice[origin]) -> String`

Returns a string with this value appended to another string.

**Args:**

* ​lhs (`StringSlice[origin]`): The left side of the result.

**Returns:**

The result string.

### `copy`

`copy(self) -> Self`

Explicitly construct a deep copy of the provided `StringSlice`.

**Returns:**

A copy of the value.

### `from_utf8`

`static from_utf8(from_utf8: Span[SIMD[uint8, 1], origin]) -> Self`

Construct a new `StringSlice` from a buffer containing UTF-8 encoded data.

**Args:**

* ​from\_utf8 (`Span[SIMD[uint8, 1], origin]`): A span of bytes containing UTF-8 encoded data.

**Returns:**

A new validated `StringSlice` pointing to the provided buffer.

**Raises:**

An exception is raised if the provided buffer byte values do not
form valid UTF-8 encoded codepoints.

### `__str__`

`__str__(self) -> String`

Convert this StringSlice to a String.

Notes:
This will allocate a new string that copies the string contents from
the provided string slice.

**Returns:**

A new String.

### `__repr__`

`__repr__(self) -> String`

Return a Mojo-compatible representation of this string slice.

**Returns:**

Representation of this string slice as a Mojo string literal input
form syntax.

### `__len__`

`__len__(self) -> Int`

Get the string length in bytes.

This function returns the number of bytes in the underlying UTF-8
representation of the string.

To get the number of Unicode codepoints in a string, use
`len(str.codepoints())`.

# Examples

Query the length of a string, in bytes and Unicode codepoints:

```mojo

from testing import assert_equal

var s = StringSlice("ನಮಸ್ಕಾರ")

assert_equal(len(s), 21)
assert_equal(len(s.codepoints()), 7)
```

Strings containing only ASCII characters have the same byte and
Unicode codepoint length:

```mojo

from testing import assert_equal

var s = StringSlice("abc")

assert_equal(len(s), 3)
assert_equal(len(s.codepoints()), 3)
```

.

**Returns:**

The string length in bytes.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this string slice to the provided `Writer`.

**Parameters:**

* ​W (`Writer`): A type conforming to the `Writable` trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__hash__`

`__hash__(self) -> UInt`

Hash the underlying buffer using builtin hash.

**Returns:**

A 64-bit hash value. This value is *not* suitable for cryptographic
uses. Its intended usage is for data structures. See the `hash`
builtin documentation for more details.

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with the underlying bytes.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

### `__fspath__`

`__fspath__(self) -> String`

Return the file system path representation of this string.

**Returns:**

The file system path representation as a string.

### `to_python_object`

`to_python_object(self) -> PythonObject`

Convert this value to a PythonObject.

**Returns:**

A PythonObject representing the value.

### `__iter__`

`__iter__(self) -> CodepointSliceIter[origin]`

Iterate over the string, returning immutable references.

**Returns:**

An iterator of references to the string elements.

### `__reversed__`

`__reversed__(self) -> CodepointSliceIter[origin, False]`

Iterate backwards over the string, returning immutable references.

**Returns:**

A reversed iterator of references to the string elements.

### `__int__`

`__int__(self) -> Int`

Parses the given string as a base-10 integer and returns that value. If the string cannot be parsed as an int, an error is raised.

**Returns:**

An integer value that represents the string, or otherwise raises.

### `__float__`

`__float__(self) -> SIMD[float64, 1]`

Parses the string as a float point number and returns that value. If the string cannot be parsed as a float, an error is raised.

**Returns:**

A float value that represents the string, or otherwise raises.

### `__merge_with__`

`__merge_with__[: Bool, : Origin[$0], //, other_type: AnyStruct[StringSlice[$1]]](self) -> StringSlice[origin]`

Returns a string slice with merged origins.

**Parameters:**

* ​other\_type (`AnyStruct[StringSlice[$1]]`): The type of the origin to merge with.

**Returns:**

A StringSlice merged with the other origin.

### `get_immutable`

`get_immutable(self) -> StringSlice[(muttoimm origin._mlir_origin)]`

Return an immutable version of this Span.

**Returns:**

An immutable version of the same Span.

### `replace`

`replace(self, old: StringSlice[origin], new: StringSlice[origin]) -> String`

Return a copy of the string with all occurrences of substring `old` if replaced by `new`.

**Args:**

* ​old (`StringSlice[origin]`): The substring to replace.
* ​new (`StringSlice[origin]`): The substring to replace with.

**Returns:**

The string where all occurrences of `old` are replaced with `new`.

### `split`

`split(self, sep: StringSlice[origin], maxsplit: Int = -1) -> List[StringSlice[(muttoimm origin._mlir_origin)]]`

Split the string by a separator.

Examples:

```mojo
# Splitting a space
_ = StringSlice("hello world").split(" ") # ["hello", "world"]
# Splitting adjacent separators
_ = StringSlice("hello,,world").split(",") # ["hello", "", "world"]
# Splitting with maxsplit
_ = StringSlice("1,2,3").split(",", 1) # ['1', '2,3']
```

**Args:**

* ​sep (`StringSlice[origin]`): The string to split on.
* ​maxsplit (`Int`): The maximum amount of items to split from String.
  Defaults to unlimited.

**Returns:**

A List of Strings containing the input split by the separator.

**Raises:**

If the separator is empty.

`split(self, sep: NoneType = NoneType(None), maxsplit: Int = -1) -> List[StringSlice[(muttoimm origin._mlir_origin)]]`

Split the string by every Whitespace separator.

Examples:

```mojo
# Splitting an empty string or filled with whitespaces
_ = StringSlice("      ").split() # []
_ = StringSlice("").split() # []

# Splitting a string with leading, trailing, and middle whitespaces
_ = StringSlice("      hello    world     ").split() # ["hello", "world"]
# Splitting adjacent universal newlines:
_ = StringSlice(
    "hello \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029world"
).split()  # ["hello", "world"]
```

**Args:**

* ​sep (`NoneType`): None.
* ​maxsplit (`Int`): The maximum amount of items to split from String. Defaults
  to unlimited.

**Returns:**

A List of Strings containing the input split by the separator.

### `strip`

`strip(self, chars: StringSlice[origin]) -> Self`

Return a copy of the string with leading and trailing characters removed.

Example:

```mojo
print("himojohi".strip("hi")) # "mojo"
```

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A copy of the string with no leading or trailing characters.

`strip(self) -> Self`

Return a copy of the string with leading and trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

Example:

```mojo
print("  mojo  ".strip()) # "mojo"
```

**Returns:**

A copy of the string with no leading or trailing whitespaces.

### `rstrip`

`rstrip(self, chars: StringSlice[origin]) -> Self`

Return a copy of the string with trailing characters removed.

Example:

```mojo
print("mojohi".strip("hi")) # "mojo"
```

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A copy of the string with no trailing characters.

`rstrip(self) -> Self`

Return a copy of the string with trailing whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

Example:

```mojo
print("mojo  ".strip()) # "mojo"
```

**Returns:**

A copy of the string with no trailing whitespaces.

### `lstrip`

`lstrip(self, chars: StringSlice[origin]) -> Self`

Return a copy of the string with leading characters removed.

Example:

```mojo
print("himojo".strip("hi")) # "mojo"
```

**Args:**

* ​chars (`StringSlice[origin]`): A set of characters to be removed. Defaults to whitespace.

**Returns:**

A copy of the string with no leading characters.

`lstrip(self) -> Self`

Return a copy of the string with leading whitespaces removed. This only takes ASCII whitespace into account: `" \t\n\v\f\r\x1c\x1d\x1e"`.

Example:

```mojo
print("  mojo".strip()) # "mojo"
```

**Returns:**

A copy of the string with no leading whitespaces.

### `codepoints`

`codepoints(self) -> CodepointsIter[origin]`

Returns an iterator over the `Codepoint`s encoded in this string slice.

# Examples

Print the characters in a string:

```mojo

from testing import assert_equal

var s = StringSlice("abc")
var iter = s.codepoints()
assert_equal(iter.__next__(), Codepoint.ord("a"))
assert_equal(iter.__next__(), Codepoint.ord("b"))
assert_equal(iter.__next__(), Codepoint.ord("c"))
assert_equal(iter.__has_next__(), False)
```

`codepoints()` iterates over Unicode codepoints, and supports multibyte
codepoints:

```mojo

from testing import assert_equal

# A visual character composed of a combining sequence of 2 codepoints.
var s = StringSlice("á")
assert_equal(s.byte_length(), 3)

var iter = s.codepoints()
assert_equal(iter.__next__(), Codepoint.ord("a"))
 # U+0301 Combining Acute Accent
assert_equal(iter.__next__().to_u32(), 0x0301)
assert_equal(iter.__has_next__(), False)
```

.

**Returns:**

An iterator type that returns successive `Codepoint` values stored in
this string slice.

### `codepoint_slices`

`codepoint_slices(self) -> CodepointSliceIter[origin]`

Iterate over the string, returning immutable references.

**Returns:**

An iterator of references to the string elements.

### `as_bytes`

`as_bytes(self) -> Span[SIMD[uint8, 1], origin]`

Get the sequence of encoded bytes of the underlying string.

**Returns:**

A slice containing the underlying sequence of encoded bytes.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[SIMD[uint8, 1], mut=mut, origin=origin]`

Gets a pointer to the first element of this string slice.

**Returns:**

A pointer pointing at the first element of this string slice.

### `byte_length`

`byte_length(self) -> Int`

Get the length of this string slice in bytes.

**Returns:**

The length of this string slice in bytes.

### `char_length`

`char_length(self) -> UInt`

Returns the length in Unicode codepoints.

This returns the number of `Codepoint` codepoint values encoded in the UTF-8
representation of this string.

Note: To get the length in bytes, use `StringSlice.byte_length()`.

# Examples

Query the length of a string, in bytes and Unicode codepoints:

```mojo

from testing import assert_equal

var s = StringSlice("ನಮಸ್ಕಾರ")

assert_equal(s.char_length(), 7)
assert_equal(len(s), 21)
```

Strings containing only ASCII characters have the same byte and
Unicode codepoint length:

```mojo

from testing import assert_equal

var s = StringSlice("abc")

assert_equal(s.char_length(), 3)
assert_equal(len(s), 3)
```

The character length of a string with visual combining characters is
the length in Unicode codepoints, not grapheme clusters:

```mojo

from testing import assert_equal

var s = StringSlice("á")
assert_equal(s.char_length(), 2)
assert_equal(s.byte_length(), 3)
```

.

**Returns:**

The length in Unicode codepoints.

### `is_codepoint_boundary`

`is_codepoint_boundary(self, index: UInt) -> Bool`

Returns True if `index` is the position of the first byte in a UTF-8 codepoint sequence, or is at the end of the string.

A byte position is considered a codepoint boundary if a valid subslice
of the string would end (noninclusive) at `index`.

Positions `0` and `len(self)` are considered to be codepoint boundaries.

Positions beyond the length of the string slice will return False.

Examples:

Check if particular byte positions are codepoint boundaries:

```mojo
from testing import assert_equal, assert_true, assert_false
var abc = StringSlice("abc")
assert_equal(len(abc), 3)
assert_true(abc.is_codepoint_boundary(0))
assert_true(abc.is_codepoint_boundary(1))
assert_true(abc.is_codepoint_boundary(2))
assert_true(abc.is_codepoint_boundary(3))
```

Only the index of the first byte in a multi-byte codepoint sequence is
considered a codepoint boundary:

```mojo
var thumb = StringSlice("👍")
assert_equal(len(thumb), 4)
assert_true(thumb.is_codepoint_boundary(0))
assert_false(thumb.is_codepoint_boundary(1))
assert_false(thumb.is_codepoint_boundary(2))
assert_false(thumb.is_codepoint_boundary(3))
```

Visualization showing which bytes are considered codepoint boundaries,
within a piece of text that includes codepoints whose UTF-8
representation requires, respectively, 1, 2, 3, and 4-bytes. The
codepoint boundary byte indices are indicated by a vertical arrow (↑).

For example, this diagram shows that a slice of bytes formed by the
half-open range starting at byte 3 and extending up to but not including
byte 6 (`[3, 6)`) is a valid UTF-8 sequence.

```text
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                a©➇𝄞                  ┃ String
┣━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┫
┃97┃  169  ┃   10119   ┃    119070     ┃ Unicode Codepoints
┣━━╋━━━┳━━━╋━━━┳━━━┳━━━╋━━━┳━━━┳━━━┳━━━┫
┃97┃194┃169┃226┃158┃135┃240┃157┃132┃158┃ UTF-8 Bytes
┗━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┻━━━┛
0  1   2   3   4   5   6   7   8   9  10
↑  ↑       ↑           ↑               ↑
```

The following program verifies the above diagram:

```mojo
from testing import assert_true, assert_false

var text = StringSlice("a©➇𝄞")
assert_true(text.is_codepoint_boundary(0))
assert_true(text.is_codepoint_boundary(1))
assert_false(text.is_codepoint_boundary(2))
assert_true(text.is_codepoint_boundary(3))
assert_false(text.is_codepoint_boundary(4))
assert_false(text.is_codepoint_boundary(5))
assert_true(text.is_codepoint_boundary(6))
assert_false(text.is_codepoint_boundary(7))
assert_false(text.is_codepoint_boundary(8))
assert_false(text.is_codepoint_boundary(9))
assert_true(text.is_codepoint_boundary(10))
```

**Args:**

* ​index (`UInt`): An index into the underlying byte representation of the
  string.

**Returns:**

A boolean indicating if `index` gives the position of the first
byte in a UTF-8 codepoint sequence, or is at the end of the string.

### `startswith`

`startswith(self, prefix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool`

Verify if the `StringSlice` starts with the specified prefix between start and end positions.

The `start` and `end` positions must be offsets given in bytes, and
must be codepoint boundaries.

**Args:**

* ​prefix (`StringSlice[origin]`): The prefix to check.
* ​start (`Int`): The start offset in bytes from which to check.
* ​end (`Int`): The end offset in bytes from which to check.

**Returns:**

True if the `self[start:end]` is prefixed by the input prefix.

### `endswith`

`endswith(self, suffix: StringSlice[origin], start: Int = 0, end: Int = -1) -> Bool`

Verify if the `StringSlice` end with the specified suffix between start and end positions.

The `start` and `end` positions must be offsets given in bytes, and
must be codepoint boundaries.

**Args:**

* ​suffix (`StringSlice[origin]`): The suffix to check.
* ​start (`Int`): The start offset in bytes from which to check.
* ​end (`Int`): The end offset in bytes from which to check.

**Returns:**

True if the `self[start:end]` is suffixed by the input suffix.

### `removeprefix`

`removeprefix(self, prefix: StringSlice[origin], /) -> Self`

Returns a new string with the prefix removed if it was present.

Examples:

```mojo
print(StringSlice('TestHook').removeprefix('Test')) # 'Hook'
print(StringSlice('BaseTestCase').removeprefix('Test')) # 'BaseTestCase'
```

**Args:**

* ​prefix (`StringSlice[origin]`): The prefix to remove from the string.

**Returns:**

`string[len(prefix):]` if the string starts with the prefix string,
or a copy of the original string otherwise.

### `removesuffix`

`removesuffix(self, suffix: StringSlice[origin], /) -> Self`

Returns a new string with the suffix removed if it was present.

Examples:

```mojo
print(StringSlice('TestHook').removesuffix('Hook')) # 'Test'
print(StringSlice('BaseTestCase').removesuffix('Test')) # 'BaseTestCase'
```

**Args:**

* ​suffix (`StringSlice[origin]`): The suffix to remove from the string.

**Returns:**

`string[:-len(suffix)]` if the string ends with the suffix string,
or a copy of the original string otherwise.

### `format`

`format[*Ts: Stringable & Representable](self, *args: *Ts) -> String`

Produce a formatted string using the current string as a template.

The template, or "format string" can contain literal text and/or
replacement fields delimited with curly braces (`{}`). Returns a copy of
the format string with the replacement fields replaced with string
representations of the `args` arguments.

For more information, see the discussion in the
[`format` module](/mojo/stdlib/collections/string/format/).

Examples:

```mojo
# Manual indexing:
print(StringSlice("{0} {1} {0}").format("Mojo", 1.125)) # Mojo 1.125 Mojo
# Automatic indexing:
print(StringSlice("{} {}").format(True, "hello world")) # True hello world
```

**Parameters:**

* ​\*Ts (`Stringable & Representable`): The types of substitution values that implement `Representable`
  and `Stringable` (to be changed and made more flexible).

**Args:**

* ​\*args (`*Ts`): The substitution values.

**Returns:**

The template with the given values substituted.

### `find`

`find(self, substr: StringSlice[origin], start: Int = 0) -> Int`

Finds the offset in bytes of the first occurrence of `substr` starting at `start`. If not found, returns `-1`.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to find.
* ​start (`Int`): The offset in bytes from which to find. Must be a codepoint
  boundary.

**Returns:**

The offset in bytes of `substr` relative to the beginning of the
string.

### `rfind`

`rfind(self, substr: StringSlice[origin], start: Int = 0) -> Int`

Finds the offset in bytes of the last occurrence of `substr` starting at `start`. If not found, returns `-1`.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to find.
* ​start (`Int`): The offset in bytes from which to find. Must be a valid
  codepoint boundary.

**Returns:**

The offset in bytes of `substr` relative to the beginning of the
string.

### `isspace`

`isspace(self) -> Bool`

Determines whether every character in the given StringSlice is a python whitespace String. This corresponds to Python's [universal separators](https://docs.python.org/3/library/stdtypes.html#str.splitlines):  `" \t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`.

Example:

Check if a string contains only whitespace:

```mojo
from testing import assert_true, assert_false

# An empty string is not considered to contain only whitespace chars:
assert_false(StringSlice("").isspace())

# ASCII space characters
assert_true(StringSlice(" ").isspace())
assert_true(StringSlice("	").isspace())

# Contains non-space characters
assert_false(StringSlice(" abc  ").isspace())
```

**Returns:**

True if the whole StringSlice is made up of whitespace characters
listed above, otherwise False.

### `isnewline`

`isnewline[single_character: Bool = False](self) -> Bool`

Determines whether every character in the given StringSlice is a python newline character. This corresponds to Python's [universal newlines:](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `"\r\n"` and `"\t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`.

**Parameters:**

* ​single\_character (`Bool`): Whether to evaluate the stringslice as a single
  unicode character (avoids overhead when already iterating).

**Returns:**

True if the whole StringSlice is made up of whitespace characters
listed above, otherwise False.

### `splitlines`

`splitlines[O: ImmutableOrigin, //](self: StringSlice[O], keepends: Bool = False) -> List[StringSlice[O]]`

Split the string at line boundaries. This corresponds to Python's [universal newlines:](https://docs.python.org/3/library/stdtypes.html#str.splitlines) `"\r\n"` and `"\t\n\v\f\r\x1c\x1d\x1e\x85\u2028\u2029"`.

**Parameters:**

* ​O (`ImmutableOrigin`): The immutable origin.

**Args:**

* ​keepends (`Bool`): If True, line breaks are kept in the resulting strings.

**Returns:**

A List of Strings containing the input split by line boundaries.

### `count`

`count(self, substr: StringSlice[origin]) -> Int`

Return the number of non-overlapping occurrences of substring `substr` in the string.

If sub is empty, returns the number of empty strings between characters
which is the length of the string plus one.

**Args:**

* ​substr (`StringSlice[origin]`): The substring to count.

**Returns:**

The number of occurrences of `substr`.

### `is_ascii_digit`

`is_ascii_digit(self) -> Bool`

A string is a digit string if all characters in the string are digits and there is at least one character in the string.

Note that this currently only works with ASCII strings.

**Returns:**

True if all characters are digits and it's not empty else False.

### `isupper`

`isupper(self) -> Bool`

Returns True if all cased characters in the string are uppercase and there is at least one cased character.

**Returns:**

True if all cased characters in the string are uppercase and there
is at least one cased character, False otherwise.

### `islower`

`islower(self) -> Bool`

Returns True if all cased characters in the string are lowercase and there is at least one cased character.

**Returns:**

True if all cased characters in the string are lowercase and there
is at least one cased character, False otherwise.

### `lower`

`lower(self) -> String`

Returns a copy of the string with all cased characters converted to lowercase.

**Returns:**

A new string where cased letters have been converted to lowercase.

### `upper`

`upper(self) -> String`

Returns a copy of the string with all cased characters converted to uppercase.

**Returns:**

A new string where cased letters have been converted to uppercase.

### `is_ascii_printable`

`is_ascii_printable(self) -> Bool`

Returns True if all characters in the string are ASCII printable.

Note that this currently only works with ASCII strings.

**Returns:**

True if all characters are printable else False.

### `rjust`

`rjust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String`

Returns the string right justified in a string of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns right justified string, or self if width is not bigger than self length.

### `ljust`

`ljust(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String`

Returns the string left justified in a string of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns left justified string, or self if width is not bigger than self length.

### `center`

`center(self, width: Int, fillchar: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string](" ")) -> String`

Returns the string center justified in a string of specified width.

**Args:**

* ​width (`Int`): The width of the field containing the string.
* ​fillchar (`StringSlice[StaticConstantOrigin]`): Specifies the padding character.

**Returns:**

Returns center justified string, or self if width is not bigger than self length.

### `join`

`join[T: Copyable & Movable & Writable](self, elems: List[T, hint_trivial_type]) -> String`

Joins string elements using the current string as a delimiter.

**Parameters:**

* ​T (`Copyable & Movable & Writable`): The type of the elements, must implement the `Copyable`,
  `Movable` and `Writable` traits.

**Args:**

* ​elems (`List[T, hint_trivial_type]`): The input values.

**Returns:**

The joined string.

`join[*Ts: Writable](self: StringSlice[StaticConstantOrigin], *elems: *Ts) -> String`

Joins string elements using the current string as a delimiter.

**Parameters:**

* ​\*Ts (`Writable`): The types of the elements.

**Args:**

* ​\*elems (`*Ts`): The input values.

**Returns:**

The joined string.

---

## Structs

A Mojo struct is a data structure that allows you to encapsulate fields and
methods that operate on an abstraction, such as a data type or an object.
**Fields** are variables that hold data relevant to the struct, and **methods**
are functions inside a struct that generally act upon the field data.

For example, if you're building a graphics program, you can use a struct to
define an `Image` that has fields to store information about each image
(such as the pixels) and methods that perform actions on it (such as rotate
it).

For the most part, Mojo's struct format is designed to provide a static,
memory-safe data structure for high-level data types used in programs. For
example, all the data types in Mojo's standard library (such as `Int`,
`Bool`, `String`, and `Tuple`) are defined as structs.

If you understand how [functions](/mojo/manual/functions) and
[variables](/mojo/manual/variables) work in Mojo, you probably
noticed that Mojo is designed to provide dynamic programming features in a
`def` function while enforcing stronger code safety in `fn` functions. When it
comes to structs, Mojo leans toward the safe side: You can still choose whether
to use either `def` or `fn` declarations for methods, but all fields must be
declared with `var`.

## Struct definition

You can define a simple struct called `MyPair` with two fields like this:

```mojo
struct MyPair:
    var first: Int
    var second: Int
```

However, you can't instantiate this struct because it has no constructor
method. So here it is with a constructor to initialize the two fields:

```mojo
struct MyPair:
    var first: Int
    var second: Int

    fn __init__(out self, first: Int, second: Int):
        self.first = first
        self.second = second
```

Notice that the first argument in the `__init__()` method is `out self`. You'll
have a `self` argument as the first argument on all struct methods. It
references the current struct instance (it allows code in the method to refer to
"itself"). *When you call the constructor, you never pass a value for
`self`—Mojo passes it in automatically.*

The `out` portion of `out self` is an [argument
convention](/mojo/manual/values/ownership#argument-conventions) that declares
`self` as a mutable reference that starts out as uninitialized and must be
initialized before the function returns.

The `__init__()` method is one of many [special methods](#special-methods)
(also known as "dunder methods" because they have *d*ouble *under*scores) with
pre-determined names.

:::note

You can't assign values when you declare fields. You must initialize
all of the struct's fields in the constructor. (If you try to leave a field
uninitialized, the code won't compile.)

:::

Once you have a constructor, you can create an instance of `MyPair` and set the
fields:

```mojo
var mine = MyPair(2,4)
print(mine.first)
```

```output
2
```

## Methods

In addition to special methods like `__init__()`, you can add any other method
you want to your struct. For example:

```mojo
struct MyPair:
    var first: Int
    var second: Int

    fn __init__(out self, first: Int, second: Int):
        self.first = first
        self.second = second

    fn get_sum(self) -> Int:
        return self.first + self.second
```

```mojo
var mine = MyPair(6, 8)
print(mine.get_sum())
```

```output
14
```

Notice that `get_sum()` also uses the `self` argument, because this is
the only way you can access the struct's fields in a method. The name `self` is
just a convention, and you can use any name you want to refer to the struct
instance that is always passed as the first argument.

Methods that take the implicit `self` argument are called *instance methods*
because they act on an instance of the struct.

:::note

The `self` argument in a struct method is the only argument in an
`fn` function that does not require a type. You can include it if you want, but
you can elide it because Mojo already knows its type (`MyPair` in this case).

:::

### `fn` versus `def` in struct methods

Struct methods can be declared with either the `def` or `fn` keywords. One
important difference is that an `fn` function without the `raises` keyword can't
raise an error. When you call a function that *can* raise an error from inside a
method that *can't* raise an error, Mojo requires you to handle any errors, as
described in
[Errors, error handling, and context managers](/mojo/manual/errors).

If you're writing code that you expect to use widely or distribute as a package,
you may want to use `fn` functions for APIs that can't raise an error to limit
the number of places users need to add error handling code.

A struct's `__del__()` method, or destructor, **must** be a non-raising method,
so it's always declared with `fn` (and without the `raises` keyword).

### Static methods

A struct can also have *static methods*. A static method can be called without
creating an instance of the struct. Unlike instance methods, a static method
doesn't receive the implicit `self` argument, so it can't access any fields on
the struct.

To declare a static method, use the `@staticmethod` decorator and don't include
a `self` argument:

```mojo
struct Logger:

    fn __init__(out self):
        pass

    @staticmethod
    fn log_info(message: String):
        print("Info: ", message)
```

You can invoke a static method by calling it on the type (in this case,
`Logger`). You can also call it on an instance of the type. Both forms are
shown below:

```mojo
Logger.log_info("Static method called.")
var l = Logger()
l.log_info("Static method called from instance.")
```

```output
Info:  Static method called.
Info:  Static method called from instance.
```

## Structs compared to classes

If you're familiar with other object-oriented languages, then structs might
sound a lot like classes, and there are some similarities, but also some
important differences. Eventually, Mojo will also support classes to match the
behavior of Python classes.

So, let's compare Mojo structs to Python classes. They both support methods,
fields, operator overloading, decorators for metaprogramming, and more, but
their key differences are as follows:

* Python classes are dynamic: they allow for dynamic dispatch, monkey-patching
  (or “swizzling”), and dynamically binding instance fields at runtime.

* Mojo structs are static: they are bound at compile-time (you cannot add
  methods at runtime). Structs allow you to trade flexibility for performance
  while being safe and easy to use.

* Mojo structs do not support inheritance ("sub-classing"), but a struct can
  implement [traits](/mojo/manual/traits).

* Python classes support class attributes—values that are shared by all
  instances of the class, equivalent to class variables or static data members
  in other languages.

* Mojo structs don't support static data members.

Syntactically, the biggest difference compared to a Python class is that all
fields in a struct must be explicitly declared with `var`.

In Mojo, the structure and contents of a struct are set at compile time and
can't be changed while the program is running. Unlike in Python, where you can
add, remove, or change attributes of an object on the fly, Mojo doesn't allow
that for structs.

However, the static nature of structs helps Mojo run your code faster. The
program knows exactly where to find the struct's information and how to use it
without any extra steps or delays at runtime.

Mojo's structs also work really well with features you might already know from
Python, like operator overloading (which lets you change how math symbols like
`+` and `-` work with your own data, using [special
methods](#special-methods)).

As mentioned above, all Mojo's standard types
(`Int`, `String`, etc.) are made using structs, rather than being hardwired
into the language itself. This gives you more flexibility and control when
writing your code, and it means you can define your own types with all the same
capabilities (there's no special treatment for the standard library types).

## Special methods

Special methods (or "dunder methods") such as `__init__()` are pre-determined
method names that you can define in a struct to perform a special task.

Although it's possible to call special methods with their method names, the
point is that you never should, because Mojo automatically invokes them in
circumstances where they're needed (which is why they're also called "magic
methods"). For example, Mojo calls the `__init__()` method when you create
an instance of the struct; and when Mojo destroys the instance, it calls the
`__del__()` method (if it exists).

Even operator behaviors that appear built-in (`+`, `<`, `==`, `|`, and so on)
are implemented as special methods that Mojo implicitly calls upon to perform
operations or comparisons on the type that the operator is applied to.

Mojo supports a long list of special methods; far too many to discuss here, but
they generally match all of [Python's special
methods](https://docs.python.org/3/reference/datamodel#special-method-names)
and they usually accomplish one of two types of tasks:

* Operator overloading: A lot of special methods are designed to overload
  operators such as `<` (less-than), `+` (add), and `|` (or) so they work
  appropriately with each type. For example, look at the methods listed for Mojo's
  [`Int` type](/mojo/stdlib/builtin/int/Int). One such method is `__lt__()`, which
  Mojo calls to perform a less-than comparison between two integers (for example,
  `num1 < num2`).

* Lifecycle event handling: These special methods deal with the lifecycle and
  value ownership of an instance. For example, `__init__()` and `__del__()`
  demarcate the beginning and end of an instance lifetime, and other special
  methods define the behavior for other lifecycle events such as how to copy or
  move a value.

You can learn all about the lifecycle special methods in the [Value
lifecycle](/mojo/manual/lifecycle/) section. However, most structs are simple
aggregations of other types, so unless your type requires custom behaviors when
an instance is created, copied, moved, or destroyed, you can synthesize the
essential lifecycle methods you need (and save yourself some time) by adding
the `@value` decorator.

### `@value` decorator

When you add the [`@value` decorator](/mojo/manual/decorators/value) to a
struct, Mojo will synthesize the essential lifecycle methods so your object
provides full value semantics. Specifically, it generates the `__init__()`,
`__copyinit__()`, and `__moveinit__()` methods, which allow you to construct,
copy, and move your struct type in a manner that's value semantic and
compatible with Mojo's ownership model.

For example:

```mojo
@value
struct MyPet:
    var name: String
    var age: Int
```

Mojo will notice that you don't have a member-wise initializer, a move
constructor, or a copy constructor, and it will synthesize these for you as if
you had written:

```mojo
struct MyPet:
    var name: String
    var age: Int

    fn __init__(out self, owned name: String, age: Int):
        self.name = name^
        self.age = age

    fn __copyinit__(out self, existing: Self):
        self.name = existing.name
        self.age = existing.age

    fn __moveinit__(out self, owned existing: Self):
        self.name = existing.name^
        self.age = existing.age
```

Without the copy and move constructors, the following code would not work
because Mojo would not know how to copy an instance of `MyPet`:

```mojo
var dog = MyPet("Charlie", 5)
var poodle = dog
print(poodle.name)
```

```output
Charlie
```

When you add the `@value` decorator, Mojo synthesizes each special method above
only if it doesn't exist already. That is, you can still implement a custom
version of each method.

In addition to the `out` argument convention you already saw with
`__init__()`, this code also introduces `owned`, which is another argument
convention that ensures the argument has unique ownership of the value.
For more detail, see the section about [value
ownership](/mojo/manual/values/ownership).

---

## Structured output

import TutorialStack from '@site/src/components/TutorialStack';

MAX supports the generation of structured output using
[XGrammar](https://github.com/mlc-ai/xgrammar) as a backend. Structured output,
also sometimes referred to as constrained decoding, allows users to enforce
specific output formats, ensuring structured and predictable responses from a
model.

:::note

Structured output is compatible with GPU deployments and MAX models only.
Support for PyTorch models and CPU deployments is in progress.

:::

## When to use structured output

If you want to structure a model's output when it responds to a user, then you
should use a structured output `response_format`.

If you are connecting a model to tools, functions, data, or other systems, then
you should use [function calling](/max/serve/function-calling) instead of
structured outputs.

## How structured output works

To use structured output, use the `--enable-structured-output` flag when serving
your model with the `max` CLI.

```bash
max serve \
	--model-path="modularai/Llama-3.1-8B-Instruct-GGUF" \
	--enable-structured-output
```

Then, when making inference requests, you must specify a `response_format` JSON
schema. Both the `/chat/completions` and `/completions` API endpoints are
compatible with structured output.

### JSON schema

To specify a structured output within your inference request, use the following
format:

:::note

You can increase the accuracy of structured output responses by mentioning JSON output specifications in your system prompt.

:::

```
curl -N http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model"="modularai/Llama-3.1-8B-Instruct-GGUF",
    "messages"=[
        {"role": "system", "content": "You are a helpful math tutor.
            Guide the user through the solution step by step.
            Provide your guidance in JSON format."},
        {"role": "user", "content": "How can I solve 8x + 7 = -23"}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "math_response",
            "schema": {
                "type": "object",
                "properties": {
                    "steps": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "explanation": {"type": "string"},
                                "output": {"type": "string"}
                            },
                            "required": ["explanation", "output"],
                            "additionalProperties": False
                        }
                    },
                    "final_answer": {"type": "string"}
                },
                "required": ["steps", "final_answer"],
                "additionalProperties": False
            }
        }
    }
```

### Schema validation

You can also define your structured output using the Pydantic
[`BaseModel`](https://docs.pydantic.dev/latest/api/base_model/) to validate your
JSON schema in Python.

 Here's an example:

```python
from pydantic import BaseModel
from openai import OpenAI

client = OpenAI()

class CalendarEvent(BaseModel):
    name: str
    date: str
    participants: list[str]

completion = client.beta.chat.completions.parse(
    model="modularai/Llama-3.1-8B-Instruct-GGUF",
    messages=[
        {"role": "system", "content": "Extract the event information."},
        {"role": "user", "content": "Alice and Bob are going to a movie on Friday."},
    ],
    response_format=CalendarEvent,
)

event = completion.choices[0].message.parsed
```

### Supported models

All text generation models support structured output with MAX. As new models are
added, they will also be compatible with structured output. This functionality
is implemented at the pipeline level, ensuring consistency across different
models.

However, structured output currently doesn't support PyTorch models or
CPU deployments—only [MAX models](/max/model-formats#max-graph) deployed
on GPUs.

## Next steps

For more examples, you can explore structured output
[recipes](https://builds.modular.com/?category=recipes&tag=structured-output).

After defining your output structure, you can explore deploying your workflow
on GPUs.

export const tutorials = [
  'max-serve-local-to-cloud',
  'deploy-max-serve-on-kubernetes',
];

---

## stx

`stx(gpr: Int)`

---

## sty

`sty(gpr: Int)`

---

## stz

`stz(gpr: Int)`

---

## stzi

`stzi(gpr: Int)`

---

## sub

`sub(x: SIMD[dtype, size], y: SIMD[dtype, size]) -> SIMD[dtype, size]`

---

## sublayout

`sublayout(layout: Layout, *modes: Int) -> Layout`

Creates a sublayout by selecting specific dimensions from a layout.

This function extracts a subset of dimensions from a layout to create a new
layout with lower rank. For example, from a 3D layout, you could extract
a 2D layout containing only the first and third dimensions.

Example:

From a layout with shape (3,4,5), sublayout(layout, 0, 2) would
create a layout with shape (3,5).

**Args:**

* ​layout (`Layout`): The source layout to extract dimensions from.
* ​\*modes (`Int`): The indices of dimensions to include in the sublayout.

**Returns:**

A new layout containing only the specified dimensions.

---

## SubMatmulConfig

`struct SubMatmulConfig`

Static configuration of sub-matrices in parallel matmul.

## Fields

* ​offset (`IndexList[3]`):
* ​shape (`IndexList[3]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `is_valid`

`is_valid(self) -> Bool`

---

## subprocess

Implements the subprocess package.

## Modules

* [​`subprocess`](/mojo/stdlib/subprocess/subprocess/): Implements the subprocess package.

---

## subprocess

Implements the subprocess package.

## Functions

* [​`run`](/mojo/stdlib/subprocess/subprocess/run): Runs the specified command and returns the output as a string.

---

## sum

`sum(t: IntTuple[origin]) -> Int`

Calculate the sum of all values in an `IntTuple`.

This function recursively computes the sum of all integer values
in a potentially nested `IntTuple` structure.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to sum.

**Returns:**

The sum of all integer values, or `UNKNOWN_VALUE` if any value
in the tuple is `UNKNOWN_VALUE`.

---

## sum

`sum[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], out: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Computes sum reduction along specified axis.

Reduces the input tensor by summing elements along the specified axis
and stores the result in the output tensor.

Example:

```mojo
from layout import LayoutTensor, Layout
from layout.math import sum

data = InlineArray[Int32, 6](0, 1, 2, 3, 4, 5)
tensor = LayoutTensor[DType.int32, Layout.row_major(2, 3)](data)
print(tensor)
print("-----")
print(sum[0](tensor))
```

Output:

```plaintext
0 1 2
3 4 5
-----
3 5 7
```

.

**Constraints:**

All tensors must have statically known shapes.
`out.rank` must equal `inp.rank - 1`.
Non-reduction dimensions must match between inp and out.
Currently only supports rank-2 inputs.

**Parameters:**

* ​axis (`Int`): The axis to sum along.

**Args:**

* ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to sum.
* ​out (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor to store sum results.

`sum[axis: Int](inp: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[dtype, _reduce_res_row_major_shape(axis, layout), MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type]`

Computes sum reduction along specified axis, returning a new tensor.

Reduces the input tensor by summing elements along the specified axis
and returns a new tensor with the results.

**Constraints:**

All tensors must have statically known shapes.
Result will have rank equal to `inp.rank` - 1.
Non-reduction dimensions in the result match the input.
Currently only supports rank-2 inputs.

**Parameters:**

* ​axis (`Int`): The axis to sum along.

**Args:**

* ​inp (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor to sum.

**Returns:**

A new tensor containing the sum values along the specified axis.

---

## sum

`sum(src: NDBuffer[type, 1, origin]) -> SIMD[type, 1]`

Computes the sum of buffer elements.

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.

**Returns:**

The sum of the buffer elements.

`sum[reduce_axis: Int](src: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive], dst: NDBuffer[type, rank, origin, shape])`

Computes the sum across reduce\_axis of an NDBuffer.

**Parameters:**

* ​reduce\_axis (`Int`): The axis to reduce across.

**Args:**

* ​src (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The input buffer.
* ​dst (`NDBuffer[type, rank, origin, shape]`): The output buffer.

`sum[: origin.set, : origin.set, //, type: DType, input_fn: fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0], output_fn: fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None, /, single_thread_blocking_override: Bool = False, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input_shape: IndexList[size], reduce_dim: Int, context: DeviceContextPtr = DeviceContextPtr())`

Computes the sum across the input and output shape.

This performs the sum computation on the domain specified by `input_shape`,
loading the inputs using the `input_fn`. The results are stored using
the `output_fn`.

**Parameters:**

* ​type (`DType`): The type of the input and output.
* ​input\_fn (`fn[Int, Int](IndexList[$1]) capturing -> SIMD[type, $0]`): The function to load the input.
* ​output\_fn (`fn[Int, Int](IndexList[$1], SIMD[type, $0]) capturing -> None`): The function to store the output.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.

**Args:**

* ​input\_shape (`IndexList[size]`): The input shape.
* ​reduce\_dim (`Int`): The axis to perform the sum on.
* ​context (`DeviceContextPtr`): The pointer to DeviceContext.

---

## sum

`sum[type: DType, width: Int, //, *, block_size: Int, broadcast: Bool = True](val: SIMD[type, width]) -> SIMD[type, width]`

Computes the sum of values across all threads in a block.

Performs a parallel reduction using warp-level operations and shared memory
to find the global sum across all threads in the block.

**Parameters:**

* ​type (`DType`): The data type of the SIMD elements.
* ​width (`Int`): The number of elements in each SIMD vector.
* ​block\_size (`Int`): The total number of threads in the block.
* ​broadcast (`Bool`): If True, the final sum is broadcast to all threads in the
  block. If False, only the first thread will have the complete sum.

**Args:**

* ​val (`SIMD[type, width]`): The SIMD value to reduce. Each thread contributes its value to the
  sum.

**Returns:**

If broadcast is True, each thread in the block will receive the final
sum. Otherwise, only the first thread will have the complete sum.

---

## sum

`sum[val_type: DType, simd_width: Int, //](val: SIMD[val_type, simd_width]) -> SIMD[val_type, simd_width]`

Computes the sum of values across all lanes in a warp.

This is a convenience wrapper around lane\_group\_sum\_and\_broadcast that
operates on the entire warp.  It performs a parallel reduction using warp
shuffle operations to find the global sum across all lanes in the warp.

**Parameters:**

* ​val\_type (`DType`): The data type of the SIMD elements (e.g. float32, int32).
* ​simd\_width (`Int`): The number of elements in the SIMD vector.

**Args:**

* ​val (`SIMD[val_type, simd_width]`): The SIMD value to reduce. Each lane contributes its value to the sum.

**Returns:**

A SIMD value where all lanes contain the sum found across the entire warp.
The sum is broadcast to all lanes.

`sum[intermediate_type: DType, *, reduction_method: ReductionMethod, output_type: DType](x: SIMD[dtype, size]) -> SIMD[output_type, 1]`

Performs a warp-level reduction to compute the sum of values across threads.

This function provides two reduction methods:

1. Warp shuffle: Uses warp shuffle operations to efficiently sum values across threads
2. Tensor core: Leverages tensor cores for high-performance reductions, with type casting

The tensor core method will cast the input to the specified intermediate type before
reduction to ensure compatibility with tensor core operations. The warp shuffle method
requires the output type to match the input type.

**Constraints:**

* For warp shuffle reduction, output\_type must match the input value type.
* For tensor core reduction, input will be cast to intermediate\_type.

**Parameters:**

* ​intermediate\_type (`DType`): The data type to cast to when using tensor core reduction.
* ​reduction\_method (`ReductionMethod`): `WARP` for warp shuffle or `TENSOR_CORE` for tensor core reduction.
* ​output\_type (`DType`): The desired output data type for the reduced value.

**Args:**

* ​x (`SIMD[dtype, size]`): The SIMD value to reduce across the warp.

**Returns:**

A scalar containing the sum of the input values across all threads in the warp,
cast to the specified output type.

---

## swap

Implements the built-in `swap` function.

These are Mojo built-ins, so you don't need to import them.

## Functions

* [​`swap`](/mojo/stdlib/builtin/swap/swap): Swaps the two given arguments.

---

## swap

`swap[T: Movable](mut lhs: T, mut rhs: T)`

Swaps the two given arguments.

**Parameters:**

* ​T (`Movable`): Constrained to Movable types.

**Args:**

* ​lhs (`T`): Argument value swapped with rhs.
* ​rhs (`T`): Argument value swapped with lhs.

---

## swilu

`swilu[type: DType, width: Int](x: SIMD[type, width], y: SIMD[type, width]) -> SIMD[type, width]`

---

## swishGLU

`swishGLU[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](a: NDBuffer[a_type, 2, MutableAnyOrigin, a_shape], b0: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], b1: NDBuffer[b_type, 2, MutableAnyOrigin, b_shape], c: NDBuffer[c_type, 2, MutableAnyOrigin, c_shape], ctx: DeviceContextPtr)`

Reference:     GLU Variants Improve Transformer     by Noam Shazeer      The implementation follows cutlass, using one kernel invocation and writing to the destination once.

---

## swizzle

Defines swizzle layouts for optimizing memory access patterns.

This module is designed for use in shared memory, especially in GPU
kernels, to reduce bank conflicts.  It provides tools to create and
apply swizzle transformations to memory indices.  Swizzling
rearranges memory access order to distribute accesses across
different memory banks.  This mitigates bank contention and improves
memory access efficiency.

Module components:

* `Swizzle` struct: Represents a swizzle transformation with
  configurable bits, base, and shift parameters.
* Helper functions: `make_ldmatrix_swizzle`, `make_swizzle` create
  predefined swizzle patterns. These are optimized for scenarios
  like `ldmatrix` instructions and general 2D memory access.
* `ComposedLayout` struct: Combines a base layout with a swizzle
  layout for complex memory access optimizations.

Primary use case: GPU kernel development where shared memory bank
conflicts can degrade performance.  Applying swizzle layouts
optimizes memory access patterns for higher throughput.

## Structs

* [​`ComposedLayout`](./ComposedLayout): Layout composed of two layouts applied sequentially.
* [​`Swizzle`](./Swizzle): Swizzle functor for memory access pattern optimization.

## Functions

* [​`eval_composed`](./eval_composed): Evaluate a composed layout with swizzle.
* [​`make_ldmatrix_swizzle`](./make_ldmatrix_swizzle): Make swizzle to avoid bank conflict for ldmatrix ops.
* [​`make_swizzle`](./make_swizzle): Create a 2D swizzle to avoid bank conflicts.
* [​`shiftl`](./shiftl): Shift left or right based on sign of shift amount.
* [​`shiftr`](./shiftr): Shift right or left based on sign of shift amount.

---

## Swizzle

`@register_passable(trivial)`
`struct Swizzle`

Swizzle functor for memory access pattern optimization.

Implements a swizzling pattern to reduce bank conflicts in shared
memory accesses.  It XORs specific bits of memory indices based
on configurable parameters.

Swizzle operation:
Given index `i`, and Swizzle\[bits, base, shift]:

1. Extract `bits` number of bits from `i` starting from position
   `base + max(0, shift)`. Let's call this `YYY`.
2. Extract `bits` number of bits from `i` starting from position
   `base - min(0, shift)`. Let's call this `ZZZ`.
3. Result is `i ^ (YYY shifted by 'shift' positions)`.

Example (Swizzle\[2, 0, 3]):
Input index bits:  `xxxxxxxxxxxxxxxxYYxxxxxxxxxZZxxxx`
Output index bits: `xxxxxxxxxxxxxxxxYYxxxxxxxxxAAxxxx`
where `AA = ZZ ^ YY`.

Attributes:
bits (Int): Number of bits in the mask (YYY).
base (Int): Number of least significant bits to keep constant.
shift (Int): Shift distance for the mask (positive: right,
negative: left).
yyy\_mask (Int): Mask for the bits to be shifted (YYY).
zzz\_mask (Int): Mask for the target bits (ZZZ).

## Fields

* ​bits (`Int`): Number of bits in the mask.
* ​base (`Int`): Number of least significant bits to keep constant.
* ​shift (`Int`): Distance to shift the mask (pos right, neg left).
* ​yyy\_mask (`Int`): Mask for the bits to be shifted.
* ​zzz\_mask (`Int`): Mask for the target bits.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`LayoutTrait`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `has_shape`

`alias has_shape = False`

Indicates if layout has shape. Swizzle always False.

## Methods

### `__init__`

`__init__(bits: Int, base: Int, shift: Int) -> Self`

Initialize a Swizzle object.

Configures the swizzle operation based on bits, base, and
shift parameters.

**Args:**

* ​bits (`Int`): Number of bits in the mask.
* ​base (`Int`): Least significant bits to keep constant.
* ​shift (`Int`): Distance to shift the mask.

### `__call__`

`__call__(self, index: IntTuple[origin]) -> Int`

Apply swizzle to an IntTuple index.

Unwraps the IntTuple and applies the swizzle to the integer
value.

**Args:**

* ​index (`IntTuple[origin]`): The IntTuple index to swizzle.

**Returns:**

The swizzled index value.

`__call__(self, offset: Int) -> Int`

Apply swizzle to an integer offset.

Performs the swizzle operation on an integer offset to
rearrange memory access patterns.

**Args:**

* ​offset (`Int`): The integer offset to swizzle.

**Returns:**

The swizzled offset value.

`__call__(self, offset: SIMD[dtype, 1]) -> SIMD[dtype, 1]`

Apply swizzle to a scalar offset.

Scalar version of the swizzle operation.  Applies swizzle to
a scalar offset.

**Args:**

* ​offset (`SIMD[dtype, 1]`): The scalar offset to swizzle.

**Returns:**

The swizzled scalar value.

### `size`

`size(self) -> Int`

Get the size of the swizzle pattern.

Calculates the size of the memory region affected by the
swizzle pattern.

**Returns:**

The size of the swizzle pattern.

### `cosize`

`cosize(self) -> Int`

Get the cosize of the swizzle pattern.

Cosize is the same as size for swizzle layouts, representing
the output size.

**Returns:**

The cosize of the swizzle pattern (same as size).

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Write the swizzle parameters to a writer.

Outputs the swizzle parameters (bits, base, shift) in a
tuple format.

**Parameters:**

* ​W (`Writer`): The writer type that implements the Writer trait.

**Args:**

* ​writer (`W`): The writer to write to.

### `__str__`

`__str__(self) -> String`

Convert the swizzle to a string representation.

**Returns:**

String representation of the swizzle parameters.

---

## sync

This module provides GPU synchronization primitives and barriers.

The module includes:

* Block-level synchronization barriers (barrier())
* Warp-level synchronization (syncwarp())
* Memory barriers (mbarrier) for NVIDIA GPUs
* Instruction scheduling controls for AMD GPUs
* Asynchronous copy and bulk transfer synchronization

The synchronization primitives help coordinate execution between threads within
thread blocks and warps, and manage memory consistency across different memory spaces.

## Structs

* [​`AMDScheduleBarrierMask`](/mojo/stdlib/gpu/sync/AMDScheduleBarrierMask): Represents different instruction scheduling masks for AMDGPU scheduling instructions.

## Functions

* [​`async_copy_arrive`](/mojo/stdlib/gpu/sync/async_copy_arrive): Makes a memory barrier track all prior async copy operations from this thread.
* [​`barrier`](/mojo/stdlib/gpu/sync/barrier): Performs a synchronization barrier at the block level.
* [​`cp_async_bulk_commit_group`](/mojo/stdlib/gpu/sync/cp_async_bulk_commit_group): Commits all prior initiated but uncommitted cp.async.bulk instructions into a cp.async.bulk-group.
* [​`cp_async_bulk_wait_group`](/mojo/stdlib/gpu/sync/cp_async_bulk_wait_group): Waits for completion of asynchronous bulk memory transfer groups.
* [​`mbarrier_arrive`](/mojo/stdlib/gpu/sync/mbarrier_arrive): Signal thread arrival at a shared memory barrier.
* [​`mbarrier_arrive_expect_tx_shared`](/mojo/stdlib/gpu/sync/mbarrier_arrive_expect_tx_shared): Configure a shared memory barrier to expect additional async transactions.
* [​`mbarrier_init`](/mojo/stdlib/gpu/sync/mbarrier_init): Initialize a shared memory barrier for synchronizing multiple threads.
* [​`mbarrier_test_wait`](/mojo/stdlib/gpu/sync/mbarrier_test_wait): Test if all threads have arrived at the memory barrier.
* [​`mbarrier_try_wait_parity_shared`](/mojo/stdlib/gpu/sync/mbarrier_try_wait_parity_shared): Wait for completion of a barrier phase with timeout.
* [​`named_barrier`](/mojo/stdlib/gpu/sync/named_barrier): Performs a named synchronization barrier at the block level.
* [​`schedule_barrier`](/mojo/stdlib/gpu/sync/schedule_barrier): Controls instruction scheduling across a barrier point in AMD GPU code.
* [​`schedule_group_barrier`](/mojo/stdlib/gpu/sync/schedule_group_barrier): Controls instruction scheduling across a barrier point in AMD GPU code by creating schedule groups.
* [​`syncwarp`](/mojo/stdlib/gpu/sync/syncwarp): Synchronizes threads within a warp using a barrier.

---

## sync_parallelize

`sync_parallelize[origins: origin.set, //, func: fn(Int) capturing -> None](num_work_items: Int)`

Executes func(0) ... func(num\_work\_items-1) as parallel sub-tasks, and returns when all are complete.

**Parameters:**

* ​origins (`origin.set`): The capture origins.
* ​func (`fn(Int) capturing -> None`): The function to invoke.

**Args:**

* ​num\_work\_items (`Int`): Number of parallel tasks.

`sync_parallelize[origins: origin.set, //, func: fn(Int) raises capturing -> None](num_work_items: Int)`

Executes func(0) ... func(num\_work\_items-1) as parallel sub-tasks, and returns when all are complete.

TODO: Currently exceptions raised by func will cause a trap rather than
be propagated back to the caller.

**Parameters:**

* ​origins (`origin.set`): The capture origins.
* ​func (`fn(Int) raises capturing -> None`): The function to invoke.

**Args:**

* ​num\_work\_items (`Int`): Number of parallel tasks.

---

## syncwarp

`syncwarp(mask: Int = -1)`

Synchronizes threads within a warp using a barrier.

This function creates a synchronization point where threads in a warp must wait until all
threads specified by the mask reach this point. On NVIDIA GPUs, it uses warp-level
synchronization primitives. On AMD GPUs, this is a no-op since threads execute in lock-step.

Note:

* On NVIDIA GPUs, this maps to the nvvm.bar.warp.sync intrinsic.
* On AMD GPUs, this is a no-op since threads execute in lock-step.
* Threads not participating in the sync must still execute the instruction.

**Args:**

* ​mask (`Int`): An integer bitmask specifying which lanes (threads) in the warp should be
  synchronized. Each bit corresponds to a lane, with bit i controlling lane i.
  A value of 1 means the lane participates in the sync, 0 means it does not.
  Default value of -1 (all bits set) synchronizes all lanes.

---

## sys

Implements the sys package.

## Modules

* [​`arg`](/mojo/stdlib/sys/arg/): Implements functions and variables for interacting with execution and system environment.
* [​`compile`](/mojo/stdlib/sys/compile/): Implements functions that return compile-time information.
* [​`debug`](/mojo/stdlib/sys/debug/): This module includes the debug hook functions.
* [​`ffi`](/mojo/stdlib/sys/ffi/): Implements a foreign functions interface (FFI).
* [​`info`](/mojo/stdlib/sys/info/): Implements methods for querying the host target info.
* [​`intrinsics`](/mojo/stdlib/sys/intrinsics/): Defines intrinsics.
* [​`param_env`](/mojo/stdlib/sys/param_env/): Implements functions for retrieving compile-time defines.
* [​`terminate`](/mojo/stdlib/sys/terminate/): This module includes the exit functions.

---

## tan

`tan[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the `tan` of the inputs.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input argument.

**Returns:**

The `tan` of the input.

---

## tanh

`tanh[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Performs elementwise evaluation of the tanh function.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The vector to perform the elementwise tanh on.

**Returns:**

The result of the elementwise tanh operation.

---

## Task

`struct Task[type: AnyType, origins: origin.set]`

Represents an asynchronous task that will produce a value of the specified type.

A Task encapsulates a coroutine that is executing asynchronously and will eventually
produce a result. Tasks can be awaited in async functions or waited on in synchronous code.

## Parameters

* ​type (`AnyType`): The type of value that this task will produce when completed.
* ​origins (`origin.set`): The set of origins for the coroutine wrapped by this task.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(out self, owned handle: Coroutine[type, origins])`

Initialize a task with a coroutine.

Takes ownership of the provided coroutine and sets up the task to receive
its result when completed.

**Args:**

* ​handle (`Coroutine[type, origins]`): The coroutine to execute as a task. Ownership is transferred.

### `__del__`

`__del__(owned self)`

Destroy the memory associated with a task. This must be manually called when a task goes out of scope.

### `__await__`

`__await__(self) -> ref [*[0,0]._result] type`

Suspend the current async function until the task completes and its result becomes available. This function must be force inlined into the calling async function.

This method enables the use of the 'await' keyword with Task objects in
async functions.

**Returns:**

A reference to the result value produced by the task.

### `get`

`get(self) -> ref [*[0,0]._result] type`

Get the task's result value. Calling this on an incomplete task is undefined behavior.

**Returns:**

A reference to the result value produced by the task.

### `wait`

`wait(self) -> ref [*[0,0]._result] type`

Block the current thread until the future value becomes available.

This method is used in synchronous code to wait for an asynchronous task
to complete. Unlike `__await__`, this method does not suspend the current
coroutine but instead blocks the entire thread.

**Returns:**

A reference to the result value produced by the task.

---

## TaskGroup

`struct TaskGroup`

A group of tasks that can be executed concurrently.

TaskGroup manages a collection of coroutines that can be executed in parallel.
It provides mechanisms to create, track, and wait for the completion of tasks.

## Fields

* ​counter (`Atomic[index]`): Atomic counter tracking the number of active tasks in the group.
* ​chain (`_Chain`): Chain used for asynchronous completion notification.
* ​tasks (`List[_TaskGroupBox]`): Collection of tasks managed by this TaskGroup.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Initialize a new TaskGroup with an empty task list and initialized chain.

### `__del__`

`__del__(owned self)`

Clean up resources associated with the TaskGroup.

### `__await__`

`__await__(mut self)`

Make TaskGroup awaitable in async contexts.

This allows using 'await task\_group' syntax in async functions.

### `create_task`

`create_task(mut self, owned task: Coroutine[None, origins])`

Add a new task to the TaskGroup for execution.

**Args:**

* ​task (`Coroutine[None, origins]`): The coroutine to be executed as a task.

### `await_body_impl`

`static await_body_impl(hdl: !co.routine, mut task_group: Self)`

Implementation of the await functionality for TaskGroup.

**Args:**

* ​hdl (`!co.routine`): The coroutine handle to be awaited.
* ​task\_group (`Self`): The TaskGroup to be awaited.

### `wait`

`wait[origins: origin.set = {}](mut self)`

Wait for all tasks in the `TaskGroup` to complete.

This is a blocking call that returns only when all tasks have finished.

**Parameters:**

* ​origins (`origin.set`): The origin set for the wait operation.

---

## TaskGroupContext

`@register_passable(trivial)`
`struct TaskGroupContext`

Context structure for task group operations.

This structure holds a callback function and a pointer to a TaskGroup,
allowing asynchronous operations to interact with their parent TaskGroup
when they complete.

## Fields

* ​callback (`fn(mut TaskGroup) -> None`): Callback function to be invoked on the TaskGroup when an operation completes.
* ​task\_group (`UnsafePointer[TaskGroup]`): Pointer to the TaskGroup that owns or is associated with this context.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `tg_callback_fn_type`

`alias tg_callback_fn_type = fn(mut TaskGroup) -> None`

Type definition for callback functions that operate on TaskGroups.

---

## tc_reduce

`tc_reduce[in_type: DType, simd_width: Int, //, out_type: DType](val: SIMD[in_type, simd_width]) -> SIMD[out_type, 1]`

Performs tensor core based reduction on a SIMD vector.

Note:
Dispatches to either scalar or vector reduction implementation based on SIMD width.
Supports various input/output type combinations using tensor core operations.

**Parameters:**

* ​in\_type (`DType`): The input data type of the SIMD vector elements.
* ​simd\_width (`Int`): The width of the SIMD vector.
* ​out\_type (`DType`): The output data type for the reduced result.

**Args:**

* ​val (`SIMD[in_type, simd_width]`): Input SIMD vector to reduce.

**Returns:**

Scalar containing the reduced result.

---

## tc_reduce_gevm_4x

`tc_reduce_gevm_4x[out_type: DType, in_type: DType, simd_width: Int](val1: SIMD[in_type, simd_width]) -> SIMD[out_type, simd_width]`

Performs a 4x GEVM reduction using tensor cores.

Note:
Currently only supports bfloat16 input to float32 output conversion.
Uses tensor core matrix multiply-accumulate (MMA) operations for reduction.

**Parameters:**

* ​out\_type (`DType`): The output data type for the reduction result (must be float32).
* ​in\_type (`DType`): The input data type of the vector to reduce (must be bfloat16).
* ​simd\_width (`Int`): The width of the SIMD vector.

**Args:**

* ​val1 (`SIMD[in_type, simd_width]`): Input SIMD vector to reduce.

**Returns:**

SIMD vector containing the reduced result.

---

## tc_reduce_gevm_8x

`tc_reduce_gevm_8x[out_type: DType, in_type: DType, simd_width: Int](val1: SIMD[in_type, simd_width], val2: SIMD[in_type, simd_width]) -> SIMD[out_type, simd_width]`

Performs an 8x GEVM reduction using tensor cores.

Note:
Currently only supports bfloat16 input to float32 output conversion.
Uses tensor core matrix multiply-accumulate (MMA) operations for reduction.

**Parameters:**

* ​out\_type (`DType`): The output data type for the reduction result (must be float32).
* ​in\_type (`DType`): The input data type of the vectors to reduce (must be bfloat16).
* ​simd\_width (`Int`): The width of the SIMD vectors.

**Args:**

* ​val1 (`SIMD[in_type, simd_width]`): First input SIMD vector to reduce.
* ​val2 (`SIMD[in_type, simd_width]`): Second input SIMD vector to reduce.

**Returns:**

SIMD vector containing the reduced result.

---

## tcgen05

This module includes utilities for working with the tensorcore 5th generation (tcgen05) instructions.

## Aliases

### `check_blackwell_constraint`

`alias check_blackwell_constraint = constrained[::Bool,::StringSlice[::Bool[_has_blackwell_tcgen05(), __init__[__mlir_type.!kgen.string]("The tcgen05 instructions are only applicable on nVidia Blackwell (sm_100a, sm_101a) hardware."), ?]`

## Structs

* [​`TensorMemory`](/mojo/stdlib/gpu/tcgen05/TensorMemory): A wrapper around tensor memory allocated for tcgen05 instructions.

## Functions

* [​`tcgen05_alloc`](/mojo/stdlib/gpu/tcgen05/tcgen05_alloc): Allocates tensor memory for use with tcgen05 instructions.
* [​`tcgen05_cp`](/mojo/stdlib/gpu/tcgen05/tcgen05_cp): Copies data from shared memory described by the matrix descriptor `s_desc` to tensor memory `tmem_addr`.
* [​`tcgen05_dealloc`](/mojo/stdlib/gpu/tcgen05/tcgen05_dealloc): Deallocates tensor memory allocated by tcgen05\_alloc().
* [​`tcgen05_fence_after`](/mojo/stdlib/gpu/tcgen05/tcgen05_fence_after): Orders all the subsequent asynchronous `tcgen05` operations.
* [​`tcgen05_fence_before`](/mojo/stdlib/gpu/tcgen05/tcgen05_fence_before): Orders all the prior asynchronous `tcgen05` operations.
* [​`tcgen05_ld`](/mojo/stdlib/gpu/tcgen05/tcgen05_ld): Loads data from tensor memory into registers.
* [​`tcgen05_load_wait`](/mojo/stdlib/gpu/tcgen05/tcgen05_load_wait): Waits for tensor memory loads to complete.
* [​`tcgen05_release_allocation_lock`](/mojo/stdlib/gpu/tcgen05/tcgen05_release_allocation_lock): Releases the allocation lock for the current CTA group.
* [​`tcgen05_st`](/mojo/stdlib/gpu/tcgen05/tcgen05_st): Stores data from registers into tensor memory.
* [​`tcgen05_store_wait`](/mojo/stdlib/gpu/tcgen05/tcgen05_store_wait): Waits for tensor memory stores to complete.

---

## tcgen05_alloc

`tcgen05_alloc[cta_group: SIMD[int32, 1]](ptr_tmem_addr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3), alignment=16], num_cols: SIMD[uint32, 1])`

Allocates tensor memory for use with tcgen05 instructions.

Note:
This function is only available on NVIDIA Blackwell GPUs (SM 100+).

**Parameters:**

* ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID.

**Args:**

* ​ptr\_tmem\_addr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3), alignment=16]`): Shared memory pointer to hold tensor memory address.
* ​num\_cols (`SIMD[uint32, 1]`): The number of columns to allocate.

---

## tcgen05_cp

`tcgen05_cp[*, cta_group: SIMD[int32, 1], datapaths: Int, bits: Int, src_fmt: String = __init__[__mlir_type.!kgen.string](""), dst_fmt: String = __init__[__mlir_type.!kgen.string](""), multicast: String = __init__[__mlir_type.!kgen.string]("")](tmem_addr: SIMD[uint32, 1], s_desc: MMASmemDescriptor)`

Copies data from shared memory described by the matrix descriptor `s_desc` to tensor memory `tmem_addr`.

Note:
This function is only available on NVIDIA Blackwell GPUs (SM 100+).

**Parameters:**

* ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID.
* ​datapaths (`Int`): The first dimension of the shape.
* ​bits (`Int`): The second dimension of the shape.
* ​src\_fmt (`String`): Source format string.
* ​dst\_fmt (`String`): Destination format string.
* ​multicast (`String`): Multicast string.

**Args:**

* ​tmem\_addr (`SIMD[uint32, 1]`): Address of the tensor memory.
* ​s\_desc (`MMASmemDescriptor`): Matrix descriptor for the copy operation.

---

## tcgen05_dealloc

`tcgen05_dealloc[cta_group: SIMD[int32, 1]](tmem_addr: SIMD[uint32, 1], num_cols: SIMD[uint32, 1])`

Deallocates tensor memory allocated by tcgen05\_alloc().

This function deallocates tensor memory that was previously allocated using
tcgen05\_alloc(). The deallocation must be performed by the same CTA group
that performed the allocation.

**Parameters:**

* ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID.

**Args:**

* ​tmem\_addr (`SIMD[uint32, 1]`): Address of the tensor memory to deallocate.
* ​num\_cols (`SIMD[uint32, 1]`): Number of columns in the tensor memory.

---

## tcgen05_fence_after

`tcgen05_fence_after()`

Orders all the subsequent asynchronous `tcgen05` operations.

Note:
This function is only available on NVIDIA Blackwell GPUs (SM 100+).

---

## tcgen05_fence_before

`tcgen05_fence_before()`

Orders all the prior asynchronous `tcgen05` operations.

Note:
This function is only available on NVIDIA Blackwell GPUs (SM 100+).

---

## tcgen05_ld

`tcgen05_ld[*, datapaths: Int, bits: Int, repeat: Int, type: DType, pack: Bool, width: Int = (div_s(mul(#lit.struct.extract, #lit.struct.extract, #lit.struct.extract), 1024) + -1) if (((bits * datapaths * repeat) , #lit.struct.extract, #lit.struct.extract), 1024) == 0) ^ True)) else div_s(mul(#lit.struct.extract, #lit.struct.extract, #lit.struct.extract), 1024)](tmem_addr: SIMD[uint32, 1]) -> SIMD[type, width]`

Loads data from tensor memory into registers.

**Parameters:**

* ​datapaths (`Int`): The first dimension of the shape.
* ​bits (`Int`): The second dimension of the shape.
* ​repeat (`Int`): The repeat factor.
* ​type (`DType`): The data type to load.
* ​pack (`Bool`): Whether to pack two 16-bit chunks of adjacent columns into a single 32-bit register.
* ​width (`Int`): The nubmer elements in the result vector.

**Args:**

* ​tmem\_addr (`SIMD[uint32, 1]`): The address of the tensor memory to load from.

**Returns:**

The SIMD register containing the loaded data.

---

## tcgen05_load_wait

`tcgen05_load_wait()`

Waits for tensor memory loads to complete.

Note:
This function is only available on NVIDIA Blackwell GPUs (SM 100+).

---

## tcgen05_release_allocation_lock

`tcgen05_release_allocation_lock[cta_group: SIMD[int32, 1]]()`

Releases the allocation lock for the current CTA group.

Note:
This function is only available on NVIDIA Blackwell GPUs (SM 100+).

**Parameters:**

* ​cta\_group (`SIMD[int32, 1]`): The cooperative thread array (CTA) group ID.

---

## tcgen05_st

`tcgen05_st[type: DType, width: Int, //, *, datapaths: Int, bits: Int, repeat: Int, pack: Bool](tmem_addr: SIMD[uint32, 1], data: SIMD[type, width])`

Stores data from registers into tensor memory.

**Parameters:**

* ​type (`DType`): The data type to load.
* ​width (`Int`): The number of elements in the data vector.
* ​datapaths (`Int`): The first dimension of the shape.
* ​bits (`Int`): The second dimension of the shape.
* ​repeat (`Int`): The repeat factor.
* ​pack (`Bool`): Whether to pack two 16-bit chunks of adjacent columns into a single 32-bit register.

**Args:**

* ​tmem\_addr (`SIMD[uint32, 1]`): The address of the tensor memory to store to.
* ​data (`SIMD[type, width]`): The data to store into the tensor memory.

---

## tcgen05_store_wait

`tcgen05_store_wait()`

Waits for tensor memory stores to complete.

Note:
This function is only available on NVIDIA Blackwell GPUs (SM 100+).

---

## tempfile

Implements the tempfile package.

## Modules

* [​`tempfile`](/mojo/stdlib/tempfile/tempfile/): Implements tempfile methods.

---

## tempfile

Implements tempfile methods.

You can import a method from the `tempfile` package. For example:

```mojo
from tempfile import gettempdir
```

## Aliases

### `TMP_MAX`

`alias TMP_MAX = 10000`

## Structs

* [​`NamedTemporaryFile`](/mojo/stdlib/tempfile/tempfile/NamedTemporaryFile): A handle to a temporary file.
* [​`TemporaryDirectory`](/mojo/stdlib/tempfile/tempfile/TemporaryDirectory): A temporary directory.

## Functions

* [​`gettempdir`](/mojo/stdlib/tempfile/tempfile/gettempdir): Return the default directory to use for temporary files.
* [​`mkdtemp`](/mojo/stdlib/tempfile/tempfile/mkdtemp): Create a temporary directory. Caller is responsible for deleting the directory when done with it.

---

## TemporaryDirectory

`struct TemporaryDirectory`

A temporary directory.

## Fields

* ​name (`String`): The name of the temporary directory.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, suffix: String = __init__[__mlir_type.!kgen.string](""), prefix: String = __init__[__mlir_type.!kgen.string]("tmp"), dir: Optional[String] = Optional(None), ignore_cleanup_errors: Bool = False)`

Create a temporary directory.

Can be used as a context manager. When used as a context manager,
the directory is removed when the context manager exits.

**Args:**

* ​suffix (`String`): Suffix to use for the directory name.
* ​prefix (`String`): Prefix to use for the directory name.
* ​dir (`Optional[String]`): Directory in which the directory will be created.
* ​ignore\_cleanup\_errors (`Bool`): Whether to ignore cleanup errors.

### `__enter__`

`__enter__(self) -> String`

The function to call when entering the context.

**Returns:**

The temporary directory name.

### `__exit__`

`__exit__(self)`

Called when exiting the context with no error.

`__exit__(self, err: Error) -> Bool`

Called when exiting the context with an error.

**Args:**

* ​err (`Error`): The error raised inside the context.

**Returns:**

True if the temporary directory was removed successfully.

---

## tensor

APIs to create and manage tensors in a graph.

## Modules

* [​`io_spec`](/max/api/mojo/tensor/io_spec/):
* [​`managed_tensor_slice`](/max/api/mojo/tensor/managed_tensor_slice/): Implements the `ManagedTensorSlice` type - a view of a tensor that doesn't own the underlying data. This type is used to build custom graph operations.
* [​`tensor_spec`](/max/api/mojo/tensor/tensor_spec/): You can import these APIs from the `max.tensor` package. For example:
* [​`transitional`](/max/api/mojo/tensor/transitional/): Utilities for transitional period during NDBuffer deprecation.

---

## Tensor

```c
#include "max/c/tensor.h"
```

## Functions

### `M_newTensorSpec()`

> [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*M\_newTensorSpec(const int64\_t \*shape, int64\_t rankSize, [M\_Dtype](types.md#_CPPv47M_Dtype) dtype, const char \*tensorName)

Creates a tensor specification.

You need this in order to set the input tensors with [`M_borrowTensorInto()`](#tensor_8h_1ab98a1def2bfd4b49ac1d3a1b77ed96b9).

When storing tensor data in memory, we always use a diminishing stride size. That is, earlier dimensions in the shape have larger strides than later dimensions. For example, a C array declared as `int arr[1][2][3]` would have a shape specified as `{1, 2, 3}`.

* **Parameters:**

  * **shape** – The shape of the tensor.
  * **rankSize** – The rank size of the tensor.
  * **dtype** – The datatype for the tensor.
  * **tensorName** – The name for the tensor. This string gets copied as part of the operation of `M_newTensorSpec`, so your original string need not remain valid after the completion of this call.
* **Returns:**

  A pointer to the tensor spec. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeTensorSpec()`](#tensor_8h_1af0b957daeba1760134c3f24079b53026).

### `M_isDynamicRanked()`

> int M\_isDynamicRanked(const [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*spec)

Returns if the given spec has a dynamic rank.

* **Parameters:**

  **spec** – The tensor spec.
* **Returns:**

  `1` if the rank is dynamic. `0` otherwise.

### `M_getDimAt()`

> int64\_t M\_getDimAt(const [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*spec, size\_t axis)

Gets the element at a particular axis.

* **Parameters:**

  * **spec** – The tensor spec.
  * **axis** – The requested axis
* **Returns:**

  The dimension at requested axis if the spec and axis are valid and has static rank. Otherwise, `0`. A dimension equaling [`M_getDynamicDimensionValue()`](common.md#common_8h_1ad250f12f9b0d259172899cc8c1076760) indicates dynamic dimension e.g. batch-size of a model expecting a batched tensor.

### `M_getRank()`

> int64\_t M\_getRank(const [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*spec)

Gets the rank from the tensor spec.

* **Parameters:**

  **spec** – The tensor spec.
* **Returns:**

  The number of dimensions in the tensor spec if the spec is static and valid, [`M_getDynamicRankValue()`](common.md#common_8h_1a3d88fdacf1960a0bcab4fc9e6768701d) if dynamic. Otherwise, `0`.

### `M_getDtype()`

> [M\_Dtype](types.md#_CPPv47M_Dtype) M\_getDtype(const [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*spec)

Gets the datatype from the tensor spec.

* **Parameters:**

  **spec** – The tensor spec.
* **Returns:**

  The element type from the tensor spec if the tensor spec is valid. Otherwise, `M_UNKNOWN`.

### `M_getName()`

> const char \*M\_getName([M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*spec)

Gets the name of the tensor from the tensor spec.

* **Parameters:**

  **spec** – The tensor spec.
* **Returns:**

  A null-terminated string containing the name of the tensor if the `spec` is valid. Otherwise, `NULL`. The memory associated with the returned string is owned by `spec`.

### `M_newAsyncTensorMap()`

> [M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*M\_newAsyncTensorMap(const [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context)

Creates a map of tensor names to async tensors.

* **Parameters:**

  **context** – The runtime context.
* **Returns:**

  A pointer to the tensor map. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeAsyncTensorMap()`](#tensor_8h_1a0ac9628dcba39c9977b7f7ff95d8781e).

### `M_copyAsyncTensorMap()`

> [M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*M\_copyAsyncTensorMap(const [M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensorMap)

Copies a tensor map.

* **Parameters:**

  **tensorMap** – The tensor map to copy.
* **Returns:**

  A pointer to the tensor map. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeAsyncTensorMap()`](#tensor_8h_1a0ac9628dcba39c9977b7f7ff95d8781e).

### `M_getTensorMapSize()`

> size\_t M\_getTensorMapSize(const [M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensorMap, [M\_Status](types.md#_CPPv48M_Status) \*status)

Gets the size of the tensor map.

* **Parameters:**

  * **tensorMap** – The tensor map.
  * **status** – The status object for reporting errors.
* **Returns:**

  The size of the tensor map if the tensor map is valid. Otherwise, `0` and the `status` parameter contains an error message.

### `M_borrowTensorInto()`

> void M\_borrowTensorInto([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensors, const void \*input, const [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*tensorSpec, [M\_Status](types.md#_CPPv48M_Status) \*status)

Adds a tensor to the tensor map.

You are responsible for the lifetime of the input tensor data. Its data gets “borrowed” into the Tensor Map.

* **Parameters:**

  * **tensors** – The tensor map, from [`M_newAsyncTensorMap()`](#tensor_8h_1a18039c6e6c1769b947120b27178306eb).
  * **input** – The input tensor data.
  * **tensorSpec** – The tensor spec, from [`M_newTensorSpec()`](#tensor_8h_1a964a8ab740605dbc51321702c34caeef). This gets copied as part of the operation of `M_borrowTensorInto`, so your original tensorSpec need not exist through the lifetime of the tensor map.
  * **status** – The status object for reporting errors.

### `M_createBorrowedTensor()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createBorrowedTensor(const void \*data, const [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*tensorSpec, [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context)

Creates a borrowed tensor wrapped in an `AsyncValue`.

* **Parameters:**

  * **data** – The tensor data.
  * **tensorSpec** – The tensor spec, from [`M_newTensorSpec()`](#tensor_8h_1a964a8ab740605dbc51321702c34caeef). This gets copied as part of the operation of [`M_createBorrowedTensor()`](#tensor_8h_1a3178be3c58f89669aeb362433c7713d9), so your original tensorSpec need not exist through the lifetime of the tensor.
  * **context** – The runtime context.
* **Returns:**

  A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](value.md#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`, however the tensor data is borrowed and must outlive the returned `M_AsyncValue`.

### `M_getTensorByNameFrom()`

> [M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*M\_getTensorByNameFrom([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensorMap, const char \*name, [M\_Status](types.md#_CPPv48M_Status) \*status)

Gets a tensor from the tensor map by name.

* **Parameters:**

  * **tensorMap** – The tensor map.
  * **name** – The name of the tensor.
  * **status** – The status object for reporting errors.
* **Returns:**

  A pointer to the tensor. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeTensor()`](#tensor_8h_1a339008df4a10af5e8c01ae970598765c). The held tensor inside the return value is simply borrowed from the corresponding input `M_AsyncTensorMap`. If the tensor map or name are invalid, a `NULL` pointer is returned and the `status` parameter contains an error message.

### `M_tensorMapKeys()`

> const char \*\*M\_tensorMapKeys([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensorMap, int64\_t \*size)

### `M_deleteTensorMapKeys()`

> void M\_deleteTensorMapKeys(const char \*\*keys)

### `M_getTensorFromValue()`

> [M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*M\_getTensorFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value)

Gets a tensor from the async value.

* **Parameters:**

  **value** – The async value.
* **Returns:**

  A pointer to the tensor. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeTensor()`](#tensor_8h_1a339008df4a10af5e8c01ae970598765c). The held tensor inside the return value is simply borrowed from the `M_AsyncValue`. Note that the tensor name is not available through this method (unlike `M_getTensorByNameFrom`). If the value is invalid or not a tensor, a `NULL` pointer is returned.

### `M_getTensorNumElements()`

> size\_t M\_getTensorNumElements(const [M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*tensor)

Gets the number of elements for the tensor.

* **Parameters:**

  **tensor** – The tensor which must not be `NULL`.
* **Returns:**

  The number of elements for the given tensor.

### `M_getTensorType()`

> [M\_Dtype](types.md#_CPPv47M_Dtype) M\_getTensorType(const [M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*tensor)

Gets the corresponding `M_Dtype` for the tensor.

* **Parameters:**

  **tensor** – The tensor which must not be `NULL`.
* **Returns:**

  The corresponding `M_Dtype` for the tensor.

### `M_getTensorData()`

> const void \*M\_getTensorData(const [M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*tensor)

Gets a pointer to underlying data of the tensor.

* **Parameters:**

  **tensor** – The tensor which must not be `NULL`.
* **Returns:**

  A pointer to the underlying data of the tensor. This pointer is valid for the lifetime of the underlying tensor.

### `M_getTensorSpec()`

> [M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*M\_getTensorSpec(const [M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*tensor)

Gets a Tensor Spec for the tensor.

* **Parameters:**

  **tensor** – The tensor.
* **Returns:**

  The tensor spec for the tensor if the tensor is valid. Otherwise, `NULL`.

### `M_getTensorMapIterator()`

> [M\_TensorMapIterator](types.md#_CPPv419M_TensorMapIterator) \*M\_getTensorMapIterator([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensorMap, [M\_Status](types.md#_CPPv48M_Status) \*status)

Gets a tensor map iterator for the tensor map.

* **Parameters:**

  * **tensorMap** – The tensor map.
  * **status** – The status object for reporting errors.
* **Returns:**

  A pointer to the tensor map iterator. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeTensorMapIterator()`](#tensor_8h_1a19fe7668b091cfa8c7e52d53612445ff). If the tensor map is invalid, a `NULL` pointer is returned and the `status` parameter contains an error message.

### `M_advanceTensorMapIterator()`

> void M\_advanceTensorMapIterator([M\_TensorMapIterator](types.md#_CPPv419M_TensorMapIterator) \*iterator)

Advances the tensor map iterator by one entry.

* **Parameters:**

  **iterator** – The tensor map iterator.

### `M_getNameFromMapIterator()`

> const char \*M\_getNameFromMapIterator([M\_TensorMapIterator](types.md#_CPPv419M_TensorMapIterator) \*iterator)

Gets the name of the tensor from the tensor map iterator.

* **Parameters:**

  **iterator** – The tensor map iterator.
* **Returns:**

  A null-terminated string containing the name of the tensor if the `iterator` is valid. Otherwise, `NULL`. The memory associated with the returned string is owned by `spec`.

### `M_getTensorFromMapIterator()`

> [M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*M\_getTensorFromMapIterator([M\_TensorMapIterator](types.md#_CPPv419M_TensorMapIterator) \*iterator)

Gets the tensor from the tensor map iterator.

* **Parameters:**

  **iterator** – The tensor map iterator.
* **Returns:**

  A pointer to the tensor. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeTensor()`](#tensor_8h_1a339008df4a10af5e8c01ae970598765c). The held tensor inside the return value is simply borrowed from the corresponding input `M_AsyncTensorMap`. If the tensor map iterator is invalid, a `NULL` pointer is returned.

### `M_isEndOfTensorMap()`

> bool M\_isEndOfTensorMap([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensorMap, [M\_TensorMapIterator](types.md#_CPPv419M_TensorMapIterator) \*iterator)

Checks if the iterator has reached the end of the tensor map.

* **Parameters:**

  * **tensorMap** – The tensor map.
  * **iterator** – The tensor map iterator.
* **Returns:**

  True if the iterator points to the end of the map, false otherwise. Also returns true if either the tensorMap or iterator are invalid.

### `M_freeTensor()`

> void M\_freeTensor([M\_AsyncTensor](types.md#_CPPv413M_AsyncTensor) \*tensor)

Deallocates the memory for the tensor. No-op if `tensor` is NULL.

* **Parameters:**

  **tensor** – The tensor to deallocate.

### `M_freeTensorNameArray()`

> void M\_freeTensorNameArray([M\_TensorNameArray](types.md#_CPPv417M_TensorNameArray) \*names)

Deallocates the memory for the array of tensor names. No-op if `names` is `NULL`.

* **Parameters:**

  **names** – The tensor names to deallocate.

### `M_freeTensorSpec()`

> void M\_freeTensorSpec([M\_TensorSpec](types.md#_CPPv412M_TensorSpec) \*spec)

Deallocates the memory for the tensor spec. No-op if `spec` is `NULL`.

* **Parameters:**

  **spec** – The tensor spec to deallocate.

### `M_freeAsyncTensorMap()`

> void M\_freeAsyncTensorMap([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensorMap)

Deallocates the memory for the tensor map. No-op if `tensorMap` is `NULL`.

* **Parameters:**

  **tensorMap** – The tensor map to deallocate.

### `M_freeTensorMapIterator()`

> void M\_freeTensorMapIterator([M\_TensorMapIterator](types.md#_CPPv419M_TensorMapIterator) \*iterator)

Deallocates the memory for the tensor map iterator. No-op if `iterator` is `NULL`.

* **Parameters:**

  **iterator** – The tensor map iterator to deallocate.

---

## tensor_builder

Tensor Builder Module

Provides a fluent interface for constructing tensors with various layouts and memory configurations.
It includes utilities for creating both static (compile-time) and dynamic (runtime) tensor dimensions,
supporting row-major, column-major, and custom layouts. The module enables memory placement in different
address spaces (generic, shared, local) and supports features like circular indexing.

Key components:

* `ValueOrUnknown`: Represents static or dynamic tensor dimensions
* `LayoutTensorBuild`: Builder class for tensor construction
* Helper functions for dimension specification and layout creation

## Structs

* [​`LayoutTensorBuild`](./LayoutTensorBuild): Tensor layout builder providing a fluent interface for constructing tensors with various layouts.
* [​`ValueOrUnknown`](./ValueOrUnknown): Represents either a static dimension (known at compile time) or a dynamic dimension (known at runtime).

## Functions

* [​`dynamic`](./dynamic): Creates a dynamic dimension with runtime value.
* [​`static`](./static): Creates a static dimension with compile-time value.

---

## tensor_core

Tensor Core Module for High-Performance Matrix Operations

Provides abstractions for using GPU Tensor Cores to perform optimized matrix operations.
It supports both NVIDIA and AMD GPU architectures with hardware-specific optimizations.

## Key Components:

* `TensorCore`: Core struct that encapsulates tensor core operations with support for various
  data types and matrix shapes. It handles loading matrix fragments, performing matrix
  multiply-accumulate operations, and storing results.

* Matrix Fragment Management: Functions for loading and storing matrix fragments to/from
  shared memory with hardware-specific optimizations.

* Matrix Multiply-Accumulate (MMA): Optimized implementations of matrix multiplication
  operations using tensor cores.

## Supported Operations:

* Matrix loading with various layouts and swizzling patterns
* Matrix multiply-accumulate (D = A \* B + C)
* Matrix storing with hardware-specific optimizations

## Supported Data Types:

* NVIDIA: float32, bfloat16, float16, float8\_e4m3fn, float8\_e5m2
* AMD: float32, bfloat16, float16

## Supported Matrix Shapes:

* NVIDIA: 16×8×8, 16×8×4, 16×8×16, 8×8×4, 16×8×32
* AMD: 16×16×4, 16×16×16, 32×32×8

## Aliases

### `shape_16x16x16`

`alias shape_16x16x16 = IndexList(16, 16, 16, Tuple())`

### `shape_16x16x4`

`alias shape_16x16x4 = IndexList(16, 16, 4, Tuple())`

### `shape_16x8x16`

`alias shape_16x8x16 = IndexList(16, 8, 16, Tuple())`

### `shape_16x8x32`

`alias shape_16x8x32 = IndexList(16, 8, 32, Tuple())`

### `shape_16x8x4`

`alias shape_16x8x4 = IndexList(16, 8, 4, Tuple())`

### `shape_16x8x8`

`alias shape_16x8x8 = IndexList(16, 8, 8, Tuple())`

### `shape_32x32x8`

`alias shape_32x32x8 = IndexList(32, 32, 8, Tuple())`

### `shape_8x8x4`

`alias shape_8x8x4 = IndexList(8, 8, 4, Tuple())`

### `shape_null`

`alias shape_null = IndexList(0, 0, 0, Tuple())`

## Structs

* [​`TensorCore`](./TensorCore): TensorCore provides an abstraction for GPU tensor core hardware to perform optimized matrix operations.

## Functions

* [​`get_fragment_size`](./get_fragment_size): Calculates the fragment size per thread for a given MMA shape.
* [​`get_mma_shape`](./get_mma_shape): Returns the appropriate matrix multiply-accumulate (MMA) shape for tensor core operations.
* [​`num_matrix_reg`](./num_matrix_reg): Calculates the number of matrix registers required per thread.

---

## tensor_core_async

Tensor Core Async Module

This module provides high-performance abstractions for utilizing NVIDIA's Tensor Cores
to perform asynchronous matrix multiplication operations. It implements optimized memory
layouts and access patterns for efficient tensor core computations.

Key components:

* Layout creation functions for K-major and MN-major memory arrangements
* Swizzling support for improved memory access patterns
* WGMMA (Warp Group Matrix Multiply-Accumulate) descriptor generation
* TensorCoreAsync struct with methods for asynchronous matrix multiplication

The module supports various data types, matrix dimensions, and memory configurations,
enabling efficient implementation of deep learning primitives and other tensor operations
that can leverage hardware acceleration.

Performance features:

* Asynchronous execution model to overlap computation and memory access
* Support for different swizzling modes to optimize memory bandwidth
* Efficient register and shared memory utilization
* Support for multi-warp group execution

This implementation is specifically optimized for NVIDIA GPUs with Tensor Core support.

## Aliases

### `WGMMA_K_BYTES`

`alias WGMMA_K_BYTES = 32`

## Structs

* [​`TensorCoreAsync`](./TensorCoreAsync): High-performance asynchronous tensor core operations for matrix multiplication.

## Functions

* [​`select_k_atom`](./select_k_atom): Creates a core matrix layout for tensor core operations.
* [​`st_matrix_n_atom`](./st_matrix_n_atom): Creates a layout for N-major `st_matrix` atom in the context of WGMMA C matrix.
* [​`st_matrix_n_layout`](./st_matrix_n_layout): Creates a layout for N-major `st_matrix` in the context of WGMMA C matrix.
* [​`tile_layout_k_major`](./tile_layout_k_major): Creates a K-major layout for tensor core operations.
* [​`tile_layout_mn_major`](./tile_layout_mn_major): Creates an MN-major layout for tensor core operations.
* [​`tile_to_descriptor`](./tile_to_descriptor): Transforms a layout into a WGMMA descriptor-compatible layout.
* [​`wgmma_c_layout`](./wgmma_c_layout): Generates three layouts for mapping WGMMA C matrix coordinates.
* [​`wgmma_c_thread_layout`](./wgmma_c_thread_layout): Returns the thread layout component for WGMMA C matrix.
* [​`wgmma_output_layout`](./wgmma_output_layout): Returns the output layout component for WGMMA C matrix.

---

## tensor_ops

This module provides tensor core operations and utilities for GPU computation.

The module includes functions for:

* Tensor core based reductions (tc\_reduce) supporting various data types and SIMD widths
* GEVM (General Matrix-Vector Multiplication) reductions using tensor cores
* Efficient warp-level reductions leveraging tensor core operations

The tensor core operations are optimized for NVIDIA GPUs and support different data types
including float32, float16, and bfloat16. The module provides both scalar and vector
variants of reduction operations with different SIMD widths for maximum performance.

Key functions:

* tc\_reduce: Main tensor core reduction function supporting various types and widths
* tc\_reduce\_gevm\_8x: 8x GEVM reduction using tensor cores
* tc\_reduce\_gevm\_4x: 4x GEVM reduction using tensor cores

Note:
Most operations require NVIDIA GPUs with tensor core support.
Operations are optimized for warp-level execution.

## Functions

* [​`tc_reduce`](/mojo/stdlib/gpu/tensor_ops/tc_reduce): Performs tensor core based reduction on a SIMD vector.
* [​`tc_reduce_gevm_4x`](/mojo/stdlib/gpu/tensor_ops/tc_reduce_gevm_4x): Performs a 4x GEVM reduction using tensor cores.
* [​`tc_reduce_gevm_8x`](/mojo/stdlib/gpu/tensor_ops/tc_reduce_gevm_8x): Performs an 8x GEVM reduction using tensor cores.

---

## tensor_spec

You can import these APIs from the `max.tensor` package. For example:

```mojo
from max.tensor import RuntimeTensorSpec
```

## Structs

* [​`RuntimeTensorSpec`](/max/api/mojo/tensor/tensor_spec/RuntimeTensorSpec):

---

## TensorCore

`struct TensorCore[out_type: DType, in_type: DType, shape: IndexList[3], transpose_b: Bool = False]`

TensorCore provides an abstraction for GPU tensor core hardware to perform optimized matrix operations.

This struct encapsulates the functionality required to efficiently map matrix operations to Tensor Cores
on NVIDIA and AMD GPUs. It handles loading matrix fragments, performing matrix multiply-accumulate
operations, and storing results with hardware-specific optimizations.

Note:
Different shapes and data types are supported depending on the GPU hardware.
For NVIDIA GPUs:

* float32: 16×8×8 or 16×8×4
* half-precision: 16×8×16
* float8: 16×8×32
  For AMD GPUs:
* float32: 16×16×4
* half-precision: 16×16×16 or 32×32×8

## Parameters

* ​out\_type (`DType`): The data type for output/accumulation operations.
* ​in\_type (`DType`): The data type for input matrix elements.
* ​shape (`IndexList[3]`): The shape parameters for the matrix operation in the form \[M, N, K]
  where M×N is the output shape and K is the inner dimension.
* ​transpose\_b (`Bool`): Whether to transpose the B matrix before multiplication. Defaults to False.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `a_reg_type`

`alias a_reg_type = SIMD[in_type, num_matrix_reg[::Int,::Int]()]`

### `b_reg_type`

`alias b_reg_type = SIMD[in_type, num_matrix_reg[::Int,::Int]()]`

### `c_reg_tile_type`

`alias c_reg_tile_type = LayoutTensor[out_type, col_major(1, num_matrix_reg[::Int,::Int]()), MutableAnyOrigin, address_space=AddressSpace(5)]`

### `c_reg_type`

`alias c_reg_type = SIMD[out_type, num_matrix_reg[::Int,::Int]()]`

### `supported_fp32`

`alias supported_fp32 = (shape == IndexList(16, 8, 8, Tuple())) if is_nvidia_gpu() else (shape == IndexList(16, 16, 4, Tuple())) if (in_type is float32) else (in_type is float32)`

### `supported_fp8`

`alias supported_fp8 = (shape == IndexList(16, 8, 32, Tuple())) if Tuple(VariadicPack(float8_e4m3fn, float8_e5m2)).__contains__[::EqualityComparable & ::Copyable & ::Movable](in_type) else Tuple(VariadicPack(float8_e4m3fn, float8_e5m2)).__contains__[::EqualityComparable & ::Copyable & ::Movable](in_type)`

### `supported_half`

`alias supported_half = (shape == IndexList(16, 8, 16, Tuple())) if is_nvidia_gpu() else Tuple(VariadicPack(IndexList(16, 16, 16, Tuple()), IndexList(32, 32, 8, Tuple()))).__contains__[::EqualityComparable & ::Copyable & ::Movable](shape) if in_type.is_half_float() else in_type.is_half_float()`

## Methods

### `__init__`

`__init__(out self)`

Initialize a new TensorCore instance.

### `get_shapes`

`static get_shapes[out_type: DType, in_type: DType]() -> List[IndexList[3]]`

Get supported shapes for given data types.

Returns a list of valid shapes for the specified output and input data types.

Note:
The returned shapes are hardware-dependent. Different shapes are supported
for different combinations of input and output types.

**Parameters:**

* ​out\_type (`DType`): The output/accumulation data type.
* ​in\_type (`DType`): The input matrix data type.

**Returns:**

List\[IndexList\[3]]: Valid shapes for the matrix operations given the specified types.

### `load_a`

`load_a[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, a: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[in_type, _get_a_reg_tile_layout[::Layout,::IndexList[::Int(), MutableAnyOrigin, address_space=AddressSpace(5)]`

Load the A matrix fragments.

Loads matrix A from memory into a LayoutTensor suitable for tensor core operations.

**Parameters:**

* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzle pattern for optimal memory access (AMD only).

**Args:**

* ​a (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source matrix A data.

**Returns:**

The loaded matrix fragments as a `LayoutTensor`.

`load_a[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, warp_tile: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], fragments: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mma_tile_coord_k: UInt = UInt(0))`

Load A matrix fragments from shared memory.

Optimized version for loading A matrix fragments from shared memory.

**Parameters:**

* ​swizzle (`OptionalReg[Swizzle]`): Optional memory access pattern for to optimize memory bandwidth.

**Args:**

* ​warp\_tile (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source data in shared memory.
* ​fragments (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor for fragments.
* ​mma\_tile\_coord\_k (`UInt`): The K coordinate of the MMA tile. Defaults to 0.

### `load_b`

`load_b[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, b: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[in_type, _get_b_reg_tile_layout[::Layout,::IndexList[::Int(), MutableAnyOrigin, address_space=AddressSpace(5)]`

Load the B matrix fragments.

Loads matrix B from memory into a `LayoutTensor` suitable for tensor core operations.
The function handles different hardware architectures and memory access patterns.

Note:
If transpose\_b is `True`, the B matrix will be transposed during loading.
This is more efficient than transposing the matrix in memory.

**Parameters:**

* ​swizzle (`OptionalReg[Swizzle]`): Optional swizzle pattern for optimal memory access (AMD only).
  Will cause an error if used with NVIDIA GPUs.

**Args:**

* ​b (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source matrix B data.

**Returns:**

The loaded matrix fragments as a `LayoutTensor`.

`load_b[swizzle: OptionalReg[Swizzle] = OptionalReg[Swizzle]({:i1 0, 1})](self, warp_tile: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], fragments: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mma_tile_coord_k: UInt = UInt(0), warp_tile_coord_n: UInt = UInt(0))`

Load B matrix fragments from shared memory into registers for tensor core operations.

This function loads matrix B fragments from a warp tile in shared memory into register fragments
for use in tensor core matrix multiply operations. It handles hardware-specific optimizations
for both NVIDIA and AMD GPUs.

Note:
The `warp_tile` must be in shared memory. For NVIDIA GPUs, `swizzle` must be `None`.
For AMD GPUs, providing an appropriate `swizzle` pattern can improve performance.

**Parameters:**

* ​swizzle (`OptionalReg[Swizzle]`): Optional memory access pattern for AMD GPUs to optimize memory bandwidth.
  Must be None when running on NVIDIA GPUs. For NVIDIA GPUs, swizzle is always on.

**Args:**

* ​warp\_tile (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Source `LayoutTensor` in shared memory containing the B matrix data.
* ​fragments (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Destination `LayoutTensor` to store the loaded matrix fragments.
* ​mma\_tile\_coord\_k (`UInt`): K-dimension coordinate within the warp tile. Defaults to 0.
* ​warp\_tile\_coord\_n (`UInt`): N-dimension coordinate within the warp tile. Defaults to 0.

`load_b(self, warp_tile: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], fragments: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], scales: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], mma_tile_coord_k: UInt = UInt(0))`

Load quantized B matrix fragments from shared memory with dequantization.

This function loads int4 quantized matrix B fragments from shared memory, dequantizes them
using the provided scales, and stores the result in register fragments for tensor core operations.

Notes:

* The `warp_tile` must be in shared memory.
* The `fragments` and `scales` must be in local memory.
* This function only supports half-precision data types (bfloat16, float16).
* The quantized data is stored as int4 values packed into int32 elements.
* Each thread processes multiple fragments by unpacking and dequantizing the int4 values.

**Args:**

* ​warp\_tile (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Source `LayoutTensor` in shared memory containing the quantized B matrix data.
* ​fragments (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Destination `LayoutTensor` to store the dequantized matrix fragments.
* ​scales (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): `LayoutTensor` containing the scaling factors for dequantization.
* ​mma\_tile\_coord\_k (`UInt`): K-dimension coordinate within the warp tile. Defaults to 0.

### `load_c`

`load_c(self, c: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[out_type, col_major(1, num_matrix_reg[::Int,::Int]()), MutableAnyOrigin, address_space=AddressSpace(5)]`

Load the C matrix fragments.

Loads matrix C from memory into a `LayoutTensor` suitable for tensor core operations.
The function handles different hardware architectures and memory access patterns.

**Args:**

* ​c (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source matrix C data.

**Returns:**

The loaded matrix fragments as a `LayoutTensor`.

### `store_d`

`store_d(self, d_dst: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], d_src: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Store matrix D to destination memory.

Stores the result matrix D from tensor core computation to the destination memory.

**Args:**

* ​d\_dst (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor to store the result.
* ​d\_src (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor containing the computed result.

### `mma_op`

`mma_op(self, a: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> LayoutTensor[out_type, col_major(1, num_matrix_reg[::Int,::Int]()), MutableAnyOrigin, address_space=AddressSpace(5)]`

Perform matrix multiply-accumulate operation (MMA).

Executes `D = A * B + C` using tensor cores.

**Args:**

* ​a (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The A matrix input.
* ​b (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The B matrix input.
* ​c (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The C matrix input for accumulation.

**Returns:**

`Self.c_reg_tile_type`: The result of the MMA operation.

### `mma`

`mma(self, a_frag: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b_frag: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_frag: LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Perform matrix multiply-accumulate operation using tensor cores.

Executes C = A \* B + C using tensor cores, where A, B, and C are matrix fragments
stored in register memory. This function handles the mapping of fragments to
hardware tensor core operations.

Notes:

* All fragments must be properly loaded using the corresponding load functions.
* The function assumes fragments are vectorized layout tensors with dimensions num\_vectors x 1.
* The c\_frag shape\[0] must equal num\_m\_mmas \* num\_n\_mmas.
* The result is accumulated in-place in c\_frag.

**Args:**

* ​a\_frag (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix A fragments as a `LayoutTensor`.
* ​b\_frag (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix B fragments as a `LayoutTensor`.
* ​c\_frag (`LayoutTensor[dtype, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix C fragments as a `LayoutTensor` for both input and output.

---

## TensorCoreAsync

`struct TensorCoreAsync[c_type: DType, a_type: DType, b_type: DType, mma_shape: IndexList[3], /, a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = False]`

High-performance asynchronous tensor core operations for matrix multiplication.

This struct provides methods for utilizing NVIDIA's Tensor Cores for asynchronous
matrix multiplication operations, with support for various data types and swizzling
configurations.

## Parameters

* ​c\_type (`DType`): Data type of the output matrix C.
* ​a\_type (`DType`): Data type of the input matrix A.
* ​b\_type (`DType`): Data type of the input matrix B.
* ​mma\_shape (`IndexList[3]`): Dimensions for the matrix multiply-accumulate (MMA) operation as \[M, N, K].
* ​a\_swizzle (`TensorMapSwizzle`): Swizzling mode for matrix A (default: SWIZZLE\_NONE).
* ​b\_swizzle (`TensorMapSwizzle`): Swizzling mode for matrix B (default: SWIZZLE\_NONE).
* ​transpose\_b (`Bool`): Whether to transpose matrix B (default: False).

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Initialize the `TensorCoreAsync` instance.

Ensures that the provided MMA shape is supported.

Note:
Fails to compile if `mma_shape` is not supported.

### `wgmma`

`static wgmma[num_warp_groups: Int = 1, scale_c: Int = 1, scale_a: Int = 1, scale_b: Int = 1](a_smem_tile: LayoutTensor[a_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b_smem_tile: LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_reg_tile: LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], wg_idx: Int = 0)`

Perform asynchronous matrix multiplication using warp group matrix multiply-accumulate (WGMMA).

This method handles the case where both A and B matrices are in shared memory.

**Parameters:**

* ​num\_warp\_groups (`Int`): Number of warp groups to distribute work across (default: 1).
* ​scale\_c (`Int`): Scale factor for matrix C. Valid values are 1 or 0 (default: 1).
* ​scale\_a (`Int`): Scale factor for matrix A. Valid values are 1 or -1 (default: 1).
* ​scale\_b (`Int`): Scale factor for matrix B. Valid values are 1 or -1 (default: 1).

**Args:**

* ​a\_smem\_tile (`LayoutTensor[a_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix A in shared memory.
* ​b\_smem\_tile (`LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix B in shared memory.
* ​c\_reg\_tile (`LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Output matrix C in register memory.
* ​wg\_idx (`Int`): Warp group index for multi-warp group scenarios (default: 0).

`static wgmma(a_frag_tile: LayoutTensor[a_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], b_smem_tile: LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_reg_tile: LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Perform asynchronous matrix multiplication using warp group matrix multiply-accumulate (WGMMA).

This overloaded method handles the case where matrix A is in register memory and matrix B
is in shared memory.

**Args:**

* ​a\_frag\_tile (`LayoutTensor[a_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix A in register memory.
* ​b\_smem\_tile (`LayoutTensor[b_type, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Matrix B in shared memory.
* ​c\_reg\_tile (`LayoutTensor[c_type, layout, origin, address_space=AddressSpace(5), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): Output matrix C in register memory.

### `arrive`

`static arrive()`

Ensures memory consistency by creating a fence for WGMMA operations.

This method should be called before committing a group to ensure all
shared memory accesses are properly aligned and visible.

### `commit_group`

`static commit_group()`

Commits the current warp group for execution.

This synchronizes the warp group and commits all pending WGMMA operations
that have been previously issued.

### `wait_group`

`static wait_group[group: Int = 0]()`

Waits for the completion of a specific warp group's operations.

This method blocks until all WGMMA operations from the specified group are complete.

**Parameters:**

* ​group (`Int`): The group ID to wait for (default: 0).

---

## TensorMemory

`@register_passable(trivial)`
`struct TensorMemory`

A wrapper around tensor memory allocated for tcgen05 instructions.

## Fields

* ​ptr (`UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3), alignment=16]`): Pointer to the tensor memory address.
* ​num\_cols (`SIMD[uint32, 1]`): The number of columns in the tensor memory.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(num_cols: SIMD[uint32, 1]) -> Self`

Initialize the TensorMemory struct.

**Args:**

* ​num\_cols (`SIMD[uint32, 1]`): The number of columns to allocate.

---

## TensorValue

Library for the graph TensorValue class.

## `TensorValue` {#max.graph.TensorValue}

> *class* max.graph.TensorValue(value)

Bases: [`Value`](Value.md#max.graph.Value)\[`TensorType`]

Represents a value semantic tensor within a [`Graph`](Graph.md#max.graph.Graph). It provides
various methods and properties to manipulate and query tensor attributes
such as [`shape`](#max.graph.TensorValue.shape), data type ([`dtype`](#max.graph.TensorValue.dtype)), device placement ([`device`](#max.graph.TensorValue.device)), and more.

The following example demonstrates how to create and manipulate tensor values in a graph:

```python
import numpy as np
from max.dtype import DType
from max.graph import Graph, ops

matrix = np.array([[1, 2], [3, 4]], dtype=np.float32)

# Create a Graph context to work with tensors
with Graph("tensor_demo") as graph:
    # Create a constant tensor from the matrix
    tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU())

    # Access tensor properties
    print(f"Shape: {tensor.shape}")  # Output: [2, 2]
    print(f"Data type: {tensor.dtype}")  # Output: DType.float32

    # Perform operations on the tensor
    transposed = tensor.T
    doubled = tensor * 2

    print(f"Original shape: {tensor.shape}")  # Output: [2, 2]
    print(f"Transposed shape: {transposed.shape}")  # Output: [2, 2]
```

Value is abstract, it shouldn’t be constructed directly.

**Parameters:**

**value** (`TensorValueLike` )

### `T` {#max.graph.TensorValue.T}

> *property* T\*: [TensorValue](#max.graph.TensorValue)\*

Returns the transposed tensor.
[`T`](#max.graph.TensorValue.T) is the shorthand notation for transposing.
For more information, see [`transpose()`](#max.graph.TensorValue.transpose).

**Returns:**

A new [`TensorValue`](#max.graph.TensorValue) with swapped dimensions.

### `broadcast_to()` {#max.graph.TensorValue.broadcast_to}

> broadcast\_to(shape)

Broadcasts the tensor to a new shape.

The following example demonstrates how to broadcast a tensor to a larger shape:

```python
import numpy as np
from max.dtype import DType
from max.graph import Graph, ops

# Create a 2x2 matrix
matrix = np.array([[1, 2], [3, 4]], dtype=np.float32)

# Create a Graph context to work with tensors
with Graph("broadcast_to_demo") as graph:
    # Create a constant tensor from the matrix
    tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU())

    # Broadcast tensor to a 3x2x2 tensor (add a new dimension of size 3)
    broadcasted_tensor = tensor.broadcast_to((3, 2, 2))

    print(f"Original shape: {tensor.shape}")  # Output: [2, 2]
    print(f"Broadcasted shape: {broadcasted_tensor.shape}")  # Output: [3, 2, 2]
```

**Parameters:**

**shape** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – An iterable of integers or symbolic dimensions.

**Returns:**

A new [`TensorValue`](#max.graph.TensorValue) with the broadcasted shape.

**Return type:**

[*TensorValue*](#max.graph.TensorValue)

### `cast()` {#max.graph.TensorValue.cast}

> cast(dtype)

Casts a symbolic tensor to a different data type.

The following example demonstrates how to cast a tensor from one data type to another:

```python
import numpy as np
from max.dtype import DType
from max.graph import Graph, ops

# Create a matrix with float32 values
matrix = np.array([[1, 2], [3, 4]], dtype=np.float32)

# Create a Graph context to work with tensors
with Graph("cast_demo") as graph:
    # Create a constant tensor from the matrix
    tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU())

    # Cast tensor to integer type
    casted_tensor = tensor.cast(DType.int32)

    print(f"Original dtype: {tensor.dtype}")  # Output: DType.float32
    print(f"Casted dtype: {casted_tensor.dtype}")  # Output: DType.int32
```

**Parameters:**

**dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The target data type (e.g., `DType.int32`, `DType.float64`).

**Returns:**

A new [`TensorValue`](#max.graph.TensorValue) with the casted data type.

**Return type:**

[*TensorValue*](#max.graph.TensorValue)

### `device` {#max.graph.TensorValue.device}

> *property* device\*: DeviceRef\*

Returns the device of the TensorValue.

### `dtype` {#max.graph.TensorValue.dtype}

> *property* dtype\*: [DType](../dtype.md#max.dtype.DType)\*

Returns the tensor data type.

The following example demonstrates how to access the data type of a tensor:

```python
import numpy as np
from max.dtype import DType
from max.graph import Graph, ops

# Create a matrix with float32 values
matrix = np.array([[1, 2], [3, 4]], dtype=np.float32)

# Create a Graph context to work with tensors
with Graph("dtype_demo") as graph:
    # Create a constant tensor from the matrix
    tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU())

    # Access tensor data type
    print(f"Data type: {tensor.dtype}")  # Output: DType.float32
```

### `flatten()` {#max.graph.TensorValue.flatten}

> flatten(start\_dim=0, end\_dim=-1)

Flattens the specified dims of a symbolic tensor.

The number and order of the elements in the tensor is unchanged.
All dimensions from `start_dim` to `end_dim` (inclusive) are merged into a single output dim.

The following example demonstrates how to flatten a multi-dimensional tensor:

```python
import numpy as np
from max.dtype import DType
from max.graph import Graph, ops

# Create a 2x2 matrix
matrix = np.array([[1, 2], [3, 4]], dtype=np.float32)

# Create a Graph context to work with tensors
with Graph("flatten_demo") as graph:
    # Create a constant tensor from the matrix
    tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU())

    # Flatten the tensor to a 1D array
    flattened_tensor = tensor.flatten()

    print(f"Original shape: {tensor.shape}")  # Output: [2, 2]
    print(f"Flattened shape: {flattened_tensor.shape}")  # Output: [4]
```

**Parameters:**

* **start\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The starting dimension to flatten. Defaults to `1`.
* **end\_dim** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The ending dimension to flatten. Defaults to `-1`.

**Returns:**

A new [`TensorValue`](#max.graph.TensorValue) with the flattened dimensions.

**Return type:**

[*TensorValue*](#max.graph.TensorValue)

### `permute()` {#max.graph.TensorValue.permute}

> permute(dims)

Permutes the tensor’s dimensions based on provided indices.

**Parameters:**

**dims** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` ) – A list of integers specifying the new order of dimensions.

**Returns:**

A new [`TensorValue`](#max.graph.TensorValue) with permuted dimensions.

**Return type:**

[*TensorValue*](#max.graph.TensorValue)

### `print()` {#max.graph.TensorValue.print}

> print(label='debug\_tensor')

Prints detailed information about the tensor.

**Parameters:**

**label** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – A string label for the printed output. Defaults `debug_tensor`.

### `rank` {#max.graph.TensorValue.rank}

> *property* rank\*: [int](https://docs.python.org/3/library/functions.html#int)\*

Returns the rank (number of dims) of the buffer.

The following example demonstrates how to access the rank of a tensor:

```python
import numpy as np
from max.dtype import DType
from max.graph import Graph, ops

# Create a 2x2 matrix (2-dimensional array)
matrix = np.array([[1, 2], [3, 4]], dtype=np.float32)

# Create a Graph context to work with tensors
with Graph("rank_demo") as graph:
    # Create a constant tensor from the matrix
    tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU())

    # Access tensor rank (number of dimensions)
    print(f"Rank: {tensor.rank}")  # Output: 2
```

### `rebind()` {#max.graph.TensorValue.rebind}

> rebind(shape, message='')

Rebinds the tensor to a new shape with error handling.

**Parameters:**

* **shape** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – The new shape as an iterable of integers or symbolic dimensions.
* **message** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – (optional) A message for logging or debugging.

**Returns:**

A new [`TensorValue`](#max.graph.TensorValue) with the updated shape.

**Return type:**

[*TensorValue*](#max.graph.TensorValue)

### `reshape()` {#max.graph.TensorValue.reshape}

> reshape(shape)

Creates a new tensor with the same data but reshaped.

The following example demonstrates how to reshape a tensor to change its dimensions:

```python
import numpy as np
from max.dtype import DType
from max.graph import Graph, ops

# Create a 2x2 matrix
matrix = np.array([[1, 2], [3, 4]], dtype=np.float32)

# Create a Graph context to work with tensors
with Graph("reshape_demo") as graph:
    # Create a constant tensor from the matrix
    tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU())

    # Reshape tensor to a 1x4 matrix
    reshaped_tensor = tensor.reshape((1, 4))

    print(f"Original shape: {tensor.shape}")  # Output: [2, 2]
    print(f"Reshaped shape: {reshaped_tensor.shape}")  # Output: [1, 4]
```

**Parameters:**

**shape** ([`Iterable`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Iterable) `[` [`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](type.md#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) `]` ) – The new shape as an iterable of integers or symbolic dimensions.

**Returns:**

A new [`TensorValue`](#max.graph.TensorValue) with the reshaped dimensions.

**Return type:**

[*TensorValue*](#max.graph.TensorValue)

### `shape` {#max.graph.TensorValue.shape}

> *property* shape\*: [Shape](type.md#max.graph.type.Shape)\*

Returns the shape of the [`TensorValue`](#max.graph.TensorValue).

The following example demonstrates how to access the shape of a tensor:

```python
import numpy as np
from max.dtype import DType
from max.graph import Graph, ops

# Create a 2x2 matrix
matrix = np.array([[1, 2], [3, 4]], dtype=np.float32)

# Create a Graph context to work with tensors
with Graph("shape_demo") as graph:
    # Create a constant tensor from the matrix
    tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU())

    # Access tensor shape
    print(f"Shape: {tensor.shape}")  # Shape: [Dim(2), Dim(2)]
```

### `to()` {#max.graph.TensorValue.to}

> to(device)

Transfers the tensor to a specified device without mutation.

The following example demonstrates how to move a tensor from one device to another:

```python
import numpy as np
from max.dtype import DType
from max.graph import Graph, ops, DeviceRef

# Create a 2x2 matrix
matrix = np.array([[1, 2], [3, 4]], dtype=np.float32)

with Graph("to_device_example") as graph:
    # Create a tensor on the default device
    tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU())

    # Move the tensor to a GPU device
    gpu_tensor = tensor.to(DeviceRef.GPU())

    print(f"Original device: {tensor.device}")  # Output depends on default device
    print(f"New device: {gpu_tensor.device}")  # Output: gpu:0
```

**Parameters:**

**device** (`DeviceRef` ) – A `DeviceRef` object specifying the target device.

**Returns:**

A new [`TensorValue`](#max.graph.TensorValue) on the specified device.

**Return type:**

[*TensorValue*](#max.graph.TensorValue)

### `transpose()` {#max.graph.TensorValue.transpose}

> transpose(dim\_1, dim\_2)

Swaps two dimensions of the tensor.

The following example demonstrates how to transpose a tensor by swapping its dimensions:

```python
import numpy as np
from max.dtype import DType
from max.graph import Graph, ops

# Create a 2x3 matrix
matrix = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32)

with Graph("transpose_demo") as graph:
    tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU())

    # Transpose the tensor (swap dimensions 0 and 1)
    transposed_tensor = tensor.transpose(dim_1=0, dim_2=1)

    print(f"Original shape: {tensor.shape}")  # Output: [2, 3]
    print(f"Transposed shape: {transposed_tensor.shape}")  # Output: [3, 2]
```

**Parameters:**

* **dim\_1** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The first dimension to swap.
* **dim\_2** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – The second dimension to swap.

**Returns:**

A new [`TensorValue`](#max.graph.TensorValue) with swapped dimensions.

**Return type:**

[*TensorValue*](#max.graph.TensorValue)

### `type` {#max.graph.TensorValue.type}

> *property* type\*: [TensorType](type.md#max.graph.type.TensorType)\*

Returns the type of the [`TensorValue`](#max.graph.TensorValue) as a `TensorType`.

---

## terminate

This module includes the exit functions.

## Functions

* [​`exit`](/mojo/stdlib/sys/terminate/exit): Exits from Mojo. Unlike the Python implementation this does not raise an exception to exit.

---

## testing

Implements the testing package.

## Modules

* [​`testing`](/mojo/stdlib/testing/testing/): Implements various testing utils.

---

## testing

Implements various testing utils.

You can import these APIs from the `testing` package. For example:

```mojo
from testing import assert_true

def main():
    x = 1
    y = 2
    try:
        assert_true(x==1)
        assert_true(y==2)
        assert_true((x+y)==3)
        print("All assertions succeeded")
    except e:
        print("At least one assertion failed:")
        print(e)
```

## Structs

* [​`assert_raises`](/mojo/stdlib/testing/testing/assert_raises): Context manager that asserts that the block raises an exception.

## Functions

* [​`assert_almost_equal`](/mojo/stdlib/testing/testing/assert_almost_equal): Asserts that the input values are equal up to a tolerance. If it is not then an Error is raised.
* [​`assert_equal`](/mojo/stdlib/testing/testing/assert_equal): Asserts that the input values are equal. If it is not then an Error is raised.
* [​`assert_false`](/mojo/stdlib/testing/testing/assert_false): Asserts that the input value is False and raises an Error if it's not.
* [​`assert_is`](/mojo/stdlib/testing/testing/assert_is): Asserts that the input values have the same identity. If they do not then an Error is raised.
* [​`assert_is_not`](/mojo/stdlib/testing/testing/assert_is_not): Asserts that the input values have different identities. If they do not then an Error is raised.
* [​`assert_not_equal`](/mojo/stdlib/testing/testing/assert_not_equal): Asserts that the input values are not equal. If it is not then an Error is raised.
* [​`assert_true`](/mojo/stdlib/testing/testing/assert_true): Asserts that the input value is True and raises an Error if it's not.

---

## Testing

Mojo includes a framework for developing and executing unit tests. The Mojo
testing framework consists of a set of assertions defined as part of the [Mojo
standard library](/mojo/lib) and the [`mojo test`](/mojo/cli/test) command line
tool.

## Get started

Let's start with a simple example of writing and running Mojo tests.

### 1. Write tests

For your first example of using the Mojo testing framework, create a file named
`test_quickstart.mojo` containing the following code:

```mojo
# Content of test_quickstart.mojo
from testing import assert_equal

def inc(n: Int) -> Int:
    return n + 1

def test_inc_zero():
    # This test contains an intentional logical error to show an example of
    # what a test failure looks like at runtime.
    assert_equal(inc(0), 0)

def test_inc_one():
    assert_equal(inc(1), 2)
```

In this file, the `inc()` function is the test *target*. The functions whose
names begin with `test_` are the tests. Usually the target should be in a
separate source file from its tests, but you can define them in the same file
for this simple example.

A test function *fails* if it raises an error when executed, otherwise it
*passes*. The two tests in this example use the `assert_equal()` function,
which raises an error if the two values provided are not equal.

:::note

The implementation of `test_inc_zero()` contains an intentional logical error
so that you can see an example of a failed test when you execute it in the
next step of this tutorial.

:::

### 2. Execute tests

Then in the directory containing the file, execute the following command in your
shell:

```bash
mojo test test_quickstart.mojo
```

You should see output similar to this (note that this example elides the full
filesystem paths from the output shown):

```output
Testing Time: 1.193s

Total Discovered Tests: 2

Passed : 1 (50.00%)
Failed : 1 (50.00%)
Skipped: 0 (0.00%)

******************** Failure: 'ROOT_DIR/test_quickstart.mojo::test_inc_zero()' ********************

Unhandled exception caught during execution

Error: At ROOT_DIR/test_quickstart.mojo:8:17: AssertionError: `left == right` comparison failed:
   left: 1
  right: 0

********************
```

The output starts with a summary of the number of tests discovered, passed,
failed, and skipped. Following that, each failed test is reported along with
its error message.

### Next steps

- [The `testing` module](#the-testing-module) describes the assertion
  functions available to help implement tests.
- [Writing unit tests](#writing-unit-tests) shows how to write unit tests and
  organize them into test files.
- [The `mojo test` command](#the-mojo-test-command) describes how to execute
  and collect lists of tests.
- Our GitHub repo contains an [example
  project](https://github.com/modular/modular/tree/main/examples/mojo/testing) to
  demonstrate unit testing. Several of the examples shown later are based on
  this project.

## The `testing` module

The Mojo standard library includes a [`testing`](/mojo/stdlib/testing/testing/)
module that defines several assertion functions for implementing tests. Each
assertion returns `None` if its condition is met or raises an error if it isn't.

- [`assert_true()`](/mojo/stdlib/testing/testing/assert_true):
  Asserts that the input value is `True`.
- [`assert_false()`](/mojo/stdlib/testing/testing/assert_false):
  Asserts that the input value is `False`.
- [`assert_equal()`](/mojo/stdlib/testing/testing/assert_equal):
  Asserts that the input values are equal.
- [`assert_not_equal()`](/mojo/stdlib/testing/testing/assert_not_equal):
  Asserts that the input values are not equal.
- [`assert_almost_equal()`](/mojo/stdlib/testing/testing/assert_almost_equal):
  Asserts that the input values are equal up to a tolerance.

The boolean assertions report a basic error message when they fail.

```mojo
from testing import *
assert_true(False)
```

```output
Unhandled exception caught during execution

Error: At Expression [1] wrapper:14:16: AssertionError: condition was unexpectedly False
```

Each function also accepts an optional `msg` keyword argument for providing a
custom message to include if the assertion fails.

```mojo
assert_true(False, msg="paradoxes are not allowed")
```

```output
Unhandled exception caught during execution

Error: At Expression [2] wrapper:14:16: AssertionError: paradoxes are not allowed
```

For comparing floating point values you should use `assert_almost_equal()`,
which allows you to specify either an absolute or relative tolerance.

```mojo
result = 10 / 3
assert_almost_equal(result, 3.33, atol=0.001, msg="close but no cigar")
```

```output
Unhandled exception caught during execution

Error: At Expression [3] wrapper:15:24: AssertionError: 3.3333333333333335 is not close to 3.3300000000000001 with a diff of 0.0033333333333334103 (close but no cigar)
```

The testing module also defines a [context
manager](/mojo/manual/errors#use-a-context-manager),
[`assert_raises()`](/mojo/stdlib/testing/testing/assert_raises), to assert that
a given code block correctly raises an expected error.

```mojo
def inc(n: Int) -> Int:
    if n == Int.MAX:
         raise Error("inc overflow")
    return n + 1

print("Test passes because the error is raised")
with assert_raises():
    _ = inc(Int.MAX)

print("Test fails because the error isn't raised")
with assert_raises():
    _ = inc(Int.MIN)
```

```output
Unhandled exception caught during execution

Test passes because the error is raised
Test fails because the error isn't raised
Error: AssertionError: Didn't raise at Expression [4] wrapper:18:23
```

:::note

The example above assigns the return value from `inc()` to a
[*discard pattern*](/mojo/manual/lifecycle/death#explicit-lifetime-extension).
Without it, the Mojo compiler reports a warning that the return value is unused.

:::

You can also provide an optional `contains` argument to `assert_raises()` to
indicate that the test passes only if the error message contains the substring
specified. Other errors are propagated, failing the test.

```mojo
print("Test passes because the error contains the substring")
with assert_raises(contains="required"):
    raise Error("missing required argument")

print("Test fails because the error doesn't contain the substring")
with assert_raises(contains="required"):
    raise Error("invalid value")
```

```output
Unhandled exception caught during execution

Test passes because the error contains the substring
Test fails because the error doesn't contain the substring
Error: invalid value
```

## Writing unit tests

A Mojo unit test is simply a function that fulfills all of these requirements:

- Has a name that starts with `test_`.
- Accepts no arguments.
- Returns `None`.
- Raises an error to indicate test failure.
- Is defined at the module scope, not as a Mojo struct method.

You can use either `def` or `fn` to define a test function. Because a test
function always raises an error to indicate failure, any test function defined
using `fn` must include the `raises` declaration.

Generally, you should use the assertion utilities from the Mojo standard library
[`testing`](/mojo/stdlib/testing/testing/) module to implement your tests.
You can include multiple related assertions in the same test function. However,
if an assertion raises an error during execution then the test function returns
immediately, skipping any subsequent assertions.

You must define your Mojo unit tests in Mojo source files named with a `test`
prefix or suffix. You can organize your test files within a directory hierarchy,
but the test files must not be part of a Mojo package (that is, the test
directories should not contain `__init__.mojo` files).

Here is an example of a test file containing three tests for functions defined
in a source module named `my_target_module` (which is not shown here).

```mojo
# File: test_my_target_module.mojo

from my_target_module import convert_input, validate_input
from testing import assert_equal, assert_false, assert_raises, assert_true

def test_validate_input():
	assert_true(validate_input("good"), msg="'good' should be valid input")
	assert_false(validate_input("bad"), msg="'bad' should be invalid input")

def test_convert_input():
	assert_equal(convert_input("input1"), "output1")
	assert_equal(convert_input("input2"), "output2")

def test_convert_input_error():
	with assert_raises():
		_ = convert_input("garbage")
```

The unique identity of a unit test consists of the path of the test file and the
name of the test function, separated by `::`. So the test IDs from the example
above are:

- `test_my_target_module.mojo::test_validate_input()`
- `test_my_target_module.mojo::test_convert_input()`
- `test_my_target_module.mojo::test_convert_error()`

## The `mojo test` command

The `mojo` command line interface includes the [`mojo test`](/mojo/cli/test)
command for running tests or collecting a list of tests.

### Running tests

By default, the `mojo test` command runs the tests that you specify using one of
the following:

- A single test ID with either an absolute or relative file path, to run only
  that test.
- A single absolute or relative file path, to run all tests in that file.
- A single absolute or relative directory path, to recurse through that
  directory hierarchy and run all tests found.

If needed, you can optionally use the `-I` option one or more times to append
additional paths to the list of directories searched to import Mojo modules and
packages. Consider the [example testing
project](https://github.com/modular/modular/tree/main/examples/mojo/testing) in
GitHub, which has the following directory structure:

```output
.
├── src
│   ├── example.mojo
│   └── my_math
│       ├── __init__.mojo
│       └── utils.mojo
└── test
    └── my_math
        ├── test_dec.mojo
        └── test_inc.mojo
```

From the project root directory, you can execute all of the tests in the `test`
directory like this:

```bash
mojo test -I src test
```

```output
Testing Time: 3.433s

Total Discovered Tests: 4

Passed : 4 (100.00%)
Failed : 0 (0.00%)
Skipped: 0 (0.00%)
```

You can run the tests contained in only the `test_dec.mojo` file like this:

```bash
mojo test -I src test/my_math/test_dec.mojo
```

```output
Testing Time: 1.175s

Total Discovered Tests: 2

Passed : 2 (100.00%)
Failed : 0 (0.00%)
Skipped: 0 (0.00%)
```

And you can run a single test from a file by providing its fully qualified
ID like this:

```bash
mojo test -I src 'test/my_math/test_dec.mojo::test_dec_valid()'
```

```output
Testing Time: 0.66s

Total Discovered Tests: 1

Passed : 1 (100.00%)
Failed : 0 (0.00%)
Skipped: 0 (0.00%)
```

### Collecting a list of tests

By including the `--collect-only` or `--co` option, you can use `mojo test` to
discover and print a list of tests.

Consider the [example testing
project](https://github.com/modular/modular/tree/main/examples/mojo/testing)
directory structure shown in the [Running tests](#running-tests) section. The
following command produces a list of all of the tests defined in the `test`
directory hierarchy.

```bash
mojo test --co test
```

The output shows the hierarchy of directories, test files, and individual tests
(note that this example elides the full filesystem paths from the output shown):

```output

  
```

### Producing JSON formatted output

By default `mojo test` produces concise, human-readable output. Alternatively
you can produce JSON formatted output more suitable for input to other tools by
including the `--diagnostic-format json` option.

For example, you can run the tests in the `test_quickstart.mojo` file shown
in the [Get started](#get-started) section with JSON formatted output using this
command:

```bash
mojo test --diagnostic-format json test_quickstart.mojo
```

The output shows the detailed results for each individual test and summary
results (note that this example elides the full filesystem paths from the
output shown):

```json
{
  "children": [
    {
      "duration_ms": 60,
      "error": "Unhandled exception caught during execution",
      "kind": "executionError",
      "stdErr": "",
      "stdOut": "Error: At ROOT_DIR/test_quickstart.mojo:8:17: AssertionError: `left == right` comparison failed:\r\n   left: 1\r\n  right: 0\r\n",
      "testID": "ROOT_DIR/test_quickstart.mojo::test_inc_zero()"
    },
    {
      "duration_ms": 51,
      "error": "",
      "kind": "success",
      "stdErr": "",
      "stdOut": "",
      "testID": "ROOT_DIR/test_quickstart.mojo::test_inc_one()"
    }
  ],
  "duration_ms": 1171,
  "error": "",
  "kind": "executionError",
  "stdErr": "",
  "stdOut": "",
  "testID": "ROOT_DIR/test_quickstart.mojo"
}
```

You can also produce JSON output for test collection as well. Consider the
[example testing
project](https://github.com/modular/modular/tree/main/examples/mojo/testing)
directory structure shown in the [Running tests](#running-tests) section. The
following command collects a list in JSON format of all of the tests defined in
the `test` directory hierarchy:

```bash
mojo test --diagnostic-format json --co test
```

The output will appear as follows (note that this example elides the full
filesystem paths from the output shown):

```json
{
  "children": [
    {
      "children": [
        {
          "id": "ROOT_DIR/test/my_math/test_dec.mojo::test_dec_valid()",
          "location": {
            "endColumn": 5,
            "endLine": 19,
            "startColumn": 5,
            "startLine": 19
          }
        },
        {
          "id": "ROOT_DIR/test/my_math/test_dec.mojo::test_dec_min()",
          "location": {
            "endColumn": 5,
            "endLine": 24,
            "startColumn": 5,
            "startLine": 24
          }
        }
      ],
      "id": "ROOT_DIR/test/my_math/test_dec.mojo"
    },
    {
      "children": [
        {
          "id": "ROOT_DIR/test/my_math/test_inc.mojo::test_inc_valid()",
          "location": {
            "endColumn": 5,
            "endLine": 19,
            "startColumn": 5,
            "startLine": 19
          }
        },
        {
          "id": "ROOT_DIR/test/my_math/test_inc.mojo::test_inc_max()",
          "location": {
            "endColumn": 5,
            "endLine": 24,
            "startColumn": 5,
            "startLine": 24
          }
        }
      ],
      "id": "ROOT_DIR/test/my_math/test_inc.mojo"
    }
  ],
  "id": "ROOT_DIR/test/my_math"
}
```

---

## Thread

In GPU programming, a thread is the smallest unit of execution within a
[kernel](kernel.mdx) function. Threads are grouped into [thread
blocks](thread-block.mdx), which are further organized into a [grid](grid.mdx).

The programmer specifies the number of thread blocks in a grid and how they are
arranged across one, two, or three dimensions. Each block within the grid is
assigned a unique [block index](block-index.mdx) that determines its position
within the grid. Similarly, the programmer also specifies the number of threads
per thread block and how they are arranged across one, two, or three dimensions.
Each thread within a block is assigned a unique [thread index](thread-index.mdx)
that determines its position within the block.

The GPU assigns each thread block within the grid to a [streaming
multiprocessor](streaming-multiprocessor.mdx) (SM) for execution. The SM groups
the threads within a block into fixed-size subsets called [warps](warp.mdx),
consisting of either 32 or 64 threads each depending on the particular GPU
architecture. The SM's warp scheduler manages the execution of warps on the SM's
cores.

The SM allocates a set of [registers](register.mdx) for each thread to store
and process values private to that thread. The registers are associated with
that thread throughout its lifetime, even if the thread is not currently
executing on the SM's cores (for example, if it is blocked waiting for data from
memory). Each thread also has access to [local memory](memory.mdx) to store
statically allocated arrays, spilled registers, and other elements of the
thread's call stack.

Threads within a block can share data through shared memory and synchronize
using built-in mechanisms, but they cannot directly communicate with threads in
other blocks.

---

## Thread block

In GPU programming, a thread block is a subset of threads within a
[grid](grid.mdx), which is the top-level organizational structure of the
[threads](thread.mdx) executing a [kernel](kernel.mdx) function. As the
primary building block for workload distribution, thread blocks serve multiple
crucial purposes:

- First, they break down the overall workload — managed by the grid — of a
  kernel function into smaller, more manageable portions that can be processed
  independently. This division allows for better resource utilization and
  scheduling flexibility across multiple [streaming
  multiprocessors](streaming-multiprocessor.mdx) (SMs) in the GPU.

- Second, thread blocks provide a scope for threads to collaborate through
  shared memory and synchronization primitives, enabling efficient parallel
  algorithms and data sharing patterns.

- Finally, thread blocks help with scalability by allowing the same program to
  run efficiently across different GPU architectures, as the hardware can
  automatically distribute blocks based on available resources.

The programmer specifies the number of thread blocks in a grid and how they are
arranged across one, two, or three dimensions. Each block within the grid is
assigned a unique [block index](block-index.mdx) that determines its position
within the grid. Similarly, the programmer also specifies the number of threads
per thread block and how they are arranged across one, two, or three dimensions.
Each thread within a block is assigned a unique [thread index](thread-index.mdx)
that determines its position within the block.

The GPU assigns each thread block within the grid to a streaming multiprocessor
(SM) for execution. The SM groups the threads within a block into fixed-size
subsets called [warps](warp.mdx), consisting of either 32 or 64 threads each
depending on the particular GPU architecture. The SM's warp scheduler manages
the execution of warps on the SM's cores.

Threads within a block can share data through [shared memory](memory.mdx)
and synchronize using built-in mechanisms, but they cannot directly communicate
with threads in other blocks.

---

## Thread index

In GPU programming, a thread index uniquely identifies the position of a
[thread](thread.mdx) within a particular [thread block](thread-block.mdx)
executing a [kernel](kernel.mdx) function on the GPU. A thread block is a subset
of threads in a [grid](grid.mdx), which is the top-level organizational
structure of the threads executing a kernel function. Each block within the grid
is also assigned a unique block index, which identifies the block's position
within the grid. The combination of block index and thread index uniquely
identifies the thread's overall position within the grid, and is used to
determine which part of the problem each thread should work on.

Because a programmer can arrange threads within a thread block across one, two,
or three dimensions, a thread index is a 3-element vector of x, y, and z
coordinates. For 2-dimensional arrangements, the z coordinate of all thread
indices is 0, and for 1-dimensional arrangements, both the y and z coordinates
of all thread indices are 0.

---

## threadfence

`threadfence[scope: Scope = Scope(5)]()`

Enforces ordering of memory operations across threads.

Acts as a memory fence/barrier that ensures all memory operations (both
loads and stores) issued before the fence are visible to other threads
within the specified scope before any memory operations after the fence.

Note:

* Maps directly to CUDA `__threadfence()` family of functions.
* Critical for synchronizing memory access in parallel algorithms.
* Performance impact increases with broader scopes.

**Parameters:**

* ​scope (`Scope`): Memory scope level for the fence. Defaults to GPU-wide scope.
  Valid values are:
  * Scope.BLOCK: Orders memory within a thread block/CTA.
  * Scope.GPU: Orders memory across all threads on the GPU (default).
  * Scope.SYSTEM: Orders memory across the entire system.

---

## ThreadScope

`@register_passable(trivial)`
`struct ThreadScope`

Represents the scope of thread operations in GPU programming.

This struct defines the scope at which thread operations are performed,
particularly for operations like tensor distribution and synchronization.
It provides two main scopes: `BLOCK` and `WARP`, which correspond to
different levels of thread grouping in GPU programming models.

Example:

```mojo
from layout.layout_tensor import copy_dram_to_sram, ThreadScope

# Distribute tensor at block level (all threads in block participate)
copy_dram_to_sram[layout, thread_scope=ThreadScope.BLOCK](dst, src)

# Distribute tensor at warp level (only threads in same warp participate)
copy_dram_to_sram[layout, thread_scope=ThreadScope.WARP](dst, src)
```

Performance:

* WARP scope operations typically have lower synchronization overhead
  than BLOCK scope operations.
* BLOCK scope operations allow coordination across all threads in a block,
  which is necessary for certain algorithms.
* The choice of scope can significantly impact performance and correctness
  of parallel algorithms.

Notes:

* The appropriate scope depends on the specific algorithm and hardware.
* WARP scope operations may be more efficient for operations that only
  require coordination within a warp.
* BLOCK scope operations are necessary when threads from different warps
  need to coordinate.
* The actual size of a warp or block is hardware-dependent.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `BLOCK`

`alias BLOCK = ThreadScope(0)`

Represents operations at the thread block level, where all threads in a block participate.

### `WARP`

`alias WARP = ThreadScope(1)`

Represents operations at the warp level, where only threads within the same warp participate.

## Methods

### `__init__`

`@implicit`
`__init__(value: Int) -> Self`

Initialize a `ThreadScope` with the given integer value.

**Args:**

* ​value (`Int`): An integer representing the thread scope (0 for `BLOCK`,
  1 for `WARP`).

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Compare two `ThreadScope` objects for equality.

**Args:**

* ​other (`Self`): Another `ThreadScope` object to compare with.

**Returns:**

True if the thread scopes are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Compare two `ThreadScope` objects for inequality.

**Args:**

* ​other (`Self`): Another `ThreadScope` object to compare with.

**Returns:**

True if the thread scopes are not equal, False otherwise.

### `__str__`

`__str__(self) -> String`

Convert the `ThreadScope` to a human-readable string representation.

Aborts:
If the thread scope has an invalid value.

**Returns:**

A string representation of the thread scope ("BLOCK" or "WARP").

### `__int__`

`__int__(self) -> Int`

Convert the `ThreadScope` to an integer value.

**Returns:**

The integer value of the thread scope (0 for BLOCK, 1 for WARP).

---

## ThroughputMeasure

`struct ThroughputMeasure`

Records a throughput metric of metric BenchMetric and value.

## Fields

* ​metric (`BenchMetric`): Type of throughput metric.
* ​value (`Int`): Measured count of throughput metric.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, name: String, value: Int, reference: List[BenchMetric] = List(BenchMetric(0, __init__[__mlir_type.!kgen.string]("throughput"), __init__[__mlir_type.!kgen.string]("GElems/s")), BenchMetric(1, __init__[__mlir_type.!kgen.string]("DataMovement"), __init__[__mlir_type.!kgen.string]("GB/s")), BenchMetric(2, __init__[__mlir_type.!kgen.string]("Arithmetic"), __init__[__mlir_type.!kgen.string]("GFLOPS/s")), Tuple()))`

Creates a `ThroughputMeasure` based on metric's name.

Example:
For the default bench metrics `BenchMetric.DEFAULTS` the
following are equivalent:
\- `ThroughputMeasure(BenchMetric.fmas, 1024)`
\- `ThroughputMeasure("fmas", 1024)`
\- `ThroughputMeasure("fmas", 1024, BenchMetric.DEFAULTS)`

**Args:**

* ​name (`String`): The name of BenchMetric in its corresponding reference.
* ​value (`Int`): The measured value to assign to this metric.
* ​reference (`List[BenchMetric]`): List of BenchMetrics that contains this metric.

`__init__(out self, *, other: Self)`

Explicitly construct a deep copy of the provided value.

**Args:**

* ​other (`Self`): The value to copy.

### `__str__`

`__str__(self) -> String`

Gets a string representation of this `ThroughputMeasure`.

**Returns:**

The string represntation.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this ThroughputMeasure to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `compute`

`compute(self, elapsed_sec: SIMD[float64, 1]) -> SIMD[float64, 1]`

Computes throughput rate for this metric per unit of time (second).

**Args:**

* ​elapsed\_sec (`SIMD[float64, 1]`): Elapsed time measured in seconds.

**Returns:**

The throughput values as a floating point 64.

---

## tile

## Functions

* [​`tile`](./tile): Implements the `Tile` operator from the ONNX spec. This behaves like Numpy tile, but without broadcast.
* [​`tile_shape`](./tile_shape): Compute the output shape of a `tile` operation, and assert the inputs are compatible.

---

## tile

`tile[type: DType, type_repeats: DType](input: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], repeats: LayoutTensor[type_repeats, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], output: LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment])`

Implements the `Tile` operator from the ONNX spec. This behaves like Numpy tile, but without broadcast.

**Parameters:**

* ​type (`DType`): Type of the input and output tensors.
* ​type\_repeats (`DType`): Type of the repeats tensor.

**Args:**

* ​input (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor. Currently repeats (`LayoutTensor[type_repeats, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): One-dimensional tensor that specifies the number of repeated
  copies along each of the input's dimensions. Length equals
  input tensor rank.
* ​output (`LayoutTensor[type, layout, origin, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The output tensor. Has the same dimensions and type as input.

---

## tile

`tile[: origin.set, //, workgroup_function: fn[Int](Int) capturing -> None, tile_size_list: VariadicList[Int]](offset: Int, upperbound: Int)`

A generator that launches work groups in specified list of tile sizes.

A workgroup function is a function that can process a configurable
consecutive "tile" of workload. E.g.
`work_on[3](5)`
should launch computation on item 5,6,7, and should be semantically
equivalent to
`work_on[1](5)`, `work_on[1](6)`, `work_on[1](7)`.

This generator will try to proceed with the given list of tile sizes on the
listed order. E.g.
`tile[func, (3,2,1)](offset, upperbound)`
will try to call `func[3]` starting from offset until remaining work is less
than 3 from upperbound and then try `func[2]`, and then `func[1]`, etc.

**Parameters:**

* ​workgroup\_function (`fn[Int](Int) capturing -> None`): Workgroup function that processes one tile of
  workload.
* ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work.

**Args:**

* ​offset (`Int`): The initial index to start the work from.
* ​upperbound (`Int`): The runtime upperbound that the work function should not
  exceed.

`tile[: origin.set, //, workgroup_function: fn(Int, Int) capturing -> None](offset: Int, upperbound: Int, tile_size_list: VariadicList[Int])`

A generator that launches work groups in specified list of tile sizes.

This is the version of tile generator for the case where work\_group function
can take the tile size as a runtime value.

**Parameters:**

* ​workgroup\_function (`fn(Int, Int) capturing -> None`): Workgroup function that processes one tile of
  workload.

**Args:**

* ​offset (`Int`): The initial index to start the work from.
* ​upperbound (`Int`): The runtime upperbound that the work function should not
  exceed.
* ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work.

`tile[: origin.set, //, secondary_tile_size_list: VariadicList[Int], secondary_cleanup_tile: Int, workgroup_function: fn[Int](Int, Int) capturing -> None](offset: Int, upperbound: Int, primary_tile_size_list: VariadicList[Int], primary_cleanup_tile: Int)`

A generator that launches work groups in specified list of tile sizes until the sum of primary\_tile\_sizes has exceeded the upperbound.

**Parameters:**

* ​secondary\_tile\_size\_list (`VariadicList[Int]`): List of static tile sizes to launch work.
* ​secondary\_cleanup\_tile (`Int`): Last static tile to use when primary tile sizes
  don't fit exactly within the upperbound.
* ​workgroup\_function (`fn[Int](Int, Int) capturing -> None`): Workgroup function that processes one tile of
  workload.

**Args:**

* ​offset (`Int`): The initial index to start the work from.
* ​upperbound (`Int`): The runtime upperbound that the work function should not
  exceed.
* ​primary\_tile\_size\_list (`VariadicList[Int]`): List of dynamic tile sizes to launch work.
* ​primary\_cleanup\_tile (`Int`): Last dynamic tile to use when primary tile sizes
  don't fit exactly within the upperbound.

`tile[: origin.set, //, workgroup_function: fn[Int, Int](Int, Int) capturing -> None, tile_sizes_x: VariadicList[Int], tile_sizes_y: VariadicList[Int]](offset_x: Int, offset_y: Int, upperbound_x: Int, upperbound_y: Int)`

Launches workgroup\_function using the largest tile sizes possible in each dimension, starting from the x and y offset, until the x and y upperbounds are reached.

**Parameters:**

* ​workgroup\_function (`fn[Int, Int](Int, Int) capturing -> None`): Function that is invoked for each tile and offset.
* ​tile\_sizes\_x (`VariadicList[Int]`): List of tile sizes to use for the first parameter of workgroup\_function.
* ​tile\_sizes\_y (`VariadicList[Int]`): List of tile sizes to use for the second parameter of workgroup\_function.

**Args:**

* ​offset\_x (`Int`): Initial x offset passed to workgroup\_function.
* ​offset\_y (`Int`): Initial y offset passed to workgroup\_function.
* ​upperbound\_x (`Int`): Max offset in x dimension passed to workgroup function.
* ​upperbound\_y (`Int`): Max offset in y dimension passed to workgroup function.

---

## tile_and_unswitch

`tile_and_unswitch[: origin.set, //, workgroup_function: fn[Int, Bool](Int, Int) capturing -> None, tile_size_list: VariadicList[Int]](offset: Int, upperbound: Int)`

Performs time and unswitch functional transformation.

A variant of static tile given a workgroup function that can be unswitched.
This generator is a fused version of tile and unswitch, where the static
unswitch is true throughout the "inner" portion of the workload and is
false only on the residue tile.

**Parameters:**

* ​workgroup\_function (`fn[Int, Bool](Int, Int) capturing -> None`): Workgroup function that processes one tile of
  workload.
* ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work.

**Args:**

* ​offset (`Int`): The initial index to start the work from.
* ​upperbound (`Int`): The runtime upperbound that the work function should not
  exceed.

`tile_and_unswitch[: origin.set, //, workgroup_function: fn[Bool](Int, Int, Int) capturing -> None](offset: Int, upperbound: Int, tile_size_list: VariadicList[Int])`

Performs time and unswitch functional transformation.

A variant of dynamic tile given a workgroup function that can be
unswitched. This generator is a fused version of tile and unswitch, where
the static unswitch is true throughout the "inner" portion of the workload
and is false only on the residue tile.

**Parameters:**

* ​workgroup\_function (`fn[Bool](Int, Int, Int) capturing -> None`): Workgroup function that processes one tile of
  workload.

**Args:**

* ​offset (`Int`): The initial index to start the work from.
* ​upperbound (`Int`): The runtime upperbound that the work function should not exceed.
* ​tile\_size\_list (`VariadicList[Int]`): List of tile sizes to launch work.

---

## tile_layout_k_major

`tile_layout_k_major[type: DType, BM: Int, BK: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))]() -> Layout`

Creates a K-major layout for tensor core operations.

Constructs a layout optimized for K-major access patterns in tensor core operations,
with optional swizzling for improved memory access patterns.

**Parameters:**

* ​type (`DType`): Element data type of the tensor.
* ​BM (`Int`): Size of the M dimension in the tile.
* ​BK (`Int`): Size of the K dimension in the tile.
* ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern swizzling mode (default: SWIZZLE\_NONE).

**Returns:**

`Layout` - A K-major layout configured for the specified dimensions and swizzle mode.

---

## tile_layout_mn_major

`tile_layout_mn_major[type: DType, mn_dim: Int, k_dim: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))]() -> Layout`

Creates an MN-major layout for tensor core operations.

Constructs a unit layout optimized for MN-major access patterns in shared memory,
with optional swizzling for improved memory access patterns.

Note:
This returns the "unit" layout; the actual shared memory layout can be a multiple of this unit.
Currently only supports SWIZZLE\_NONE and SWIZZLE\_128B modes.

**Parameters:**

* ​type (`DType`): Element data type of the tensor.
* ​mn\_dim (`Int`): Size of the MN dimension.
* ​k\_dim (`Int`): Size of the K dimension.
* ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern swizzling mode (default: SWIZZLE\_NONE).

**Returns:**

`Layout` - An MN-major layout configured for the specified dimensions and swizzle mode.

---

## tile_middle_unswitch_boundaries

`tile_middle_unswitch_boundaries[: origin.set, //, work_fn: fn[Int, Bool](Int) capturing -> None, middle_tile_sizes: VariadicList[Int], left_tile_size: Int = 1, right_tile_size: Int = 1](left_boundary_start: Int, left_boundary_end: Int, right_boundary_start: Int, right_boundary_end: Int)`

Divides 1d iteration space into three parts and tiles them with different steps.

The 1d iteration space is divided into:
1\. \[left\_boundary\_start, left\_boundary\_end), effected by left boundary.
2\. \[left\_boundary\_end, right\_boundary\_start), not effected by any boundary.
3\. \[right\_boundary\_start, right\_boundary\_end), effected by right boundary.

work\_fn's switch is true for the left and right boundaries, implying boundary
conditions like padding in convolution. The middle part is tiled with static
tile sizes with the switch as false.

`middle_tile_sizes` should be in descending order for optimal performance.
(Larger tile size appeared later in the list fails the while-loop.)

**Parameters:**

* ​work\_fn (`fn[Int, Bool](Int) capturing -> None`): Work function that processes one tile of workload.
* ​middle\_tile\_sizes (`VariadicList[Int]`): List of tile sizes for the middle part.
* ​left\_tile\_size (`Int`): Tile size for the left boundary region.
* ​right\_tile\_size (`Int`): Tile size for the right boundary region.

**Args:**

* ​left\_boundary\_start (`Int`): Start index of the left boundary.
* ​left\_boundary\_end (`Int`): End index of the left boundary.
* ​right\_boundary\_start (`Int`): Start index of the right boundary.
* ​right\_boundary\_end (`Int`): End index of the right boundary.

`tile_middle_unswitch_boundaries[: origin.set, //, work_fn: fn[Int, Bool, Bool](Int) capturing -> None, tile_size: Int, size: Int]()`

Tile 1d iteration space with boundary conditions at both ends.

This generator is primarily for convolution with static shapes. `work_fn`'s
flags hints the function to handle padding at the boundary. The size is the
static output row size, i.e., WO dimension.

**Parameters:**

* ​work\_fn (`fn[Int, Bool, Bool](Int) capturing -> None`): Work function that updates one tile. It has two flags for
  left and right boundaries, respectively.
* ​tile\_size (`Int`): 1D Tile size.
* ​size (`Int`): Iteration range is \[0, size).

---

## tile_shape

`tile_shape[input_type: DType, repeats_type: DType, single_thread_blocking_override: Bool](input_buf: LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], repeats_buf: LayoutTensor[repeats_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]) -> IndexList[layout.rank()]`

Compute the output shape of a `tile` operation, and assert the inputs are compatible.

**Parameters:**

* ​input\_type (`DType`): Type of the input tensor.
* ​repeats\_type (`DType`): Type of the repeats tensor.
* ​single\_thread\_blocking\_override (`Bool`): If True, then the operation is run
  synchronously using a single thread.

**Args:**

* ​input\_buf (`LayoutTensor[input_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The input tensor.
* ​repeats\_buf (`LayoutTensor[repeats_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The repeats tensor.

**Returns:**

The output shape.

---

## tile_to_descriptor

`tile_to_descriptor[type: DType, layout: Layout, is_k_major: Bool = True]() -> Layout`

Transforms a layout into a WGMMA descriptor-compatible layout.

Converts a standard layout into a form that can be used with WGMMA descriptors,
handling both K-major and MN-major layouts differently.

**Parameters:**

* ​type (`DType`): Element data type of the tensor.
* ​layout (`Layout`): Input layout to transform.
* ​is\_k\_major (`Bool`): Whether the layout is K-major (True) or MN-major (False).

**Returns:**

\`Layout - A transformed layout compatible with WGMMA descriptors.

---

## tile_to_shape

`tile_to_shape(tile: Layout, target_shape: IntTuple[origin], order: Optional[IntTuple] = Optional(None)) -> Layout`

Creates a layout by tiling a base layout to match a target shape.

This function creates a hierarchical layout by repeating a tile layout to match
a target shape. It calculates how many times the tile needs to be repeated in
each dimension to reach the target shape, and creates a tiler layout with this
information.

Example:

```mojo
from layout import Layout, IntTuple
from layout.layout import tile_to_shape

# Create a 2x2 tile layout
var tile = Layout.row_major(2, 2)
# Tile it to create a 6x4 layout
var tiled = tile_to_shape(tile, IntTuple(6, 4))
# Result: A layout with 3x2 tiles of size 2x2 each
```

.

**Args:**

* ​tile (`Layout`): The base layout to be tiled.
* ​target\_shape (`IntTuple[origin]`): The desired final shape to tile to.
* ​order (`Optional[IntTuple]`): Optional memory ordering for the tiler layout. If None, defaults to
  column-major ordering.

**Returns:**

A new layout representing the tiled structure that matches the target shape.

---

## tileconfig

`struct tileconfig`

## Fields

* ​pavarte\_id (`SIMD[uint8, 1]`):
* ​start\_row (`SIMD[uint8, 1]`):
* ​reserved (`StaticTuple[scalar, 14]`):
* ​colb (`StaticTuple[scalar, 16]`):
* ​rows (`StaticTuple[scalar, 16]`):

## Implemented traits

`AnyType`,
`UnknownDestructibility`

---

## tiled_matmul_run

`tiled_matmul_run[config: KernelConfig, transpose_b: Bool, b_packed: Bool, simd_size: Int, elementwise_epilogue_enabled: Bool, kernel_id: InnerKernelID, algorithm: InnerMatmulKernel](alg: algorithm, c: NDBuffer[type, 2, origin, shape], a: NDBuffer[type, 2, origin, shape], b: NDBuffer[type, 2, origin, shape], elementwise_epilogue_fn: fn(GemmShape, GemmShape) escaping -> None, global_tile_shape: GemmShape, global_tile_offset: GemmShape)`

Interface function to run tiled matmul on a given sub-tile.

**Args:**

* ​alg (`algorithm`): InnerMatmulKernel algorithm for microkernel.
* ​c (`NDBuffer[type, 2, origin, shape]`): Pre-allocated buffer space for result.
* ​a (`NDBuffer[type, 2, origin, shape]`): Operand A of the matmul.
* ​b (`NDBuffer[type, 2, origin, shape]`): Operand B of the mamtul.
* ​elementwise\_epilogue\_fn (`fn(GemmShape, GemmShape) escaping -> None`): The elementwise epilogue function.
* ​global\_tile\_shape (`GemmShape`): Tile shape this call will process.
* ​global\_tile\_offset (`GemmShape`): Tile offset on the original buffer.

---

## TiledMatmul

`struct TiledMatmul[a_mut: Bool, b_mut: Bool, //, config: KernelConfig, transpose_b: Bool, b_packed: Bool, elementwise_epilogue_enabled: Bool, kernel_id: InnerKernelID, a_type: DType, a_shape: DimList, a_origin: Origin[a_mut], b_type: DType, b_shape: DimList, b_origin: Origin[b_mut], c_type: DType, c_shape: DimList, c_origin: MutableOrigin, algorithm: InnerMatmulKernel]`

Tiled matmul implementation integrating packing, inner loop and tile partitions.

TODO: add tag based implementation dispatch.
TODO: add fusion hooks.

## Fields

* ​alg (`algorithm`):
* ​c (`NDBuffer[c_type, 2, c_origin, c_shape]`):
* ​a (`NDBuffer[a_type, 2, a_origin, a_shape]`):
* ​b (`NDBuffer[b_type, 2, b_origin, b_shape]`):
* ​tile\_n\_k (`IndexList[2]`):
* ​global\_tile\_offset (`GemmShape`):
* ​global\_tile\_shape (`GemmShape`):
* ​b\_tile\_generator (`BTileGenerator[config, a_type, b_type, c_type, b_shape, transpose_b, b_packed, b_origin]`):
* ​elementwise\_epilogue\_fn (`fn(GemmShape, GemmShape) escaping -> None`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

---

## TileMaskStatus

`@register_passable(trivial)`
`struct TileMaskStatus`

A tile's masking status.

## Fields

* ​status (`SIMD[uint8, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `FULL_MASK`

`alias FULL_MASK = TileMaskStatus(__init__[__mlir_type.!pop.int_literal](3))`

### `NO_MASK`

`alias NO_MASK = TileMaskStatus(__init__[__mlir_type.!pop.int_literal](0))`

### `PARTIAL_MASK`

`alias PARTIAL_MASK = TileMaskStatus(__init__[__mlir_type.!pop.int_literal](1))`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

### `__is__`

`__is__(self, rhs: Self) -> Bool`

### `__and__`

`__and__(self, rhs: Self) -> Self`

### `__or__`

`__or__(self, rhs: Self) -> Self`

### `__is_not__`

`__is_not__(self, rhs: Self) -> Bool`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

---

## TileScheduler

`@register_passable(trivial)`
`struct TileScheduler[tile_shape: IndexList[3], grid_shape: IndexList[2], cluster: IndexList[3] = Index(1, 1, 1), raster_dim: SIMD[uint32, 1] = __init__[__mlir_type.!pop.int_literal](1), schedule: MatmulSchedule = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](1))]`

## Fields

* ​idx (`SIMD[uint32, 1]`):
* ​prob\_shape (`IndexList[3]`):
* ​num\_waves\_m (`SIMD[uint32, 1]`):
* ​num\_waves\_n (`SIMD[uint32, 1]`):
* ​log\_num\_waves\_n (`FastDiv[uint32]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `num_grids`

`alias num_grids = SIMD((grid_shape.__getitem__[::Indexer](0) * grid_shape.__getitem__[::Indexer](1)))`

### `wave_shape`

`alias wave_shape = Index((grid_shape.__getitem__[::Indexer](1) * tile_shape.__getitem__[::Indexer](0)), (grid_shape.__getitem__[::Indexer](0) * tile_shape.__getitem__[::Indexer](1)))`

## Methods

### `__init__`

`__init__(prob_shape: IndexList[3]) -> Self`

### `get_current_work_info`

`get_current_work_info(self) -> WorkInfo`

### `advance`

`advance(mut self)`

### `fetch_next_work`

`fetch_next_work(mut self) -> WorkInfo`

### `num_output_tiles`

`num_output_tiles(self) -> UInt`

---

## TileScheduler

`@register_passable(trivial)`
`struct TileScheduler[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1], /, num_ctas: SIMD[uint32, 1] = SIMD(Info(__init__[__mlir_type.!kgen.string]("H100"), Vendor(__init__[__mlir_type.!pop.int_literal](2)), __init__[__mlir_type.!kgen.string]("cuda"), __init__[__mlir_type.!kgen.string]("hopper"), __init__[__mlir_type.!kgen.string]("nvptx-short-ptr=true"), __init__[__mlir_type.!pop.float_literal](9), __init__[__mlir_type.!kgen.string]("sm_90a"), 132, 32, 2048, 32, 64, 2048, 32, 233472, 65536, 256, __init__[__mlir_type.!kgen.string]("warp"), 255, 65536, 32, 128, 4, 1024)), schedule: MHASchedule = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))]`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`MHATileScheduler`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `may_advance`

`alias may_advance = True`

### `mha_schedule`

`alias mha_schedule = schedule`

## Methods

### `__init__`

`__init__() -> Self`

### `get_current_work_info`

`get_current_work_info(self, ts: MHATileSummary, state: MHATileState) -> WorkInfo`

### `fetch_next_work`

`fetch_next_work(self, ts: MHATileSummary, mut state: MHATileState) -> WorkInfo`

### `advance`

`advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]`

### `grid_dim`

`static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]`

### `initial_state`

`initial_state(self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState`

### `unsafe_seq_info`

`unsafe_seq_info[ragged: Bool](self, ts: MHATileSummary, state: MHATileState) -> SeqInfo`

---

## time

Implements the time package.

## Modules

* [​`time`](/mojo/stdlib/time/time/): Implements basic utils for working with time.

---

## time

Implements basic utils for working with time.

You can import these APIs from the `time` package. For example:

```mojo
from time import perf_counter_ns
```

## Functions

* [​`monotonic`](/mojo/stdlib/time/time/monotonic): Returns the current monotonic time time in nanoseconds. This function queries the current platform's monotonic clock, making it useful for measuring time differences, but the significance of the returned value varies depending on the underlying implementation.
* [​`perf_counter`](/mojo/stdlib/time/time/perf_counter): Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid.
* [​`perf_counter_ns`](/mojo/stdlib/time/time/perf_counter_ns): Return the value (in nanoseconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide. The reference point of the returned value is undefined, so that only the difference between the results of two calls is valid.
* [​`sleep`](/mojo/stdlib/time/time/sleep): Suspends the current thread for the seconds specified.
* [​`time_function`](/mojo/stdlib/time/time/time_function): Measures the time spent in the function.

---

## time_function

`time_function[: origin.set, //, func: fn() raises capturing -> None]() -> UInt`

Measures the time spent in the function.

**Parameters:**

* ​func (`fn() raises capturing -> None`): The function to time.

**Returns:**

The time elapsed in the function in ns.

`time_function[: origin.set, //, func: fn() capturing -> None]() -> UInt`

Measures the time spent in the function.

**Parameters:**

* ​func (`fn() capturing -> None`): The function to time.

**Returns:**

The time elapsed in the function in ns.

---

## tma_async

Tensor Memory Accelerator (TMA) Asynchronous Operations Module

Provides high-performance abstractions for NVIDIA's Tensor Memory Accelerator (TMA),
enabling efficient asynchronous data movement between global and shared memory in GPU kernels.
It is designed for use with NVIDIA Hopper architecture and newer GPUs that support TMA instructions.

## Key Components:

* `TMATensorTile`: Core struct that encapsulates a TMA descriptor for efficient data transfers
  between global and shared memory with various access patterns and optimizations.

* `SharedMemBarrier`: Synchronization primitive for coordinating asynchronous TMA operations,
  ensuring data transfers complete before dependent operations begin.

* `PipelineState`: Helper struct for managing multi-stage pipeline execution with circular
  buffer semantics, enabling efficient double or triple buffering techniques.

* `create_tma_tile`: Factory functions for creating optimized `TMATensorTile` instances with
  various configurations for different tensor shapes and memory access patterns.

## Structs

* [​`PipelineState`](./PipelineState): Manages state for a multi-stage pipeline with circular buffer semantics.
* [​`SharedMemBarrier`](./SharedMemBarrier): A hardware-accelerated synchronization primitive for GPU shared memory operations.
* [​`TMATensorTile`](./TMATensorTile): A hardware-accelerated tensor memory access (TMA) tile for efficient asynchronous data movement.
* [​`TMATensorTileArray`](./TMATensorTileArray): An array of TMA descripotr.

## Functions

* [​`create_tma_tile`](./create_tma_tile): Creates a `TMATensorTile` with specified tile dimensions and swizzle mode.

---

## tma_store_fence

`tma_store_fence()`

Establishes a memory fence for shared memory stores in TMA operations.

This function creates a memory barrier that ensures all previous shared memory
stores are completed before subsequent TMA (Tensor Memory Access) store operations
begin. This is crucial for maintaining memory consistency in tensor operations.

Note:

This fence specifically targets the CTA (Cooperative Thread Array) scope
and is used to synchronize async shared memory operations.

---

## tma_wgmma_warp_specialized_gemm_kernel

`tma_wgmma_warp_specialized_gemm_kernel[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, c_desc_layout: Layout, c_tma_layout: Layout, c_smem_layout: Layout, cluster_shape: StaticTuple[SIMD[int32, 1], 3], a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), c_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = True, num_threads: Int = 128, pipeline_stages: Int = 7, partitioned_multicast: Bool = False, use_tma_store: Bool = False, promotion_frequency: Int = 1, pdl_level: PDLLevel = PDLLevel(), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1})](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c_tma_op: TMATensorTile[c_type, c_tma_layout, c_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin])`

---

## tma_wgmma_warp_specialized_gemm_kernel_persistent

`tma_wgmma_warp_specialized_gemm_kernel_persistent[a_type: DType, b_type: DType, c_type: DType, a_layout: Layout, b_layout: Layout, a_tile_layout: Layout, b_tile_layout: Layout, c_layout: Layout, block_tile_shape: IndexList[3], wgmma_shape: IndexList[3], a_desc_layout: Layout, b_desc_layout: Layout, c_desc_layout: Layout, c_tma_layout: Layout, c_smem_layout: Layout, cluster_shape: StaticTuple[SIMD[int32, 1], 3], grid_shape: IndexList[2], a_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), b_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](3)), c_swizzle: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0)), transpose_b: Bool = True, num_threads: Int = 128, pipeline_stages: Int = 7, partitioned_multicast: Bool = False, use_tma_store: Bool = False, promotion_frequency: Int = 1, pdl_level: PDLLevel = PDLLevel(), elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1})](a_tma_op: TMATensorTile[a_type, a_tile_layout, a_desc_layout], b_tma_op: TMATensorTile[b_type, b_tile_layout, b_desc_layout], c_tma_op: TMATensorTile[c_type, c_tma_layout, c_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin], problem_shape: IndexList[3])`

---

## TMATensorTile

`struct TMATensorTile[dtype: DType, layout: Layout, desc_layout: Layout = layout]`

A hardware-accelerated tensor memory access (TMA) tile for efficient asynchronous data movement.

The TMATensorTile struct provides a high-performance interface for asynchronous data transfers
between global memory and shared memory in GPU tensor operations. It encapsulates a TMA descriptor
that defines the memory access pattern and provides methods for various asynchronous operations.

Performance:

* Hardware-accelerated memory transfers using TMA instructions
* Supports prefetching of descriptors for latency hiding
* Enforces 128-byte alignment requirements for optimal memory access

## Parameters

* ​dtype (`DType`): DType
  The data type of the tensor elements.
* ​layout (`Layout`): Layout
  The layout of the tile in shared memory, typically specified as row\_major.
* ​desc\_layout (`Layout`): Layout = layout
  The layout of the descriptor, which can be different from the shared memory layout
  to accommodate hardware requirements like WGMMA.

## Fields

* ​descriptor (`TMADescriptor`): The TMA descriptor that defines the memory access pattern.
  This field stores the hardware descriptor that encodes information about:

  * The source tensor's memory layout and dimensions
  * The tile shape and access pattern
  * Swizzling configuration for optimal memory access

  The descriptor is used by the GPU's Tensor Memory Accelerator hardware to
  efficiently transfer data between global and shared memory.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(out self, descriptor: TMADescriptor)`

Initializes a new TMATensorTile with the provided TMA descriptor.

**Args:**

* ​descriptor (`TMADescriptor`): The TMA descriptor that defines the memory access pattern.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Copy initializes this `TMATensorTile` from another instance.

**Args:**

* ​other (`Self`): The other `TMATensorTile` instance to copy from.

### `prefetch_descriptor`

`prefetch_descriptor(self)`

Prefetches the TMA descriptor into cache to reduce latency.

This method helps hide memory access latency by prefetching the descriptor
before it's needed for actual data transfers.

### `async_copy`

`async_copy(self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ref [3] mem_barrier: SharedMemBarrier, coords: Tuple[UInt, UInt])`

Schedules an asynchronous copy from global memory to shared memory at specified coordinates.

This method initiates a hardware-accelerated asynchronous transfer of data from global memory
to the specified destination in shared memory. The transfer is tracked by the provided memory
barrier.

**Constraints:**

* The destination tensor must be 128-byte aligned in shared memory.
* The descriptor layout may be smaller than the shared memory tile shape
  to accommodate hardware requirements.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in shared memory where data will be copied.
  Must be 128-byte aligned.
* ​mem\_barrier (`SharedMemBarrier`): The memory barrier used to track and synchronize the asynchronous transfer.
* ​coords (`Tuple[UInt, UInt]`): The 2D coordinates in the source tensor from which to copy data.

### `async_copy_3d`

`async_copy_3d(self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ref [3] mem_barrier: SharedMemBarrier, coords: Tuple[UInt, UInt, UInt])`

Schedules an asynchronous copy from global memory to shared memory at specified 3D coordinates.

This method initiates a hardware-accelerated asynchronous transfer of data from global memory
to the specified destination in shared memory for 3D tensors. The transfer is tracked by the
provided memory barrier.

**Constraints:**

* The destination tensor must be 128-byte aligned in shared memory.
* The descriptor layout may be smaller than the shared memory tile shape
  to accommodate hardware requirements.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The destination tensor in shared memory where data will be copied.
  Must be 128-byte aligned.
* ​mem\_barrier (`SharedMemBarrier`): The memory barrier used to track and synchronize the asynchronous transfer.
* ​coords (`Tuple[UInt, UInt, UInt]`): The 3D coordinates in the source tensor from which to copy data.

### `async_multicast_load`

`async_multicast_load(self, dst: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ref [3] mem_barrier: SharedMemBarrier, coords: Tuple[UInt, UInt], multicast_mask: SIMD[uint16, 1])`

Schedules an asynchronous multicast load from global memory to multiple shared memory locations.

This method initiates a hardware-accelerated asynchronous transfer of data from global memory
to multiple destination locations in shared memory across different CTAs (Cooperative Thread Arrays)
as specified by the multicast mask.

**Constraints:**

The destination tensor must be 128-byte aligned in shared memory.

**Args:**

* ​dst (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): LayoutTensor
  The destination tensor in shared memory where data will be copied.
  Must be 128-byte aligned.
* ​mem\_barrier (`SharedMemBarrier`): SharedMemBarrierArray
  The memory barrier used to track and synchronize the asynchronous transfer.
* ​coords (`Tuple[UInt, UInt]`): Tuple\[UInt, UInt]
  The 2D coordinates in the source tensor from which to copy data.
* ​multicast\_mask (`SIMD[uint16, 1]`): UInt16
  A bit mask specifying which CTAs should receive the data.

### `async_store`

`async_store(self, src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], coords: Tuple[UInt, UInt])`

Schedules an asynchronous store from shared memory to global memory.

This method initiates a hardware-accelerated asynchronous transfer of data from shared memory
to global memory at the specified coordinates.

**Constraints:**

The source tensor must be 128-byte aligned in shared memory.

**Args:**

* ​src (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): LayoutTensor
  The source tensor in shared memory from which data will be copied.
  Must be 128-byte aligned.
* ​coords (`Tuple[UInt, UInt]`): Tuple\[UInt, UInt]
  The 2D coordinates in the destination tensor where data will be stored.

### `async_reduce`

`async_reduce[reduction_kind: ReduceOp](self, src: LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], coords: Tuple[UInt, UInt])`

Schedules an asynchronous reduction operation from shared memory to global memory.

This method initiates a hardware-accelerated asynchronous reduction operation that combines
data from shared memory with data in global memory using the specified reduction operation.
The reduction is performed element-wise at the specified coordinates in the global tensor.

**Constraints:**

The source tensor must be 128-byte aligned in shared memory.

**Parameters:**

* ​reduction\_kind (`ReduceOp`): The type of reduction operation to perform (e.g., ADD, MIN, MAX).
  This determines how values are combined during the reduction.

**Args:**

* ​src (`LayoutTensor[dtype, layout, origin, address_space=AddressSpace(3), element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment]`): The source tensor in shared memory containing the data to be reduced.
  Must be 128-byte aligned.
* ​coords (`Tuple[UInt, UInt]`): The 2D coordinates in the destination tensor where the reduction will be applied.

### `commit_group`

`commit_group(self)`

Commits all prior initiated but uncommitted TMA instructions into a group.

This function behaves the same as `cp_async_bulk_commit_group`, which creates
a synchronization point for bulk TMA transfer.

### `wait_group`

`wait_group[n: Int = 0](self)`

Wait for the completion of asynchronous copy until a specified number of groups are waiting.

This function behaves the same as `cp_async_bulk_wait_group`, which causes the executing
thread to wait until a specified number of the most recent TMA copy are pending.

**Parameters:**

* ​n (`Int`): The number of pending groups left.

### `smem_tensormap_init`

`smem_tensormap_init(self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3)])`

Initializes a TMA descriptor in shared memory from this tensor tile's descriptor.

This method copies the TMA descriptor from global memory to shared memory, allowing
for faster access during kernel execution. The descriptor is copied in 16-byte chunks
using asynchronous copy operations for efficiency.

Note:

* Only one thread should call this method to avoid race conditions
* The descriptor is copied in 8 chunks of 16 bytes each (total 128 bytes)

**Args:**

* ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3)]`): Pointer to the location in shared memory where the
  descriptor will be stored. Must be properly aligned.

### `replace_tensormap_global_address_in_gmem`

`replace_tensormap_global_address_in_gmem[dtype: DType](self, src_ptr: UnsafePointer[SIMD[dtype, 1]])`

Replaces the global memory address in the TMA descriptor stored in global memory.

This method allows dynamically changing the source tensor for TMA operations without
recreating the entire descriptor, which is useful for reusing descriptors with different
data sources. The operation modifies the descriptor in global memory directly.

Note:
A memory fence may be required after this operation to ensure visibility
of the changes to other threads.

**Parameters:**

* ​dtype (`DType`): The data type of the new source tensor.

**Args:**

* ​src\_ptr (`UnsafePointer[SIMD[dtype, 1]]`): The new source tensor whose address will replace the current one in the descriptor.
  Must have compatible layout with the original tensor.

### `tensormap_fence_acquire`

`tensormap_fence_acquire(self)`

Establishes a memory fence for TMA operations with acquire semantics.

This method ensures proper ordering of memory operations by creating a barrier
that prevents subsequent TMA operations from executing before prior operations
have completed. It is particularly important when reading from a descriptor
that might have been modified by other threads or processes.

The acquire semantics ensure that all memory operations after this fence
will observe any modifications made to the descriptor before the fence.

Notes:

* The entire warp must call this function as the instruction is warp-aligned.
* Typically used in pairs with `tensormap_fence_release` for proper synchronization.

### `tensormap_fence_release`

`tensormap_fence_release(self)`

Establishes a memory fence for TMA operations with release semantics.

This method ensures proper ordering of memory operations by creating a barrier
that ensures all prior memory operations are visible before subsequent operations
can proceed. It is particularly important when modifying a TMA descriptor in
global memory that might be read by other threads or processes.

The release semantics ensure that all memory operations before this fence
will be visible to any thread that observes operations after the fence.

Notes:

* Typically used after modifying a tensormap descriptor in global memory.
* Often paired with `tensormap_fence_acquire` for proper synchronization.

### `replace_tensormap_global_address_in_shared_mem`

`replace_tensormap_global_address_in_shared_mem[dtype: DType](self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], src_ptr: UnsafePointer[SIMD[dtype, 1]])`

Replaces the global memory address in the TMA descriptor stored in shared memory.

This method allows dynamically changing the source tensor for TMA operations without
recreating the entire descriptor, which is useful for reusing descriptors with different
data sources. The operation modifies a descriptor that has been previously copied to
shared memory.

Notes:

* Only one thread should call this method to avoid race conditions.
* A memory fence may be required after this operation to ensure visibility
  of the changes to other threads.
* Typically used with descriptors previously initialized with `smem_tensormap_init`.

**Parameters:**

* ​dtype (`DType`): The data type of the new source tensor.

**Args:**

* ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the TMA descriptor in shared memory that will be modified.
* ​src\_ptr (`UnsafePointer[SIMD[dtype, 1]]`): The new source tensor whose address will replace the current one in the descriptor.

### `tensormap_cp_fence_release`

`tensormap_cp_fence_release(self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3)])`

Establishes a memory fence for TMA operations with release semantics for shared memory descriptors.

This method ensures proper ordering of memory operations by creating a barrier
that ensures all prior memory operations are visible before subsequent operations
can proceed. It is specifically designed for synchronizing between global memory and
shared memory TMA descriptors.

The release semantics ensure that all memory operations before this fence
will be visible to any thread that observes operations after the fence.

Notes:

* The entire warp must call this function as the instruction is warp-aligned
* Typically used after modifying a tensormap descriptor in shared memory
* More specialized than the general `tensormap_fence_release` for cross-memory space synchronization

**Args:**

* ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3)]`): Pointer to the TMA descriptor in shared memory that
  is being synchronized with the global memory descriptor.

### `replace_tensormap_global_dim_strides_in_shared_mem`

`replace_tensormap_global_dim_strides_in_shared_mem[dtype: DType, only_update_dim_0: Bool, /, *, rank: Int](self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], gmem_dims: IndexList[rank], gmem_strides: IndexList[rank])`

Replaces dimensions and strides in a TMA descriptor stored in shared memory. Note: This function is only supported for CUDA versions >= 12.5.

This function allows dynamically modifying the dimensions and strides of a TMA
descriptor that has been previously initialized in shared memory. If only the first dimension (dim 0) is updated, then updating strides can be skipped.

Notes:

* Only one thread should call this method to avoid race conditions.
* A memory fence may be required after this operation to ensure visibility
  of the changes to other threads.

**Parameters:**

* ​dtype (`DType`): The data type of the new source tensor.
* ​only\_update\_dim\_0 (`Bool`): If true, only the first dimension (dim 0) is updated with updating strides.
* ​rank (`Int`): The rank of the tensor.

**Args:**

* ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the TMA descriptor in shared memory that will be modified.
* ​gmem\_dims (`IndexList[rank]`): The global dimensions of the tensor to be updated.
* ​gmem\_strides (`IndexList[rank]`): The global strides of the tensor to be updated.

`replace_tensormap_global_dim_strides_in_shared_mem[dtype: DType, tensor_rank: Int, dim_idx: Int](self, smem_tma_descriptor_ptr: UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin], dim_value: SIMD[uint32, 1], dim_stride: Optional[SIMD[uint64, 1]] = Optional(None))`

Replaces dimensions and strides in a TMA descriptor stored in shared memory. Note: This function is only supported for CUDA versions >= 12.5. This function allows dynamically modifying the dimensions and strides of a TMA descriptor that has been previously initialized in shared memory. If only the first dimension is updated, then updating strides can be skipped.

Notes:

* Only one thread should call this method to avoid race conditions.
* A memory fence may be required after this operation to ensure visibility
  of the changes to other threads.

**Parameters:**

* ​dtype (`DType`): The data type of the source tensor in GMEM.
* ​tensor\_rank (`Int`): The rank of the source tensor in GMEM.
* ​dim\_idx (`Int`): The index of the dimension to be updated in the TMA descriptor with the provided dimension and stride values at runtime.

**Args:**

* ​smem\_tma\_descriptor\_ptr (`UnsafePointer[TMADescriptor, address_space=AddressSpace(3), alignment=alignment, mut=mut, origin=origin]`): Pointer to the TMA descriptor in shared memory that will be modified.
* ​dim\_value (`SIMD[uint32, 1]`): The new dimension value to be set.
* ​dim\_stride (`Optional[SIMD[uint64, 1]]`): The new stride value to be set.

---

## TMATensorTileArray

`@register_passable(trivial)`
`struct TMATensorTileArray[num_of_tensormaps: Int, dtype: DType, cta_tile_layout: Layout, desc_layout: Layout]`

An array of TMA descripotr.

## Parameters

* ​num\_of\_tensormaps (`Int`): Int
  The number of TMA descriptors aka tensor map.
* ​dtype (`DType`): DType
  The data type of the tensor elements.
* ​cta\_tile\_layout (`Layout`): Layout
  The layout of the tile in shared memory, typically specified as row\_major.
* ​desc\_layout (`Layout`): Layout
  The layout of the descriptor, which can be different from the shared memory layout
  to accommodate hardware requirements like WGMMA.

## Fields

* ​tensormaps\_ptr (`UnsafePointer[SIMD[uint8, 1]]`): A static tuple of pointers to TMA descriptors.
  This field stores an array of pointers to `TMATensorTile` instances, where each pointer
  references a TMA descriptor in device memory. The array has a fixed size determined by
  the num\_of\_tensormaps parameter.

  The TMA descriptors are used by the GPU hardware to efficiently transfer data between
  global and shared memory with specific memory access patterns defined by the layouts.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `descriptor_bytes`

`alias descriptor_bytes = 128`

Size of the TMA descriptor in bytes.

This is a constant value that represents the size of the TMA descriptor in bytes.
It is used to calculate the offset of the TMA descriptor in the device memory.

## Methods

### `__init__`

`__init__(out self, tensormaps_device: DeviceBuffer[uint8])`

Initializes a new TMATensorTileArray.

**Args:**

* ​tensormaps\_device (`DeviceBuffer[uint8]`): Device buffer to store TMA descriptors.

### `__getitem__`

`__getitem__(self, index: Int) -> UnsafePointer[TMATensorTile[dtype, cta_tile_layout, desc_layout]]`

Retrieve a TMA descriptor.

**Args:**

* ​index (`Int`): Index of the TMA descriptor.

**Returns:**

`UnsafePointer` to the `TMATensorTile` at the specified index.

---

## to_nest

`to_nest(nested: IntTuple[origin], flat: IntTuple[origin]) -> IntTuple`

Nests a flat `IntTuple` according to the structure of a nested `IntTuple`.

This function reshapes a flat sequence of values into a hierarchical structure
that matches the pattern of a template nested `IntTuple`.

Example:

```mojo
from layout import IntTuple
from layout.int_tuple import to_nest

var result = to_nest(IntTuple(2, IntTuple(3, 4), 5), IntTuple(1, 2, 3, 4))
# returns IntTuple(1, (2, 3), 4)
```

.

**Args:**

* ​nested (`IntTuple[origin]`): The template `IntTuple` defining the desired structure.
* ​flat (`IntTuple[origin]`): The flat `IntTuple` containing the values to be nested.

**Returns:**

A new `IntTuple` with the values from flat arranged in the structure of nested.

---

## to_unknown

`to_unknown(t: IntTuple[origin]) -> IntTuple`

Create an `IntTuple` with the same structure but filled with `UNKNOWN_VALUE`.

This function preserves the hierarchical structure of the input `IntTuple`
but replaces all integer values with `UNKNOWN_VALUE`.

**Args:**

* ​t (`IntTuple[origin]`): The template `IntTuple` defining the structure.

**Returns:**

A new `IntTuple` with the same structure as t but with all values
replaced by `UNKNOWN_VALUE`.

---

## Tokenization

Tokenization is the process of dividing the input for an AI model into discrete
units that have numerical IDs called tokens. Depending on what the input is
(such as text, audio, or an image) the tokens might be based on different words
or subwords in text, or different slices/blocks of pixels in images.

For example, consider the sentence, "The cat sat on the mat." A word-level
tokenization might split this sentence into the following words: "The," "cat,"
"sat," "on," "the," "mat." Then it replaces each word with a token (a number).
The token "vocabulary"—the mapping of words to numbers—is predetermined and may
vary from model to model.

But tokenizers in large language models (LLMs) are much more sophisticated than
that. Among other things, they also tokenize punctuations (or combinations of
words and punctuations) and break words into subwords that allow them to
tokenize words they've never seen before.

Because LLMs are trained on these tokens, they don't actually understand words
and letters the way we do. They can only recognize and generate information
based on the token vocabulary that they were trained upon. (Popular LLMs have a
token vocabulary with over 100,000 tokens.)

---

## tokenizer

Implementations of provided tokenizers.

## `IdentityPipelineTokenizer` {#max.pipelines.lib.tokenizer.IdentityPipelineTokenizer}

> *class* max.pipelines.lib.tokenizer.IdentityPipelineTokenizer(\*args, \*\*kwargs)

### `decode()` {#max.pipelines.lib.tokenizer.IdentityPipelineTokenizer.decode}

> *async* decode(context, encoded, \*\*kwargs)

Decodes response tokens to text.

**Parameters:**

* **context** (`TokenGeneratorContext` ) – Current generation context.
* **encoded** (`TokenizerEncoded` ) – Encoded response tokens.

**Returns:**

Un-encoded response text.

**Return type:**

[str](https://docs.python.org/3/library/stdtypes.html#str)

### `encode()` {#max.pipelines.lib.tokenizer.IdentityPipelineTokenizer.encode}

> *async* encode(prompt, add\_special\_tokens=False)

Encodes text prompts as tokens.

**Parameters:**

* **prompt** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – Un-encoded prompt text.
* **add\_special\_tokens** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the prompt exceeds the configured maximum length.

**Return type:**

[str](https://docs.python.org/3/library/stdtypes.html#str)

### `eos` {#max.pipelines.lib.tokenizer.IdentityPipelineTokenizer.eos}

> *property* eos\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The end of sequence token for this tokenizer.

### `expects_content_wrapping` {#max.pipelines.lib.tokenizer.IdentityPipelineTokenizer.expects_content_wrapping}

> *property* expects\_content\_wrapping\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

If true, this tokenizer expects messages to have a content property.

Text messages are formatted as:

```json
{ "type": "text", "content": "text content" }
```

instead of the OpenAI spec:

```json
{ "type": "text", "text": "text content" }
```

NOTE: Multimodal messages omit the content property.
Both `image_urls` and `image` content parts are converted to:

```json
{ "type": "image" }
```

Their content is provided as byte arrays through the top-level property
on the request object, i.e., `PipelineTokenizerRequest.images`.

## `PreTrainedPipelineTokenizer` {#max.pipelines.lib.tokenizer.PreTrainedPipelineTokenizer}

> *class* max.pipelines.lib.tokenizer.PreTrainedPipelineTokenizer(delegate)

**Parameters:**

**delegate** (`Union` `[` `PreTrainedTokenizer` `,`  `PreTrainedTokenizerFast` `]` )

### `apply_chat_template()` {#max.pipelines.lib.tokenizer.PreTrainedPipelineTokenizer.apply_chat_template}

> apply\_chat\_template(messages)

**Parameters:**

**messages** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TokenGeneratorRequestMessage`](core.md#max.pipelines.core.TokenGeneratorRequestMessage) `]` )

**Return type:**

[str](https://docs.python.org/3/library/stdtypes.html#str)

### `decode()` {#max.pipelines.lib.tokenizer.PreTrainedPipelineTokenizer.decode}

> *async* decode(context, encoded, \*\*kwargs)

Decodes response tokens to text.

**Parameters:**

* **context** (`TokenGeneratorContext` ) – Current generation context.
* **encoded** (`TokenizerEncoded` ) – Encoded response tokens.

**Returns:**

Un-encoded response text.

**Return type:**

[str](https://docs.python.org/3/library/stdtypes.html#str)

### `encode()` {#max.pipelines.lib.tokenizer.PreTrainedPipelineTokenizer.encode}

> *async* encode(prompt, add\_special\_tokens=False)

Encodes text prompts as tokens.

**Parameters:**

* **prompt** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) ) – Un-encoded prompt text.
* **add\_special\_tokens** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Raises:**

[**ValueError**](https://docs.python.org/3/library/exceptions.html#ValueError) – If the prompt exceeds the configured maximum length.

**Return type:**

[*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)

### `eos` {#max.pipelines.lib.tokenizer.PreTrainedPipelineTokenizer.eos}

> *property* eos\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The end of sequence token for this tokenizer.

### `expects_content_wrapping` {#max.pipelines.lib.tokenizer.PreTrainedPipelineTokenizer.expects_content_wrapping}

> *property* expects\_content\_wrapping\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

If true, this tokenizer expects messages to have a content property.

Text messages are formatted as:

```json
{ "type": "text", "content": "text content" }
```

instead of the OpenAI spec:

```json
{ "type": "text", "text": "text content" }
```

NOTE: Multimodal messages omit the content property.
Both `image_urls` and `image` content parts are converted to:

```json
{ "type": "image" }
```

Their content is provided as byte arrays through the top-level property
on the request object, i.e., `PipelineTokenizerRequest.images`.

## `TextAndVisionTokenizer` {#max.pipelines.lib.tokenizer.TextAndVisionTokenizer}

> *class* max.pipelines.lib.tokenizer.TextAndVisionTokenizer(model\_path, \*, revision=None, max\_length=None, max\_new\_tokens=None, trust\_remote\_code=False, \*\*unused\_kwargs)

Encapsulates creation of TextContext and specific token encode/decode logic.

**Parameters:**

* **model\_path** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **revision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `None` )
* **max\_length** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **max\_new\_tokens** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **trust\_remote\_code** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

### `apply_chat_template()` {#max.pipelines.lib.tokenizer.TextAndVisionTokenizer.apply_chat_template}

> apply\_chat\_template(messages)

**Parameters:**

**messages** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TokenGeneratorRequestMessage`](core.md#max.pipelines.core.TokenGeneratorRequestMessage) `]` )

**Return type:**

[str](https://docs.python.org/3/library/stdtypes.html#str)

### `decode()` {#max.pipelines.lib.tokenizer.TextAndVisionTokenizer.decode}

> *async* decode(context, encoded, \*\*kwargs)

Transformer a provided encoded token array, back into readable text.

**Parameters:**

* **context** ([`TextAndVisionContext`](core.md#max.pipelines.core.TextAndVisionContext) )
* **encoded** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Return type:**

[str](https://docs.python.org/3/library/stdtypes.html#str)

### `encode()` {#max.pipelines.lib.tokenizer.TextAndVisionTokenizer.encode}

> *async* encode(prompt, add\_special\_tokens=True)

Transform the provided prompt into a token array.

**Parameters:**

* **prompt** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` )
* **add\_special\_tokens** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Return type:**

[*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)

### `eos` {#max.pipelines.lib.tokenizer.TextAndVisionTokenizer.eos}

> *property* eos\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The end of sequence token for this tokenizer.

### `expects_content_wrapping` {#max.pipelines.lib.tokenizer.TextAndVisionTokenizer.expects_content_wrapping}

> *property* expects\_content\_wrapping\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

If true, this tokenizer expects messages to have a content property.

Text messages are formatted as:

```json
{ "type": "text", "content": "text content" }
```

instead of the OpenAI spec:

```json
{ "type": "text", "text": "text content" }
```

NOTE: Multimodal messages omit the content property.
Both `image_urls` and `image` content parts are converted to:

```json
{ "type": "image" }
```

Their content is provided as byte arrays through the top-level property
on the request object, i.e., `PipelineTokenizerRequest.images`.

### `new_context()` {#max.pipelines.lib.tokenizer.TextAndVisionTokenizer.new_context}

> *async* new\_context(request)

Create a new TextAndVisionContext object, leveraging necessary information like
cache\_seq\_id and prompt from TokenGeneratorRequest.

**Parameters:**

**request** ([`TokenGeneratorRequest`](core.md#max.pipelines.core.TokenGeneratorRequest) )

**Return type:**

[*TextAndVisionContext*](core.md#max.pipelines.core.TextAndVisionContext)

## `TextTokenizer` {#max.pipelines.lib.tokenizer.TextTokenizer}

> *class* max.pipelines.lib.tokenizer.TextTokenizer(model\_path, \*, revision=None, max\_length=None, max\_new\_tokens=None, trust\_remote\_code=False, enable\_llama\_whitespace\_fix=False, \*\*unused\_kwargs)

Encapsulates creation of TextContext and specific token encode/decode logic.

**Parameters:**

* **model\_path** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )
* **revision** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  `None` )
* **max\_length** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **max\_new\_tokens** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **trust\_remote\_code** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )
* **enable\_llama\_whitespace\_fix** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

### `apply_chat_template()` {#max.pipelines.lib.tokenizer.TextTokenizer.apply_chat_template}

> apply\_chat\_template(messages, tools, chat\_template\_options=None)

**Parameters:**

* **messages** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TokenGeneratorRequestMessage`](core.md#max.pipelines.core.TokenGeneratorRequestMessage) `]` )
* **tools** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` [`TokenGeneratorRequestTool`](core.md#max.pipelines.core.TokenGeneratorRequestTool) `]`  `|`  `None` )
* **chat\_template\_options** ([`dict`](https://docs.python.org/3/library/stdtypes.html#dict) `[` [`str`](https://docs.python.org/3/library/stdtypes.html#str) `,`  [`Any`](https://docs.python.org/3/library/typing.html#typing.Any) `]`  `|`  `None` )

**Return type:**

[str](https://docs.python.org/3/library/stdtypes.html#str)

### `decode()` {#max.pipelines.lib.tokenizer.TextTokenizer.decode}

> *async* decode(context, encoded, \*\*kwargs)

Transformer a provided encoded token array, back into readable text.

**Parameters:**

* **context** ([`TextContext`](core.md#max.pipelines.core.TextContext) )
* **encoded** ([`ndarray`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray) )

**Return type:**

[str](https://docs.python.org/3/library/stdtypes.html#str)

### `encode()` {#max.pipelines.lib.tokenizer.TextTokenizer.encode}

> *async* encode(prompt, add\_special\_tokens=True)

Transform the provided prompt into a token array.

**Parameters:**

* **prompt** ([`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Sequence`](https://docs.python.org/3/library/collections.abc.html#collections.abc.Sequence) `[` [`int`](https://docs.python.org/3/library/functions.html#int) `]` )
* **add\_special\_tokens** ([`bool`](https://docs.python.org/3/library/functions.html#bool) )

**Return type:**

[*ndarray*](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html#numpy.ndarray)

### `eos` {#max.pipelines.lib.tokenizer.TextTokenizer.eos}

> *property* eos\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The end of sequence token for this tokenizer.

### `expects_content_wrapping` {#max.pipelines.lib.tokenizer.TextTokenizer.expects_content_wrapping}

> *property* expects\_content\_wrapping\*: [bool](https://docs.python.org/3/library/functions.html#bool)\*

If true, this tokenizer expects messages to have a content property.

Text messages are formatted as:

```json
{ "type": "text", "content": "text content" }
```

instead of the OpenAI spec:

```json
{ "type": "text", "text": "text content" }
```

NOTE: Multimodal messages omit the content property.
Both `image_urls` and `image` content parts are converted to:

```json
{ "type": "image" }
```

Their content is provided as byte arrays through the top-level property
on the request object, i.e., `PipelineTokenizerRequest.images`.

### `new_context()` {#max.pipelines.lib.tokenizer.TextTokenizer.new_context}

> *async* new\_context(request)

Create a new TextContext object, leveraging necessary information like
cache\_seq\_id and prompt from TokenGeneratorRequest.

**Parameters:**

**request** ([`TokenGeneratorRequest`](core.md#max.pipelines.core.TokenGeneratorRequest) )

**Return type:**

[*TextContext*](core.md#max.pipelines.core.TextContext)

## `max_tokens_to_generate()` {#max.pipelines.lib.tokenizer.max_tokens_to_generate}

> max.pipelines.lib.tokenizer.max\_tokens\_to\_generate(prompt\_size, max\_length, max\_new\_tokens=None)

Returns the max number of new tokens to generate.

**Parameters:**

* **prompt\_size** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **max\_length** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )
* **max\_new\_tokens** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  `None` )

**Return type:**

[int](https://docs.python.org/3/library/functions.html#int) | None

## `run_with_default_executor()` {#max.pipelines.lib.tokenizer.run_with_default_executor}

> *async* max.pipelines.lib.tokenizer.run\_with\_default\_executor(fn, \*args)

---

## top_k

`top_k[rank: Int, type: DType, out_idx_type: DType, //, largest: Bool = True, target: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("cpu")](input: NDBuffer[type, rank, origin], k: Int, axis: Int, out_vals: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], sorted: Bool, ctx: DeviceContextPtr)`

Implementation of the Top K algorithm. Returns the top or bottom K elements and their index along a specified axis.

**Parameters:**

* ​rank (`Int`): Rank of the input.
* ​type (`DType`): Data type of the input buffer.
* ​out\_idx\_type (`DType`): The data type of the output indices (default is DType.int64).
* ​largest (`Bool`): Whether to find the maximum (top k) or minimum value (bottom k).
* ​target (`StringSlice[StaticConstantOrigin]`): The target to run on.

**Args:**

* ​input (`NDBuffer[type, rank, origin]`): The input tensor.
* ​k (`Int`): Represents the K largest/smallest value.
* ​axis (`Int`): On which axis it should operate.
* ​out\_vals (`NDBuffer[type, rank, origin]`): Output values.
* ​out\_idxs (`NDBuffer[out_idx_type, rank, origin]`): Output indices.
* ​sorted (`Bool`): Indicates if the top/bottom K elements are in (stable) sorted order.
* ​ctx (`DeviceContextPtr`): The device call context.

---

## top_k_fused_sampling_cpu

`top_k_fused_sampling_cpu[type: DType, rank: Int, out_idx_type: DType](k: Int, input: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))`

Generalized implementation of the Top K algorithm with sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume.

**Parameters:**

* ​type (`DType`): Data type of the input buffer.
* ​rank (`Int`): Rank of the input.
* ​out\_idx\_type (`DType`): Data type of the output indices.

**Args:**

* ​k (`Int`): Int - Represents the K largest values to consider for sampling.
* ​input (`NDBuffer[type, rank, origin]`): NDBuffer\[type, rank] (Any shape)- The input tensor.
* ​out\_idxs (`NDBuffer[out_idx_type, rank, origin]`): NDBuffer\[out\_idx\_type, rank] (shape of \[input\_shape\[:-1]] + \[1]) - The output indices.
* ​temperature (`SIMD[type, 1]`): The temperature based scaling.

---

## top_k_shape

`top_k_shape[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], k: Int, axis: Int) -> IndexList[rank]`

---

## top_k_shape_impl

`top_k_shape_impl[type: DType, rank: Int, single_thread_blocking_override: Bool](input: NDBuffer[type, rank, origin], k: Int, axis: Int) -> IndexList[rank]`

Compute the output shape of a top/bottom k operation.

**Parameters:**

* ​type (`DType`): Data type of the input buffer.
* ​rank (`Int`): Rank of the input.
* ​single\_thread\_blocking\_override (`Bool`): If this function can block.

**Args:**

* ​input (`NDBuffer[type, rank, origin]`): The input tensor.
* ​k (`Int`): The K value in a tensor.
* ​axis (`Int`): The axis value in a tensor.

**Returns:**

The output shape.

---

## top_p_sampling

`top_p_sampling[type: DType, rank: Int, out_idx_type: DType, //, _test_sort: Bool = False](top_ps: NDBuffer[type, 1, origin], input_logits: NDBuffer[type, rank, origin], out_token_ids: NDBuffer[out_idx_type, rank, origin], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))`

Naive CPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the cumulative probability mass (Top-P).

---

## top_p_sampling_gpu

`top_p_sampling_gpu[type: DType, rank: Int, out_idx_type: DType, //, _test_sort: Bool = False](ctx: DeviceContext, top_ps: NDBuffer[type, 1, origin], input_logits: NDBuffer[type, rank, origin], out_token_ids: NDBuffer[out_idx_type, rank, origin], temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))`

GPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the cumulative probability mass (Top-P).

---

## topk

## Aliases

### `SEED`

`alias SEED = 0`

## Structs

* [​`TopK_2`](./TopK_2):

## Functions

* [​`bottom_k_shape`](./bottom_k_shape):
* [​`top_k`](./top_k): Implementation of the Top K algorithm. Returns the top or bottom K elements and their index along a specified axis.
* [​`top_k_fused_sampling_cpu`](./top_k_fused_sampling_cpu): Generalized implementation of the Top K algorithm with sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume.
* [​`top_k_shape`](./top_k_shape):
* [​`top_k_shape_impl`](./top_k_shape_impl): Compute the output shape of a top/bottom k operation.
* [​`topk_fused_sampling_gpu`](./topk_fused_sampling_gpu): Top K algorithm with fused sampling. Returns the sampled indices from the Top-K of the innermost dimension of the input tensor for each row/subvolume.
* [​`topk_gpu`](./topk_gpu): Generalized implementation of the Top K algorithm with/without sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume or the top K values and indices across the tensor.

---

## TopK_2

`@register_passable(trivial)`
`struct TopK_2[T: DType, largest: Bool = True]`

## Fields

* ​p (`Int`):
* ​u (`SIMD[T, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__() -> Self`

### `insert`

`insert(mut self, elem: SIMD[T, 1], elem_id: Int)`

---

## topk_fused_sampling_gpu

`topk_fused_sampling_gpu[type: DType, rank: Int, out_idx_type: DType, //](ctx: DeviceContext, K: Int, input: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], block_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), num_blocks_per_input: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))`

Top K algorithm with fused sampling. Returns the sampled indices from the Top-K of the innermost dimension of the input tensor for each row/subvolume.

---

## topk_gpu

`topk_gpu[type: DType, rank: Int, out_idx_type: DType, //, sampling: Bool = True, largest: Bool = True](ctx: DeviceContext, K: Int, input: NDBuffer[type, rank, origin], out_vals: NDBuffer[type, rank, origin], out_idxs: NDBuffer[out_idx_type, rank, origin], block_size: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), num_blocks_per_input: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}), temperature: SIMD[type, 1] = __init__[__mlir_type.!pop.int_literal](1))`

Generalized implementation of the Top K algorithm with/without sampling. Returns the sampled index from the innermost dimension of the input tensor for each row/subvolume or the top K values and indices across the tensor.

**Parameters:**

* ​type (`DType`): DType - The data type of the input tensor.
* ​rank (`Int`): Int - The rank of the input tensor.
* ​out\_idx\_type (`DType`): DType - The data type of the output indices (default is DType.index).
* ​sampling (`Bool`): Bool - Whether to return token samples from topK dist (default is True).
* ​largest (`Bool`): Bool - Whether to find the maximum or minimum value.

**Args:**

* ​ctx (`DeviceContext`): DeviceContext
  The context for GPU execution.
* ​K (`Int`): Int - The number of top elements to keep.
* ​input (`NDBuffer[type, rank, origin]`): NDBuffer\[type, rank]
  Input tensor as a device NDBuffer.
* ​out\_vals (`NDBuffer[type, rank, origin]`): NDBuffer\[type, rank]
  Output buffer on device for the K largest values.
* ​out\_idxs (`NDBuffer[out_idx_type, rank, origin]`): NDBuffer\[DType.index, rank]
  Output buffer on device for the indices of the K largest values, or sampled token indices.
  Last dimension is 1 if sampling is True, otherwise K.
* ​block\_size (`OptionalReg[Int]`): Int
  The number of threads per block (default is 256 from TRT and empirical testing).
* ​num\_blocks\_per\_input (`OptionalReg[Int]`): OptionalReg\[Int]
  Number of blocks per input (default computed from input size and block size).
  This is the equivalent of "BLOCKS\_PER\_BEAM" in TRT-LLM kernel allowing for much larger
  batch sizes through packing several elements per thread in the first stage.
* ​temperature (`SIMD[type, 1]`): The temperature based scaling.

---

## topk_wrapper

`topk_wrapper[T: DType, out_idx_type: DType, is_top_p: Bool, largest: Bool = True, _test_sort: Bool = False](K: Int, num_elements: Int, num_blocks_per_input: Int, in_buffer: UnsafePointer[SIMD[T, 1]], local_topk_vals: UnsafePointer[SIMD[T, 1]], local_topk_idxs: UnsafePointer[SIMD[out_idx_type, 1]], p_threshold: UnsafePointer[SIMD[T, 1]], skip_sort: UnsafePointer[SIMD[bool, 1]])`

Copy of `Kernels/mojo/nn/topk.mojo:_topk_stage1` with the addition of max\_vals and p\_threshold arguments to determine if sorting is needed for top-p/min-p sampling.

Arguments:
K: Int - Number of top elements to select per block
num\_elements: Int - Size of last dimension of input buffer (vocab size)
num\_blocks\_per\_input: Int - Number of blocks used to process the input data
in\_buffer: UnsafePointer\[Scalar\[T]] - Input buffer containing the elements to process
local\_topk\_vals: UnsafePointer\[Scalar\[T]] - Output buffer to store the local top-K values
local\_topk\_idxs: UnsafePointer\[Scalar\[out\_idx\_type]] - Output buffer to store the indices of local top-K elements
p\_threshold: UnsafePointer\[Scalar\[T]] - Threshold for top-p sampling if is\_top\_p is True else min-p cofficient
skip\_sort: UnsafePointer\[Scalar\[DType.bool]] - Output buffer to store whether sorting is needed

**Parameters:**

* ​T (`DType`): DType - The data type of the elements.
* ​out\_idx\_type (`DType`): DType - The data type of the output indices.
* ​is\_top\_p (`Bool`): Bool - Whether this if for top-p sampling or min-p sampling.
* ​largest (`Bool`): Bool - Whether to find the maximum or minimum value.
* ​\_test\_sort (`Bool`): Bool - An internal test flag to not skip sort if testing.

---

## topp_minp_sampling_kernel

`topp_minp_sampling_kernel[type: DType, out_idx_type: DType, is_top_p: Bool](p_thresholds_: UnsafePointer[SIMD[type, 1]], sorted_probs_: UnsafePointer[SIMD[type, 1]], sorted_ids_: UnsafePointer[SIMD[out_idx_type, 1]], out_token_ids: UnsafePointer[SIMD[out_idx_type, 1]], skip_sort: UnsafePointer[SIMD[bool, 1]], vocab_size: Int)`

Top P-Min P sampling kernel.

**Parameters:**

* ​type (`DType`): DType - scalar values dtype.
* ​out\_idx\_type (`DType`): DType - output index type.
* ​is\_top\_p (`Bool`): Bool - Whether to use Top-P (True) or Min-P (False) sampling.

---

## toppminp

## Functions

* [​`merge`](./merge): Merge two sorted subarrays into one sorted array.
* [​`merge_sort_recursive`](./merge_sort_recursive): Recursive merge sort implementation.
* [​`min_p_sampling`](./min_p_sampling): Naive CPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the calculated probability threshold (Min-P).
* [​`sort_buf_descending`](./sort_buf_descending): Sort each batch separately in descending order using parallel merge sort.
* [​`top_p_sampling`](./top_p_sampling): Naive CPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a merge sort, and then samples tokens based on the cumulative probability mass (Top-P).

---

## toppminp_gpu

## Aliases

### `DEBUG_FILE`

`alias DEBUG_FILE = False`

### `SEED`

`alias SEED = 42`

## Functions

* [​`min_p_sampling_gpu`](./min_p_sampling_gpu): GPU implementation of Min-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the calculated probability threshold (Min-P).
* [​`normalize`](./normalize):
* [​`normalize_u32`](./normalize_u32):
* [​`radix_sort_pairs_kernel`](./radix_sort_pairs_kernel): Radix pair sort kernel for (default) descending order.
* [​`run_radix_sort_pairs_gpu`](./run_radix_sort_pairs_gpu):
* [​`top_p_sampling_gpu`](./top_p_sampling_gpu): GPU implementation of Top-P sampling for token selection. This function applies temperature scaling, softmax, a radix sort, and then samples tokens based on the cumulative probability mass (Top-P).
* [​`topk_wrapper`](./topk_wrapper): Copy of `Kernels/mojo/nn/topk.mojo:_topk_stage1` with the addition of max\_vals and p\_threshold arguments to determine if sorting is needed for top-p/min-p sampling.
* [​`topp_minp_sampling_kernel`](./topp_minp_sampling_kernel): Top P-Min P sampling kernel.

---

## torch

## `CustomOpLibrary` {#max.torch.CustomOpLibrary}

> *class* max.torch.CustomOpLibrary(kernel\_library)

A PyTorch interface to custom operations implemented in Mojo.

This API allows for easy passing of PyTorch data as
`torch.Tensor` values to the corresponding custom op. `CustomOpLibrary`
handles the compilation of the Mojo custom ops and marshalling of data between
PyTorch and the executable Mojo code.

For example, consider a grayscale operation implemented in Mojo:

```mojo title="my_library/grayscale.mojo"
 @register("grayscale")
 struct Grayscale:
     @staticmethod
     fn execute[
         # The kind of device this is running on: "cpu" or "gpu"
         target: StaticString,
     ](
         img_out: OutputTensor[type = DType.uint8, rank=2],
         img_in: InputTensor[type = DType.uint8, rank=3],
         ctx: DeviceContextPtr,
     ) raises:
         ...
```

You can then use `CustomOpLibrary` to invoke the Mojo operation like so:

```python
import torch
from max.torch import CustomOpLibrary

op_library = CustomOpLibrary("my_library")
grayscale_op = op_library.grayscale

def grayscale(pic: torch.Tensor) -> torch.Tensor:
    result = pic.new_empty(pic.shape[:-1])
    grayscale_op(result, pic)
    return result

img = (torch.rand(64, 64, 3) * 255).to(torch.uint8)
result = grayscale(img)
```

The custom operation produced by `op_library.` will have the
same interface as the backing Mojo operation. Each `InputTensor` or
`OutputTensor` argument corresponds to a
[`torch.Tensor`](https://docs.pytorch.org/docs/stable/tensors.html#tensor-class-reference)
value in Python. Each argument corresponding to an `OutputTensor` in the
Mojo operation will be modified in-place.

**Parameters:**

**kernel\_library** (`Path`  `|`  [`KernelLibrary`](graph/KernelLibrary.md#max.graph.KernelLibrary) ) – The path to a `.mojo` file or a `.mojopkg` with
your custom op kernels, or the corresponding library object.

---

## Trace

`struct Trace[level: TraceLevel, *, category: TraceCategory = TraceCategory(4), target: Optional[StringSlice[StaticConstantOrigin]] = Optional(None)]`

An object representing a specific trace.

This struct provides functionality for creating and managing trace events
for profiling and debugging purposes.

## Parameters

* ​level (`TraceLevel`): The trace level to use.
* ​category (`TraceCategory`): The trace category to use (defaults to TraceCategory.MAX).
* ​target (`Optional[StringSlice[StaticConstantOrigin]]`): Optional target information to include in the trace.

## Fields

* ​int\_payload (`OptionalReg[Int]`): Optional integer payload, typically used for task IDs that are appended to trace names.
* ​detail (`String`): Additional details about the trace event, included when detailed tracing is enabled.
* ​event\_id (`Int`): Unique identifier for the trace event, assigned when the trace begins.
* ​parent\_id (`Int`): Identifier of the parent trace event, used for creating hierarchical trace relationships.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, *, owned _name_value: Variant[String, StringSlice[StaticConstantOrigin]], detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

Creates a Mojo trace with the given name.

**Args:**

* ​\_name\_value (`Variant[String, StringSlice[StaticConstantOrigin]]`): The name that is used to identify this Mojo trace.
* ​detail (`String`): Details of the trace entry.
* ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be
  appended to parent name. 0 (default) indicates no parent.
* ​task\_id (`OptionalReg[Int]`): Int that is appended to name.

`__init__(out self, owned name: String, detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, *, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

Creates a Mojo trace with the given string name.

**Args:**

* ​name (`String`): The name that is used to identify this Mojo trace.
* ​detail (`String`): Details of the trace entry.
* ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be
  appended to parent name. 0 (default) indicates no parent.
* ​task\_id (`OptionalReg[Int]`): Int that is appended to name.

`__init__(out self, name: StringSlice[StaticConstantOrigin], detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, *, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

Creates a Mojo trace with the given static string name.

**Args:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name that is used to identify this Mojo trace.
* ​detail (`String`): Details of the trace entry.
* ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be
  appended to parent name. 0 (default) indicates no parent.
* ​task\_id (`OptionalReg[Int]`): Int that is appended to name.

`__init__(out self, name: StringLiteral[value], detail: String = __init__[__mlir_type.!kgen.string](""), parent_id: Int = 0, *, task_id: OptionalReg[Int] = OptionalReg[Int]({:i1 0, 1}))`

Creates a Mojo trace with the given string literal name.

**Args:**

* ​name (`StringLiteral[value]`): The name that is used to identify this Mojo trace.
* ​detail (`String`): Details of the trace entry.
* ​parent\_id (`Int`): Parent to associate the trace with. Trace name will be
  appended to parent name. 0 (default) indicates no parent.
* ​task\_id (`OptionalReg[Int]`): Int that is appended to name.

### `__enter__`

`__enter__(mut self)`

Enters the trace context.

This begins recording of the trace event.

### `__exit__`

`__exit__(self)`

Exits the trace context.

This finishes recording of the trace event.

### `mark`

`mark(self)`

Marks the tracer with the info at the specific point of time.

This creates a point event in the trace timeline rather than a range.

### `name`

`name(self) -> String`

Returns the name of the trace.

**Returns:**

The name of the trace as a String.

### `start`

`start(mut self)`

Start recording trace event.

This begins recording of the trace event, similar to **enter**.

### `end`

`end(mut self)`

End recording trace event.

This finishes recording of the trace event, similar to **exit**.

---

## trace_arg

`trace_arg(name: String, shape: IndexList[size, element_type=element_type]) -> String`

Helper to stringify the type and shape of a kernel argument for tracing.

**Args:**

* ​name (`String`): The name of the argument.
* ​shape (`IndexList[size, element_type=element_type]`): The shape of the argument.

**Returns:**

A string representation of the argument with its shape.

`trace_arg(name: String, shape: IndexList[size, element_type=element_type], dtype: DType) -> String`

Helper to stringify the type and shape of a kernel argument for tracing.

**Args:**

* ​name (`String`): The name of the argument.
* ​shape (`IndexList[size, element_type=element_type]`): The shape of the argument.
* ​dtype (`DType`): The data type of the argument.

**Returns:**

A string representation of the argument with its shape and data type.

`trace_arg(name: String, buf: NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]) -> String`

Helper to stringify the type and shape of a kernel argument for tracing.

**Args:**

* ​name (`String`): The name of the argument.
* ​buf (`NDBuffer[type, rank, origin, shape, strides, alignment=alignment, address_space=address_space, exclusive=exclusive]`): The NDBuffer to trace.

**Returns:**

A string representation of the buffer with its shape and data type.

---

## trace_slice_arg

`trace_slice_arg(name: String, buf: ManagedTensorSlice[io_spec, static_spec=static_spec]) -> String`

Helper to stringify the type and shape of a kernel argument for tracing.

**Args:**

* ​name (`String`): The name of the argument.
* ​buf (`ManagedTensorSlice[io_spec, static_spec=static_spec]`): The NDBuffer to trace.

**Returns:**

A string representation of the buffer with its shape and data type.

---

## TraceCategory

`@register_passable(trivial)`
`struct TraceCategory`

An enum-like struct specifying the type of tracing to perform.

## Fields

* ​value (`Int`): The integer value representing the trace category. Used for bitwise operations when determining if profiling is enabled for a specific category.

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Intable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ASYNCRT`

`alias ASYNCRT = TraceCategory(1)`

### `Kernel`

`alias Kernel = TraceCategory(3)`

### `MAX`

`alias MAX = TraceCategory(4)`

### `MEM`

`alias MEM = TraceCategory(2)`

### `OTHER`

`alias OTHER = TraceCategory(0)`

## Methods

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compares for equality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are equal.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Compares for inequality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are not equal.

### `__is__`

`__is__(self, rhs: Self) -> Bool`

Compares for equality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are equal.

### `__isnot__`

`__isnot__(self, rhs: Self) -> Bool`

Compares for inequality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are not equal.

### `__int__`

`__int__(self) -> Int`

Converts the trace category to an integer.

**Returns:**

The integer value of the trace category.

---

## TraceLevel

`@register_passable(trivial)`
`struct TraceLevel`

An enum-like struct specifying the level of tracing to perform.

## Fields

* ​value (`Int`): The integer value representing the trace level.
  Lower values indicate higher priority trace levels:
  * 0 (ALWAYS): Always traced
  * 1 (OP): Operation-level tracing
  * 2 (THREAD): Thread-level tracing

## Implemented traits

`AnyType`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `ALWAYS`

`alias ALWAYS = TraceLevel(0)`

### `OP`

`alias OP = TraceLevel(1)`

### `THREAD`

`alias THREAD = TraceLevel(2)`

## Methods

### `__init__`

`@implicit`
`__init__(value: Int) -> Self`

Initializes a TraceLevel with the given integer value.

**Args:**

* ​value (`Int`): The integer value for the trace level.

### `__le__`

`__le__(self, rhs: Self) -> Bool`

Performs less than or equal to comparison.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if this value is less than or equal to `rhs`.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compares for equality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are equal.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Compares for inequality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are not equal.

### `__is__`

`__is__(self, rhs: Self) -> Bool`

Compares for equality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are equal.

### `__isnot__`

`__isnot__(self, rhs: Self) -> Bool`

Compares for inequality.

**Args:**

* ​rhs (`Self`): The value to compare.

**Returns:**

True if they are not equal.

### `__int__`

`__int__(self) -> Int`

Converts the trace level to an integer.

**Returns:**

The integer value of the trace level.

---

## tracing

Provides tracing utilities.

## Structs

* [​`Trace`](/mojo/stdlib/runtime/tracing/Trace): An object representing a specific trace.
* [​`TraceCategory`](/mojo/stdlib/runtime/tracing/TraceCategory): An enum-like struct specifying the type of tracing to perform.
* [​`TraceLevel`](/mojo/stdlib/runtime/tracing/TraceLevel): An enum-like struct specifying the level of tracing to perform.

## Functions

* [​`get_current_trace_id`](/mojo/stdlib/runtime/tracing/get_current_trace_id): Returns the id of last created trace entry on the current thread.
* [​`is_profiling_disabled`](/mojo/stdlib/runtime/tracing/is_profiling_disabled): Returns False if the profiling is enabled for that specific type and level and True otherwise.
* [​`is_profiling_enabled`](/mojo/stdlib/runtime/tracing/is_profiling_enabled): Returns True if the profiling is enabled for that specific type and level and False otherwise.
* [​`trace_arg`](/mojo/stdlib/runtime/tracing/trace_arg): Helper to stringify the type and shape of a kernel argument for tracing.

---

## Traits

A *trait* is a set of requirements that a type must implement. You can think of
it as a contract: a type that *conforms* to a trait guarantees that it
implements all of the features of the trait.

Traits are similar to Java *interfaces*, C++ *concepts*, Swift *protocols*, and
Rust *traits*. If you're familiar with any of those features, Mojo traits solve
the same basic problem.

## Background

In dynamically-typed languages like Python, you don't need to explicitly declare
that two classes are similar. This is easiest to show by example:

```python
class Duck:
    def quack(self):
        print("Quack.")

class StealthCow:
    def quack(self):
        print("Moo!")

def make_it_quack(maybe_a_duck):
    try:
        maybe_a_duck.quack()
    except:
        print("Not a duck.")

make_it_quack(Duck())
make_it_quack(StealthCow())
```

The `Duck` and `StealthCow` classes aren't related in any way, but they both
define a `quack()` method, so they work the same in the `make_it_quack()`
function. This works because Python uses dynamic dispatch—it identifies the
methods to call at runtime. So `make_it_quack()` doesn't care what types
you're passing it, only the fact that they implement the `quack()` method.

In a statically-typed environment, this approach doesn't work:
Mojo functions require you to
specify the type of each argument. If you wanted to write this example in Mojo
*without* traits, you'd need to write a function overload for each input type.

```mojo
@value
struct Duck:
    fn quack(self):
        print("Quack")

@value
struct StealthCow:
    fn quack(self):
        print("Moo!")

fn make_it_quack(definitely_a_duck: Duck):
    definitely_a_duck.quack()

fn make_it_quack(not_a_duck: StealthCow):
    not_a_duck.quack()

make_it_quack(Duck())
make_it_quack(StealthCow())
```

```output
Quack
Moo!
```

This isn't too bad with only two types. But the more types you want to
support, the less practical this approach is.

You might notice that the Mojo versions of `make_it_quack()` don't include the
`try/except` statement. We don't need it because Mojo's static type checking
ensures that you can only pass instances of `Duck` or `StealthCow` into the
`make_it_quack()`function.

## Using traits

Traits solve this problem by letting you define a shared set of *behaviors* that
types can implement. Then you can write a function that depends on the trait,
rather than individual types. As an example, let's update the `make_it_quack()`
example using traits. The first step is defining a trait:

```mojo
trait Quackable:
    fn quack(self):
        ...
```

A trait looks a lot like a struct, except it's introduced by the `trait`
keyword. A trait can contain method signatures, but it can't implement those
methods. Each method signature must be followed by
three dots (`...`) to indicate that the method is unimplemented.

A trait can also include associated aliases—compile-time constant values that
must be defined by conforming structs. Associated aliases are useful for writing
traits that describe generic types. For more information, see
[Associated aliases for generics](#associated-aliases-for-generics).

:::note TODO

In the future, we plan to support defining fields and default method
implementations inside a trait.

:::

Next we create some structs that conform to the `Quackable` trait. To indicate
that a struct conforms to a trait, include the trait name in parenthesis after
the struct name. You can also include multiple traits, separated by commas.
(If you're familiar with Python, this looks just like Python's inheritance
syntax.)

```mojo
@value
struct Duck(Quackable):
    fn quack(self):
        print("Quack")

@value
struct StealthCow(Quackable):
    fn quack(self):
        print("Moo!")
```

The struct needs to implement any methods that are declared in the trait. The
compiler enforces conformance: if a struct says it conforms to a trait, it must
implement everything required by the trait or the code won't compile.

Finally, you can define a function that takes a `Quackable` like this:

```mojo
fn make_it_quack[type: Quackable](maybe_a_duck: type):
    maybe_a_duck.quack()
```

This syntax may look a little unfamiliar if you haven't dealt with Mojo
[parameters](/mojo/manual/parameters/) before. What this signature
means is that `maybe_a_duck` is an argument of type `type`, where `type` is a
type that must conform to the `Quackable` trait.

Using the method is simple enough:

```mojo
make_it_quack(Duck())
make_it_quack(StealthCow())
```

```output
Quack
Moo!
```

Note that you don't need the square brackets when you call `make_it_quack()`:
the compiler infers the type of the argument, and ensures the type has the
required trait.

One limitation of traits is that you can't add traits to existing types. For
example, if you define a new `Numeric` trait, you can't add it to the standard
library `Float64` and `Int` types. However, the standard library already
includes quite a few traits, and we'll be adding more over time.

### Traits can require static methods

In addition to regular instance methods, traits can specify required static
methods.

```mojo
trait HasStaticMethod:
    @staticmethod
    fn do_stuff(): ...

fn fun_with_traits[type: HasStaticMethod]():
    type.do_stuff()
```

## Trait compositions

You can compose traits using the `&` sigil. This lets you define new traits
that are simple combinations of other traits. You can use a trait composition
anywhere that you'd use a single trait:

```mojo
trait Flyable:
    fn fly(self): ...

fn quack_and_go[type: Quackable & Flyable](quacker: type):
    quacker.quack()
    quacker.fly()

@value
struct FlyingDuck(Quackable & Flyable):
    fn quack(self):
        print("quack")

    fn fly(self):
        print("whoosh!")

quack_and_go(FlyingDuck())
```

You can also use the `alias` keyword to create a shorthand name for a
trait composition:

```mojo
alias DuckLike = Quackable & Flyable

struct ToyDuck(DuckLike):
    # ... implementation omitted
```

Previously, you could only compose traits using
[inheritance](#trait-inheritance), by defining a new, empty trait like this:

```mojo
trait DuckTrait(Quackable, Flyable):
    pass
```

The difference is that using the `trait` keyword defines a new, named
trait. For a struct to *explictly* conform to this trait, you need to include
it in the struct's signature. On the other hand, the `DuckLike` alias represents
a composition of two separate traits, `Quackable` and `Flyable`, and anything
that conforms to those two traits conforms to `DuckLike`. For example, our
earlier `FlyingDuck` type:

```mojo
struct FlyingDuck(Quackable & Flyable):
    # ... etc
```

Because `FlyingDuck` conforms to both `Quackable` and `Flyable`, it also
conforms to the `DuckLike` trait composition. But it *doesn't* explicitly
conform to `DuckTrait`, since it doesn't include `DuckTrait` in its list of
traits.

Currently this distinction doesn't make much difference, because Mojo supports
[implicit trait conformance](#implicit-trait-conformance), which means that
`FlyingDuck` is treated as if it conforms to `DuckTrait`, since it meets all of
the requirements. However, implicit conformance is due to be phased out in the
future, so we recommend replacing empty traits like `DuckTrait` with more
flexible trait compositions.

## Trait inheritance

Traits can inherit from other traits. A trait that inherits from another trait
includes all of the requirements declared by the parent trait. For example:

```mojo
trait Animal:
    fn make_sound(self):
        ...

# Bird inherits from Animal
trait Bird(Animal):
    fn fly(self):
        ...
```

Since `Bird` inherits from `Animal`, a struct that conforms to the `Bird` trait
needs to implement **both** `make_sound()` and `fly()`. And since every `Bird`
conforms to `Animal`, a struct that conforms to `Bird` can be passed to any
function that requires an `Animal`.

To inherit from multiple traits, add a comma-separated list of traits or
trait compositions inside the parenthesis. For example, you could define a
`NamedAnimal` trait that combines the requirements of the `Animal` trait and a
new `Named` trait:

```mojo
trait Named:
    fn get_name(self) -> String:
        ...

trait NamedAnimal(Animal & Named):
    # ...
```

Inheritance is useful when you're creating a new trait that adds its own
requirements. If you simply want to express the union of two or more traits,
you can use a simple trait composition instead:

```mojo
alias NamedAnimal = Animal & Named
```

## Traits and lifecycle methods

Traits can specify required
[lifecycle methods](/mojo/manual/lifecycle/#lifecycles-and-lifetimes), including
constructors, copy constructors and move constructors.

For example, the following code creates a `MassProducible` trait. A
`MassProducible` type has a default (no-argument) constructor and can be moved.
It uses the built-in [`Movable`](/mojo/stdlib/builtin/value/Movable) trait,
which requires the type to have a [move
constructor](/mojo/manual/lifecycle/life#move-constructor).

The `factory[]()` function returns a newly-constructed instance of a
`MassProducible` type.

```mojo
trait DefaultConstructible:
    fn __init__(out self): ...

alias MassProducible = DefaultConstructible & Movable

fn factory[type: MassProducible]() -> type:
    return type()

struct Thing(MassProducible):
    var id: Int

    fn __init__(out self):
        self.id = 0

    fn __moveinit__(out self, owned existing: Self):
        self.id = existing.id

var thing = factory[Thing]()
```

Note that [`@register_passable("trivial")`](/mojo/manual/decorators/register-passable#register_passabletrivial)
types have restrictions on their lifecycle methods: they can't define copy or
move constructors, because they don't require any custom logic.

For the purpose of trait conformance, the compiler treats trivial types as
copyable and movable.

## Implicit trait conformance

Mojo currently supports *implicit* trait conformance, but this will be
deprecated in a future release.

Implicit conformance means that if a type implements
all of the methods required for a trait, it's treated as conforming to the
trait, even if it doesn't explicitly include the trait in its declaration:

```mojo
struct RubberDucky:
    fn quack(self):
        print("Squeak!")

make_it_quack(RubberDucky())
```

Implicit conformance can be convenient, but supporting it prevents us from
adding future trait features like default function implementations.

We strongly recommend using explicit trait conformance for all new code and
phasing out dependence on implicit trait conformance.

## Built-in traits

The Mojo standard library includes many traits. They're implemented
by a number of standard library types, and you can also implement these on your
own types. These standard library traits include:

* [`Absable`](/mojo/stdlib/builtin/math/Absable)
* [`AnyType`](/mojo/stdlib/builtin/anytype/AnyType)
* [`Boolable`](/mojo/stdlib/builtin/bool/Boolable)
* [`Comparable`](/mojo/stdlib/builtin/comparable/Comparable)
* [`Copyable`](/mojo/stdlib/builtin/value/Copyable)
* [`Defaultable`](/mojo/stdlib/builtin/value/Defaultable)
* [`Hashable`](/mojo/stdlib/hashlib/hash/Hashable)
* [`Indexer`](/mojo/stdlib/builtin/int/Indexer)
* [`Intable`](/mojo/stdlib/builtin/int/Intable)
* [`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising)
* [`KeyElement`](/mojo/stdlib/collections/dict/KeyElement)
* [`Movable`](/mojo/stdlib/builtin/value/Movable)
* [`PathLike`](/mojo/stdlib/os/pathlike/PathLike)
* [`Powable`](/mojo/stdlib/builtin/math/Powable)
* [`Representable`](/mojo/stdlib/builtin/repr/Representable)
* [`Sized`](/mojo/stdlib/builtin/len/Sized)
* [`Stringable`](/mojo/stdlib/builtin/str/Stringable)
* [`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising)
* [`Roundable`](/mojo/stdlib/builtin/math/Roundable)
* [`Writable`](/mojo/stdlib/utils/write/Writable)
* [`Writer`](/mojo/stdlib/utils/write/Writer)

The API reference docs linked above include usage examples for each trait. The
following sections discuss a few of these traits.

### The `Sized` trait

The [`Sized`](/mojo/stdlib/builtin/len/Sized) trait identifies types that
have a measurable length, like strings and arrays.

Specifically, `Sized` requires a type to implement the `__len__()` method.
This trait is used by the built-in [`len()`](/mojo/stdlib/builtin/len/len)
function. For example, if you're writing a custom list type, you could
implement this trait so your type works with `len()`:

```mojo
struct MyList(Sized):
    var size: Int
    # ...

    fn __init__(out self):
        self.size = 0

    fn __len__(self) -> Int:
        return self.size

print(len(MyList()))
```

```output
0
```

### The `Intable` and `IntableRaising` traits

The [`Intable`](/mojo/stdlib/builtin/int/Intable) trait identifies a type that
can be implicitly converted to `Int`. The
[`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising) trait describes a
type can be converted to an `Int`, but the conversion might raise an error.

Both of these traits require the type to implement the `__int__()` method. For
example:

```mojo
@value
struct Foo(Intable):
    var i: Int

    fn __int__(self) -> Int:
        return self.i

var foo = Foo(42)
print(Int(foo) == 42)
```

```output
True
```

### The `Stringable`, `Representable`, and `Writable` traits

The [`Stringable`](/mojo/stdlib/builtin/str/Stringable) trait identifies a type
that can be explicitly converted to
[`String`](/mojo/stdlib/collections/string/string/String). The
[`StringableRaising`](/mojo/stdlib/builtin/str/StringableRaising) trait
describes a type that can be converted to a `String`, but the conversion might
raise an error. These traits also mean that the type can support both the `{!s}`
and `{}` format specifiers of the `String` and `StringSlice` class's
[`format()`](/mojo/stdlib/collections/string/string/String#format) method. These
traits require the type to define the
[`__str__()`](/mojo/stdlib/builtin/str/Stringable#__str__) method.

In contrast, the [`Representable`](/mojo/stdlib/builtin/repr/Representable)
trait defines a type that can be used with the built-in
[`repr()`](/mojo/stdlib/builtin/repr/repr) function, as well as the `{!r}`
format specifier of the `format()` method. This trait requires the type to
define the [`__repr__()`](/mojo/stdlib/builtin/repr/Representable#__repr__)
method, which should compute the "official" string representation of a type. If
at all possible, this should look like a valid Mojo expression that could be
used to recreate a struct instance with the same value.

The [`Writable`](/mojo/stdlib/utils/write/Writable) trait describes a
type that can be converted to a stream of UTF-8 encoded data by writing to a
`Writer` object. The [`print()`](/mojo/stdlib/builtin/io/print) function
requires that its arguments conform to the `Writable` trait. This enables
efficient stream-based writing by default, avoiding unnecessary intermediate
String heap allocations.

The `Writable` trait requires a type to implement a
[`write_to()`](/mojo/stdlib/utils/write/Writable#write_to) method, which
is provided with an object that conforms to the
[`Writer`](/mojo/stdlib/utils/write/Writer) as an argument. You then
invoke the `Writer` instance's
[`write()`](/mojo/stdlib/utils/write/Writer#write) method to write a
sequence of `Writable` arguments constituting the `String` representation of
your type.

While this might sound complex at first, in practice you can minimize
boilerplate and duplicated code by using the
[`String.write()`](/mojo/stdlib/collections/string/string/String#write) static
function to implement the type's `Stringable` implementation in terms of its
`Writable` implementation. Here is a simple example of a type that implements
all of the `Stringable`, `Representable`, and `Writable` traits:

```mojo
@value
struct Dog(Stringable, Representable, Writable):
    var name: String
    var age: Int

    # Allows the type to be written into any `Writer`
    fn write_to[W: Writer](self, mut writer: W) -> None:
        writer.write("Dog(", self.name, ", ", self.age, ")")

    # Construct and return a `String` using the previous method
    fn __str__(self) -> String:
        return String.write(self)

    # Alternative full representation when calling `repr`
    fn __repr__(self) -> String:
        return String("Dog(name=", repr(self.name), ", age=", repr(self.age), ")")

var dog = Dog("Rex", 5)
print(repr(dog))
print(dog)

var dog_info = StaticString("String: {!s}\nRepresentation: {!r}").format(dog, dog)
print(dog_info)
```

```output
Dog(name='Rex', age=5)
Dog(Rex, 5)
String: Dog(Rex, 5)
Representation: Dog(name='Rex', age=5)
```

### The `AnyType` trait

When building a generic container type, one challenge is knowing how to dispose
of the contained items when the container is destroyed. Any type that
dynamically allocates memory needs to supply a
[destructor](/mojo/manual/lifecycle/death#destructor) (`__del__()` method)
that must be called to free the allocated memory. But not all types have a
destructor, and your Mojo code has no way to determine which is which.

The [`AnyType`](/mojo/stdlib/builtin/anytype/AnyType) trait solves this
issue: every trait implicitly inherits from `AnyType`, and all structs conform
to `AnyType`, which guarantees that the type has a destructor. For types that
don't have one, Mojo adds a no-op destructor. This means you can call the
destructor on any type.

This makes it possible to build generic collections without leaking memory. When
the collection's destructor is called, it can safely call the destructors on
every item it contains.

## Generic structs with traits

You can also use traits when defining a generic container. A generic container
is a container (for example, an array or hashmap) that can hold different data
types. In a dynamic language like Python it's easy to add  different types of
items to a container. But in a statically-typed environment the compiler needs
to be able to identify the types at compile time. For example, if the container
needs to copy a value, the compiler needs to verify that the type can be copied.

The [`List`](/mojo/stdlib/collections/list) type is an example of a
generic container. A single `List` can only hold a single type of data.
For example, you can create a list of integer values like this:

```mojo
from collections import List

var list = List[Int](1, 2, 3)
for i in range(len(list)):
    print(list[i], sep=" ", end="")
```

```output
1  2  3
```

You can use traits to define requirements for elements that are stored in a
container. For example, `List` requires elements that can be moved and
copied. To store a struct in a `List`, the struct needs to conform to
the `Copyable` and `Movable` traits, which require a
[copy constructor](/mojo/manual/lifecycle/life#copy-constructor) and a
[move constructor](/mojo/manual/lifecycle/life#move-constructor).

Building generic containers is an advanced topic. For an introduction, see the
section on
[parameterized structs](/mojo/manual/parameters/#parameterized-structs).

### Associated aliases for generics

In addition to methods, a trait can include _associated aliases_, which must be
defined by any conforming struct. For example:

```mojo
trait Repeater:
    alias count: Int
```

An implementing struct must define a concrete constant value for the alias,
using any compile-time parameter value. For example, it can use a literal
constant or a compile-time expression, including one that uses the struct's
parameters.

```mojo
struct Doublespeak(Repeater):
    alias count: Int = 2

struct Multispeak[verbosity: Int](Repeater):
    alias count: Int = verbosity*2+1
```

The `Doublespeak` struct has a constant value for the alias, but the `Multispeak`
struct lets the user set the value using a parameter:

```mojo
repeater = Multispeak[12]()
```

Note that the alias is named `count`, and the `Multispeak` parameter is named
`verbosity`. Parameters and aliases are in the same namespace, so the parameter
can't have the same name as the associated alias.

Associated aliases are most useful for writing traits for generic types. For
example, imagine that you want to write a trait that describes a generic stack
data structure that stores elements that conform to the `Copyable` and `Movable`
traits.

By adding the element type as an associated alias to the trait, you can specify
generic methods on the trait:

```mojo
trait Stacklike:
    alias EltType: Copyable & Movable

    fn push(mut self, owned item: Self.EltType):
        pass

    fn pop(mut self) -> Self.EltType:
        pass
```

The following struct implements the `Stacklike` trait using a `List` as the
underlying storage:

```mojo
struct MyStack[type: Copyable & Movable](Stacklike):
    """A simple Stack built using a List."""
    alias EltType = type
    alias list_type = List[Self.EltType]

    var list: Self.list_type

    fn __init__(out self):
        self.list = Self.list_type()

    fn push(mut self, owned item: Self.EltType):
        self.list.append(item)

    fn pop(mut self) -> Self.EltType:
        return self.list.pop()

    fn dump[WritableEltType: Writable & Copyable & Movable](self: MyStack[WritableEltType]):
        for item in self.list:
            print(item[])
```

The `MyStack` type adds a `dump()` method that prints the contents of the stack.
Because a struct that conforms to `Copyable` and `Movable` is not necessarily
printable, `MyStack` uses
[conditional conformance](/mojo/manual/parameters/#conditional-conformance) to
define a `dump()` method that works as long as the element type is
[writable](/mojo/stdlib/utils/write/Writable/).

The following code exercises this new trait by defining a generic method,
`add_to_stack()` that adds an item to any `Stacklike` type.

```mojo
def add_to_stack[S: Stacklike](mut stack: S, item: S.EltType):
    stack.push(item)

def main():
    s = MyStack[Int]()
    add_to_stack(s, 12)
    add_to_stack(s, 33)
    s.dump()             # [12, 33]
    print(s.pop())       # 33
```

---

## transformer

## Modules

* [`distributed_transformer`](/max/api/python/nn/transformer/distributed_transformer)
* [`transformer`](/max/api/python/nn/transformer/transformer)

---

## transformer

## `ReturnLogits` {#max.nn.transformer.transformer.ReturnLogits}

> *class* max.nn.transformer.transformer.ReturnLogits(value, names=\, \*values, module=None, qualname=None, type=None, start=1, boundary=None)

### `ALL` {#max.nn.transformer.transformer.ReturnLogits.ALL}

> ALL *= 'all'*

### `LAST_TOKEN` {#max.nn.transformer.transformer.ReturnLogits.LAST_TOKEN}

> LAST\_TOKEN *= 'last\_token'*

### `VARIABLE` {#max.nn.transformer.transformer.ReturnLogits.VARIABLE}

> VARIABLE *= 'variable'*

## `Transformer` {#max.nn.transformer.transformer.Transformer}

> *class* max.nn.transformer.transformer.Transformer(dim, n\_heads, layers, norm, output, embedding, kv\_params, kv\_collection\_constructor, return\_logits=ReturnLogits.LAST\_TOKEN, embedding\_multiplier=1.0, logits\_postprocessor=None)

Transformer model consisting for TransformerBlock layers.

**Parameters:**

* **dim** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **n\_heads** ([`int`](https://docs.python.org/3/library/functions.html#int) )
* **layers** ([`list`](https://docs.python.org/3/library/stdtypes.html#list) `[` `Block` `]` )
* **norm** ([`Layer`](../layer.md#max.nn.layer.Layer) )
* **output** ([`LinearV1`](../linear.md#max.nn.linear.LinearV1)  `|`  [`Linear`](../linear.md#max.nn.linear.Linear) )
* **embedding** ([`EmbeddingV1`](../embedding.md#max.nn.embedding.EmbeddingV1)  `|`  [`Embedding`](../embedding.md#max.nn.embedding.Embedding) )
* **kv\_params** ([`KVCacheParams`](../kv_cache/cache_params.md#max.nn.kv_cache.cache_params.KVCacheParams) )
* **kv\_collection\_constructor** ([`FetchContinuousBatchingKVCacheCollection`](../kv_cache/continuous_batching_cache.md#max.nn.kv_cache.continuous_batching_cache.FetchContinuousBatchingKVCacheCollection)  `|`  `FetchPagedKVCacheCollection` )
* **return\_logits** ([`ReturnLogits`](#max.nn.transformer.transformer.ReturnLogits) )
* **embedding\_multiplier** ([`float`](https://docs.python.org/3/library/functions.html#float) )
* **logits\_postprocessor** (`Callable` `[` `[` [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `]` `,`  [`TensorValue`](../../graph/TensorValue.md#max.graph.TensorValue) `]`  `|`  `None` )

## `TransformerBlock` {#max.nn.transformer.transformer.TransformerBlock}

> *class* max.nn.transformer.transformer.TransformerBlock(attention, mlp, attention\_norm, mlp\_norm, residual\_multiplier=1.0)

Stack of Attention, FeedForward, and RMSNorm layers.

**Parameters:**

* **attention** ([`AttentionImpl`](../attention/interfaces.md#max.nn.attention.interfaces.AttentionImpl)  `|`  [`AttentionImplQKV`](../attention/interfaces.md#max.nn.attention.interfaces.AttentionImplQKV)  `|`  [`Module`](../layer.md#max.nn.layer.Module) )
* **mlp** ([`Layer`](../layer.md#max.nn.layer.Layer) )
* **attention\_norm** ([`Layer`](../layer.md#max.nn.layer.Layer) )
* **mlp\_norm** ([`Layer`](../layer.md#max.nn.layer.Layer) )
* **residual\_multiplier** ([`float`](https://docs.python.org/3/library/functions.html#float) )

---

## Transformer

A transformer is a neural network architecture designed to perform complex
tasks with sequential data (such as text, speech, and images) in a manner that
can be efficiently parallelized on GPUs or other accelerator hardware. This
makes them highly effective for natural language processing and other
generative AI (GenAI) applications.

The transformer model architecture was first introduced in the paper [Attention
Is All You Need](https://arxiv.org/abs/1706.03762) (Vaswani, et al., 2017).
This design emphasizes the use of [self-attention](self-attention.mdx)
mechanisms instead of recurrent structures like recurrent neural networks (RNNs) or
long short-term memory networks (LSTMs), which is what
allows for the processing to be parallelized across separate compute cores
instead of requiring the model to generate predictions synchronously. This
design is currently the foundation for all major large language models (LLMs)
such as GPT, Llama, Gemini, DeepSeek, and more.

---

## TransientScheduler

`@register_passable(trivial)`
`struct TransientScheduler[tile_shape: SIMD[uint32, 1], num_heads: SIMD[uint32, 1]]`

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`MHATileScheduler`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `may_advance`

`alias may_advance = False`

### `mha_schedule`

`alias mha_schedule = MHASchedule(__init__[__mlir_type.!pop.int_literal](0))`

## Methods

### `__init__`

`__init__() -> Self`

### `get_current_work_info`

`get_current_work_info(self) -> WorkInfo`

`get_current_work_info(self, ts: MHATileSummary, state: MHATileState) -> WorkInfo`

### `advance`

`advance[ragged: Bool, producer: Bool, sync: MHASchedulerSynchronization = MHASchedulerSynchronization(__init__[__mlir_type.!pop.int_literal](1))](self, ts: MHATileSummary, mut state: MHATileState, pipeline_idx: SIMD[uint32, 1]) -> OptionalReg[SeqInfo]`

### `grid_dim`

`static grid_dim(batch_size: SIMD[uint32, 1], max_num_prompt_tiles: SIMD[uint32, 1]) -> Tuple[Int, Int, Int]`

### `initial_state`

`initial_state(self, ptr: UnsafePointer[SIMD[uint32, 1], address_space=AddressSpace(3)], tile_summary: MHATileSummary) -> MHATileState`

### `unsafe_seq_info`

`unsafe_seq_info[ragged: Bool](self, ts: MHATileSummary, state: MHATileState) -> SeqInfo`

---

## transitional

Utilities for transitional period during NDBuffer deprecation.

## Functions

* [​`managed_tensor_slice_to_ndbuffer`](/max/api/mojo/tensor/transitional/managed_tensor_slice_to_ndbuffer):

---

## transpose

The module implements Transpose functions.

## Functions

* [​`transpose`](./transpose): Permute the axis of `input` based on `perms`, and place the result in `output`.
* [​`transpose_2d`](./transpose_2d):
* [​`transpose_3d_swap_inner`](./transpose_3d_swap_inner):
* [​`transpose_3d_swap_outer`](./transpose_3d_swap_outer):
* [​`transpose_4d_swap_middle`](./transpose_4d_swap_middle):
* [​`transpose_inplace`](./transpose_inplace):
* [​`transpose_strided`](./transpose_strided):
* [​`transpose_trivial_memcpy`](./transpose_trivial_memcpy):

---

## transpose

`transpose[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape], perms: UnsafePointer[SIMD[index, 1]])`

Permute the axis of `input` based on `perms`, and place the result in `output`.

Example:

```mojo
transpose(output, input, [2, 0, 1])
# guarantees output[x, y, z] = input[z, x, y]
```

**Parameters:**

* ​rank (`Int`): The rank of input and output buffers.
* ​type (`DType`): The dtype of buffer elements.

**Args:**

* ​output (`NDBuffer[type, rank, origin, shape]`): The output buffer.
* ​input (`NDBuffer[type, rank, origin, shape]`): The input buffer.
* ​perms (`UnsafePointer[SIMD[index, 1]]`): Permutation of the input axes.

---

## transpose_2d

`transpose_2d[rank: Int, output_shape: DimList, input_shape: DimList, type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int, offset: Int)`

---

## transpose_3d_swap_inner

`transpose_3d_swap_inner[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int)`

---

## transpose_3d_swap_outer

`transpose_3d_swap_outer[rank: Int, output_shape: DimList, input_shape: DimList, type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int)`

---

## transpose_4d_swap_middle

`transpose_4d_swap_middle[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape, strides], perms: UnsafePointer[SIMD[index, 1]], simplified_input_shape: IndexList[rank], simplified_rank: Int)`

---

## transpose_inplace

`transpose_inplace[rows: Int, cols: Int, type: DType](buf: NDBuffer[type, 2, origin, __init__[::Indexer,::Indexer](rows, cols)])`

---

## transpose_strided

`transpose_strided[rank: Int, type: DType, //](output: NDBuffer[type, rank, origin, shape], input: NDBuffer[type, rank, origin, shape], perms: UnsafePointer[SIMD[index, 1]])`

---

## transpose_trivial_memcpy

`transpose_trivial_memcpy[rank: Int, output_shape: DimList, input_shape: DimList, type: DType](output: NDBuffer[type, rank, origin, output_shape], input: NDBuffer[type, rank, origin, input_shape])`

---

## transpose_z_to_x_or_y

`transpose_z_to_x_or_y[destination: StringSlice[StaticConstantOrigin], type: DType](z_col_index: Int, xy_row_index: Int, z_row_suboffset: Int)`

---

## trunc

`trunc[T: Truncable, //](value: T) -> T`

Get the truncated value of the given object.

**Parameters:**

* ​T (`Truncable`): The type conforming to Truncable.

**Args:**

* ​value (`T`): The object to get the truncated value of.

**Returns:**

The truncated value of the object.

---

## Truncable

The `Truncable` trait describes a type that defines a truncation operation.

Types that conform to `Truncable` will work with the builtin `trunc`
function. The truncation operation always returns the same type as the
input.

For example:

```mojo
from math import Truncable, trunc

@value
struct Complex(Truncable):
    var re: Float64
    var im: Float64

    fn __trunc__(self) -> Self:
        return Self(trunc(self.re), trunc(self.im))
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__trunc__`

`__trunc__(self: _Self) -> _Self`

Return the truncated Int value, which is itself.

**Returns:**

The Int value itself.

---

## tuple

Implements the Tuple type.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`Tuple`](/mojo/stdlib/builtin/tuple/Tuple): The type of a literal tuple expression.

---

## Tuple

`struct Tuple[*element_types: Copyable & Movable]`

The type of a literal tuple expression.

A tuple consists of zero or more values, separated by commas.

## Parameters

* ​\*element\_types (`Copyable & Movable`): The elements type.

## Fields

* ​storage (`!kgen.pack> element_types>`): The underlying storage for the tuple.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self: Tuple[])`

Construct an empty tuple.

`__init__(out self, owned *args: *element_types)`

Construct the tuple.

**Args:**

* ​\*args (`*element_types`): Initial values.

`__init__(out self, *, owned storage: VariadicPack[is_owned, origin, Copyable & Movable, element_types])`

Construct the tuple from a low-level internal representation.

**Args:**

* ​storage (`VariadicPack[is_owned, origin, Copyable & Movable, element_types]`): The variadic pack storage to construct from.

### `__copyinit__`

`__copyinit__(out self, existing: Self)`

Copy construct the tuple.

**Args:**

* ​existing (`Self`): The value to copy from.

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Move construct the tuple.

**Args:**

* ​existing (`Self`): The value to move from.

### `__del__`

`__del__(owned self)`

Destructor that destroys all of the elements.

### `__getitem__`

`__getitem__[idx: Int](ref self) -> ref [self] element_types[idx.value]`

Get a reference to an element in the tuple.

**Parameters:**

* ​idx (`Int`): The element to return.

**Returns:**

A reference to the specified element.

### `__contains__`

`__contains__[T: EqualityComparable & Copyable & Movable](self, value: T) -> Bool`

Return whether the tuple contains the specified value.

For example:

```mojo
var t = Tuple(True, 1, 2.5)
if 1 in t:
    print("t contains 1")
```

**Parameters:**

* ​T (`EqualityComparable & Copyable & Movable`): The type of the value.

**Args:**

* ​value (`T`): The value to search for.

**Returns:**

True if the value is in the tuple, False otherwise.

### `copy`

`copy(self) -> Self`

Explicitly construct a copy of self.

**Returns:**

A copy of this value.

### `__len__`

`static __len__() -> Int`

Return the number of elements in the tuple.

**Returns:**

The tuple length.

`__len__(self) -> Int`

Get the number of elements in the tuple.

**Returns:**

The tuple length.

---

## tuple_max

`tuple_max(t: IntTuple[origin]) -> Int`

Calculate the maximum value in an `IntTuple`.

This function recursively finds the maximum integer value
in a potentially nested `IntTuple` structure.

**Args:**

* ​t (`IntTuple[origin]`): The `IntTuple` to search.

**Returns:**

The maximum integer value found in the tuple.

---

## tuple_min

`tuple_min(a: IntTuple[origin], b: IntTuple[origin]) -> IntTuple`

Compute the element-wise minimum of two `IntTuple`s.

This function compares corresponding elements of two `IntTuple`s and
returns a new `IntTuple` containing the minimum value at each position.

Aborts:
If the input tuples have different lengths.

Note:
If either input contains `UNKNOWN_VALUE`, the result will be `UNKNOWN_VALUE`.

**Args:**

* ​a (`IntTuple[origin]`): First `IntTuple`.
* ​b (`IntTuple[origin]`): Second `IntTuple`.

**Returns:**

A new `IntTuple` with each element being the minimum of the corresponding
elements in a and b.

---

## type

Library for graph value types.

## `AlgebraicDim` {#max.graph.type.AlgebraicDim}

> *class* max.graph.type.AlgebraicDim(value)

An algebraic tensor dimension to enable expressions over symbolic
dimensions.

That is, any expression over a symbolic dimension returns `AlgebraicDim`.
Furthermore, algebraic dimensions automatically simplify into a canonical
form.

The following example demonstrates how to create and use algebraic dimensions with symbolic values:

```python
from max.graph import AlgebraicDim, Dim
isinstance(Dim("batch") * 5, AlgebraicDim)  # Returns True
print(Dim("batch") * 5)  # Outputs: batch * 5
-Dim("x") - 4 == -(Dim("x") + 4)  # Returns True
```

Converts valid input values to Dim.

**Parameters:**

**attr** (`ParamOperatorAttr` )

### `apply()` {#max.graph.type.AlgebraicDim.apply}

> *classmethod* apply(op, \*operands)

**Parameters:**

* **op** (`POC` )
* **operands** ([`int`](https://docs.python.org/3/library/functions.html#int)  `|`  [`str`](https://docs.python.org/3/library/stdtypes.html#str)  `|`  [`Dim`](#max.graph.type.Dim)  `|`  [`integer`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.integer) )

### `attr` {#max.graph.type.AlgebraicDim.attr}

> attr\*: ParamOperatorAttr\*

### `from_mlir()` {#max.graph.type.AlgebraicDim.from_mlir}

> *static* from\_mlir(attr)

Constructs a dimension from an `mlir.Attribute`.

**Parameters:**

* **dim\_attr** – The MLIR Attribute object to parse into a dimension.
* **attr** (`TypedAttr` )

**Returns:**

The dimension represented by the MLIR Attr value.

**Return type:**

[Dim](#max.graph.type.Dim)

### `to_mlir()` {#max.graph.type.AlgebraicDim.to_mlir}

> to\_mlir()

Creates an mlir.Attribute representing this dimension.
This is used internally when constructing tensor MLIR types.

**Returns:**

An mlir.Attribute in the context representing the dimension.

**Return type:**

*ParamOperatorAttr*

## `Dim` {#max.graph.type.Dim}

> *class* max.graph.type.Dim(value)

A tensor dimension.

Tensor dimensions can be one of three types:

* **Static**: Known size
* **Symbolic**: Unknown size but named
* **Algebraic**: Unknown size has an algebraic expression

In most cases, you don’t need to work with a `Dim` directly.
Instead, use conversion constructors:

```python
from max.graph import Dim, TensorType, DeviceRef

tensor_type = TensorType(DType.int64, ("batch", 10), device=DeviceRef.CPU())
```

This creates a tensor type with three dimensions:

* A symbolic “batch” dimension
* A static dimension of size 10

For explicit dimension construction, use the following helpers:

```python
from max.graph import Dim

some_dims = [
    SymbolicDim("batch"),
    StaticDim(5),
    AlgebraicDim(Dim("batch") + 1),
]
```

Constraining tensor dimensions is one important way to improve model
performance. If tensors have unknown dimensions, we can’t optimize them
as aggressively. Symbolic tensors allow the compiler to learn constraints
on a specific dimension (eg. if 2 inputs have the same batch dimension),
but static dims are the easiest to optimize and therefore the easiest to
create and work with.

Converts valid input values to Dim.

**Parameters:**

**value** (`DimLike` )

### `from_mlir()` {#max.graph.type.Dim.from_mlir}

> *static* from\_mlir(attr)

Constructs a dimension from an `mlir.Attribute`.

**Parameters:**

* **dim\_attr** – The MLIR Attribute object to parse into a dimension.
* **attr** (`TypedAttr` )

**Returns:**

The dimension represented by the MLIR Attr value.

**Return type:**

[Dim](#max.graph.type.Dim)

### `to_mlir()` {#max.graph.type.Dim.to_mlir}

> to\_mlir()

Creates an `mlir.Attribute` representing this dimension.

This is used internally when constructing tensor MLIR types.

**Returns:**

An `mlir.Attribute` in the context representing the dimension.

**Return type:**

*TypedAttr*

## `Shape` {#max.graph.type.Shape}

> *class* max.graph.type.Shape(dims=())

**Parameters:**

**dims** (`ShapeLike` )

### `from_mlir()` {#max.graph.type.Shape.from_mlir}

> *classmethod* from\_mlir(attr)

**Parameters:**

**attr** (`TypedAttr` )

**Return type:**

[*Shape*](#max.graph.type.Shape)

### `rank` {#max.graph.type.Shape.rank}

> *property* rank

### `static_dims` {#max.graph.type.Shape.static_dims}

> *property* static\_dims\*: [list](https://docs.python.org/3/library/stdtypes.html#list)\[[int](https://docs.python.org/3/library/functions.html#int)]\*

Returns all static dims in the shape as a list of integers.

### `to_mlir()` {#max.graph.type.Shape.to_mlir}

> to\_mlir()

**Return type:**

*ShapeAttr*

## `StaticDim` {#max.graph.type.StaticDim}

> *class* max.graph.type.StaticDim(value)

A static tensor dimension.

Static tensor dimensions will always have exactly the same value,
and are key to good model performance.

The following example shows how static dimensions can be created implicitly:

```python
from max.graph import TensorType
from max.dtype import DType
tensor = TensorType(DType.int64, (4, 5))
```

Converts valid input values to Dim.

**Parameters:**

**dim** ([`int`](https://docs.python.org/3/library/functions.html#int) )

### `dim` {#max.graph.type.StaticDim.dim}

> dim\*: [int](https://docs.python.org/3/library/functions.html#int)\*

The size of the static dimension.

### `from_mlir()` {#max.graph.type.StaticDim.from_mlir}

> *static* from\_mlir(attr)

Constructs a dimension from an `mlir.Attribute`.

**Parameters:**

* **dim\_attr** – The MLIR Attribute object to parse into a dimension.
* **attr** (`TypedAttr` )

**Returns:**

The dimension represented by the MLIR Attr value.

**Return type:**

[*Dim*](#max.graph.type.Dim)

### `to_mlir()` {#max.graph.type.StaticDim.to_mlir}

> to\_mlir()

Creates an `mlir.Attribute` representing this dimension.

This is used internally when constructing tensor MLIR types.

**Returns:**

An `mlir.Attribute` in the context representing the dimension.

**Return type:**

*IntegerAttr*

## `SymbolicDim` {#max.graph.type.SymbolicDim}

> *class* max.graph.type.SymbolicDim(value)

A symbolic tensor dimension.

Symbolic dimensions represent named dimensions in MO tensor types.

Symbolic dimensions don’t have a static value, but they allow a readable
name to understand what’s going on in the model IR better, and they also
allow users to hint to the compiler that two dimensions will have the same
value, which can often allow important speedups.

In tensor type notation:

```default
!mo.tensor
```

The first and second dimensions are named `batch` and `x` respectively.

Creating a `SymbolicDim`:

```python
dim = SymbolicDim("name")
```

Using `SymbolicDim` in a [`TensorType`](#max.graph.type.TensorType):

```python
tensor_type = TensorType(DType.bool, (SymbolicDim("batch"), SymbolicDim("x"), 10))
```

Converts valid input values to Dim.

**Parameters:**

**name** ([`str`](https://docs.python.org/3/library/stdtypes.html#str) )

### `from_mlir()` {#max.graph.type.SymbolicDim.from_mlir}

> *static* from\_mlir(attr)

Constructs a dimension from an `mlir.Attribute`.

**Parameters:**

* **dim\_attr** – The MLIR Attribute object to parse into a dimension.
* **attr** (`TypedAttr` )

**Returns:**

The dimension represented by the MLIR Attr value.

**Return type:**

[Dim](#max.graph.type.Dim)

### `name` {#max.graph.type.SymbolicDim.name}

> name\*: [str](https://docs.python.org/3/library/stdtypes.html#str)\*

The name of the dimension.

### `to_mlir()` {#max.graph.type.SymbolicDim.to_mlir}

> to\_mlir()

Creates an `mlir.Attribute` representing this dimension.

This is used internally when constructing tensor MLIR types.

**Returns:**

An `mlir.Attribute` in the context representing the dimension.

**Return type:**

*ParamDeclRefAttr*

## `TensorType` {#max.graph.type.TensorType}

> *class* max.graph.type.TensorType(dtype, shape, device)

A symbolic [`TensorType`](#max.graph.type.TensorType).

This is not an eager tensor type! This contains no actual data, but
instead represents the type of a value at some point in time during model
execution.

Most internal values in a model will be tensors. This type represents
their element type (`dtype`) and dimensions (`dims`) at a specific point during
model computation. It allows us to do some optimistic optimizations and
shape inference during graph construction, and to provide more detailed
shape information to the compiler for further optimization passes.

The following example shows how to create a tensor type with static dimensions and access its properties:

```python
from max.graph import TensorType
from max.dtype import DType
# Create a tensor type with float32 elements and static dimensions 2x3
tensor_type = TensorType(DType.float32, (2, 3))
print(tensor_type.dtype)  # Outputs: DType.float32
print(tensor_type.shape)  # Outputs: [2, 3]
```

It can also represent a fully dynamic rank tensor. The presence of dynamic
rank tensors in a graph will often degrade performance dramatically and
prevents many classes of optimizations.

An optional device (`device`) can also be provided to indicate the explicit
device the tensor is associated with.

Constructs a tensor type.

**Parameters:**

* **dtype** ([`DType`](../dtype.md#max.dtype.DType) ) – The element type of the tensor data.
* **dims** – The shape dimensions of the tensor. The number of dims
  is the rank of the tensor.
* **shape** ([`Shape`](#max.graph.type.Shape) )
* **device** (`DeviceRef` )

### `as_buffer()` {#max.graph.type.TensorType.as_buffer}

> as\_buffer()

Returns the analogous buffer type.

**Return type:**

*BufferType*

### `from_mlir()` {#max.graph.type.TensorType.from_mlir}

> *classmethod* from\_mlir(type)

Constructs a tensor type from an MLIR type.

**Parameters:**

* **t** – The MLIR Type object to parse into a tensor type.
* **type** (`TensorType` )

**Returns:**

The tensor type represented by the MLIR Type value.

**Return type:**

[*TensorType*](#max.graph.type.TensorType)

### `to_mlir()` {#max.graph.type.TensorType.to_mlir}

> to\_mlir()

Converts to an `mlir.Type` instance.

**Returns:**

An `mlir.Type` in the specified Context.

**Return type:**

*TensorType*

## `Type` {#max.graph.type.Type}

> *class* max.graph.type.Type

Represents any possible type for Graph values.

Every Value in the Graph has a Type, and that type is represented by an Type.
This type may be inspected to get finer-grained types and learn more
about an individual Value.

The following example shows how to work with types in a graph:

```python
from max.graph import Graph, TensorType
from max.dtype import DType
with Graph() as g:
    # Create a tensor constant with a specific type
    tensor_type = TensorType(DType.float32, [2, 3])
    # The type can be inspected to get information about the value
    print(f"Tensor element type: {tensor_type.dtype}")  # Outputs: DType.float32
    print(f"Tensor shape: {tensor_type.shape}")  # Outputs: [2, 3]
```

### `from_mlir()` {#max.graph.type.Type.from_mlir}

> *static* from\_mlir(t)

Constructs a type from an MLIR type.

**Parameters:**

**t** (`MlirType` ) – The MLIR Type object to parse into a type.

**Returns:**

The type represented by the MLIR Type value.

**Return type:**

[*Type*](#max.graph.type.Type)

### `to_mlir()` {#max.graph.type.Type.to_mlir}

> to\_mlir()

Converts to an `mlir.Type` instance.

**Returns:**

An `mlir.Type` in the specified Context.

**Return type:**

*MlirType*

---

## type_aliases

Defines some type aliases.

These are Mojo built-ins, so you don't need to import them.

## Aliases

### `AnyTrivialRegType`

`alias AnyTrivialRegType = AnyTrivialRegType`

Represents any register passable Mojo data type.

### `ImmutableAnyOrigin`

`alias ImmutableAnyOrigin = ImmutableAnyOrigin`

The immutable origin that might access any memory value.

### `ImmutableOrigin`

`alias ImmutableOrigin = ImmutableOrigin`

Immutable origin reference type.

### `MutableAnyOrigin`

`alias MutableAnyOrigin = MutableAnyOrigin`

The mutable origin that might access any memory value.

### `MutableOrigin`

`alias MutableOrigin = MutableOrigin`

Mutable origin reference type.

### `OriginSet`

`alias OriginSet = origin.set`

A set of origin parameters.

### `StaticConstantOrigin`

`alias StaticConstantOrigin = StaticConstantOrigin`

An origin for strings and other always-immutable static constants.

## Structs

* [​`Origin`](/mojo/stdlib/builtin/type_aliases/Origin): This represents a origin reference for a memory value.

---

## TypedPythonObject

`@register_passable`
`struct TypedPythonObject[type_hint: StringSlice[StaticConstantOrigin]]`

A wrapper around `PythonObject` that indicates the type of the contained object.

The PythonObject structure is entirely dynamically typed. This type provides
a weak layer of optional static typing.

## Parameters

* ​type\_hint (`StringSlice[StaticConstantOrigin]`): The type name hint indicating the static type of this
  object.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`PythonConvertible`,
`SizedRaising`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(*, owned unsafe_unchecked_from: PythonObject) -> Self`

Construct a TypedPythonObject without any validation that the given object is of the specified hinted type.

**Args:**

* ​unsafe\_unchecked\_from (`PythonObject`): The PythonObject to construct from. This
  will not be type checked.

`__init__(out self: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Module")], name: StringSlice[StaticConstantOrigin])`

Construct a Python module with the given name.

**Args:**

* ​name (`StringSlice[StaticConstantOrigin]`): The name of the module.

**Raises:**

If the module creation fails.

### `__copyinit__`

`__copyinit__(other: Self) -> Self`

Copy an instance of this type.

**Args:**

* ​other (`Self`): The value to copy.

### `__getitem__`

`__getitem__[I: Indexer](self: TypedPythonObject[__init__[__mlir_type.!kgen.string]("Tuple")], pos: I) -> PythonObject`

Get an element from this tuple.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​pos (`I`): The tuple element position to retrieve.

**Returns:**

The value of the tuple element at the specified position.

### `__len__`

`__len__(self) -> Int`

Returns the length of the object.

**Returns:**

The length of the object.

### `to_python_object`

`to_python_object(self) -> PythonObject`

Convert the TypedPythonObject to a PythonObject.

**Returns:**

A PythonObject representing the value.

### `unsafe_as_py_object_ptr`

`unsafe_as_py_object_ptr(self) -> PyObjectPtr`

Get the underlying PyObject pointer.

Safety:
Use-after-free: The caller must take care that `self` outlives the
usage of the pointer returned by this function.

**Returns:**

The underlying PyObject pointer.

---

## TypeIdentifiable

Denotes a type that can be uniquely identified.

This trait is intended to be usable for implementing "type map" based
functionality.

This type will eventually be replaced with a generic compiler interface.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `TYPE_ID`

`alias TYPE_ID`

The unique identifier.

---

## types

This module contains the types for the key-value cache APIs.

The module includes structs implementing several different types of
[KV caches](/glossary/ai/kv-cache).

This module defines two traits that define the roles of the different structs

* `KVCacheT`: Defines the interface for a single (key or value) cache.
* `KVCollectionT`: Defines the interface for a pair of caches (keys and values).

## Structs

* [​`ContinuousBatchingKVCache`](./ContinuousBatchingKVCache): Wrapper for the ContinuousKVCache of a given layer in the transformer model.
* [​`ContinuousBatchingKVCacheCollection`](./ContinuousBatchingKVCacheCollection): This is a "view" of the cache for the given sequences in the batch.
* [​`KVCacheStaticParams`](./KVCacheStaticParams):
* [​`PagedKVCache`](./PagedKVCache): The PagedKVCache is a wrapper around the KVCache blocks for a given layer. It is used to access the KVCache blocks for PagedAttention.
* [​`PagedKVCacheCollection`](./PagedKVCacheCollection):

## Traits

* [​`KVCacheT`](./KVCacheT): Trait for different KVCache types and implementations.
* [​`KVCollectionT`](./KVCollectionT): Trait for a pair of caches (keys and values).

---

## Types

```c
#include "max/c/types.h"
```

## Typedefs

### `M_Status`

> typedef struct [M\_Status](#_CPPv48M_Status) M\_Status

Contains the success or failure of an API call.

In general, any API that may fail accepts a `M_Status` argument that is filled in with a meaningful error message on failure.

You can create this with [`M_newStatus()`](common.md#common_8h_1adb1ef3fc2e0bcdc8eb17cac3ce91835b). When you’re done, call [`M_freeStatus()`](common.md#common_8h_1ab5067fd51a5696b3679f7f629d3329c4).

### `M_RuntimeConfig`

> typedef struct [M\_RuntimeConfig](#_CPPv415M_RuntimeConfig) M\_RuntimeConfig

Specifies the MAX Engine configuration.

Configuration properties include the number of threads, artifact path, etc.

You can create this with [`M_newRuntimeConfig()`](context.md#context_8h_1a963f1d4eefd812ba8691acf516007cfc). When you’re done, call [`M_freeRuntimeConfig()`](context.md#context_8h_1a47f7e22f7f71da9ab5fb3a1886911610).

### `M_RuntimeContext`

> typedef struct [M\_RuntimeContext](#_CPPv416M_RuntimeContext) M\_RuntimeContext

Contains information that needs to be shared between APIs.

You can create this with [`M_newRuntimeContext()`](context.md#context_8h_1a46a6c670f73e1ce560f3c2cc1de93175). When you’re done, call [`M_freeRuntimeContext()`](context.md#context_8h_1a2434a11d8d65890c66f6b5516243a730).

### `M_UInt64Counter`

> typedef struct [M\_UInt64Counter](#_CPPv415M_UInt64Counter) M\_UInt64Counter

Represents custom counters created by user to be fed to the custom metrics end-point.

### `M_DoubleCounter`

> typedef struct [M\_DoubleCounter](#_CPPv415M_DoubleCounter) M\_DoubleCounter

### `M_UInt64Histogram`

> typedef struct [M\_UInt64Histogram](#_CPPv417M_UInt64Histogram) M\_UInt64Histogram

### `M_DoubleHistogram`

> typedef struct [M\_DoubleHistogram](#_CPPv417M_DoubleHistogram) M\_DoubleHistogram

### `M_Int64Gauge`

> typedef struct [M\_Int64Gauge](#_CPPv412M_Int64Gauge) M\_Int64Gauge

### `M_DoubleGauge`

> typedef struct [M\_DoubleGauge](#_CPPv413M_DoubleGauge) M\_DoubleGauge

### `M_CustomMetricReader`

> typedef struct [M\_CustomMetricReader](#_CPPv420M_CustomMetricReader) M\_CustomMetricReader

Represents a custom metrics reader created by the user to generate custom metrics.

### `M_CompileConfig`

> typedef struct [M\_CompileConfig](#_CPPv415M_CompileConfig) M\_CompileConfig

Specifies the configuration required for model compilation.

You can create this with [`M_newCompileConfig()`](model.md#model_8h_1a417e7a581c096ca26c36a1875163b665). When you’re done, call [`M_freeCompileConfig()`](model.md#model_8h_1abbf74b13adaf5bc8a0bb4d46c40688d9).

### `M_DeviceConfig`

> typedef struct [M\_DeviceConfig](#_CPPv414M_DeviceConfig) M\_DeviceConfig

### `M_AsyncCompiledModel`

> typedef struct [M\_AsyncCompiledModel](#_CPPv420M_AsyncCompiledModel) M\_AsyncCompiledModel

Contains an async value to a compiled model.

`M_AsyncCompiledModel` can be passed to other APIs that accept compiled models as a function parameter. This async value will eventually resolve to a compiled model or an error in the case of compilation failure.

You can create this with [`M_compileModel()`](model.md#model_8h_1a88afca26a64b945885e1e1a0d09b5750). When you’re done, call [`M_freeCompiledModel()`](model.md#model_8h_1a5b6846eb4d47d445eb65c305b1c81b1c).

### `M_AsyncModel`

> typedef struct [M\_AsyncModel](#_CPPv412M_AsyncModel) M\_AsyncModel

Contains a future used for inference.

The future will resolve to a model that’s ready for inference.

You can create this with [`M_initModel()`](model.md#model_8h_1a2dcb9570ae117602579182d8faed494a). When you’re done, call [`M_freeModel()`](model.md#model_8h_1a4094fa8e414f8b6a6563474f8840d33c).

### `M_AsyncTensor`

> typedef struct [M\_AsyncTensor](#_CPPv413M_AsyncTensor) M\_AsyncTensor

Contains an async value to a tensor for inference.

You can get this from [`M_getTensorByNameFrom()`](tensor.md#tensor_8h_1a9522ad955454dbd2d044066dea2cad95). When you’re done, call [`M_freeTensor()`](tensor.md#tensor_8h_1a339008df4a10af5e8c01ae970598765c).

### `M_TensorNameArray`

> typedef struct [M\_TensorNameArray](#_CPPv417M_TensorNameArray) M\_TensorNameArray

Contains an array of tensor names of model inputs or outputs.

You can get this from [`M_getInputNames()`](model.md#model_8h_1a625f111600585b4a68c05d9519ff9e3c) and [`M_getOutputNames()`](model.md#model_8h_1a757f1d1f20726e3324d2a0f5683bc0f9). When you’re done, call [`M_freeTensorNameArray()`](tensor.md#tensor_8h_1a7fa5d2aff7f89143ae1905fc29b5b112).

### `M_TensorSpec`

> typedef struct [M\_TensorSpec](#_CPPv412M_TensorSpec) M\_TensorSpec

Contains the representation of a shape and an element type.

You can create this with [`M_newTensorSpec()`](tensor.md#tensor_8h_1a964a8ab740605dbc51321702c34caeef). When you’re done, call [`M_freeTensorSpec()`](tensor.md#tensor_8h_1af0b957daeba1760134c3f24079b53026).

### `M_AsyncTensorMap`

> typedef struct [M\_AsyncTensorMap](#_CPPv416M_AsyncTensorMap) M\_AsyncTensorMap

Contains a collection of tensors.

The collection of tensors is used to represent inputs and outputs when executing a model.

You can create this with [`M_newAsyncTensorMap()`](tensor.md#tensor_8h_1a18039c6e6c1769b947120b27178306eb). When you’re done, call [`M_freeAsyncTensorMap()`](tensor.md#tensor_8h_1a0ac9628dcba39c9977b7f7ff95d8781e).

### `M_TensorMapIterator`

> typedef struct [M\_TensorMapIterator](#_CPPv419M_TensorMapIterator) M\_TensorMapIterator

Contains an iterator over a collection of tensors. Note that the iteration order may not be deterministic.

### `M_AsyncValue`

> typedef struct [M\_AsyncValue](#_CPPv412M_AsyncValue) M\_AsyncValue

Contains an async value for inference.

### `M_Config`

> typedef struct [M\_Config](#_CPPv48M_Config) M\_Config

Contains a `Config`.

### `M_AsyncDict`

> typedef struct [M\_AsyncDict](#_CPPv411M_AsyncDict) M\_AsyncDict

Contains an async value to a dict.

### `M_AsyncList`

> typedef struct [M\_AsyncList](#_CPPv411M_AsyncList) M\_AsyncList

Contains an async value to a list.

### `M_AsyncTuple`

> typedef struct [M\_AsyncTuple](#_CPPv412M_AsyncTuple) M\_AsyncTuple

Contains an async value to a tuple.

### `M_AsyncNone`

> typedef struct [M\_AsyncNone](#_CPPv411M_AsyncNone) M\_AsyncNone

Contains an async value to none.

### `M_MaxContext`

> typedef struct [M\_MaxContext](#_CPPv412M_MaxContext) M\_MaxContext

Global context for MAX.

### `M_ModelSource`

> typedef struct [M\_ModelSource](#_CPPv413M_ModelSource) M\_ModelSource

Contains the source format and representation to compile a model.

### `M_WeightsRegistry`

> typedef struct [M\_WeightsRegistry](#_CPPv417M_WeightsRegistry) M\_WeightsRegistry

Maps unique weight names to their backing data.

### `M_DevicesList`

> typedef struct [M\_DevicesList](#_CPPv413M_DevicesList) M\_DevicesList

Contains the a list of device pointers.

### `M_DeviceRefsList`

> typedef struct [M\_DeviceRefsList](#_CPPv416M_DeviceRefsList) M\_DeviceRefsList

Contains the a list of device refs.

## Enums

### `M_Dtype`

> enum M\_Dtype

Represents all data types supported by the framework.

*Values:*

#### `M_UNKNOWN`

> enumerator M\_UNKNOWN

#### `mIsInteger`

> enumerator mIsInteger

#### `mIsFloat`

> enumerator mIsFloat

#### `mIsComplex`

> enumerator mIsComplex

#### `mIsSigned`

> enumerator mIsSigned

Bit 0 encodes “isSigned”.

#### `kIntWidthShift`

> enumerator kIntWidthShift

#### `M_INT1`

> enumerator M\_INT1

#### `M_UINT1`

> enumerator M\_UINT1

#### `M_INT2`

> enumerator M\_INT2

#### `M_UINT2`

> enumerator M\_UINT2

#### `M_INT4`

> enumerator M\_INT4

#### `M_UINT4`

> enumerator M\_UINT4

#### `M_INT8`

> enumerator M\_INT8

#### `M_UINT8`

> enumerator M\_UINT8

#### `M_INT16`

> enumerator M\_INT16

#### `M_UINT16`

> enumerator M\_UINT16

#### `M_INT32`

> enumerator M\_INT32

#### `M_UINT32`

> enumerator M\_UINT32

#### `M_INT64`

> enumerator M\_INT64

#### `M_UINT64`

> enumerator M\_UINT64

#### `M_INT128`

> enumerator M\_INT128

#### `M_UINT128`

> enumerator M\_UINT128

#### `M_FLOAT8_E3M4`

> enumerator M\_FLOAT8\_E3M4

Bits 0 through 3 indicate the kind of FP value.

#### `M_FLOAT8_E4M3`

> enumerator M\_FLOAT8\_E4M3

#### `M_FLOAT8_E4M3FN`

> enumerator M\_FLOAT8\_E4M3FN

#### `M_FLOAT8_E4M3FNUZ`

> enumerator M\_FLOAT8\_E4M3FNUZ

#### `M_FLOAT8_E5M2`

> enumerator M\_FLOAT8\_E5M2

#### `M_FLOAT8_E5M2FNUZ`

> enumerator M\_FLOAT8\_E5M2FNUZ

#### `M_FLOAT16`

> enumerator M\_FLOAT16

#### `M_BFLOAT16`

> enumerator M\_BFLOAT16

#### `M_FLOAT32`

> enumerator M\_FLOAT32

#### `M_FLOAT64`

> enumerator M\_FLOAT64

#### `M_TF32`

> enumerator M\_TF32

#### `M_BOOL`

> enumerator M\_BOOL

### `M_AllocatorType`

> enum M\_AllocatorType

Contains an `AllocatorType`. You can choose between kCaching and kSystem kCaching trades off higher memory usage for better performance. kSystem uses the default system allocator.

*Values:*

#### `kSystem`

> enumerator kSystem

#### `kCaching`

> enumerator kCaching

### `M_ValueType`

> enum M\_ValueType

Represents the type of a value.

*Values:*

#### `M_STRING_VALUE`

> enumerator M\_STRING\_VALUE

#### `M_DOUBLE_VALUE`

> enumerator M\_DOUBLE\_VALUE

#### `M_LONG_VALUE`

> enumerator M\_LONG\_VALUE

#### `M_BOOL_VALUE`

> enumerator M\_BOOL\_VALUE

#### `M_TENSOR_VALUE`

> enumerator M\_TENSOR\_VALUE

#### `M_LIST_VALUE`

> enumerator M\_LIST\_VALUE

#### `M_TUPLE_VALUE`

> enumerator M\_TUPLE\_VALUE

#### `M_DICT_VALUE`

> enumerator M\_DICT\_VALUE

#### `M_NONE_VALUE`

> enumerator M\_NONE\_VALUE

#### `M_UNKNOWN_VALUE`

> enumerator M\_UNKNOWN\_VALUE

#### `M_MOJO_VALUE`

> enumerator M\_MOJO\_VALUE

#### `M_PYTHON_MOJO_VALUE`

> enumerator M\_PYTHON\_MOJO\_VALUE

### `M_FrameworkFormat`

> enum M\_FrameworkFormat

Represents the format.

*Values:*

#### `M_MAX_GRAPH_FRAMEWORK_FORMAT`

> enumerator M\_MAX\_GRAPH\_FRAMEWORK\_FORMAT

#### `M_TORCHSCRIPT_MODULE_FRAMEWORK_FORMAT`

> enumerator M\_TORCHSCRIPT\_MODULE\_FRAMEWORK\_FORMAT

#### `M_TORCHSCRIPT_FUNCTION_FRAMEWORK_FORMAT`

> enumerator M\_TORCHSCRIPT\_FUNCTION\_FRAMEWORK\_FORMAT

#### `M_TORCH_MLIR_FRAMEWORK_FORMAT`

> enumerator M\_TORCH\_MLIR\_FRAMEWORK\_FORMAT

### `M_ResultOutputStyle`

> enum M\_ResultOutputStyle

Represents the result output style for debug printing.

*Values:*

#### `M_COMPACT`

> enumerator M\_COMPACT

#### `M_FULL`

> enumerator M\_FULL

#### `M_BINARY`

> enumerator M\_BINARY

#### `M_BINARY_MAX_CHECKPOINT`

> enumerator M\_BINARY\_MAX\_CHECKPOINT

#### `M_NONE`

> enumerator M\_NONE

---

## Types

All values in Mojo have an associated data type. Most of the types are
*nominal* types, defined by a [`struct`](/mojo/manual/structs). These types are
nominal (or "named") because type equality is determined by the type's *name*,
not its *structure*.

There are a some types that aren't defined as structs:

* Functions are typed based on their signatures.
* `NoneType` is a type with one instance, the `None` object, which is used to
  signal "no value."

Mojo comes with a standard library that provides a number of useful types and
utility functions. These standard types aren't privileged. Each of the standard
library types is defined just like user-defined types—even basic types like
[`Int`](/mojo/stdlib/builtin/int/Int) and
[`String`](/mojo/stdlib/collections/string/string/String). But these standard library
types are the building blocks you'll use for most Mojo programs.

The most common types are *built-in types*, which are always available and
don't need to be imported. These include types for numeric values, strings,
boolean values, and others.

The standard library also includes many more types that you can import as
needed, including collection types, utilities for interacting with the
filesystem and getting system information, and so on.

## Numeric types

Mojo's most basic numeric type is `Int`, which represents a signed integer of
the largest size supported by the system—typically 64 bits or 32 bits.

Mojo also has built-in types for integer, unsigned integer, and floating-point
values of various precisions:

| Type name | Description                                           |
| --------- | ----------------------------------------------------- |
| `Int8`    | 8-bit signed integer                                  |
| `UInt8`   | 8-bit unsigned integer                                |
| `Int16`   | 16-bit signed integer                                 |
| `UInt16`  | 16-bit unsigned integer                               |
| `Int32`   | 32-bit signed integer                                 |
| `UInt32`  | 32-bit unsigned integer                               |
| `Int64`   | 64-bit signed integer                                 |
| `UInt64`  | 64-bit unsigned integer                               |
| `Int128`  | 128-bit signed integer                                |
| `UInt128` | 128-bit unsigned integer                              |
| `Int256`  | 256-bit signed integer                                |
| `UInt256` | 256-bit unsigned integer                              |
| `Float16` | 16-bit floating point number (IEEE 754-2008 binary16) |
| `Float32` | 32-bit floating point number (IEEE 754-2008 binary32) |
| `Float64` | 64-bit floating point number (IEEE 754-2008 binary64) |

Table 1. Numeric types with specific precision

The types in Table 1 are actually all aliases to a single type,
[`SIMD`](/mojo/stdlib/builtin/simd/SIMD), which is discussed later.

All of the numeric types support the usual numeric and bitwise operators. The
[`math`](/mojo/stdlib/math/) module provides a number of additional math
functions.

You may wonder when to use `Int` and when to use the other integer
types. In general, `Int` is a good safe default when you need an integer type
and you don't require a specific bit width. Using `Int` as the default integer
type for APIs makes APIs more consistent and predictable.

### Signed and unsigned integers

Mojo supports both signed (`Int`) and unsigned (`UInt`) integers. You can use
the general `Int` or `UInt` types when you do not require a specific bit width.
Note that any alias to a fixed-precision type will be of type
[`SIMD`](/mojo/stdlib/builtin/simd/SIMD).

You might prefer to use unsigned integers over signed integers in conditions
where you don't need negative numbers, are not writing for a public API, or need
additional range.

Mojo's `UInt` type represents an unsigned integer of the
[word size](https://en.wikipedia.org/wiki/Word_\(computer_architecture\)) of the
CPU, which is 64 bits on 64-bit CPUs and 32 bits on 32-bit CPUs. If you wish to
use a fixed size unsigned integer, you can use `UInt8`, `UInt16`, `UInt32`, or
`UInt64`, which are aliases to the [`SIMD`](/mojo/stdlib/builtin/simd/SIMD)
type.

Signed and unsigned integers of the same bit width can represent the same number
of values, but have different ranges. For example, an `Int8` can represent 256
values ranging from -128 to 127. A `UInt8` can also represent 256 values, but
represents a range of 0 to 255.

Signed and unsigned integers also have different overflow behavior. When a
signed integer overflows outside the range of values that its type can
represent, the value overflows to negative numbers. For example, adding `1` to
`var si: Int8 = 127` results in `-128`.

When an unsigned integer overflows outside the range of values that its type can
represent, the value overflows to zero. So, adding `1` to `var ui: UInt8 = 255`
is equal to `0`.

### Floating-point numbers

Floating-point types represent real numbers. Because not all real numbers can be
expressed in a finite number of bits, floating-point numbers can't represent
every value exactly.

The floating-point types listed in Table 1—`Float64`, `Float32`, and
`Float16`—follow the IEEE 754-2008 standard for representing floating-point
values. Each type includes a sign bit, one set of bits representing an exponent,
and another set representing the fraction or mantissa. Table 2 shows how each of
these types are represented in memory.

| Type name | Sign  | Exponent | Mantissa |
| --------- | ----- | -------- | -------- |
| `Float64` | 1 bit | 11 bits  | 52 bits  |
| `Float32` | 1 bit | 8 bits   | 23 bits  |
| `Float16` | 1 bit | 5 bits   | 10 bits  |

Table 2. Details of floating-point types

Numbers with exponent values of all ones or all zeros represent special values,
allowing floating-point numbers to represent infinity, negative infinity,
signed zeros, and not-a-number (NaN). For more details on how numbers are
represented, see [IEEE 754](https://en.wikipedia.org/wiki/IEEE_754) on
Wikipedia.

A few things to note with floating-point values:

* Rounding errors. Rounding may produce unexpected results. For example, 1/3
  can't be represented exactly in these floating-point formats. The more
  operations you perform with floating-point numbers, the more the rounding
  errors accumulate.

* Space between consecutive numbers. The space between consecutive numbers is
  variable across the range of a floating-point number format. For numbers close
  to zero, the distance between consecutive numbers is very small. For large
  positive and negative numbers, the space between consecutive numbers is
  greater than 1, so it may not be possible to represent consecutive integers.

Because the values are approximate, it is rarely useful to compare them with
the equality operator (`==`). Consider the following example:

```mojo
var big_num = 1.0e16
var bigger_num = big_num+1.0
print(big_num == bigger_num)
```

```output
True
```

Comparison operators (`=` and so on) work with floating point numbers. You
can also use the [`math.isclose()`](/mojo/stdlib/math/math/isclose) function to
compare whether two floating-point numbers are equal within a specified
tolerance.

### Numeric literals

In addition to these numeric types, the standard libraries provides integer and
floating-point literal types,
[`IntLiteral`](/mojo/stdlib/builtin/int_literal/IntLiteral) and
[`FloatLiteral`](/mojo/stdlib/builtin/float_literal/FloatLiteral).

These literal types are used at compile time to represent literal numbers that
appear in the code. In general, you should never instantiate these types
yourself.

Table 3 summarizes the literal formats you can use to represent numbers.

| Format                 | Examples        | Notes                                                                                            |
| ---------------------- | --------------- | ------------------------------------------------------------------------------------------------ |
| Integer literal        | `1760`          | Integer literal, in decimal format.                                                              |
| Hexadecimal literal    | `0xaa`, `0xFF`  | Integer literal, in hexadecimal format.Hex digits are case-insensitive.                    |
| Octal literal          | `0o77`          | Integer literal, in octal format.                                                                |
| Binary literal         | `0b0111`        | Integer literal, in binary format.                                                               |
| Floating-point literal | `3.14`, `1.2e9` | Floating-point literal.Must include the decimal point to be interpreted as floating-point. |

Table 3. Numeric literal formats

At compile time, the literal types are arbitrary-precision (also called
infinite-precision) values, so the compiler can perform compile-time
calculations without overflow or rounding errors.

At runtime the values are converted to finite-precision types—`Int` for
integer values, and `Float64` for floating-point values. (This process of
converting a value that can only exist at compile time into a runtime value is
called *materialization*.)

The following code sample shows the difference between an arbitrary-precision
calculation and the same calculation done using `Float64` values at runtime,
which suffers from rounding errors.

```mojo
var arbitrary_precision = 3.0 * (4.0 / 3.0 - 1.0)
# use a variable to force the following calculation to occur at runtime
var three = 3.0
var finite_precision = three * (4.0 / three - 1.0)
print(arbitrary_precision, finite_precision)
```

```output
1.0 0.99999999999999978
```

### `SIMD` and `DType`

To support high-performance numeric processing, Mojo uses the
[`SIMD`](/mojo/stdlib/builtin/simd/SIMD) type as the basis for its numeric
types. SIMD (single instruction, multiple data) is a processor technology that
allows you to perform an operation on an entire set of operands at once. Mojo's
`SIMD` type abstracts SIMD operations. A `SIMD` value represents a SIMD
*vector*—that is, a fixed-size array of values that can fit into a processor's
register. SIMD vectors are defined by two
[*parameters*](/mojo/manual/parameters/):

* A `DType` value, defining the data type in the vector (for example,
  32-bit floating-point numbers).
* The number of elements in the vector, which must be a power of two.

For example, you can define a vector of four `Float32` values like this:

```mojo
var vec = SIMD[DType.float32, 4](3.0, 2.0, 2.0, 1.0)
```

Math operations on SIMD values are
applied *elementwise*, on each individual element in the vector. For example:

```mojo
var vec1 = SIMD[DType.int8, 4](2, 3, 5, 7)
var vec2 = SIMD[DType.int8, 4](1, 2, 3, 4)
var product = vec1 * vec2
print(product)
```

```output
[2, 6, 15, 28]
```

### Scalar values

The `SIMD` module defines several *type aliases* that are shorthand for
different types of `SIMD` vectors. In particular, the `Scalar` type is just a
`SIMD` vector with a single element. The numeric types listed in
[Table 1](#table-1), like `Int8` and `Float32` are actually type aliases for
different types of scalar values:

```mojo
alias Scalar = SIMD[size=1]
alias Int8 = Scalar[DType.int8]
alias Float32 = Scalar[DType.float32]
```

This may seem a little confusing at first, but it means that whether you're
working with a single `Float32` value or a vector of float32 values,
the math operations go through exactly the same code path.

#### The `DType` type

The `DType` struct describes the different data types that a `SIMD` vector can
hold, and defines a number of utility functions for operating on those data
types. The `DType` struct defines a set of aliases that act as identifiers for
the different data types, like `DType.int8` and `DType.float32`. You use
these aliases when declaring a `SIMD` vector:

```mojo
var v: SIMD[DType.float64, 16]
```

Note that `DType.float64` isn't a *type*, it's a value that describes a data
type. You can't create a variable with the type `DType.float64`. You can create
a variable with the type `SIMD[DType.float64, 1]` (or  `Float64`, which is the
same thing).

```mojo
from utils.numerics import max_finite, min_finite

def describeDType[dtype: DType]():
    print(dtype, "is floating point:", dtype.is_floating_point())
    print(dtype, "is integral:", dtype.is_integral())
    print("Min/max finite values for", dtype)
    print(min_finite[dtype](), max_finite[dtype]())

describeDType[DType.float32]()
```

```output
float32 is floating point: True
float32 is integral: False
Min/max finite values for float32
-3.4028234663852886e+38 3.4028234663852886e+38
```

There are several other data types in the standard library that also use
the `DType` abstraction.

### Numeric type conversion

[Constructors and implicit conversion](/mojo/manual/lifecycle/life/#constructors-and-implicit-conversion)
documents the circumstances in which Mojo automatically converts a value from
one type to another. Importantly, numeric [operators](/mojo/manual/operators)
**don't** automatically narrow or widen operands to a common type.

You can explicitly convert a `SIMD` value to a different `SIMD` type either
by invoking its [`cast()`](/mojo/stdlib/builtin/simd/SIMD#cast) method or by
passing it as an argument to the constructor of the target type. For example:

```mojo
simd1 = SIMD[DType.float32, 4](2.2, 3.3, 4.4, 5.5)
simd2 = SIMD[DType.int16, 4](-1, 2, -3, 4)
simd3 = simd1 * simd2.cast[DType.float32]()  # Convert with cast() method
print("simd3:", simd3)
simd4 = simd2 + SIMD[DType.int16, 4](simd1)  # Convert with SIMD constructor
print("simd4:", simd4)
```

```output
simd3: [-2.2, 6.6, -13.200001, 22.0]
simd4: [1, 5, 1, 9]
```

You can convert a `Scalar` value by passing it as an argument to the constructor
of the target type. For example:

```mojo
var my_int: Int16 = 12                 # SIMD[DType.int16, 1]
var my_float: Float32 = 0.75           # SIMD[DType.float32, 1]
result = Float32(my_int) * my_float    # Result is SIMD[DType.float32, 1]
print("Result:", result)
```

```output
Result: 9.0
```

You can convert a scalar value of any numeric type to `Int` by passing the value
to the [`Int()`](/mojo/stdlib/builtin/int/Int#__init__) constructor method.
Additionally, you can pass an instance of any struct that implements the
[`Intable`](/mojo/stdlib/builtin/int/Intable) trait or
[`IntableRaising`](/mojo/stdlib/builtin/int/IntableRaising) trait to the `Int()`
constructor to convert that instance to an `Int`.

You can convert an `Int` or `IntLiteral` value to the `UInt` type by passing the
value to the [`UInt()`](/mojo/stdlib/builtin/uint/UInt#__init__) constructor.
You can't convert other numeric types to `UInt` directly, though you can first
convert them to `Int` and then to `UInt`.

## Strings

Mojo's `String` type represents a mutable string. (For Python programmers, note
that this is different from Python's standard string, which is immutable.)
Strings support a variety of operators and common methods.

```mojo
var s: String = "Testing"
s += " Mojo strings"
print(s)
```

```output
Testing Mojo strings
```

Most standard library types conform to the
[`Stringable`](/mojo/stdlib/builtin/str/Stringable) trait, which represents
a type that can be converted to a string. Use `String(value)` to
explicitly convert a value to a string:

```mojo
var s = String("Items in list: ") + String(5)
print(s)
```

```output
Items in list: 5
```

Or use `String.write` to take variadic `Stringable` types, so you don't have to
call `String()` on each value:

```mojo
var s = String("Items in list: ", 5)
print(s)
```

```output
Items in list: 5
```

### String literals

As with numeric types, the standard library includes a string literal type used
to represent literal strings in the program source. String literals are
enclosed in either single or double quotes.

Adjacent literals are concatenated together, so you can define a long string
using a series of literals broken up over several lines:

```
var s = "A very long string which is "
        "broken into two literals for legibility."
```

To define a multi-line string, enclose the literal in three single or double
quotes:

```
var s = """
Multi-line string literals let you
enter long blocks of text, including
newlines."""
```

Note that the triple double quote form is also used for API documentation
strings.

Unlike `IntLiteral` and `FloatLiteral`, `StringLiteral` doesn't automatically
materialize to a runtime type. In some cases, you may need to explicitly convert
`StringLiteral` values to `String`.

```mojo
# Variable is type `StringLiteral`
var s1 = "Example"

# Variable is type `String`
var s2: String = "Example"

# Variable is type `String`
var s3 = String("Example")
```

## Booleans

Mojo's `Bool` type represents a boolean value. It can take one of two values,
`True` or `False`. You can negate a boolean value using the `not` operator.

```mojo
var conditionA = False
var conditionB: Bool
conditionB = not conditionA
print(conditionA, conditionB)
```

```output
False True
```

Many types have a boolean representation. Any type that implements the
[`Boolable`](/mojo/stdlib/builtin/bool/Boolable) trait has a boolean
representation. As a general principle, collections evaluate as True if they
contain any elements, False if they are empty; strings evaluate as True if they
have a non-zero length.

## Tuples

Mojo's `Tuple` type represents an immutable tuple consisting of zero or more
values, separated by commas. Tuples can consist of multiple types and you can
index into tuples in multiple ways.

```mojo
# Tuples are immutable and can hold multiple types
example_tuple = Tuple[Int, String](1, "Example")

# Assign multiple variables at once
x, y = example_tuple
print(x, y)

# Get individual values with an index
s = example_tuple[1]
print(s)
```

```output
1 Example
Example
```

You can also create a tuple without explicit typing. Note that if we declare the
same tuple from the previous example with implicit typing instead of explicit,
we must also convert `"Example"` from type `StringLiteral` to type `String`.

```mojo
example_tuple = (1, String("Example"))
s = example_tuple[1]
print(s)
```

```output
Example
```

When defining a function, you can explicitly declare the type of tuple elements
in one of two ways:

```mojo
def return_tuple_1() -> Tuple[Int, Int]:
    return Tuple[Int, Int](1, 1)

def return_tuple_2() -> (Int, Int):
    return (2, 2)
```

## Collection types

The Mojo standard library also includes a set of basic collection types that
can be used to build more complex data structures:

* [`List`](/mojo/stdlib/collections/list/List), a dynamically-sized array of
  items.
* [`Dict`](/mojo/stdlib/collections/dict/Dict), an associative array of
  key-value pairs.
* [`Set`](/mojo/stdlib/collections/set/Set), an unordered collection of unique
  items.
* [`Optional`](/mojo/stdlib/collections/optional/Optional)
  represents a value that may or may not be present.

The collection types are *generic types*: while a given collection can only
hold a specific type of value (such as `Int` or `Float64`), you specify the
type at compile time using a [parameter](/mojo/manual/parameters/). For
example, you can create a `List` of `Int` values like this:

```mojo
var l = List[Int](1, 2, 3, 4)
# l.append(3.14) # error: FloatLiteral cannot be converted to Int
```

You don't always need to specify the type explicitly. If Mojo can *infer* the
type, you can omit it. For example, when you construct a list from a set of
integer literals, Mojo creates a `List[Int]`.

```mojo
# Inferred type == Int
var l1 = List(1, 2, 3, 4)
```

Where you need a more flexible collection, the
[`Variant`](/mojo/stdlib/utils/variant/Variant) type can hold different types
of values. For example, a `Variant[Int32, Float64]` can hold either an `Int32`
*or* a `Float64` value at any given time. (Using `Variant` is not covered in
this section, see the [API docs](/mojo/stdlib/utils/variant/Variant) for more
information.)

The following sections give brief introduction to the main collection types.

### List

[`List`](/mojo/stdlib/collections/list/List) is a dynamically-sized array of
elements. List elements need to conform to the
[`Copyable`](/mojo/stdlib/builtin/value/Copyable) and
[`Movable`](/mojo/stdlib/builtin/value/Movable) traits. Most of the common
standard library primitives, like `Int`, `String`, and `SIMD` conform to this
trait. You can create a `List` by passing the element type as a parameter,  like
this:

```mojo
var l = List[String]()
```

The `List` type supports a subset of the Python `list` API, including the
ability to append to the list, pop items out of the list, and access list items
using subscript notation.

```mojo
from collections import List

var list = List(2, 3, 5)
list.append(7)
list.append(11)
print("Popping last item from list: ", list.pop())
for idx in range(len(list)):
      print(list[idx], end=", ")

```

```output
Popping last item from list:  11
2, 3, 5, 7,
```

Note that the previous code sample leaves out the type parameter when creating
the list. Because the list is being created with a set of `Int` values, Mojo can
*infer* the type from the arguments.

* Mojo supports list and dictionary literals for collection initialization:

  ```mojo
  # List literal
  var nums: List[Int] = [2, 3, 5]
  ```

  You can also use variadic arguments for lists:

  ```mojo
  var list = List(2, 3, 5)
  ```

* You can't `print()` a list, or convert it directly into a string.

  ```mojo
  # Does not work
  print(list)
  ```

  As shown above, you can print the individual elements in a list as long as
  they're a [`Stringable`](/mojo/stdlib/builtin/str/Stringable) type.

* Iterating a `List` currently returns a
  [`Pointer`](/mojo/stdlib/memory/pointer/Pointer) to each item, not the
  item itself. You can access the item using the dereference operator, `[]`:

```mojo
#: from collections import List
var list = List(2, 3, 4)
for item in list:
      print(item[], end=", ")
```

```output
2, 3, 4,
```

Subscripting in to a list, however, returns the item directly—no need to
dereference:

```mojo
#: from collections import List
#: var list = List[Int](2, 3, 4)
for i in range(len(list)):
    print(list[i], end=", ")
```

```output
2, 3, 4,
```

### Dict

The [`Dict`](/mojo/stdlib/collections/dict/Dict) type is an associative array
that holds key-value pairs. You can create a `Dict` by specifying the key type
and value type as parameters and using dictionary literals:

```mojo
# Empty dictionary
var empty_dict: Dict[String, Float64] = {}

# Dictionary with initial key-value pairs
var values: Dict[String, Float64] = {"pi": 3.14159, "e": 2.71828}
```

You can also use the constructor syntax:

```mojo
var values = Dict[String, Float64]()
```

The dictionary's key type must conform to the
[`KeyElement`](/mojo/stdlib/collections/dict/KeyElement) trait, and value
elements must conform to the [`Copyable`](/mojo/stdlib/builtin/value/Copyable)
and [`Movable`](/mojo/stdlib/builtin/value/Movable) traits.

You can insert and remove key-value pairs, update the value assigned to a key,
and iterate through keys, values, or items in the dictionary.

The `Dict` iterators all yield references, so you need to use the dereference
operator `[]` as shown in the following example:

```mojo
var d: Dict[String, Float64] = {
    "plasticity": 3.1,
    "elasticity": 1.3,
    "electricity": 9.7
}
for item in d.items():
    print(item[].key, item[].value)
```

```output
plasticity 3.1000000000000001
elasticity 1.3
electricity 9.6999999999999993
```

### Set

The [`Set`](/mojo/stdlib/collections/set/Set) type represents a set of unique
values. You can add and remove elements from the set, test whether a value
exists in the set, and perform set algebra operations, like unions and
intersections between two sets.

Sets are generic and the element type must conform to the
[`KeyElement`](/mojo/stdlib/collections/dict/KeyElement) trait.

Unlike lists and dictionaries, sets do not yet support literal syntax.

```mojo
from collections import Set

i_like = Set("sushi", "ice cream", "tacos", "pho")
you_like = Set("burgers", "tacos", "salad", "ice cream")
we_like = i_like.intersection(you_like)

print("We both like:")
for item in we_like:
    print("-", item[])
```

```output
We both like:
- ice cream
- tacos
```

### Optional

An [`Optional`](/mojo/stdlib/collections/optional/Optional)  represents a
value that may or may not be present. Like the other collection types, it is
generic, and can hold any type that conforms to the
[`Copyable`](/mojo/stdlib/builtin/value/Copyable) and
[`Movable`](/mojo/stdlib/builtin/value/Movable) traits.

```mojo
# Two ways to initialize an Optional with a value
var opt1 = Optional(5)
var opt2: Optional[Int] = 5
# Two ways to initialize an Optional with no value
var opt3 = Optional[Int]()
var opt4: Optional[Int] = None
```

An `Optional` evaluates as `True` when it holds a value, `False` otherwise. If
the `Optional` holds a value, you can retrieve a reference to the value using
the `value()` method. But calling `value()` on an `Optional` with no value
results in undefined behavior, so you should always guard a call to `value()`
inside a conditional that checks whether a value exists.

```mojo
var opt: Optional[String] = String("Testing")
if opt:
    var value_ref = opt.value()
    print(value_ref)
```

```output
Testing
```

Alternately, you can use the `or_else()` method, which returns the stored
value if there is one, or a user-specified default value otherwise:

```mojo
var custom_greeting: Optional[String] = None
print(custom_greeting.or_else("Hello"))

custom_greeting = String("Hi")
print(custom_greeting.or_else("Hello"))

```

```output
Hello
Hi
```

## Register-passable, memory-only, and trivial types

In various places in the documentation you'll see references to
register-passable, memory-only, and trivial types. Register-passable and
memory-only types are distinguished based on how they hold data:

* Register-passable types are composed exclusively of fixed-size data types,
  which can (theoretically) be stored in a machine register. A register-passable
  type can include other types, as long as they are also register-passable.
  `Int`, `Bool`, and `SIMD`, for example, are all register-passable types. So
  a register-passable `struct` could include `Int` and `Bool` fields, but not a
  `String` field. Register-passable types are declared with the
  [`@register_passable`](/mojo/manual/decorators/register-passable) decorator.

  Register-passable types are always passed by value (that is, the values are
  copied).

* Memory-only types consist of any types that *don't* fit the description of
  register-passable types. In particular, these types usually have pointers or
  references to dynamically-allocated memory. `String`, `List`, and `Dict` are
  all examples of memory-only types.

Our long-term goal is to make this distinction transparent to the user, and
ensure all APIs work with both register-passable and memory-only types.
But right now you will see some standard library types that only work with
register-passable types or only work with memory-only types.

In addition to these two categories, Mojo also has "trivial" types. Conceptually
a trivial type is simply a type that doesn't require any custom logic in its
lifecycle methods. The bits that make up an instance of a trivial type can be
copied or moved without any knowledge of what they do. Currently, trivial types
are declared using the
[`@register_passable(trivial)`](/mojo/manual/decorators/register-passable#register_passabletrivial)
decorator. Trivial types shouldn't be limited to only register-passable types,
so in the future we intend to separate trivial types from the
`@register_passable` decorator.

## `AnyType` and `AnyTrivialRegType`

Two other things you'll see in Mojo APIs are references to `AnyType` and
`AnyTrivialRegType`. These are effectively *metatypes*, that is, types of types.

* `AnyType` represents any Mojo type. Mojo treats `AnyType` as a special kind of
  trait, and you'll find more discussion of it on the
  [Traits page](/mojo/manual/traits#the-anytype-trait).
* `AnyTrivialRegType` is a metatype representing any Mojo type that's marked
  register passable.

You'll see them in signatures like this:

```mojo
fn any_type_function[ValueType: AnyTrivialRegType](value: ValueType):
    ...
```

You can read this as `any_type_function` has an argument, `value` of type
`ValueType`, where `ValueType` is a register-passable type, determined at
compile time.

There is still some code like this in the standard library, but it's gradually
being migrated to more generic code that doesn't distinguish between
register-passable and memory-only types.

---

## uint

Implements the UInt class.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`UInt`](/mojo/stdlib/builtin/uint/UInt): This type represents an unsigned integer.

---

## UInt

`@register_passable(trivial)`
`struct UInt`

This type represents an unsigned integer.

The size of this unsigned integer is platform-dependent.

If you wish to use a fixed size unsigned integer, consider using
`UInt8`, `UInt16`, `UInt32`, or `UInt64`.

## Fields

* ​value (`index`): The underlying storage for the integer value.
  Note that it is the same type as the `Int.value` field.
  MLIR doesn't differentiate between signed and unsigned integers
  when it comes to storing them with the index dialect.
  The difference is in the operations that are performed on them,
  which have signed and unsigned variants.

## Implemented traits

`Absable`,
`AnyType`,
`Boolable`,
`CeilDivable`,
`Comparable`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`Hashable`,
`Indexer`,
`Intable`,
`KeyElement`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`Representable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`,
`_HashableWithHasher`

## Aliases

### `BITWIDTH`

`alias BITWIDTH = __init__[::Intable](bitwidthof[::DType,__mlir_type.!kgen.target]())`

The bit width of the integer type.

### `MAX`

`alias MAX = UInt((0 if (__init__[::Intable](bitwidthof[::DType,__mlir_type.!kgen.target]()) 

Returns the maximum integer value.

### `MIN`

`alias MIN = UInt(0)`

Returns the minimum value of type.

## Methods

### `__init__`

`__init__() -> Self`

Default constructor that produces zero.

`@implicit`
`__init__(value: IntLiteral[value]) -> Self`

Construct UInt from the given IntLiteral value.

**Args:**

* ​value (`IntLiteral[value]`): The init value.

`@implicit`
`__init__(value: Int) -> Self`

Construct UInt from the given Int value.

**Args:**

* ​value (`Int`): The init value.

`__init__[T: Indexer](value: T) -> Self`

Construct UInt from the given Indexable value.

**Parameters:**

* ​T (`Indexer`): The type that that can index into a collection or pointer.

**Args:**

* ​value (`T`): The init value.

### `__bool__`

`__bool__(self) -> Bool`

Convert this Int to Bool.

**Returns:**

False Bool value if the value is equal to 0 and True otherwise.

### `__pos__`

`__pos__(self) -> Self`

Return +self.

**Returns:**

The +self value.

### `__invert__`

`__invert__(self) -> Self`

Return \~self.

**Returns:**

The \~self value.

### `__lt__`

`__lt__(self, rhs: Self) -> Bool`

Return whether this UInt is strictly less than another.

**Args:**

* ​rhs (`Self`): The other UInt to compare against.

**Returns:**

True if this UInt is less than the other UInt and False otherwise.

### `__le__`

`__le__(self, rhs: Self) -> Bool`

Compare this Int to the RHS using LE comparison.

**Args:**

* ​rhs (`Self`): The other UInt to compare against.

**Returns:**

True if this Int is less-or-equal than the RHS Int and False
otherwise.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Compare this UInt to the RHS using EQ comparison.

**Args:**

* ​rhs (`Self`): The other UInt to compare against.

**Returns:**

True if this UInt is equal to the RHS UInt and False otherwise.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Compare this UInt to the RHS using NE comparison.

**Args:**

* ​rhs (`Self`): The other UInt to compare against.

**Returns:**

True if this UInt is non-equal to the RHS UInt and False otherwise.

### `__gt__`

`__gt__(self, rhs: Self) -> Bool`

Return whether this UInt is strictly greater than another.

**Args:**

* ​rhs (`Self`): The other UInt to compare against.

**Returns:**

True if this UInt is greater than the other UInt and False
otherwise.

### `__ge__`

`__ge__(self, rhs: Self) -> Bool`

Return whether this UInt is greater than or equal to another.

**Args:**

* ​rhs (`Self`): The other UInt to compare against.

**Returns:**

True if this UInt is greater than or equal to the other UInt and
False otherwise.

### `__add__`

`__add__(self, rhs: Self) -> Self`

Return `self + rhs`.

**Args:**

* ​rhs (`Self`): The value to add.

**Returns:**

`self + rhs` value.

### `__sub__`

`__sub__(self, rhs: Self) -> Self`

Return `self - rhs`.

**Args:**

* ​rhs (`Self`): The value to subtract.

**Returns:**

`self - rhs` value.

### `__mul__`

`__mul__(self, rhs: Self) -> Self`

Return `self * rhs`.

**Args:**

* ​rhs (`Self`): The value to multiply with.

**Returns:**

`self * rhs` value.

### `__truediv__`

`__truediv__(self, rhs: Self) -> SIMD[float64, 1]`

Return the floating point division of `self` and `rhs`.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

`Float64(self)/Float64(rhs)` value.

### `__floordiv__`

`__floordiv__(self, rhs: Self) -> Self`

Return the division of `self` and `rhs` rounded down to the nearest integer.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

`floor(self/rhs)` value.

### `__mod__`

`__mod__(self, rhs: Self) -> Self`

Return the remainder of self divided by rhs.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

The remainder of dividing self by rhs.

### `__pow__`

`__pow__(self, exp: Self) -> Self`

Return the value raised to the power of the given exponent.

Computes the power of an integer using the Russian Peasant Method.

**Args:**

* ​exp (`Self`): The exponent value.

**Returns:**

The value of `self` raised to the power of `exp`.

### `__lshift__`

`__lshift__(self, rhs: Self) -> Self`

Return `self rhs (`Self`): The value to shift with.

**Returns:**

`self 

### `__rshift__`

`__rshift__(self, rhs: Self) -> Self`

Return `self >> rhs`.

**Args:**

* ​rhs (`Self`): The value to shift with.

**Returns:**

`self >> rhs`.

### `__and__`

`__and__(self, rhs: Self) -> Self`

Return `self & rhs`.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self & rhs`.

### `__or__`

`__or__(self, rhs: Self) -> Self`

Return `self | rhs`.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self | rhs`.

### `__xor__`

`__xor__(self, rhs: Self) -> Self`

Return `self ^ rhs`.

**Args:**

* ​rhs (`Self`): The RHS value.

**Returns:**

`self ^ rhs`.

### `__radd__`

`__radd__(self, value: Self) -> Self`

Return `value + self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value + self`.

### `__rsub__`

`__rsub__(self, value: Self) -> Self`

Return `value - self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value - self`.

### `__rmul__`

`__rmul__(self, value: Self) -> Self`

Return `value * self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value * self`.

### `__rfloordiv__`

`__rfloordiv__(self, value: Self) -> Self`

Return `value // self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value // self`.

### `__rmod__`

`__rmod__(self, value: Self) -> Self`

Return `value % self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value % self`.

### `__rpow__`

`__rpow__(self, value: Self) -> Self`

Return `pow(value,self)`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`pow(value,self)`.

### `__rlshift__`

`__rlshift__(self, value: Self) -> Self`

Return `value value (`Self`): The other value.

**Returns:**

`value 

### `__rrshift__`

`__rrshift__(self, value: Self) -> Self`

Return `value >> self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value >> self`.

### `__rand__`

`__rand__(self, value: Self) -> Self`

Return `value & self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value & self`.

### `__ror__`

`__ror__(self, value: Self) -> Self`

Return `value | self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value | self`.

### `__rxor__`

`__rxor__(self, value: Self) -> Self`

Return `value ^ self`.

**Args:**

* ​value (`Self`): The other value.

**Returns:**

`value ^ self`.

### `__iadd__`

`__iadd__(mut self, rhs: Self)`

Compute `self + rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__isub__`

`__isub__(mut self, rhs: Self)`

Compute `self - rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__imul__`

`__imul__(mut self, rhs: Self)`

Compute self\*rhs and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__itruediv__`

`__itruediv__(mut self, rhs: Self)`

Compute `self / rhs`, convert to int, and save the result in self.

Since `floor(self / rhs)` is equivalent to `self // rhs`, this yields
the same as `__ifloordiv__`.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ifloordiv__`

`__ifloordiv__(mut self, rhs: Self)`

Compute `self // rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__imod__`

`__imod__(mut self, rhs: Self)`

Compute `self % rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ipow__`

`__ipow__(mut self, rhs: Self)`

Compute `pow(self, rhs)` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ilshift__`

`__ilshift__(mut self, rhs: Self)`

Compute `self rhs (`Self`): The RHS value.

### `__irshift__`

`__irshift__(mut self, rhs: Self)`

Compute `self >> rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__iand__`

`__iand__(mut self, rhs: Self)`

Compute `self & rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ixor__`

`__ixor__(mut self, rhs: Self)`

Compute `self ^ rhs` and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__ior__`

`__ior__(mut self, rhs: Self)`

Compute self|rhs and save the result in self.

**Args:**

* ​rhs (`Self`): The RHS value.

### `__divmod__`

`__divmod__(self, rhs: Self) -> Tuple[UInt, UInt]`

Computes both the quotient and remainder using integer division.

**Args:**

* ​rhs (`Self`): The value to divide on.

**Returns:**

The quotient and remainder as a `Tuple(self // rhs, self % rhs)`.

### `__index__`

`__index__(self) -> index`

Convert to index.

**Returns:**

The corresponding \_\_mlir\_type.index value.

### `__int__`

`__int__(self) -> Int`

Gets the integral value, wrapping to a negative number on overflow.

**Returns:**

The value as an integer.

### `__abs__`

`__abs__(self) -> Self`

Return the absolute value of the UInt value.

**Returns:**

The absolute value.

### `__ceil__`

`__ceil__(self) -> Self`

Return the ceiling of the UInt value, which is itself.

**Returns:**

The UInt value itself.

### `__floor__`

`__floor__(self) -> Self`

Return the floor of the UInt value, which is itself.

**Returns:**

The UInt value itself.

### `__round__`

`__round__(self) -> Self`

Return the rounded value of the UInt value, which is itself.

**Returns:**

The UInt value itself.

`__round__(self, ndigits: Self) -> Self`

Return the rounded value of the UInt value, which is itself.

**Args:**

* ​ndigits (`Self`): The number of digits to round to.

**Returns:**

The UInt value itself if ndigits >= 0 else the rounded value.

### `__trunc__`

`__trunc__(self) -> Self`

Return the truncated UInt value, which is itself.

**Returns:**

The Int value itself.

### `__ceildiv__`

`__ceildiv__(self, denominator: Self) -> Self`

Return the rounded-up result of dividing self by denominator.

**Args:**

* ​denominator (`Self`): The denominator.

**Returns:**

The ceiling of dividing numerator by denominator.

### `is_power_of_two`

`is_power_of_two(self) -> Bool`

Check if the integer is a (non-zero) power of two.

**Returns:**

True if the integer is a power of two, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this integer to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `__str__`

`__str__(self) -> String`

Convert this UInt to a string.

A small example.

```mojo
x = UInt(50)
assert_equal(String(x), "50")
```

**Returns:**

The string representation of this UInt.

### `__repr__`

`__repr__(self) -> String`

Convert this UInt to a string.

A small example.

```mojo
x = UInt(50)
assert_equal(repr(x), "UInt(50)")
```

**Returns:**

The string representation of this UInt.

### `__hash__`

`__hash__(self) -> Self`

Hash the UInt using builtin hash.

**Returns:**

A 64-bit hash value. This value is *not* suitable for cryptographic
uses. Its intended usage is for data structures. See the `hash`
builtin documentation for more details.

`__hash__[H: _Hasher](self, mut hasher: H)`

Updates hasher with this uint value.

**Parameters:**

* ​H (`_Hasher`): The hasher type.

**Args:**

* ​hasher (`H`): The hasher instance.

---

## UIntSized

The `Sized` trait describes a type that has an integer length (such as a string or array).

Any type that conforms to `Sized` or
[`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) works with the
built-in [`len()`](/mojo/stdlib/builtin/len/len) function.

The `Sized` trait requires a type to implement the `__len__()`
method. For example:

```mojo
struct Foo(Sized):
    var length: Int

    fn __len__(self) -> Int:
        return self.length
```

You can pass an instance of `Foo` to the `len()` function to get its
length:

```mojo
var foo = Foo(42)
print(len(foo) == 42)
```

```plaintext
True
```

**Note:** If the `__len__()` method can raise an error, use the
[`SizedRaising`](/mojo/stdlib/builtin/len/SizedRaising) trait instead.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__len__`

`__len__(self: _Self) -> UInt`

Get the length of the type.

**Returns:**

The length of the type.

---

## ulp

`ulp[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the ULP (units of last place) or (units of least precision) of the number.

**Constraints:**

The element type of the inpiut must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): SIMD vector input.

**Returns:**

The ULP of x.

---

## UMMAInsDescriptor

`@register_passable(trivial)`
`struct UMMAInsDescriptor[mma_kind: UMMAKind]`

Descriptor for UMMA instructions.

This struct represents a descriptor that encodes information about UMMA instructions.
The descriptor contains the following bit fields:

* Sparsity (2 bits): The sparsity of the input matrices. Currently defaults to dense matrices.
* Saturate for integer types (1 bits): Whether to saturate the result for integer types. Currently not supported.
* Matrix D type (2 bits): Data type of matrix D.
* Matrix A type (3 bits): Data type of matrix A.
* Matrix B type (3 bits): Data type of matrix B.
* Negate A matrix (1 bit): Whether to negate matrix A. Currently defaults to False.
* Negate B matrix (1 bit): Whether to negate matrix B. Currently defaults to False.
* Transpose A (1 bit): Whether to transpose matrix A.
* Transpose B (1 bit): Whether to transpose matrix B.
* N, Dimension of Matrix B (6 bits): Number of columns in matrix B. 3 LSBs are unused.
* M, Dimension of Matrix A (6 bits): Number of rows in matrix A. 3 LSBs are unused.

See: 

## Parameters

* ​mma\_kind (`UMMAKind`): The kind of UMMA instruction.

## Fields

* ​desc (`SIMD[uint32, 1]`): The 32-bit descriptor value that encodes UMMA instruction information.
  This field stores the complete descriptor with all bit fields packed into a single 32-bit integer:
  * Bits 0-1: Sparsity selector(2 bits)
  * Bits 2: Sparsity enable(1 bit)
  * Bits 3: Saturate for integer types (1 bit)
  * Bits 4-5: Matrix D type (2 bits)
  * Bits 6: Reserved (1 bit)
  * Bits 7-9: Matrix A type (3 bits)
  * Bits 10-12: Matrix B type (3 bits)
  * Bits 13: Negate A matrix (1 bit)
  * Bits 14: Negate B matrix (1 bit)
  * Bits 15: Transpose A (1 bit)
  * Bits 16: Transpose B (1 bit)
  * Bits 17-22: N, Dimension of Matrix B (6 bits)
  * Bits 23: Reserved (1 bit)
  * Bits 24-28: M, Dimension of Matrix A (5 bits)
  * Bits 29: Reserved (1 bit)
  * Bits 30-31: Maximum shift while attempting B matrix (2 bits)

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(value: SIMD[uint32, 1]) -> Self`

Initialize descriptor with raw 32-bit value.

This constructor allows creating a descriptor directly from a 32-bit integer
that already contains the properly formatted bit fields for the descriptor.

**Args:**

* ​value (`SIMD[uint32, 1]`): A 32-bit integer containing the complete descriptor bit layout.

### `create`

`static create[d_type: DType, a_type: DType, b_type: DType, output_shape: IndexList[2, element_type=uint32], /, *, transpose_a: Bool = False, transpose_b: Bool = True]() -> Self`

Create a descriptor for UMMA instructions.

This function creates a descriptor for UMMA instructions based on the provided parameters.

**Parameters:**

* ​d\_type (`DType`): The data type of matrix D.
* ​a\_type (`DType`): The data type of matrix A.
* ​b\_type (`DType`): The data type of matrix B.
* ​output\_shape (`IndexList[2, element_type=uint32]`): The shape of the output matrix.
* ​transpose\_a (`Bool`): Whether to transpose matrix A.
* ​transpose\_b (`Bool`): Whether to transpose matrix B.

**Returns:**

A 32-bit integer containing the complete descriptor bit layout.

---

## UMMAKind

`@register_passable(trivial)`
`struct UMMAKind`

Struct for UMMA instruction types.

This struct defines the different types of UMMA instructions that is supported by BlackWell.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `KIND_F16`

`alias KIND_F16 = UMMAKind(__init__[__mlir_type.!pop.int_literal](2))`

f16 type

### `KIND_F8F6F4`

`alias KIND_F8F6F4 = UMMAKind(__init__[__mlir_type.!pop.int_literal](3))`

f8f6f4 type

### `KIND_I8`

`alias KIND_I8 = UMMAKind(__init__[__mlir_type.!pop.int_literal](4))`

i8 type

### `KIND_TF32`

`alias KIND_TF32 = UMMAKind(__init__[__mlir_type.!pop.int_literal](0))`

tf32 type

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Check if two UMMA kinds are equal.

**Args:**

* ​other (`Self`): The other UMMA kind to compare with.

**Returns:**

True if the UMMA kinds are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Check if two UMMA kinds are not equal.

**Args:**

* ​other (`Self`): The other UMMA kind to compare with.

**Returns:**

True if the UMMA kinds are not equal, False otherwise.

### `__int__`

`__int__(self) -> Int`

Convert UMMA kind to an integer value.

**Returns:**

The integer value representing the UMMA instruction type.

### `__str__`

`__str__(self) -> String`

Convert UMMA kind to a string, this can be used as the instruction qualifier.

**Returns:**

The PTX qualifier representation of the UMMA kind.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Write the UMMA kind to a writer.

**Parameters:**

* ​W (`Writer`): The writer type that will receive the formatted output.

**Args:**

* ​writer (`W`): The writer to write the UMMA kind to.

---

## unfused_qkv_matmul_ragged_paged_gguf_quantized

`unfused_qkv_matmul_ragged_paged_gguf_quantized[type: DType, num_heads: Int, head_dim: Int, page_size: Int, //, quantization_encoding_q: StringSlice[StaticConstantOrigin], quantization_encoding_k: StringSlice[StaticConstantOrigin], quantization_encoding_v: StringSlice[StaticConstantOrigin]](hidden_state: NDBuffer[float32, 2, origin, shape], input_row_offsets: NDBuffer[uint32, 1, origin, shape, strides], q_weight: NDBuffer[uint8, 2, origin, shape], k_weight: NDBuffer[uint8, 2, origin, shape], v_weight: NDBuffer[uint8, 2, origin, shape], kv_collection: PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size], layer_idx: SIMD[uint32, 1], output: NDBuffer[float32, 2, origin, shape], ctx: DeviceContextPtr)`

Performs a quantized matmul, writing the output into a mutable PagedKVCacheCollection object.

Unlike the un-quantized version (kv\_matmul\_ragged\_continuous\_batching), this
implementation does not concat the q, k, and v weights together. Instead, it
performs three matmuls. This allows the q, k, and v weights to have different
quantization encodings.

This is only supported on CPU.

**Args:**

* ​hidden\_state (`NDBuffer[float32, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_heads \* head\_size).
* ​input\_row\_offsets (`NDBuffer[uint32, 1, origin, shape, strides]`): Tensor with shape (batch\_size + 1,)
  denoting the start of each sequence along the seq\_len dimension.
* ​q\_weight (`NDBuffer[uint8, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​k\_weight (`NDBuffer[uint8, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​v\_weight (`NDBuffer[uint8, 2, origin, shape]`): Tensor with shape (num\_heads \* head\_size, num\_kv\_heads \* head\_size).
* ​kv\_collection (`PagedKVCacheCollection[type, KVCacheStaticParams(UInt(num_heads), UInt(head_dim)), page_size]`): The Collection object storing KVCache entries.
* ​layer\_idx (`SIMD[uint32, 1]`): The index of the layer being executed. Used to retrieve the KVCache
  for the given layer from kv\_collection.
* ​output (`NDBuffer[float32, 2, origin, shape]`): Tensor with shape (sum(seq\_lens), num\_kv\_heads \* head\_size).
  This is the output buffer for the Q matmul.
* ​ctx (`DeviceContextPtr`): The call context pointer, passed by the graph compiler.

---

## Unit

`struct Unit`

Time Unit used by Benchmark Report.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Aliases

### `ms`

`alias ms = "ms"`

Milliseconds

### `ns`

`alias ns = "ns"`

Nanoseconds

### `s`

`alias s = "s"`

Seconds

---

## UnknownDestructibility

The most basic trait that all Mojo types extend by default.

This trait indicates that a type has no destructor and therefore no lifetime
management. It is the default for all types unless they explicitly implement
`AnyType` or `ImplicitlyDestructible`.

Types with this trait:

* Have no `__del__` method
* Do not perform any cleanup when they go out of scope
* Are suitable for simple value types that don't own resources

For types that need cleanup when they are destroyed, use `ImplicitlyDestructible`
or `AnyType` instead.

---

## unlikely

`unlikely(val: Bool) -> Bool`

Provides information that the most probable value of `val` is going to be `False`. This information can be used by optimizers.

**Args:**

* ​val (`Bool`): The input value which is likely to be `False` most of the time.

**Returns:**

The input value.

---

## unlink

`unlink[PathLike: PathLike](path: PathLike)`

Removes the specified file.

If the path is a directory or it can not be deleted, an error is raised.
Absolute and relative paths are allowed, relative paths are resolved from cwd.

**Parameters:**

* ​PathLike (`PathLike`): The a type conforming to the os.PathLike trait.

**Args:**

* ​path (`PathLike`): The path to the file.

---

## unpack_4bit_int

`unpack_4bit_int(val: SIMD[uint32, size], idx: Int) -> SIMD[uint8, 1]`

---

## unsafe

Provides utility functions for unsafe manipulation of SIMD values.

You can import these APIs from the `memory` package. For example:

```mojo
from memory import bitcast
```

## Functions

* [​`bitcast`](/mojo/stdlib/memory/unsafe/bitcast): Bitcasts a SIMD value to another SIMD value.
* [​`pack_bits`](/mojo/stdlib/memory/unsafe/pack_bits): Packs a SIMD vector of `bool` values into an integer.

---

## Unsafe pointers

The [`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer) type is
one of several pointer types available in the standard library to indirectly
reference locations in memory.

You can use an `UnsafePointer` to dynamically allocate and free memory, or to
point to memory allocated by some other piece of code. You can use these
pointers to write code that interacts with low-level interfaces, to interface
with other programming languages, or to build array-like data structures.
But as the name suggests, they're inherently *unsafe*. For example, when using
unsafe pointers, you're responsible for ensuring that memory gets allocated and
freed correctly.

In general, you should prefer safe pointer types when possible, reserving
`UnsafePointer` for those use cases where no other pointer type works.
For a comparison of standard library pointer types, see [Intro to
pointers](/mojo/manual/pointers/).

## Unsafe pointer basics

An `UnsafePointer` is a type that holds an address to memory. You can store
and retrieve values in that memory. The `UnsafePointer` type is *generic*—it can
point to any type of value, and the value type is specified as a parameter. The
value pointed to by a pointer is sometimes called a *pointee*.

```mojo
from memory import UnsafePointer

# Allocate memory to hold a value
var ptr = UnsafePointer[Int].alloc(1)
# Initialize the allocated memory
ptr.init_pointee_copy(100)
```

![](../images/pointer-diagram.png#light)
![](../images/pointer-diagram-dark.png#dark)

Figure 1. Pointer and pointee

Accessing the memory—to retrieve or update a value—is called
*dereferencing* the pointer. You can dereference a pointer by following the
variable name with an empty pair of square brackets:

```mojo
# Update an initialized value
ptr[] += 10
# Access an initialized value
print(ptr[])
```

```output
110
```

You can also allocate memory to hold multiple values to build array-like
structures. For details, see
[Storing multiple values](#storing-multiple-values).

## Lifecycle of a pointer

At any given time, a pointer can be in one of several states:

- Uninitialized. Just like any variable, a variable of type `UnsafePointer` can
  be declared but uninitialized.

  ```mojo
  var ptr: UnsafePointer[Int]
  ```

- Null. A null pointer has an address of 0, indicating an invalid pointer.

  ```mojo
  ptr = UnsafePointer[Int]()
  ```

- Pointing to allocated, uninitialized memory. The `alloc()` static method
  returns a pointer to a newly-allocated block of memory with space for the
  specified number of elements of the pointee's type.

  ```mojo
  ptr = UnsafePointer[Int].alloc(1)
  ```

  Trying to dereference a pointer to uninitialized memory results in undefined
  behavior.

- Pointing to initialized memory. You can initialize an allocated, uninitialized
  pointer by moving or copying an existing value into the memory. Or you can get
  a pointer to an  existing value by calling the constructor with the `to`
  keyword argument.

  ```mojo
  ptr.init_pointee_copy(value)
  # or
  ptr.init_pointee_move(value^)
  # or
  ptr = UnsafePointer(to=value)
  ```

  Once the value is initialized, you can read or mutate it using the dereference
  syntax:

  ```mojo
  oldValue = ptr[]
  ptr[] = newValue
  ```

- Dangling. When you free the pointer's allocated memory, you're left with a
  *dangling pointer*. The address still points to its previous location, but the
  memory is no longer allocated to this pointer. Trying to dereference the
  pointer, or calling any method that would access the memory location results
  in undefined behavior.

  ```mojo
  ptr.free()
  ```

The following diagram shows the lifecycle of an `UnsafePointer`:

![](../images/pointer-lifecycle.png#light)
![](../images/pointer-lifecycle-dark.png#dark)

Figure 2. Lifecycle of an UnsafePointer

### Allocating memory

Use the static `alloc()` method to allocate memory. The method returns a new
pointer pointing to the requested memory. You can allocate space for one or
more values of the pointee's type.

```mojo
ptr = UnsafePointer[Int].alloc(10) # Allocate space for 10 Int values
```

The allocated space is *uninitialized*—like a variable that's been declared but
not initialized.

### Initializing the pointee

To initialize allocated memory, `UnsafePointer` provides the
[`init_pointee_copy()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#init_pointee_copy)
and [`init_pointee_move()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#init_pointee_move)
methods. For example:

```mojo
ptr.init_pointee_copy(my_value)
```

To move a value into the pointer's memory location, use
`init_pointee_move()`:

```mojo
str_ptr.init_pointee_move(my_string^)
```

Note that to move the value, you usually need to add the transfer sigil
(`^`), unless the value is a [trivial
type](/mojo/manual/types#register-passable-memory-only-and-trivial-types) (like
`Int`) or a newly-constructed, "owned" value:

```mojo
str_ptr.init_pointee_move(String("Owned string"))
```

Alternately, you can get a pointer to an existing value by calling the
`UnsafePointer` constructor with the keyword `to` argument. This is useful for
getting a pointer to a value on the stack, for example.

```mojo
var counter: Int = 5
ptr = UnsafePointer(to=counter)
```

Note that when calling `UnsafePointer(to=value)`, you don't need to allocate
memory, since you're pointing to an existing value.

### Dereferencing pointers

Use the `[]` dereference operator to access the value stored at a pointer (the
"pointee").

```mojo
# Read from pointee
print(ptr[])
# mutate pointee
ptr[] = 0
```

```output
5
```

If you've allocated space for multiple values, you can use subscript syntax to
access the values, as if they were an array, like `ptr[3]`. The empty subscript
`[]` has the same meaning as `[0]`.

:::caution

The dereference operator assumes that the memory being dereferenced is
initialized. Dereferencing uninitialized memory results in undefined behavior.

:::

You cannot safely use the dereference operator on uninitialized memory, even to
*initialize* a pointee. This is because assigning to a dereferenced pointer
calls lifecycle methods on the existing pointee (such as the destructor, move
constructor or copy constructor).

```mojo
str_ptr = UnsafePointer[String].alloc(1)
# str_ptr[] = "Testing" # Undefined behavior!
str_ptr.init_pointee_move("Testing")
str_ptr[] += " pointers" # Works now
```

### Destroying or removing values

The
[`take_pointee()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#take_pointee)
method moves the pointee from the memory location pointed to by `ptr`. This is
a consuming move—it invokes `__moveinit__()` on the destination value. It leaves
the memory location uninitialized.

The [`destroy_pointee()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#destroy_pointee)
method calls the destructor on the pointee, and leaves the memory location
pointed to by `ptr` uninitialized.

Both `take_pointee()` and `destroy_pointee()` require that the pointer is
non-null, and the memory location contains a valid, initialized value of the
pointee's type; otherwise the function results in undefined behavior.

The [`move_pointee_into(self, dst)`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#move_pointee_into)
method moves the pointee from one pointer location to another. Both pointers
must be non-null. The source location must contain a valid, initialized value of
the pointee's type, and is left uninitialized after the call. The destination
location is assumed to be uninitialized—if it contains a valid value, that
value's destructor is not run. The value from the source location is moved to
the destination location as a consuming move. This function also has undefined
behavior if any of its prerequisites is not met.

### Freeing memory

Calling [`free()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#free) on a
pointer frees the memory allocated by the pointer. It doesn't call the
destructors on any values stored in the memory—you need to do that explicitly
(for example, using
[`destroy_pointee()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#destroy_pointee) or
one of the other functions described in
[Destroying or removing values](#destroying-or-removing-values)).

Disposing of a pointer without freeing the associated memory can result in a
memory leak—where your program keeps taking more and more memory, because not
all allocated memory is being freed.

On the other hand, if  you have multiple copies of a pointer accessing the same
memory, you need to make sure you only call `free()` on one of them. Freeing the
same memory twice is also an error.

After freeing a pointer's memory, you're left with a dangling pointer—its
address still points to the freed memory. Any attempt to access the memory,
like dereferencing the pointer results in undefined behavior.

## Storing multiple values

As mentioned in [Allocating memory](#allocating-memory), you can use an
`UnsafePointer` to allocate memory for multiple values. The memory is allocated
as a single, contiguous block. Pointers support arithmetic: adding an integer
to a pointer returns a new pointer offset by the specified number of values from
the original pointer:

```mojo
third_ptr = first_ptr + 2
```

Pointers also support subtraction, as well as in-place addition and subtraction:

```mojo
# Advance the pointer one element:
ptr += 1
```

![](../images/pointer-offset.png#light)
![](../images/pointer-offset-dark.png#dark)

Figure 3. Pointer arithmetic

For example, the following example allocates memory to store 6 `Float64`
values, and initializes them all to zero.

```mojo
float_ptr = UnsafePointer[Float64].alloc(6)
for offset in range(6):
    (float_ptr+offset).init_pointee_copy(0.0)
```

Once the values are initialized, you can access them using subscript syntax:

```mojo
float_ptr[2] = 3.0
for offset in range(6):
    print(float_ptr[offset], end=", ")
```

```output
0.0, 0.0, 3.0, 0.0, 0.0, 0.0,
```

## `UnsafePointer` and origins

The `UnsafePointer` struct has an optional `origin` parameter to track the
origin of the memory it points to.

For pointers initialized with the `to` keyword argument, the origin is set to
the origin of the pointee. For example, in the following code, `s_ptr.origin` is
the same as the origin of `s`:

```mojo
s = String("Testing")
s_ptr = UnsafePointer(to=s)
```

When initializing an `UnsafePointer` in other ways, the `origin` defaults to
`MutableAnyOrigin`—indicating a pointer that could reference anything in the
current scope.

If you're using a pointer in the implementation of a struct, you usually don't
have to worry about the origin, as long as the pointer isn't exposed outside of
of the struct. For example, if you implement a static array type that allocates
memory in its constructor, deallocates it in its destructor, and doesn't expose
the pointer outside of the struct, the default origin is fine.

But if the struct exposes a pointer to that memory, you need to set the origin
appropriately. For example, the `List` type has an `unsafe_ptr()` method that
returns an `UnsafePointer` to the underlying storage. In this case, the returned
pointer should share the origin of the list, since the list is the logical owner
of the storage.

That method looks something like this:

```mojo
fn unsafe_ptr(
    ref self,
) -> UnsafePointer[
    T,
    mut = Origin(__origin_of(self)).mut,
    origin = __origin_of(self),
]:

    return self.data.origin_cast[
        mut = Origin(__origin_of(self)).mut, origin = __origin_of(self)
    ]()
```

This returns a copy of the original pointer, with the origin set to match the
origin and mutability of the `self` value.

A method like this is unsafe, but setting the correct origin makes it safer,
since the compiler knows that the pointer is referring to data owned by the
list.

## Working with foreign pointers

When exchanging data with other programming languages, you may need to construct
an `UnsafePointer` from a foreign pointer. Mojo restricts creating
`UnsafePointer` instances from arbitrary addresses, to avoid users accidentally
creating pointers that *alias* each other (that is, two pointers that refer to
the same location). However, there are specific methods you can use to get an
`UnsafePointer` from a Python or C/C++ pointer.

When dealing with memory allocated elsewhere, you need to be aware of who's
responsible for freeing the memory. Freeing memory allocated elsewhere
can result in undefined behavior.

You also need to be aware of the format of the data stored in memory, including
data types and byte order. For more information, see
[Converting data: bitcasting and byte order](#converting-data-bitcasting-and-byte-order).

### Creating a Mojo pointer from a Python pointer

The `PythonObject` type defines
an [`unsafe_get_as_pointer()`](/mojo/stdlib/python/object/PythonObject#unsafe_get_as_pointer)
method to construct an `UnsafePointer` from a Python address.

For example, the following code creates a NumPy array and then accesses the
data using a Mojo pointer:

```mojo
from python import Python
from memory import UnsafePointer

def share_array():
    np = Python.import_module("numpy")
    arr = np.array(Python.list(1, 2, 3, 4, 5, 6, 7, 8, 9))
    ptr = arr.ctypes.data.unsafe_get_as_pointer[DType.int64]()
    for i in range(9):
        print(ptr[i], end=", ")
    print()

def main():
    share_array()
```

```output
1, 2, 3, 4, 5, 6, 7, 8, 9,
```

This example uses the NumPy
[`ndarray.ctype`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.ctypes.html#numpy.ndarray.ctypes)
attribute to access the raw pointer to the underlying storage
(`ndarray.ctype.data`). The `unsafe_get_as_pointer()` method constructs an
`UnsafePointer` to this address.

### Working with C/C++ pointers

If you call a C/C++ function that returns a pointer using the
[`external_call`](/mojo/stdlib/sys/ffi/external_call) function, you can specify
the return type as an `UnsafePointer`, and Mojo will handle the type conversion
for you.

```mojo
from sys.ffi import external_call

def get_foreign_pointer() -> UnsafePointer[Int]:
    ptr = external_call[
        "my_c_function",   # external function name
        UnsafePointer[Int] # return type
    ]()
    return ptr
```

## Converting data: bitcasting and byte order

Bitcasting a pointer returns a new pointer that has the same memory location,
but a new data type. This can be useful if you need to access different types of
data from a single area of memory. This can happen when you're reading binary
files, like image files, or receiving data over the network.

The following sample processes a format that consists of chunks of data,
where each chunk contains a variable number of 32-bit integers.
Each chunk begins with an 8-bit integer that identifies the number of values
in the chunk.

```mojo
def read_chunks(owned ptr: UnsafePointer[UInt8]) -> List[List[UInt32]]:
    chunks = List[List[UInt32]]()
    # A chunk size of 0 indicates the end of the data
    chunk_size = Int(ptr[])
    while (chunk_size > 0):
        # Skip the 1 byte chunk_size and get a pointer to the first
        # UInt32 in the chunk
        ui32_ptr = (ptr + 1).bitcast[UInt32]()
        chunk = List[UInt32](capacity=chunk_size)
        for i in range(chunk_size):
            chunk.append(ui32_ptr[i])
        chunks.append(chunk)
        # Move our pointer to the next byte after the current chunk
        ptr += (1 + 4 * chunk_size)
        # Read the size of the next chunk
        chunk_size = Int(ptr[])
    return chunks
```

When dealing with data read in from a file or from the network, you may also
need to deal with byte order. Most systems use little-endian byte order (also
called least-signficicant byte, or LSB) where the least-significant byte in a
multibyte value comes first. For example, the number 1001 can be represented in
hexadecimal as 0x03E9, where E9 is the least-significant byte. Represented as a
16-bit little-endian integer, the two bytes are ordered E9 03. As a 32-bit
integer, it would be represented as E9 03 00 00.

Big-endian or most-significant byte (MSB) ordering is the opposite: in the
32-bit case, 00 00 03 E9. MSB ordering is frequently used in file formats and
when transmitting data over the network. You can use the
[`byte_swap()`](/mojo/stdlib/bit/bit/byte_swap) function to swap the byte
order of  a SIMD value from big-endian to little-endian or the reverse. For
example, if the method above was reading big-endian data, you'd just need to
change a single line:

```mojo
chunk.append(byte_swap(ui32_ptr[i]))
```

## Working with SIMD vectors

The `UnsafePointer` type includes
[`load()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#load) and
[`store()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#store) methods for
performing aligned loads and stores of scalar values. It also has methods
supporting strided load/store and gather/scatter.

Strided load loads values from memory into a SIMD vector using an offset (the
"stride") between successive memory addresses. This can be useful for
extracting rows or columns from tabular data, or for extracting individual
values from structured data. For example, consider the data for an RGB image,
where each pixel is made up of three 8-bit values, for red, green, and blue. If
you want to access just the red values, you can use a strided load or store.

![](../images/strided-load-storage.png#light)
![](../images/strided-load-storage-dark.png#dark)

Figure 4. Strided load

The following function uses the
[`strided_load()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#strided_load)
and
[`strided_store()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#strided_store)
methods to invert the red pixel values in an image, 8 values at a time. (Note
that this function only handles images where the number of pixels is evenly
divisible by eight.)

```mojo
def invert_red_channel(ptr: UnsafePointer[UInt8], pixel_count: Int):
    # number of values loaded or stored at a time
    alias simd_width = 8
    # bytes per pixel, which is also the stride size
    bpp = 3
    for i in range(0, pixel_count * bpp, simd_width * bpp):
        red_values = ptr.offset(i).strided_load[width=simd_width](bpp)
        # Invert values and store them in their original locations
        ptr.offset(i).strided_store[width=simd_width](~red_values, bpp)
```

The [`gather()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#gather) and
[`scatter()`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer#scatter) methods
let you load or store a set of values that are stored in arbitrary locations.
You do this by passing in a SIMD vector of *offsets* to the current pointer. For
example, when using `gather()`, the nth value in the vector is loaded
from (pointer address) + offset[n].

## Safety

Unsafe pointers are unsafe for several reasons:

- Memory management is up to the user. You need to manually allocate
  and free memory, and/or be aware of when other APIs are allocating or freeing
  memory for you.

- `UnsafePointer` values are *nullable*—that is, the pointer
  is not guaranteed to point to anything. And even when a pointer points to
  allocated memory, that memory may not be *initialized*.

- `UnsafePointer` does have an `origin` parameter so Mojo can track the origin
  of the data it points to, but it also provides unsafe APIs. For example, when
  you do pointer arithmetic, the compiler doesn't do any bounds checking.

---

## unsafe_pointer

Implement a generic unsafe pointer type.

You can import these APIs from the `memory` package. For example:

```mojo
from memory import UnsafePointer
```

## Structs

* [​`UnsafePointer`](/mojo/stdlib/memory/unsafe_pointer/UnsafePointer): UnsafePointer\[T] represents an indirect reference to one or more values of type T consecutively in memory, and can refer to uninitialized memory.

---

## UnsafeMaybeUninitialized

`struct UnsafeMaybeUninitialized[ElementType: AnyType]`

A memory location that may or may not be initialized.

Note that the destructor is a no-op. If the memory was initialized, the caller
is responsible for calling `assume_initialized_destroy` before the memory is
deallocated.

Every method in this struct is unsafe and the caller must know at all
times if the memory is initialized or not. Calling a method
that assumes the memory is initialized when it is not will result in
undefined behavior.

## Parameters

* ​ElementType (`AnyType`): The type of the element to store.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Aliases

### `type`

`alias type = array ElementType>`

## Methods

### `__init__`

`__init__(out self)`

The memory is now considered uninitialized.

`__init__[MovableType: Movable](out self: UnsafeMaybeUninitialized[MovableType], owned value: MovableType)`

The memory is now considered initialized.

**Parameters:**

* ​MovableType (`Movable`): The type of the element to store.

**Args:**

* ​value (`MovableType`): The value to initialize the memory with.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Copy another object.

This method should never be called as implicit copy should not
be done on memory that may be uninitialized.

Trying to call this method will abort.

If you wish to perform a copy, you should manually call the method
`copy_from` instead.

**Args:**

* ​other (`Self`): The object to copy.

### `__moveinit__`

`__moveinit__(out self, owned other: Self)`

Move another object.

This method should never be called as implicit moves should not
be done on memory that may be uninitialized.

Trying to call this method will abort.

If you wish to perform a move, you should manually call the method
`move_from` instead.

**Args:**

* ​other (`Self`): The object to move.

### `__del__`

`__del__(owned self)`

This is a no-op.

Calling this method assumes that the memory is uninitialized.
If the memory was initialized, the caller should
use `assume_initialized_destroy` before.

### `copy_from`

`copy_from[CopyableType: ExplicitlyCopyable](mut self: UnsafeMaybeUninitialized[CopyableType], other: UnsafeMaybeUninitialized[CopyableType])`

Copy another object.

This function assumes that the current memory is uninitialized
and the other object is initialized memory.

**Parameters:**

* ​CopyableType (`ExplicitlyCopyable`): The type object to copy.

**Args:**

* ​other (`UnsafeMaybeUninitialized[CopyableType]`): The object to copy.

`copy_from[CopyableType: ExplicitlyCopyable](mut self: UnsafeMaybeUninitialized[CopyableType], other: CopyableType)`

Copy another object.

This function assumes that the current memory is uninitialized.

**Parameters:**

* ​CopyableType (`ExplicitlyCopyable`): The type object to copy.

**Args:**

* ​other (`CopyableType`): The object to copy.

### `move_from`

`move_from[MovableType: Movable](mut self: UnsafeMaybeUninitialized[MovableType], mut other: UnsafeMaybeUninitialized[MovableType])`

Move another object.

This function assumes that the current memory is uninitialized
and the other object is initialized memory.

After the function is called, the other object is considered uninitialized.

**Parameters:**

* ​MovableType (`Movable`): The type object to move.

**Args:**

* ​other (`UnsafeMaybeUninitialized[MovableType]`): The object to move.

`move_from[MovableType: Movable](mut self: UnsafeMaybeUninitialized[MovableType], other: UnsafePointer[MovableType])`

Move another object.

This function assumes that the current memory is uninitialized
and the other object is initialized memory.

After the function is called, the `other` object is considered uninitialized.

**Parameters:**

* ​MovableType (`Movable`): The type object to move.

**Args:**

* ​other (`UnsafePointer[MovableType]`): The pointer to the object to move.

### `write`

`write[MovableType: Movable](mut self: UnsafeMaybeUninitialized[MovableType], owned value: MovableType)`

Write a value into an uninitialized memory location.

Calling this method assumes that the memory is uninitialized.

**Parameters:**

* ​MovableType (`Movable`): The type of the element to store.

**Args:**

* ​value (`MovableType`): The value to write.

### `assume_initialized`

`assume_initialized(ref self) -> ref [self] ElementType`

Returns a reference to the internal value.

Calling this method assumes that the memory is initialized.

**Returns:**

A reference to the internal value.

### `unsafe_ptr`

`unsafe_ptr(self) -> UnsafePointer[ElementType]`

Get a pointer to the underlying element.

Note that this method does not assumes that the memory is initialized
or not. It can always be called.

**Returns:**

A pointer to the underlying element.

### `assume_initialized_destroy`

`assume_initialized_destroy(mut self)`

Runs the destructor of the internal value.

Calling this method assumes that the memory is initialized.

---

## UnsafePointer

`@register_passable(trivial)`
`struct UnsafePointer[type: AnyType, *, address_space: AddressSpace = AddressSpace(0), alignment: Int = _default_alignment[::AnyType](), mut: Bool = True, origin: Origin[mut] = SomeAnyOrigin]`

UnsafePointer\[T] represents an indirect reference to one or more values of type T consecutively in memory, and can refer to uninitialized memory.

Because it supports referring to uninitialized memory, it provides unsafe
methods for initializing and destroying instances of T, as well as methods
for accessing the values once they are initialized.

For more information see [Unsafe
pointers](/mojo/manual/pointers/unsafe-pointers) in the Mojo Manual. For a
comparison with other pointer types, see [Intro to
pointers](/mojo/manual/pointers/).

## Parameters

* ​type (`AnyType`): The type the pointer points to.
* ​address\_space (`AddressSpace`): The address space associated with the UnsafePointer allocated memory.
* ​alignment (`Int`): The minimum alignment of this pointer known statically.
* ​mut (`Bool`): Whether the origin is mutable.
* ​origin (`Origin[mut]`): The origin of the memory being addressed.

## Fields

* ​address (`pointer *"type", #lit.struct.extract, "value">>`): The underlying pointer.

## Implemented traits

`AnyType`,
`Boolable`,
`Comparable`,
`Copyable`,
`EqualityComparable`,
`ExplicitlyCopyable`,
`GreaterThanComparable`,
`GreaterThanOrEqualComparable`,
`ImplicitlyBoolable`,
`Intable`,
`LessThanComparable`,
`LessThanOrEqualComparable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__() -> Self`

Create a null pointer.

`__init__(*, ref [origin, address_space] to: type) -> Self`

Constructs a Pointer from a reference to a value.

**Args:**

* ​to (`type`): The value to construct a pointer to.

`@implicit`
`__init__(other: UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> Self`

Exclusivity parameter cast a pointer.

**Args:**

* ​other (`UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`): Pointer to cast.

`__init__(*, ref [origin] unchecked_downcast_value: PythonObject) -> UnsafePointer[type, mut=mut, origin=origin]`

Downcast a `PythonObject` known to contain a Mojo object to a pointer.

This operation is only valid if the provided Python object contains
an initialized Mojo object of matching type.

**Args:**

* ​unchecked\_downcast\_value (`PythonObject`): The Python object to downcast from.

### `__bool__`

`__bool__(self) -> Bool`

Return true if the pointer is non-null.

**Returns:**

Whether the pointer is null.

### `__getitem__`

`__getitem__(self) -> ref [origin, address_space] type`

Return a reference to the underlying data.

**Returns:**

A reference to the value.

`__getitem__[I: Indexer, //](self, offset: I) -> ref [origin, address_space] type`

Return a reference to the underlying data, offset by the given index.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​offset (`I`): The offset index.

**Returns:**

An offset reference.

### `__lt__`

`__lt__(self, rhs: Self) -> Bool`

Returns True if this pointer represents a lower address than rhs.

**Args:**

* ​rhs (`Self`): The value of the other pointer.

**Returns:**

True if this pointer represents a lower address and False otherwise.

### `__le__`

`__le__(self, rhs: Self) -> Bool`

Returns True if this pointer represents a lower than or equal    address than rhs.

**Args:**

* ​rhs (`Self`): The value of the other pointer.

**Returns:**

True if this pointer represents a lower address and False otherwise.

### `__eq__`

`__eq__(self, rhs: Self) -> Bool`

Returns True if the two pointers are equal.

**Args:**

* ​rhs (`Self`): The value of the other pointer.

**Returns:**

True if the two pointers are equal and False otherwise.

### `__ne__`

`__ne__(self, rhs: Self) -> Bool`

Returns True if the two pointers are not equal.

**Args:**

* ​rhs (`Self`): The value of the other pointer.

**Returns:**

True if the two pointers are not equal and False otherwise.

### `__gt__`

`__gt__(self, rhs: Self) -> Bool`

Returns True if this pointer represents a higher address than rhs.

**Args:**

* ​rhs (`Self`): The value of the other pointer.

**Returns:**

True if this pointer represents a higher than or equal address and
False otherwise.

### `__ge__`

`__ge__(self, rhs: Self) -> Bool`

Returns True if this pointer represents a higher than or equal    address than rhs.

**Args:**

* ​rhs (`Self`): The value of the other pointer.

**Returns:**

True if this pointer represents a higher than or equal address and
False otherwise.

### `__add__`

`__add__[I: Indexer, //](self, offset: I) -> Self`

Return a pointer at an offset from the current one.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​offset (`I`): The offset index.

**Returns:**

An offset pointer.

### `__sub__`

`__sub__[I: Indexer, //](self, offset: I) -> Self`

Return a pointer at an offset from the current one.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​offset (`I`): The offset index.

**Returns:**

An offset pointer.

### `__iadd__`

`__iadd__[I: Indexer, //](mut self, offset: I)`

Add an offset to this pointer.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​offset (`I`): The offset index.

### `__isub__`

`__isub__[I: Indexer, //](mut self, offset: I)`

Subtract an offset from this pointer.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​offset (`I`): The offset index.

### `copy`

`copy(self) -> Self`

Copy an existing pointer.

**Returns:**

A copy of the value.

### `address_of`

`static address_of(ref [address_space] arg: type) -> UnsafePointer[type, address_space=address_space, alignment=1, mut=arg_is_mut, origin=arg_is_origin]`

Gets the address of the argument.

**Args:**

* ​arg (`type`): The value to get the address of.

**Returns:**

An UnsafePointer which contains the address of the argument.

### `alloc`

`static alloc(count: Int) -> UnsafePointer[type, alignment=alignment, origin={}]`

Allocate an array with specified or default alignment.

**Args:**

* ​count (`Int`): The number of elements in the array.

**Returns:**

The pointer to the newly allocated array.

### `offset`

`offset[I: Indexer, //](self, idx: I) -> Self`

Returns a new pointer shifted by the specified offset.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The offset of the new pointer.

**Returns:**

The new constructed UnsafePointer.

### `__merge_with__`

`__merge_with__[: Int, : Bool, : Origin[$1], //, other_type: AnyStruct[UnsafePointer[type, address_space=address_space, alignment=$0, mut=$1, origin=$2]]](self) -> UnsafePointer[type, address_space=address_space, alignment=min(alignment, alignment), mut=mut, origin=origin]`

Returns a pointer merged with the specified `other_type`.

**Parameters:**

* ​other\_type (`AnyStruct[UnsafePointer[type, address_space=address_space, alignment=$0, mut=$1, origin=$2]]`): The type of the pointer to merge with.

**Returns:**

A pointer merged with the specified `other_type`.

### `__as_bool__`

`__as_bool__(self) -> Bool`

Return true if the pointer is non-null.

**Returns:**

Whether the pointer is null.

### `__int__`

`__int__(self) -> Int`

Returns the pointer address as an integer.

**Returns:**

The address of the pointer as an Int.

### `__str__`

`__str__(self) -> String`

Gets a string representation of the pointer.

**Returns:**

The string representation of the pointer.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats this pointer address to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The object to write to.

### `as_noalias_ptr`

`as_noalias_ptr(self) -> Self`

Cast the pointer to a new pointer that is known not to locally alias any other pointer. In other words, the pointer transitively does not alias any other memory value declared in the local function context.

This information is relayed to the optimizer. If the pointer does
locally alias another memory value, the behaviour is undefined.

**Returns:**

A noalias pointer.

### `load`

`load[dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False, invariant: Bool = _default_invariant[::Bool]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin]) -> SIMD[dtype, width]`

Loads the value the pointer points to.

**Constraints:**

The width and alignment must be positive integer values.

**Parameters:**

* ​dtype (`DType`): The data type of SIMD vector.
* ​width (`Int`): The size of the SIMD vector.
* ​alignment (`Int`): The minimal alignment of the address.
* ​volatile (`Bool`): Whether the operation is volatile or not.
* ​invariant (`Bool`): Whether the memory is load invariant.

**Returns:**

The loaded value.

`load[dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False, invariant: Bool = _default_invariant[::Bool]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[dtype, 1]) -> SIMD[dtype, width]`

Loads the value the pointer points to with the given offset.

**Constraints:**

The width and alignment must be positive integer values.
The offset must be integer.

**Parameters:**

* ​dtype (`DType`): The data type of SIMD vector elements.
* ​width (`Int`): The size of the SIMD vector.
* ​alignment (`Int`): The minimal alignment of the address.
* ​volatile (`Bool`): Whether the operation is volatile or not.
* ​invariant (`Bool`): Whether the memory is load invariant.

**Args:**

* ​offset (`SIMD[dtype, 1]`): The offset to load from.

**Returns:**

The loaded value.

`load[I: Indexer, dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False, invariant: Bool = _default_invariant[::Bool]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: I) -> SIMD[dtype, width]`

Loads the value the pointer points to with the given offset.

**Constraints:**

The width and alignment must be positive integer values.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.
* ​dtype (`DType`): The data type of SIMD vector elements.
* ​width (`Int`): The size of the SIMD vector.
* ​alignment (`Int`): The minimal alignment of the address.
* ​volatile (`Bool`): Whether the operation is volatile or not.
* ​invariant (`Bool`): Whether the memory is load invariant.

**Args:**

* ​offset (`I`): The offset to load from.

**Returns:**

The loaded value.

### `store`

`store[I: Indexer, dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: I, val: SIMD[dtype, width])`

Stores a single element value at the given offset.

**Constraints:**

The width and alignment must be positive integer values.
The offset must be integer.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.
* ​dtype (`DType`): The data type of SIMD vector elements.
* ​width (`Int`): The size of the SIMD vector.
* ​alignment (`Int`): The minimal alignment of the address.
* ​volatile (`Bool`): Whether the operation is volatile or not.

**Args:**

* ​offset (`I`): The offset to store to.
* ​val (`SIMD[dtype, width]`): The value to store.

`store[dtype: DType, offset_type: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[offset_type, 1], val: SIMD[dtype, width])`

Stores a single element value at the given offset.

**Constraints:**

The width and alignment must be positive integer values.

**Parameters:**

* ​dtype (`DType`): The data type of SIMD vector elements.
* ​offset\_type (`DType`): The data type of the offset value.
* ​width (`Int`): The size of the SIMD vector.
* ​alignment (`Int`): The minimal alignment of the address.
* ​volatile (`Bool`): Whether the operation is volatile or not.

**Args:**

* ​offset (`SIMD[offset_type, 1]`): The offset to store to.
* ​val (`SIMD[dtype, width]`): The value to store.

`store[dtype: DType, //, width: Int = 1, *, alignment: Int = _default_alignment[::DType,::Int](), volatile: Bool = False](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], val: SIMD[dtype, width])`

Stores a single element value.

**Constraints:**

The width and alignment must be positive integer values.

**Parameters:**

* ​dtype (`DType`): The data type of SIMD vector elements.
* ​width (`Int`): The size of the SIMD vector.
* ​alignment (`Int`): The minimal alignment of the address.
* ​volatile (`Bool`): Whether the operation is volatile or not.

**Args:**

* ​val (`SIMD[dtype, width]`): The value to store.

### `strided_load`

`strided_load[dtype: DType, T: Intable, //, width: Int](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], stride: T) -> SIMD[dtype, width]`

Performs a strided load of the SIMD vector.

**Parameters:**

* ​dtype (`DType`): DType of returned SIMD value.
* ​T (`Intable`): The Intable type of the stride.
* ​width (`Int`): The SIMD width.

**Args:**

* ​stride (`T`): The stride between loads.

**Returns:**

A vector which is stride loaded.

### `strided_store`

`strided_store[dtype: DType, T: Intable, //, width: Int = 1](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], val: SIMD[dtype, width], stride: T)`

Performs a strided store of the SIMD vector.

**Parameters:**

* ​dtype (`DType`): DType of `val`, the SIMD value to store.
* ​T (`Intable`): The Intable type of the stride.
* ​width (`Int`): The SIMD width.

**Args:**

* ​val (`SIMD[dtype, width]`): The SIMD value to store.
* ​stride (`T`): The stride between stores.

### `gather`

`gather[dtype: DType, //, *, width: Int = 1, alignment: Int = _default_alignment[::DType,::Int]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[dtype, width], mask: SIMD[bool, width] = SIMD(True), default: SIMD[dtype, width] = __init__[__mlir_type.!pop.int_literal](0)) -> SIMD[dtype, width]`

Gathers a SIMD vector from offsets of the current pointer.

This method loads from memory addresses calculated by appropriately
shifting the current pointer according to the `offset` SIMD vector,
or takes from the `default` SIMD vector, depending on the values of
the `mask` SIMD vector.

If a mask element is `True`, the respective result element is given
by the current pointer and the `offset` SIMD vector; otherwise, the
result element is taken from the `default` SIMD vector.

**Constraints:**

The offset type must be an integral type.
The alignment must be a power of two integer value.

**Parameters:**

* ​dtype (`DType`): DType of the return SIMD.
* ​width (`Int`): The SIMD width.
* ​alignment (`Int`): The minimal alignment of the address.

**Args:**

* ​offset (`SIMD[dtype, width]`): The SIMD vector of offsets to gather from.
* ​mask (`SIMD[bool, width]`): The SIMD vector of boolean values, indicating for each
  element whether to load from memory or to take from the
  `default` SIMD vector.
* ​default (`SIMD[dtype, width]`): The SIMD vector providing default values to be taken
  where the `mask` SIMD vector is `False`.

**Returns:**

The SIMD vector containing the gathered values.

### `scatter`

`scatter[dtype: DType, //, *, width: Int = 1, alignment: Int = _default_alignment[::DType,::Int]()](self: UnsafePointer[SIMD[dtype, 1], address_space=address_space, alignment=alignment, mut=mut, origin=origin], offset: SIMD[dtype, width], val: SIMD[dtype, width], mask: SIMD[bool, width] = SIMD(True))`

Scatters a SIMD vector into offsets of the current pointer.

This method stores at memory addresses calculated by appropriately
shifting the current pointer according to the `offset` SIMD vector,
depending on the values of the `mask` SIMD vector.

If a mask element is `True`, the respective element in the `val` SIMD
vector is stored at the memory address defined by the current pointer
and the `offset` SIMD vector; otherwise, no action is taken for that
element in `val`.

If the same offset is targeted multiple times, the values are stored
in the order they appear in the `val` SIMD vector, from the first to
the last element.

**Constraints:**

The offset type must be an integral type.
The alignment must be a power of two integer value.

**Parameters:**

* ​dtype (`DType`): DType of `value`, the result SIMD buffer.
* ​width (`Int`): The SIMD width.
* ​alignment (`Int`): The minimal alignment of the address.

**Args:**

* ​offset (`SIMD[dtype, width]`): The SIMD vector of offsets to scatter into.
* ​val (`SIMD[dtype, width]`): The SIMD vector containing the values to be scattered.
* ​mask (`SIMD[bool, width]`): The SIMD vector of boolean values, indicating for each
  element whether to store at memory or not.

### `free`

`free(self: UnsafePointer[type, alignment=alignment, mut=mut, origin=origin])`

Free the memory referenced by the pointer.

### `bitcast`

`bitcast[T: AnyType = type](self) -> UnsafePointer[T, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`

Bitcasts a UnsafePointer to a different type.

**Parameters:**

* ​T (`AnyType`): The target type.

**Returns:**

A new UnsafePointer object with the specified type and the same address,
as the original UnsafePointer.

### `static_alignment_cast`

`static_alignment_cast[alignment: Int = alignment](self) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`

Changes the `alignment` of an `UnsafePointer`.

The static alignment of an UnsafePointer must be greater
or equal to the actual alignment of the runtime pointer
value. Casting an UnsafePointer to a static alignment greater
than its runtime alignment may cause undefined behavior".

This only changes the compile-time alignment encoded in the type of
this pointer. This does not change the alignment of the pointer address
at runtime.

**Parameters:**

* ​alignment (`Int`): Alignment of the destination pointer.

**Returns:**

A new UnsafePointer object with the same type, address\_space, and address,
as the original UnsafePointer, and the new specified alignment.

### `origin_cast`

`origin_cast[mut: Bool = mut, origin: Origin[mut] = origin](self) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`

Changes the origin or mutability of a pointer.

**Parameters:**

* ​mut (`Bool`): Whether the origin is mutable.
* ​origin (`Origin[mut]`): Origin of the destination pointer.

**Returns:**

A new UnsafePointer object with the same type and the same address,
as the original UnsafePointer and the new specified mutability and origin.

### `address_space_cast`

`address_space_cast[address_space: AddressSpace = address_space](self) -> UnsafePointer[type, address_space=address_space, alignment=alignment, mut=mut, origin=origin]`

Casts an UnsafePointer to a different address space.

**Parameters:**

* ​address\_space (`AddressSpace`): The address space of the result.

**Returns:**

A new UnsafePointer object with the same type and the same address,
as the original UnsafePointer and the new address space.

### `destroy_pointee`

`destroy_pointee(self: UnsafePointer[type, alignment=alignment, mut=mut, origin=origin])`

Destroy the pointed-to value.

The pointer must not be null, and the pointer memory location is assumed
to contain a valid initialized instance of `type`.  This is equivalent to
`_ = self.take_pointee()` but doesn't require `Movable` and is
more efficient because it doesn't invoke `__moveinit__`.

### `take_pointee`

`take_pointee[T: Movable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]) -> T`

Move the value at the pointer out, leaving it uninitialized.

The pointer must not be null, and the pointer memory location is assumed
to contain a valid initialized instance of `T`.

This performs a *consuming* move, ending the origin of the value stored
in this pointer memory location. Subsequent reads of this pointer are
not valid. If a new valid value is stored using `init_pointee_move()`, then
reading from this pointer becomes valid again.

**Parameters:**

* ​T (`Movable`): The type the pointer points to, which must be `Movable`.

**Returns:**

The value at the pointer.

### `init_pointee_move`

`init_pointee_move[T: Movable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], owned value: T)`

Emplace a new value into the pointer location, moving from `value`.

The pointer memory location is assumed to contain uninitialized data,
and consequently the current contents of this pointer are not destructed
before writing `value`. Similarly, ownership of `value` is logically
transferred into the pointer location.

When compared to `init_pointee_copy`, this avoids an extra copy on
the caller side when the value is an `owned` rvalue.

**Parameters:**

* ​T (`Movable`): The type the pointer points to, which must be `Movable`.

**Args:**

* ​value (`T`): The value to emplace.

### `init_pointee_copy`

`init_pointee_copy[T: Copyable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], value: T)`

Emplace a copy of `value` into the pointer location.

The pointer memory location is assumed to contain uninitialized data,
and consequently the current contents of this pointer are not destructed
before writing `value`. Similarly, ownership of `value` is logically
transferred into the pointer location.

When compared to `init_pointee_move`, this avoids an extra move on
the callee side when the value must be copied.

**Parameters:**

* ​T (`Copyable`): The type the pointer points to, which must be `Copyable`.

**Args:**

* ​value (`T`): The value to emplace.

### `init_pointee_explicit_copy`

`init_pointee_explicit_copy[T: ExplicitlyCopyable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], value: T)`

Emplace a copy of `value` into this pointer location.

The pointer memory location is assumed to contain uninitialized data,
and consequently the current contents of this pointer are not destructed
before writing `value`. Similarly, ownership of `value` is logically
transferred into the pointer location.

When compared to `init_pointee_move`, this avoids an extra move on
the callee side when the value must be copied.

**Parameters:**

* ​T (`ExplicitlyCopyable`): The type the pointer points to, which must be
  `ExplicitlyCopyable`.

**Args:**

* ​value (`T`): The value to emplace.

### `move_pointee_into`

`move_pointee_into[T: Movable, //](self: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin], dst: UnsafePointer[T, alignment=alignment, mut=mut, origin=origin])`

Moves the value `self` points to into the memory location pointed to by `dst`.

This performs a consuming move (using `__moveinit__()`) out of the
memory location pointed to by `self`. Subsequent reads of this
pointer are not valid unless and until a new, valid value has been
moved into this pointer's memory location using `init_pointee_move()`.

This transfers the value out of `self` and into `dest` using at most one
`__moveinit__()` call.

**Safety:**

* `self` must be non-null
* `self` must contain a valid, initialized instance of `T`
* `dst` must not be null
* The contents of `dst` should be uninitialized. If `dst` was
  previously written with a valid value, that value will be be
  overwritten and its destructor will NOT be run.

**Parameters:**

* ​T (`Movable`): The type the pointer points to, which must be `Movable`.

**Args:**

* ​dst (`UnsafePointer[T, alignment=alignment, mut=mut, origin=origin]`): Destination pointer that the value will be moved into.

---

## unsetenv

`unsetenv(owned name: String) -> Bool`

Unsets an environment variable.

**Args:**

* ​name (`String`): The name of the environment variable.

**Returns:**

True if unsetting the variable succeeded. Otherwise, False is returned.

---

## unswitch

`unswitch[: origin.set, //, switched_func: fn[Bool]() raises capturing -> None](dynamic_switch: Bool)`

Performs a functional unswitch transformation.

Unswitch is a simple pattern that is similar idea to loop unswitching
pass but extended to functional patterns. The pattern facilitates the
following code transformation that reduces the number of branches in the
generated code

Before:

```
for i in range(...)
    if i switched\_func (`fn[Bool]() raises capturing -> None`): The function containing the inner loop logic that can be
  unswitched.

**Args:**

* ​dynamic\_switch (`Bool`): The dynamic condition that enables the unswitched code
  path.

`unswitch[: origin.set, //, switched_func: fn[Bool]() capturing -> None](dynamic_switch: Bool)`

Performs a functional unswitch transformation.

Unswitch is a simple pattern that is similar idea to loop unswitching
pass but extended to functional patterns. The pattern facilitates the
following code transformation that reduces the number of branches in the
generated code

Before:

```
for i in range(...)
    if i switched\_func (`fn[Bool]() capturing -> None`): The function containing the inner loop logic that can be
  unswitched.

**Args:**

* ​dynamic\_switch (`Bool`): The dynamic condition that enables the unswitched code
  path.

`unswitch[: origin.set, //, switched_func: fn[Bool, Bool]() capturing -> None](dynamic_switch_a: Bool, dynamic_switch_b: Bool)`

Performs a functional 2-predicates unswitch transformation.

**Parameters:**

* ​switched\_func (`fn[Bool, Bool]() capturing -> None`): The function containing the inner loop logic that has 2
  predicates which can be unswitched.

**Args:**

* ​dynamic\_switch\_a (`Bool`): The first dynamic condition that enables the outer
  unswitched code path.
* ​dynamic\_switch\_b (`Bool`): The second dynamic condition that enables the inner
  unswitched code path.

---

## upcast

`upcast(layout: Layout, factor: Int) -> Layout`

Fuses consecutive elements in a layout to create a coarser layout.

This function is useful for converting between different data type granularities,
such as from bytes to larger data types like bfloat16 or tf32.

**Args:**

* ​layout (`Layout`): The layout to upcast.
* ​factor (`Int`): The number of consecutive elements to fuse into one.

**Returns:**

A new layout with adjusted shape and stride for the coarser granularity.

---

## update_frequency_data

`update_frequency_data[token_type: DType, //, target: StringSlice[StaticConstantOrigin]](compressed_frequency_data: LayoutTensor[int32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], frequency_offsets: LayoutTensor[uint32, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], new_tokens: LayoutTensor[token_type, layout, origin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], ctx: DeviceContextPtr)`

Update the frequency data for the given new tokens.

The frequency data is stored in a CSR format. This kernel expects there will be
enough padding for each sequence to store the new tokens.

---

## update_w_tile_2d

`update_w_tile_2d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], _init_output: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[2], n: Int, hw: IndexList[2])`

---

## update_w_tile_3d

`update_w_tile_3d[micro_kernel_height: Int, micro_kernel_width: Int, simd_size: Int, effected_by_padding: Bool, has_residual: Bool, last_c_tile: Bool, output_dt: DType, input_dt: DType, filter_dt: DType](output: UnsafePointer[SIMD[output_dt, 1]], input: UnsafePointer[SIMD[input_dt, 1]], filter: UnsafePointer[SIMD[filter_dt, 1]], _init_output: Bool, c_tile_size: Int, f_tile_offset: Int, f_tile_size: Int, conv_shape: ConvShape[3], n: Int, hw: IndexList[3])`

---

## use_apple_accelerate_lib

`use_apple_accelerate_lib[c_type: DType, a_type: DType, b_type: DType]() -> Bool`

---

## use_i8mm_fn

`use_i8mm_fn[a_type: DType, b_type: DType, c_type: DType]() -> Bool`

---

## use_vnni_fn

`use_vnni_fn[a_type: DType, b_type: DType, c_type: DType]() -> Bool`

---

## Using AI coding assistants

You can use large language models (LLMs) to accelerate your development with Modular by providing
structured context about Modular Platform’s docs and code to your projects . We provide two mechanisms:

- `llms.txt` files for broad documentation access.
- `.cursorules` files for specific coding guidelines.

## Supply documentation to LLMs with `llms.txt`

Modular supports the [llms.txt](https://llmstxt.org/) proposed standard, enabling
LLMs to access our documentation at inference time. This allows LLMs the most
up-to-date documentation providing more accurate and context-aware responses.

Modular provides the following `llms.txt` files:

- **llms.txt**: Contains an index of links with brief content descriptions for LLMs to navigate to detailed information.
- **llms-full.txt**: Provides all detailed content in a single file, removing the need for navigation.
- **llms-mojo.txt**: Includes documentation for the [Mojo standard library](/mojo/lib#standard-library), [MAX AI Kernels](/mojo/lib#max-ai-kernels-library), and [MAX library](/mojo/lib#max-library).
- **llms-python.txt**: Contains [MAX Python APIs](/max/api/python/) documentation.

### Integrate `llms.txt` with AI-assisted IDEs

You can leverage `llms.txt` files with IDEs that support tool calling, such as
[Cursor](https://www.cursor.com/) or [Windsurf](https://windsurf.dev/), to
provide context directly within your development environment.

For example, when writing Mojo code, you can reference the `llms-mojo.txt` file
by using `@docs.modular.com/llms-mojo.txt` in your chat window. Your IDE will
then use this documentation to inform its suggestions, completions, and error
corrections.

## Enhance LLM guidance with `.cursorules`

[`.cursorules`](https://docs.cursor.com/context/rules), also known as project
rules, are a powerful way to give LLMs consistent reusable information. These
rules are usually stored in a `.cursor/rules` directory right within your
project, so they can be version-controlled and specifically scoped to your
codebase.

You can use Modular's `.cursorules` to assist in coding tasks or working with
Modular based projects:

- **[`general_behavior_rules.mdc`](https://github.com/modular/modular/blob/main/.cursor/rules/general_behavior_rules.mdc)**:
General rules for code creation. Emphasizes simplicity, thorough investigation,
using existing solutions, descriptive naming, environment variables for
configuration, robust error handling, documentation, assertions, virtual
environments, and workspace-relative operations. 

- **[`git.mdc`](https://github.com/modular/modular/blob/main/.cursor/rules/git.mdc)**:
Outlines best practices for using Git effectively. Includes guidance on code
organization, commit strategies, branching models, and collaborative workflows. 

- **[`mojo.mdc`](https://github.com/modular/modular/blob/main/.cursor/rules/mojo.mdc)**:
Enforces Mojo coding standards, performance optimizations, and best practices.
Aims to ensure efficient and maintainable GPU-accelerated code, with guidance on
code organization, memory management, and error handling.

---

## utils

## Aliases

### `elementwise_compute_lambda_type`

`alias elementwise_compute_lambda_type = fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]`

### `elementwise_epilogue_type`

`alias elementwise_epilogue_type = fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None`

## Structs

* [​`GemmShape`](./GemmShape): Helper class to unpack gemm dimension and layout.
* [​`InnerKernelID`](./InnerKernelID):
* [​`KernelConfig`](./KernelConfig): Static configuration of the matmul inner kernel.
* [​`MicroKernelShape`](./MicroKernelShape): Record describing the inner kernel shape.
* [​`SubMatmulConfig`](./SubMatmulConfig): Static configuration of sub-matrices in parallel matmul.

## Functions

* [​`apply_epilogue`](./apply_epilogue):
* [​`calculate_tile_n_k`](./calculate_tile_n_k): Helper heuristic function to decide on tile size to partition the matmul given the cache size and desired data layout.
* [​`dispatch_get_kernel_type`](./dispatch_get_kernel_type):
* [​`get_kernel_config`](./get_kernel_config): Utility function to extract matmul configuration parameters for exported Functions.     TODO: Add target dependent configuration parameters.
* [​`get_kernel_type`](./get_kernel_type):
* [​`get_matmul_arch_factor`](./get_matmul_arch_factor):
* [​`get_matmul_kernel_shape`](./get_matmul_kernel_shape):
* [​`get_matmul_kernel_shape_ARM`](./get_matmul_kernel_shape_ARM):
* [​`get_matmul_kernel_shape_x86`](./get_matmul_kernel_shape_x86):
* [​`get_matmul_num_tasks`](./get_matmul_num_tasks): Compute the number of tasks for parallel matmul. The max number of tasks is typically the number of threads/cores.
* [​`get_matmul_prefetch_b_distance_k`](./get_matmul_prefetch_b_distance_k):
* [​`get_min_task_size`](./get_min_task_size):
* [​`get_pack_data_size`](./get_pack_data_size): Utility to compute the number of elements to pack in each tile. Returns:     The number of elements to pack.
* [​`get_packB_unroll_factor`](./get_packB_unroll_factor):
* [​`get_partitioned_matmul`](./get_partitioned_matmul):
* [​`get_partitioned_matmul_mojo`](./get_partitioned_matmul_mojo):
* [​`get_partitioned_matmul_mojo_shape`](./get_partitioned_matmul_mojo_shape):
* [​`packA_i8mm`](./packA_i8mm):
* [​`partition_work`](./partition_work):
* [​`select_inner_kernel`](./select_inner_kernel):
* [​`use_i8mm_fn`](./use_i8mm_fn):
* [​`use_vnni_fn`](./use_vnni_fn):

---

## utils

Implements the utils package.

## Modules

* [​`index`](/mojo/stdlib/utils/index_/): Implements `IndexList` which is commonly used to represent N-D indices.
* [​`lock`](/mojo/stdlib/utils/lock/):
* [​`numerics`](/mojo/stdlib/utils/numerics/): Defines utilities to work with numeric types.
* [​`static_tuple`](/mojo/stdlib/utils/static_tuple/): Implements StaticTuple, a statically-sized uniform container.
* [​`variant`](/mojo/stdlib/utils/variant/): Defines a Variant type.
* [​`write`](/mojo/stdlib/utils/write/): Establishes the contract between `Writer` and `Writable` types.

---

## utils_gpu

## Structs

* [​`MatmulConfig`](./MatmulConfig): Static configuration of GPU matmul.
* [​`MatmulKernels`](./MatmulKernels): Supported matmul kernels.

## Functions

* [​`block_swizzle`](./block_swizzle):
* [​`get_config_from_shape`](./get_config_from_shape):
* [​`select_config`](./select_config):

---

## valid_length_managed_tensor_slice_to_ndbuffer

`valid_length_managed_tensor_slice_to_ndbuffer(tensor: ManagedTensorSlice[io_spec, static_spec=static_spec]) -> NDBuffer[uint32, 1, MutableAnyOrigin]`

---

## valid_length_managed_tensor_slice_to_ndbuffer

`valid_length_managed_tensor_slice_to_ndbuffer(tensor: ManagedTensorSlice[io_spec, static_spec=static_spec]) -> NDBuffer[uint32, 1, MutableAnyOrigin]`

---

## value

Defines core value traits.

These are Mojo built-ins, so you don't need to import them.

## Traits

* [​`Copyable`](/mojo/stdlib/builtin/value/Copyable): The Copyable trait denotes a type whose value can be copied.
* [​`Defaultable`](/mojo/stdlib/builtin/value/Defaultable): The `Defaultable` trait describes a type with a default constructor.
* [​`ExplicitlyCopyable`](/mojo/stdlib/builtin/value/ExplicitlyCopyable): The ExplicitlyCopyable trait denotes a type whose value can be copied explicitly.
* [​`Movable`](/mojo/stdlib/builtin/value/Movable): The Movable trait denotes a type whose value can be moved.

---

## Value

```c
#include "max/c/value.h"
```

## Functions

### `M_getValueByNameFrom()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_getValueByNameFrom([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*valueMap, const char \*valueName, [M\_Status](types.md#_CPPv48M_Status) \*status)

Gets a value from the value map by name.

* **Parameters:**

  * **valueMap** – The value map.
  * **valueName** – The name of the value.
  * **status** – The status object for reporting errors.
* **Returns:**

  A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is simply borrowed from the corresponding input `M_AsyncTensorMap`. If the value map or name are invalid, a `NULL` pointer is returned and the `status` parameter contains an error message.

### `M_getValueFromMapIterator()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_getValueFromMapIterator([M\_TensorMapIterator](types.md#_CPPv419M_TensorMapIterator) \*iterator)

Gets the tensor from the tensor map iterator.

* **Parameters:**

  **iterator** – The tensor map iterator.
* **Returns:**

  A pointer to the tensor. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeTensor()`](tensor.md#tensor_8h_1a339008df4a10af5e8c01ae970598765c). The held tensor inside the return value is simply borrowed from the corresponding input `M_AsyncTensorMap`. If the tensor map iterator is invalid, a `NULL` pointer is returned.

### `M_freeValue()`

> void M\_freeValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value)

Deallocates the memory for the container. No-op if `value` is `NULL`.

* **Parameters:**

  **value** – The value to deallocate.

### `M_getStringFromValue()`

> const char \*M\_getStringFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value)

Gets a string from the async value.

* **Parameters:**

  **value** – The async value.
* **Returns:**

  A null-terminated string if the `value` is valid. Otherwise, `NULL`. The memory associated with the returned string is owned by the `value`.

### `M_createStringAsyncValue()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createStringAsyncValue(const char \*data, [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context)

Creates a string wrapped in an `AsyncValue`.

* **Parameters:**

  * **data** – The zero-terminated string data.
  * **context** – The runtime context.
* **Returns:**

  A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`.

### `M_getDoubleFromValue()`

> double M\_getDoubleFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value)

Gets a double from the async value.

* **Parameters:**

  **value** – The async value.
* **Returns:**

  A double value.

### `M_createDoubleAsyncValue()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createDoubleAsyncValue(double value, [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context)

Creates a double value wrapped in an `AsyncValue`.

* **Parameters:**

  * **value** – The double value.
  * **context** – The runtime context.
* **Returns:**

  A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`.

### `M_getLongFromValue()`

> int64\_t M\_getLongFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value)

Gets a long from the async value.

* **Parameters:**

  **value** – The async value.
* **Returns:**

  A long value.

### `M_createLongAsyncValue()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createLongAsyncValue(int64\_t value, [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context)

Creates a long value wrapped in an `AsyncValue`.

* **Parameters:**

  * **value** – The long value.
  * **context** – The runtime context.
* **Returns:**

  A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`.

### `M_getBoolFromValue()`

> bool M\_getBoolFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value)

Gets a boolean from the async value.

* **Parameters:**

  **value** – The async value.
* **Returns:**

  A boolean value.

### `M_createBoolAsyncValue()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createBoolAsyncValue(bool value, [M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context)

Creates a boolean value wrapped in an `AsyncValue`.

* **Parameters:**

  * **value** – The boolean value.
  * **context** – The runtime context.
* **Returns:**

  A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`.

### `M_borrowValueInto()`

> void M\_borrowValueInto([M\_AsyncTensorMap](types.md#_CPPv416M_AsyncTensorMap) \*tensors, const char \*name, const [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value, [M\_Status](types.md#_CPPv48M_Status) \*status)

Adds a value to the tensor map.

You are responsible for the lifetime of the input value. It gets “borrowed” into the `TensorMap`.

* **Parameters:**

  * **tensors** – The tensor map, from [`M_newAsyncTensorMap()`](tensor.md#tensor_8h_1a18039c6e6c1769b947120b27178306eb).
  * **name** – The zero-terminated string data, representing the name of the value.
  * **value** – The input value.
  * **status** – The status object for reporting errors.

### `M_getValueType()`

> [M\_ValueType](types.md#_CPPv411M_ValueType) M\_getValueType([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value)

Returns the type contained in the underlying value.

* **Parameters:**

  **value** – The async value.
* **Returns:**

  An enum describing the type of the underlying value. Returns `M_UNKNOWN_VALUE` for unsupported values and if the value is invalid.

### `M_getDictFromValue()`

> [M\_AsyncDict](types.md#_CPPv411M_AsyncDict) \*M\_getDictFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value)

Gets a `Dict` from the async value.

* **Parameters:**

  **value** – The async value.
* **Returns:**

  A pointer to the `Dict`. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeDict()`](#value_8h_1a4578bec6c4257a48ecc05ef358c464a5). The held `Dict` inside the return value is simply borrowed from the `M_AsyncValue`. If the value is invalid or not a `Dict`, a `NULL` pointer is returned.

### `M_createDictAsyncValue()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createDictAsyncValue([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context)

Creates an empty `Dict` wrapped in an `AsyncValue`.

* **Parameters:**

  **context** – The runtime context.
* **Returns:**

  A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`.

### `M_insertIntoDict()`

> void M\_insertIntoDict([M\_AsyncDict](types.md#_CPPv411M_AsyncDict) \*dict, [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*key, [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value)

Inserts a key-value pair to the `Dict`.

You are responsible for the lifetime of the key and value. Their data gets “borrowed” into the `Dict`. No-op if either the dict, key or value are invalid.

* **Parameters:**

  * **dict** – The dict to insert into.
  * **key** – The key to insert.
  * **value** – The value to insert.

### `M_getListFromValue()`

> [M\_AsyncList](types.md#_CPPv411M_AsyncList) \*M\_getListFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value)

Gets a `List` from the async value.

* **Parameters:**

  **value** – The async value.
* **Returns:**

  A pointer to the `List`. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeList()`](#value_8h_1a653e01f359ce0579b4bb7e7b6a0c286c). The held `List` inside the return value is simply borrowed from the `M_AsyncValue`. If the value is invalid or not a `List`, a `NULL` pointer is returned.

### `M_createListAsyncValue()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createListAsyncValue([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context)

Creates an empty `List` wrapped in an `AsyncValue`.

* **Parameters:**

  **context** – The runtime context.
* **Returns:**

  A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`.

### `M_appendToList()`

> void M\_appendToList([M\_AsyncList](types.md#_CPPv411M_AsyncList) \*list, [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value)

Appends a value to the `List`.

You are responsible for the lifetime of the value. Its data gets “borrowed” into the `List`. No-op if either the list or value are invalid.

* **Parameters:**

  * **list** – The list to append onto.
  * **value** – The value to append.

### `M_getTupleFromValue()`

> [M\_AsyncTuple](types.md#_CPPv412M_AsyncTuple) \*M\_getTupleFromValue([M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value)

Gets a `Tuple` from the async value.

* **Parameters:**

  **value** – The async value.
* **Returns:**

  A pointer to the `Tuple`. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeTuple()`](#value_8h_1a8bb2dfb3040465617541d2819e3b3e46). The held `Tuple` inside the return value is simply borrowed from the `M_AsyncValue`. If the value is invalid or not a `Tuple`, a `NULL` pointer is returned.

### `M_borrowIntoTuple()`

> void M\_borrowIntoTuple([M\_AsyncTuple](types.md#_CPPv412M_AsyncTuple) \*tuple, [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*value)

Adds a value to the `Tuple`.

You are responsible for the lifetime of the value. Its data gets “borrowed” into the `Tuple`. No-op if either the tuple or value are invalid.

* **Parameters:**

  * **tuple** – The tuple to add into.
  * **value** – The value to add.

### `M_createTupleAsyncValue()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createTupleAsyncValue([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context)

Creates an empty `Tuple` wrapped in an `AsyncValue`.

* **Parameters:**

  **context** – The runtime context.
* **Returns:**

  A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`.

### `M_getDictSize()`

> size\_t M\_getDictSize([M\_AsyncDict](types.md#_CPPv411M_AsyncDict) \*dict)

Returns the number of elements in the `Dict`.

* **Parameters:**

  **dict** – The dict.
* **Returns:**

  The number of elements in the `Dict`. Returns 0 if the dict is invalid.

### `M_getListSize()`

> size\_t M\_getListSize([M\_AsyncList](types.md#_CPPv411M_AsyncList) \*list)

Returns the number of elements in the `List`.

* **Parameters:**

  **list** – The list.
* **Returns:**

  The number of elements in the `List`. Returns 0 if the list is invalid.

### `M_getTupleSize()`

> size\_t M\_getTupleSize([M\_AsyncTuple](types.md#_CPPv412M_AsyncTuple) \*tuple)

Returns the number of elements in the `Tuple`.

* **Parameters:**

  **tuple** – The tuple.
* **Returns:**

  The number of elements in the `Tuple`. Returns 0 if the tuple is invalid.

### `M_getDictKey()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_getDictKey([M\_AsyncDict](types.md#_CPPv411M_AsyncDict) \*dict, size\_t i)

Returns the dict key at position `i`.

* **Parameters:**

  * **dict** – The dict.
  * **i** – The index to return.
* **Returns:**

  A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. If the dict is invalid or the index out of bounds, a `NULL` pointer is returned.

### `M_getDictValue()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_getDictValue([M\_AsyncDict](types.md#_CPPv411M_AsyncDict) \*dict, size\_t i)

Returns the dict value at position `i`.

* **Parameters:**

  * **dict** – The dict.
  * **i** – The index to return.
* **Returns:**

  A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. If the dict is invalid or the index out of bounds, a `NULL` pointer is returned.

### `M_getListValue()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_getListValue([M\_AsyncList](types.md#_CPPv411M_AsyncList) \*list, size\_t i)

Returns the list value at position `i`.

* **Parameters:**

  * **list** – The list.
  * **i** – The index to return.
* **Returns:**

  A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. If the list is invalid or the index out of bounds, a `NULL` pointer is returned.

### `M_getTupleValue()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_getTupleValue([M\_AsyncTuple](types.md#_CPPv412M_AsyncTuple) \*tuple, size\_t i)

Returns the tuple value at position `i`.

* **Parameters:**

  * **tuple** – The tuple.
  * **i** – The index to return.
* **Returns:**

  A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`. If the tuple is invalid or the index out of bounds, a `NULL` pointer is returned.

### `M_createNoneAsyncValue()`

> [M\_AsyncValue](types.md#_CPPv412M_AsyncValue) \*M\_createNoneAsyncValue([M\_RuntimeContext](types.md#_CPPv416M_RuntimeContext) \*context)

Creates a `None` value wrapped in an `AsyncValue`.

* **Parameters:**

  **context** – The runtime context.
* **Returns:**

  A pointer to the value. You are responsible for the memory associated with the pointer returned. The memory can be deallocated by calling [`M_freeValue()`](#value_8h_1a9f8e4b2be9e0d7877da6f88919b3e96e). The held value inside the return value is owned by the `AsyncValue`.

### `M_freeDict()`

> void M\_freeDict([M\_AsyncDict](types.md#_CPPv411M_AsyncDict) \*dict)

Deallocates the memory for the dictionary. No-op if `dict` is `NULL`.

* **Parameters:**

  **dict** – The dictionary to deallocate.

### `M_freeList()`

> void M\_freeList([M\_AsyncList](types.md#_CPPv411M_AsyncList) \*list)

Deallocates the memory for the list. No-op if `list` is `NULL`.

* **Parameters:**

  **list** – The list to deallocate.

### `M_freeTuple()`

> void M\_freeTuple([M\_AsyncTuple](types.md#_CPPv412M_AsyncTuple) \*list)

Deallocates the memory for the tuple. No-op if `tuple` is `NULL`.

* **Parameters:**

  **list** – The list to deallocate.

### `M_freeNone()`

> void M\_freeNone([M\_AsyncNone](types.md#_CPPv411M_AsyncNone) \*none)

Deallocates the memory for the none value. No-op if `none` is `NULL`.

* **Parameters:**

  **none** – The async none to deallocate.

---

## Value

## `Value` {#max.graph.Value}

> *class* max.graph.Value

Represents a symbolic value within a Graph.

A Value can represent the output of a node, the arguments of a
Graph (as seen from within its body), and more generally any symbolic
value available within the Graph. Other nodes receive Value
values as inputs to form a computation graph.

A Value may also refer to an existing input or output of a node,
and you can change them, such as by swapping a new Value.

Conceptually, think of a Value as an edge in the dataflow graph,
with the other end being the user of that value.

The following example shows how to work with Values in a graph to create a simple computation:

```python
from max.graph import Graph, ops, Value
from max.dtype import DType
import numpy as np

with Graph("value_example") as graph:
    # Create input values
    a = ops.constant(np.array([1, 2, 3]), dtype=DType.float32, device=DeviceRef.CPU())
    b = ops.constant(np.array([4, 5, 6]), dtype=DType.float32, device=DeviceRef.CPU())

    # Use values to perform operations
    c = a + b  # c is a Value representing the addition

    # Demonstrate that the result is a Value
    print(f"Type of c: {type(c)}")
    print(f"Is c a Value? {isinstance(c, Value)}")
```

Similar to a regular variable, a Value has a data type.

Value is abstract, it shouldn’t be constructed directly.

### `buffer` {#max.graph.Value.buffer}

> *property* buffer\*: [BufferValue](BufferValue.md#max.graph.BufferValue)\*

Returns the Value as a [`BufferValue`](BufferValue.md#max.graph.BufferValue).

Raises an exception if the Value is not a BufferValue.

### `from_mlir()` {#max.graph.Value.from_mlir}

> *classmethod* from\_mlir(value: Value\[TensorType]) → [TensorValue](TensorValue.md#max.graph.TensorValue)

> *classmethod* from\_mlir(value: Value\[BufferType]) → [BufferValue](BufferValue.md#max.graph.BufferValue)

> *classmethod* from\_mlir(value: Value\[OpaqueType]) → \_OpaqueValue

> *classmethod* from\_mlir(value: Value\[ChainType]) → \_ChainValue

### `opaque` {#max.graph.Value.opaque}

> *property* opaque\*: \_OpaqueValue\*

Returns the Value as an `_OpaqueValue`.

Raises an exception if the Value is not a \_OpaqueValue.

### `tensor` {#max.graph.Value.tensor}

> *property* tensor\*: [TensorValue](TensorValue.md#max.graph.TensorValue)\*

Returns the Value as a [`TensorValue`](TensorValue.md#max.graph.TensorValue).

Raises an exception if the Value is not a TensorValue.

### `type` {#max.graph.Value.type}

> *property* type\*: [Type](type.md#max.graph.type.Type)\*

Returns the type of the [`Value`](#max.graph.Value) as a `Type`.

---

## Value semantics

Mojo doesn't enforce value semantics or reference semantics. It supports them
both and allows each type to define how it is created, copied, and moved (if at
all). So, if you're building your own type, you can implement it to support
value semantics, reference semantics, or a bit of both. That said, Mojo is
designed with argument behaviors that default to value semantics, and it
provides tight controls for reference semantics that avoid memory errors.

The controls over reference semantics are provided by the [value ownership
model](/mojo/manual/values/ownership), but before we get into the syntax
and rules for that, it's important that you understand the principles of value
semantics. Generally, it means that each variable has unique access to a value,
and any code outside the scope of that variable cannot modify its value.

## Intro to value semantics

In the most basic situation, sharing a value-semantic type means that you create
a copy of the value. This is also known as "pass by value." For example,
consider this code:

```mojo
def main():
    var x = 1
    var y = x
    y += 1

    print("x:", x)
    print("y:", y)
```

```output
x: 1
y: 2
```

We assigned the value of `x` to `y`, which creates the value for `y` by making a
copy of `x`. When we increment `y`, the value of `x` doesn't change. Each
variable has exclusive ownership of a value.

Whereas, if a type instead uses reference semantics, then `y` would point to
the same value as `x`, and incrementing either one would affect the value for
both. Neither `x` nor `y` would "own" the value, and any variable would be
allowed to reference it and mutate it.

Numeric values in Mojo are value semantic because they're trivial types, which
are cheap to copy.

## Value semantics in Mojo functions

Value semantics also apply to function arguments in Mojo by default. However,
the way in which they apply differs depending on whether you define the function
using `def` or `fn`. You can also override the default behavior by providing an
explicit [argument
convention](/mojo/manual/values/ownership#argument-conventions), which is
discussed in the [Ownership](/mojo/manual/values/ownership) page.

### Value semantics in `def` functions

Here's an example with a function defined using `def`:

```mojo
def add_one(y: Int):
    # def creates an implicit copy of the value because it's mutated
    y += 1
    print("y:", y)

def main():
    var x = 1
    add_one(x)
    print("x:", x)
```

```output
y: 2
x: 1
```

Again, the `y` value is a copy and the function cannot modify the original `x`
value.

If you're familiar with Python, this is probably familiar so far, because the
code above behaves the same in Python. However, Python is not value semantic.

It gets complicated, but let's consider a situation in which you call a Python
function and pass an object with a pointer to a heap-allocated value. Python
actually gives that function a reference to your object, which allows the
function to mutate the heap-allocated value. This can cause nasty bugs if
you're not careful, because the function might incorrectly assume it has unique
ownership of that object.

In Mojo, the default behavior for all function arguments is to use value
semantics. If the function wants to modify the value of an incoming argument,
then it must explicitly declare so, which avoids accidental mutations of the
original value.

All Mojo types passed to a `def` function can be treated as mutable,
which maintains the expected mutability behavior from Python. But by default, it
is mutating a uniquely-owned value, not the original value.

For example, when you pass an instance of a `SIMD` vector to a `def`
function it creates a unique copy of all values. Thus, if we modify the
argument in the function, the original value is unchanged:

```mojo
def update_simd(t: SIMD[DType.int32, 4]):
    t[0] = 9
    print("t:", t)

def main():
    var v = SIMD[DType.int32, 4](1, 2, 3, 4)
    update_simd(v)
    print("v:", v)
```

```output
t: [9, 2, 3, 4]
v: [1, 2, 3, 4]
```

If this were Python code, the function would modify the original object, because
Python shares a reference to the original object.

However, not all types are inexpensive to copy. Copying a `String` or `List`
requires allocating heap memory, so we want to avoid copying one by accident.
When designing a type like this, ideally you want to prevent *implicit* copies,
and only make a copy when it's explicitly requested.

### Value semantics in `fn` functions

The arguments above are mutable because a function defined with `def` has
special treatment for the default [`read` argument
convention](/mojo/manual/values/ownership#argument-conventions).

In contrast, `fn` functions always receive `read` arguments as immutable
references. This is a memory optimization to avoid making
unnecessary copies.

For example, let's create another function with the `fn` declaration. In this
case, the `y` argument is immutable by default, so if the function wants to
modify the value in the local scope, it needs to make a local copy:

```mojo
fn add_two(y: Int):
    # y += 2  # This would cause a compiler error because `y` is immutable
    # We can instead make an explicit copy:
    var z = y
    z += 2
    print("z:", z)

def main():
    var x = 1
    add_two(x)
    print("x:", x)
```

```output
z: 3
x: 1
```

This is all consistent with value semantics because each variable maintains
unique ownership of its value.

The way the `fn` function receives the `y` value is a "look but don't touch"
approach to value semantics. This is also a more memory-efficient approach when
dealing with memory-intensive arguments, because Mojo doesn't make any copies
unless we explicitly make the copies ourselves.

Thus, the default behavior for `def` and `fn` arguments is fully value
semantic: arguments are either copies or immutable references, and any living
variable from the callee is not affected by the function.

But we must also allow reference semantics (mutable references) because it's
how we build performant and memory-efficient programs (making copies of
everything gets really expensive). The challenge is to introduce reference
semantics in a way that does not disturb the predictability and safety of value
semantics.

The way we do that in Mojo is, instead of enforcing that every variable have
"exclusive access" to a value, we ensure that every value has an "exclusive
owner," and destroy each value when the lifetime of its owner ends.

On the next page about [value
ownership](/mojo/manual/values/ownership), you'll learn how to modify
the default argument conventions, and safely use reference semantics so every
value has only one owner at a time.

---

## ValueOrUnknown

`struct ValueOrUnknown[dim: Int = -1]`

Represents either a static dimension (known at compile time) or a dynamic dimension (known at runtime).

## Parameters

* ​dim (`Int`): Optional compile-time dimension value. Default is `UNKNOWN_VALUE` for dynamic dimensions.

## Fields

* ​value (`Int`): The runtime value of the dimension.
  For static dimensions, this is set to the compile-time value.
  For dynamic dimensions, this is set at runtime.

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self)`

Initializes a static dimension with compile-time value.

Note:
Fails to compile if dim is `UNKNOWN_VALUE`, as dynamic dimensions require a runtime value.

`@implicit`
`__init__(out self, v: Int)`

Initializes a dynamic dimension with runtime value.

**Args:**

* ​v (`Int`): Runtime value for the dimension.

---

## Variables

A variable is a name that holds a value or object. All variables in Mojo are
mutable—their value can be changed. (If you want to define a constant value that
can't change at runtime, see the
[`alias` keyword](/mojo/manual/parameters/#alias-named-parameter-expressions).)

Mojo has two kinds of variables:

* Explicitly-declared variables are created with the `var` keyword, and may include
  [type annotations](#type-annotations).

  ```mojo
  var a = 5
  var b: Float64 = 3.14
  ```

* Implicitly-declared variables are created with an assignment statement:

  ```mojo
  a = 5
  b = 3.14
  ```

Both types of variables are strongly-typed: the variable receives a type when
it's created, and the type never changes. You can't assign a variable a value of
a different type:

```mojo
count = 8 # count is type Int
count = "Nine?" # Error: can't implicitly convert 'StringLiteral' to 'Int'
```

Some types support [*implicit conversions*](#implicit-type-conversion) from
other types. For example, an integer value can implicitly convert to a
floating-point value:

```mojo
var temperature: Float64 = 99
print(temperature)
```

```output
99.0
```

In this example, the `temperature` variable is explicitly typed as `Float64`,
but assigned an integer value, so the  value is implicitly converted to a
`Float64`.

## Implicitly-declared variables

You can create a variable with just a name and a value. For example:

```mojo
name = String("Sam")
user_id = 0
```

Implicitly-declared variables are strongly typed: they take the type from the
first value assigned to them. For example, the `user_id` variable above is type
`Int`, while the `name` variable is type `String`. You can't assign a string to
`user_id` or an integer to `name`.

Implicitly-declared variables are scoped at the function level. You create an
implicitly-declared variable the first time you assign a value to a given name
inside a function. Any subsequent references to that name inside the function
refer to the same variable. For more information, see [Variable
scopes](#variable-scopes), which describes how variable scoping differs between
explicitly- and implicitly-declared variables.

## Explicitly-declared variables

You can declare a variable with the `var` keyword. For example:

```mojo
var name = String("Sam")
var user_id: Int
```

The `name` variable is initialized to the string "Sam". The `user_id` variable
is uninitialized, but it has a declared type, `Int` for an integer value. All
explicitly-declared variables are typed—either explicitly with a
[type annotation](#type-annotations) or implicitly when they're initialized with
a value.

Since variables are strongly typed, you can't assign a variable a
value of a different type, unless those types can be
[implicitly converted](#implicit-type-conversion). For example, this code will
not compile:

```mojo
var user_id: Int = "Sam"
```

There are several main differences between explicitly-declared variables and
implicitly-declared variables:

* An explicitly-declared variable can be declared without initializing it:

  ```mojo
  var value: Float64
  ```

* Explicitly-declared variables follow [lexical scoping](#variable-scopes),
  unlike implicitly-declared variables.

## Type annotations

Although Mojo can infer a variable type from the first value assigned to a
variable, it also supports static type annotations on variables. Type
annotations provide a more explicit way of specifying the variable's type.

To specify the type for a variable, add a colon followed by the type name:

```mojo
var name: String = get_name()
```

This makes it clear that `name` is type `String`, without knowing what the
`get_name()` function returns. The `get_name()` function may return a `String`,
or a value that's implicitly convertible to a `String`.

:::note

You must declare a variable with `var` to use type annotations.

:::

If a type has a constructor with just one argument, you can initialize it in
two ways:

```mojo
var name1: String = "Sam"
var name2 = String("Sam")
```

Both of these lines invoke the same constructor to create a `String` from a
`StringLiteral`.

### Late initialization

Using type annotations allows for late initialization. For example, notice here
that the `z` variable is first declared with just a type, and the value is
assigned later:

```mojo
fn my_function(x: Int):
    var z: Float32
    if x != 0:
        z = 1.0
    else:
        z = foo()
    print(z)

fn foo() -> Float32:
    return 3.14
```

If you try to pass an uninitialized variable to a function or use
it on the right-hand side of an assignment statement, compilation fails.

```mojo
var z: Float32
var y = z # Error: use of uninitialized value 'z'
```

:::note

Late initialization works only if the variable is declared with a
type.

:::

### Implicit type conversion

Some types include built-in type conversion (type casting) from one type into
its own type. For example, if you assign an integer to a variable that has a
floating-point type, it converts the value instead of giving a compiler error:

```mojo
var number: Float64 = Int(1)
print(number)
```

```output
1.0
```

As shown above, value assignment can be converted into a constructor call if the
target type has a constructor that meets the following criteria:

- It's decorated with the `@implicit` decorator.

- It takes a single required argument that matches the value being assigned.

So, this code uses the `Float64` constructor that takes an
integer: `__init__(out self, value: Int)`.

In general, implicit conversions should only be supported where the conversion
is lossless.

Implicit conversion follows the logic of [overloaded
functions](/mojo/manual/functions#overloaded-functions). If the destination
type has a viable implicit conversion constructor for the source
type, it can be invoked for implicit conversion.

So assigning an integer to a `Float64` variable is exactly the same as this:

```mojo
var number = Float64(1)
```

Similarly, if you call a function that requires an argument of a certain type
(such as `Float64`), you can pass in any value as long as that value type can
implicitly convert to the required type (using one of the type's overloaded
constructors).

For example, you can pass an `Int` to a function that expects a `Float64`,
because `Float64` includes an implicit conversion constructor that takes an
`Int`:

```mojo
fn take_float(value: Float64):
    print(value)

fn pass_integer():
    var value: Int = 1
    take_float(value)
```

For more details on implicit conversion, see
[Constructors and implicit
conversion](/mojo/manual/lifecycle/life/#constructors-and-implicit-conversion).

## Variable scopes

Variables declared with `var` are bound by *lexical scoping*. This
means that nested code blocks can read and modify variables defined in an
outer scope. But an outer scope **cannot** read variables defined in an
inner scope at all.

For example, the `if` code block shown here creates an inner scope where outer
variables are accessible to read/write, but any new variables do not live
beyond the scope of the `if` block:

```mojo
def lexical_scopes():
    var num = 1
    var dig = 1
    if num == 1:
        print("num:", num)  # Reads the outer-scope "num"
        var num = 2         # Creates new inner-scope "num"
        print("num:", num)  # Reads the inner-scope "num"
        dig = 2             # Updates the outer-scope "dig"
    print("num:", num)      # Reads the outer-scope "num"
    print("dig:", dig)      # Reads the outer-scope "dig"

lexical_scopes()
```

```output
num: 1
num: 2
num: 1
dig: 2
```

Note that the `var` statement inside the `if` creates a **new** variable with the same name as the outer variable. This prevents the inner loop from accessing the outer `num` variable. (This is called "variable shadowing," where the inner scope variable hides or "shadows" a variable from an outer scope.)

The lifetime of the inner `num` ends exactly where the `if` code block ends,
because that's the scope in which the variable was defined.

This is in contrast to implicitly-declared variables (those without the `var`
keyword), which use **function-level scoping** (consistent with Python variable
behavior). That means, when you change the value of an implicitly-declared
variable inside the `if` block, it actually changes the value for the entire
function.

For example, here's the same code but *without* the `var` declarations:

```mojo
def function_scopes():
    num = 1
    if num == 1:
        print(num)   # Reads the function-scope "num"
        num = 2      # Updates the function-scope variable
        print(num)   # Reads the function-scope "num"
    print(num)       # Reads the function-scope "num"

function_scopes()
```

```output
1
2
2
```

Now, the last `print()` function sees the updated `num` value from the inner
scope, because implicitly-declared variables (Python-style variables) use function-level
scope (instead of lexical scope).

---

## VariadicList

`@register_passable(trivial)`
`struct VariadicList[type: AnyTrivialRegType]`

A utility class to access variadic function arguments. Provides a "list" view of the function argument so that the size of the argument list and each individual argument can be accessed.

## Parameters

* ​type (`AnyTrivialRegType`): The type of the elements in the list.

## Fields

* ​value (`Variadic[type]`): The underlying storage for the variadic list.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `IterType`

`alias IterType = _VariadicListIter[type]`

## Methods

### `__init__`

`@implicit`
`__init__(*value: type) -> Self`

Constructs a VariadicList from a variadic list of arguments.

**Args:**

* ​\*value (`type`): The variadic argument list to construct the variadic list
  with.

### `__getitem__`

`__getitem__[I: Indexer](self, idx: I) -> type`

Gets a single element on the variadic list.

**Parameters:**

* ​I (`Indexer`): A type that can be used as an index.

**Args:**

* ​idx (`I`): The index of the element to access on the list.

**Returns:**

The element on the list corresponding to the given index.

### `__len__`

`__len__(self) -> Int`

Gets the size of the list.

**Returns:**

The number of elements on the variadic list.

### `__iter__`

`__iter__(self) -> _VariadicListIter[type]`

Iterate over the list.

**Returns:**

An iterator to the start of the list.

---

## VariadicListMem

`struct VariadicListMem[elt_is_mutable: Bool, //, element_type: AnyType, origin: Origin[elt_is_mutable], is_owned: Bool]`

A utility class to access variadic function arguments of memory-only types that may have ownership. It exposes references to the elements in a way that can be enumerated.  Each element may be accessed with `elt[]`.

## Parameters

* ​elt\_is\_mutable (`Bool`): True if the elements of the list are mutable for an
  mut or owned argument.
* ​element\_type (`AnyType`): The type of the elements in the list.
* ​origin (`Origin[elt_is_mutable]`): The origin of the underlying elements.
* ​is\_owned (`Bool`): Whether the elements are owned by the list.

## Fields

* ​value (`Variadic[ref [origin._mlir_origin] element_type]`): The underlying storage, a variadic list of references to elements of the given type.

## Implemented traits

`AnyType`,
`Sized`,
`UnknownDestructibility`

## Aliases

### `reference_type`

`alias reference_type = Pointer[element_type, origin]`

## Methods

### `__moveinit__`

`__moveinit__(out self, owned existing: Self)`

Moves constructor.

**Args:**

* ​existing (`Self`): The existing VariadicListMem.

### `__del__`

`__del__(owned self)`

Destructor that releases elements if owned.

### `__getitem__`

`__getitem__(self, idx: Int) -> ref [origin, *[0,0]] element_type`

Gets a single element on the variadic list.

**Args:**

* ​idx (`Int`): The index of the element to access on the list.

**Returns:**

A low-level pointer to the element on the list corresponding to the
given index.

### `__len__`

`__len__(self) -> Int`

Gets the size of the list.

**Returns:**

The number of elements on the variadic list.

### `__iter__`

`__iter__(self, out result: _VariadicListMemIter[element_type, origin, self, is_owned])`

Iterate over the list.

**Returns:**

An iterator to the start of the list.

---

## VariadicPack

`@register_passable`
`struct VariadicPack[elt_is_mutable: Bool, //, is_owned: Bool, origin: Origin[elt_is_mutable], element_trait: AnyTrait[AnyType], *element_types: element_trait]`

A utility class to access variadic pack  arguments and provide an API for doing things with them.

## Parameters

* ​elt\_is\_mutable (`Bool`): True if the elements of the list are mutable for an
  mut or owned argument pack.
* ​is\_owned (`Bool`): Whether the elements are owned by the pack. If so, the pack
  will release the elements when it is destroyed.
* ​origin (`Origin[elt_is_mutable]`): The origin of the underlying elements.
* ​element\_trait (`AnyTrait[AnyType]`): The trait that each element of the pack conforms to.
* ​\*element\_types (`element_trait`): The list of types held by the argument pack.

## Implemented traits

`AnyType`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__del__`

`__del__(owned self)`

Destructor that releases elements if owned.

### `__getitem__`

`__getitem__[index: Int](self) -> ref [origin] element_types[index.value]`

Return a reference to an element of the pack.

**Parameters:**

* ​index (`Int`): The element of the pack to return.

**Returns:**

A reference to the element.  The Pointer's mutability follows the
mutability of the pack argument convention.

### `__len__`

`static __len__() -> Int`

Return the VariadicPack length.

**Returns:**

The number of elements in the variadic pack.

`__len__(self) -> Int`

Return the VariadicPack length.

**Returns:**

The number of elements in the variadic pack.

### `each`

`each[func: fn[element_trait]($0) capturing -> None](self)`

Apply a function to each element of the pack in order.  This applies the specified function (which must be parametric on the element type) to each element of the pack, from the first element to the last, passing in each element as a read-only argument.

**Parameters:**

* ​func (`fn[element_trait]($0) capturing -> None`): The function to apply to each element.

### `each_idx`

`each_idx[func: fn[Int, element_trait]($1) capturing -> None](self)`

Apply a function to each element of the pack in order.  This applies the specified function (which must be parametric on the element type) to each element of the pack, from the first element to the last, passing in each element as a read-only argument.

**Parameters:**

* ​func (`fn[Int, element_trait]($1) capturing -> None`): The function to apply to each element.

---

## variadics

Implements the VariadicList and VariadicPack types.

These are Mojo built-ins, so you don't need to import them.

## Structs

* [​`VariadicList`](/mojo/stdlib/builtin/variadics/VariadicList): A utility class to access variadic function arguments. Provides a "list" view of the function argument so that the size of the argument list and each individual argument can be accessed.
* [​`VariadicListMem`](/mojo/stdlib/builtin/variadics/VariadicListMem): A utility class to access variadic function arguments of memory-only types that may have ownership. It exposes references to the elements in a way that can be enumerated.  Each element may be accessed with `elt[]`.
* [​`VariadicPack`](/mojo/stdlib/builtin/variadics/VariadicPack): A utility class to access variadic pack  arguments and provide an API for doing things with them.

---

## VariadicTensors

`@register_passable(trivial)`
`struct VariadicTensors[mut: Bool, input: IO, //, type: DType, rank: Int, size: Int, io_spec: IOSpec[mut, input], *, static_specs: StaticTuple[StaticTensorSpec[type, rank], size]]`

A tuple-like container of tensors representing variadic arguments from the graph compiler.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Sized`,
`UnknownDestructibility`

## Methods

### `__getitem__`

`__getitem__[index: Int](self) -> ManagedTensorSlice[io_spec, static_spec=static_specs.__getitem__[::Indexer](index)]`

Returns the tensor at the given position in the variadic argument argument pack.

**Parameters:**

* ​index (`Int`): The index into the variadic tensor arguments.

**Returns:**

The tensor at the specified index.

### `__len__`

`__len__(self) -> Int`

Returns the number of variadic arguments in the pack.

**Returns:**

The number of variadic arguments.

---

## variance

`variance(src: NDBuffer[type, 1, origin], mean_value: SIMD[type, 1], correction: Int = 1) -> SIMD[type, 1]`

Given a mean, computes the variance of elements in a buffer.

The mean value is used to avoid a second pass over the data:

```
variance(x) = sum((x - E(x))^2) / (size - correction)
```

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.
* ​mean\_value (`SIMD[type, 1]`): The mean value of the buffer.
* ​correction (`Int`): Normalize variance by size - correction.

**Returns:**

The variance value of the elements in a buffer.

`variance(src: NDBuffer[type, 1, origin], correction: Int = 1) -> SIMD[type, 1]`

Computes the variance value of the elements in a buffer.

```
variance(x) = sum((x - E(x))^2) / (size - correction)
```

**Args:**

* ​src (`NDBuffer[type, 1, origin]`): The buffer.
* ​correction (`Int`): Normalize variance by size - correction (Default=1).

**Returns:**

The variance value of the elements in a buffer.

---

## variant

Defines a Variant type.

You can use this type to implement variant/sum types. For example:

```mojo
from utils import Variant

alias IntOrString = Variant[Int, String]
fn to_string(mut x: IntOrString) -> String:
  if x.isa[String]():
    return x[String]
  # x.isa[Int]()
  return String(x[Int])

# They have to be mutable for now, and implement Copyable & Movable
var an_int = IntOrString(4)
var a_string = IntOrString(String("I'm a string!"))
var who_knows = IntOrString(0)
import random
if random.random_ui64(0, 1):
    who_knows.set[String]("I'm actually a string too!")

print(to_string(an_int))
print(to_string(a_string))
print(to_string(who_knows))
```

## Structs

* [​`Variant`](/mojo/stdlib/utils/variant/Variant): A runtime-variant type.

---

## Variant

`struct Variant[*Ts: Copyable & Movable]`

A runtime-variant type.

Data for this type is stored internally. Currently, its size is the
largest size of any of its variants plus a 16-bit discriminant.

You can
\- use `isa[T]()` to check what type a variant is
\- use `unsafe_take[T]()` to take a value from the variant
\- use `[T]` to get a value out of a variant
\- This currently does an extra copy/move until we have origins
\- It also temporarily requires the value to be mutable
\- use `set[T](owned new_value: T)` to reset the variant to a new value
\- use `is_type_supported[T]` to check if the variant permits the type `T`

Example:

```mojo
from utils import Variant
alias IntOrString = Variant[Int, String]
fn to_string(mut x: IntOrString) -> String:
    if x.isa[String]():
        return x[String]
    # x.isa[Int]()
    return String(x[Int])

# They have to be mutable for now, and implement Copyable & Movable
var an_int = IntOrString(4)
var a_string = IntOrString(String("I'm a string!"))
var who_knows = IntOrString(0)
import random
if random.random_ui64(0, 1):
    who_knows.set[String]("I'm actually a string too!")

print(to_string(an_int))
print(to_string(a_string))
print(to_string(who_knows))
```

## Parameters

* ​\*Ts (`Copyable & Movable`): The elements of the variadic.

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(out self, *, unsafe_uninitialized: Tuple[])`

Unsafely create an uninitialized Variant.

**Args:**

* ​unsafe\_uninitialized (`Tuple[]`): Marker argument indicating this initializer is unsafe.

`@implicit`
`__init__[T: Copyable & Movable](out self, owned value: T)`

Create a variant with one of the types.

**Parameters:**

* ​T (`Copyable & Movable`): The type to initialize the variant to. Generally this should
  be able to be inferred from the call type, eg. `Variant[Int, String](4)`.

**Args:**

* ​value (`T`): The value to initialize the variant with.

### `__copyinit__`

`__copyinit__(out self, other: Self)`

Creates a deep copy of an existing variant.

**Args:**

* ​other (`Self`): The variant to copy from.

### `__moveinit__`

`__moveinit__(out self, owned other: Self)`

Move initializer for the variant.

**Args:**

* ​other (`Self`): The variant to move.

### `__del__`

`__del__(owned self)`

Destroy the variant.

### `__getitem__`

`__getitem__[T: Copyable & Movable](ref self) -> ref [self] T`

Get the value out of the variant as a type-checked type.

This explicitly check that your value is of that type!
If you haven't verified the type correctness at runtime, the program
will abort!

For now this has the limitations that it
\- requires the variant value to be mutable

**Parameters:**

* ​T (`Copyable & Movable`): The type of the value to get out.

**Returns:**

A reference to the internal data.

### `copy`

`copy(self, out copy: Self)`

Explicitly creates a deep copy of an existing variant.

**Returns:**

A copy of the value.

### `take`

`take[T: Copyable & Movable](mut self) -> T`

Take the current value of the variant with the provided type.

The caller takes ownership of the underlying value.

This explicitly check that your value is of that type!
If you haven't verified the type correctness at runtime, the program
will abort!

**Parameters:**

* ​T (`Copyable & Movable`): The type to take out.

**Returns:**

The underlying data to be taken out as an owned value.

### `unsafe_take`

`unsafe_take[T: Copyable & Movable](mut self) -> T`

Unsafely take the current value of the variant with the provided type.

The caller takes ownership of the underlying value.

This doesn't explicitly check that your value is of that type!
If you haven't verified the type correctness at runtime, you'll get
a type that *looks* like your type, but has potentially unsafe
and garbage member data.

**Parameters:**

* ​T (`Copyable & Movable`): The type to take out.

**Returns:**

The underlying data to be taken out as an owned value.

### `replace`

`replace[Tin: Copyable & Movable, Tout: Copyable & Movable](mut self, owned value: Tin) -> Tout`

Replace the current value of the variant with the provided type.

The caller takes ownership of the underlying value.

This explicitly check that your value is of that type!
If you haven't verified the type correctness at runtime, the program
will abort!

**Parameters:**

* ​Tin (`Copyable & Movable`): The type to put in.
* ​Tout (`Copyable & Movable`): The type to take out.

**Args:**

* ​value (`Tin`): The value to put in.

**Returns:**

The underlying data to be taken out as an owned value.

### `unsafe_replace`

`unsafe_replace[Tin: Copyable & Movable, Tout: Copyable & Movable](mut self, owned value: Tin) -> Tout`

Unsafely replace the current value of the variant with the provided type.

The caller takes ownership of the underlying value.

This doesn't explicitly check that your value is of that type!
If you haven't verified the type correctness at runtime, you'll get
a type that *looks* like your type, but has potentially unsafe
and garbage member data.

**Parameters:**

* ​Tin (`Copyable & Movable`): The type to put in.
* ​Tout (`Copyable & Movable`): The type to take out.

**Args:**

* ​value (`Tin`): The value to put in.

**Returns:**

The underlying data to be taken out as an owned value.

### `set`

`set[T: Copyable & Movable](mut self, owned value: T)`

Set the variant value.

This will call the destructor on the old value, and update the variant's
internal type and data to the new value.

**Parameters:**

* ​T (`Copyable & Movable`): The new variant type. Must be one of the Variant's type arguments.

**Args:**

* ​value (`T`): The new value to set the variant to.

### `isa`

`isa[T: Copyable & Movable](self) -> Bool`

Check if the variant contains the required type.

**Parameters:**

* ​T (`Copyable & Movable`): The type to check.

**Returns:**

True if the variant contains the requested type.

### `unsafe_get`

`unsafe_get[T: Copyable & Movable](ref self) -> ref [self] T`

Get the value out of the variant as a type-checked type.

This doesn't explicitly check that your value is of that type!
If you haven't verified the type correctness at runtime, you'll get
a type that *looks* like your type, but has potentially unsafe
and garbage member data.

For now this has the limitations that it
\- requires the variant value to be mutable

**Parameters:**

* ​T (`Copyable & Movable`): The type of the value to get out.

**Returns:**

The internal data represented as a `Pointer[T]`.

### `is_type_supported`

`static is_type_supported[T: Copyable & Movable]() -> Bool`

Check if a type can be used by the `Variant`.

Example:

```mojo
from utils import Variant

def takes_variant(mut arg: Variant):
    if arg.is_type_supported[Float64]():
        arg = Float64(1.5)

def main():
    var x = Variant[Int, Float64](1)
    takes_variant(x)
    if x.isa[Float64]():
        print(x[Float64]) # 1.5
```

For example, the `Variant[Int, Bool]` permits `Int` and `Bool`.

**Parameters:**

* ​T (`Copyable & Movable`): The type of the value to check support for.

**Returns:**

`True` if type `T` is supported by the `Variant`.

---

## vec_int__

`vec_int__(gpr: Int)`

Horizontal ui16 multiply `z0[i] += x0[i] + y0[i]`.

---

## vecfp

`vecfp(gpr: Int)`

Horizontal float16 multiply `z0[i] += x0[i] + y0[i]`.

---

## vectorize

`vectorize[origins: origin.set, //, func: fn[Int](Int) capturing -> None, simd_width: Int, /, *, unroll_factor: Int = 1](size: Int)`

Simplifies SIMD optimized loops by mapping a function across a range from 0 to `size`, incrementing by `simd_width` at each step. The remainder of `size % simd_width` will run in separate iterations.

The below example demonstrates how you could improve the performance of a
loop, by setting multiple values at the same time using SIMD registers on
the machine:

```mojo
from algorithm.functional import vectorize
from memory import UnsafePointer

# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()  # assumed to be 4 in this example

fn main():
    var p = UnsafePointer[Int32].alloc(size)

    # @parameter allows the closure to capture the `p` pointer
    @parameter
    fn closure[width: Int](i: Int):
        print("storing", width, "els at pos", i)
        p.store[width=width](i, i)

    vectorize[closure, simd_width](size)
    print(p.load[width=simd_width]())
    print(p.load[width=simd_width](simd_width))
```

On a machine with a SIMD register size of 128, this will set 4xInt32 values
on each iteration. The remainder of 10 % 4 is 2, so those last two elements
will be set in two separate iterations:

```plaintext
storing 4 els at pos 0
storing 4 els at pos 4
storing 1 els at pos 8
storing 1 els at pos 9
[0, 0, 0, 0, 4, 4, 4, 4, 8, 9]
```

You can also unroll the loop to potentially improve performance at the cost
of binary size:

```
vectorize[closure, width, unroll_factor=2](size)
```

In the generated assembly the function calls will be repeated, resulting in
fewer arithmetic, comparison, and conditional jump operations. The assembly
would look like this in pseudocode:

```
closure[4](0)
closure[4](4)
# Remainder loop won't unroll unless `size` is passed as a parameter
for i in range(8, 10):
    closure[1](i)
    closure[1](i)
```

You can pass `size` as a parameter if it's compile time known to reduce the
iterations for the remainder. This only occurs if the remainder is an
exponent of 2 (2, 4, 8, 16, ...). The remainder loop will still unroll for
performance improvements if not an exponent of 2.

**Parameters:**

* ​origins (`origin.set`): The capture origins.
* ​func (`fn[Int](Int) capturing -> None`): The function that will be called in the loop body.
* ​simd\_width (`Int`): The SIMD vector width.
* ​unroll\_factor (`Int`): The unroll factor for the main loop (Default 1).

**Args:**

* ​size (`Int`): The upper limit for the loop.

`vectorize[origins: origin.set, //, func: fn[Int](Int) capturing -> None, simd_width: Int, /, *, size: Int, unroll_factor: Int = size if is_nvidia_gpu() else 1]()`

Simplifies SIMD optimized loops by mapping a function across a range from 0 to `size`, incrementing by `simd_width` at each step. The remainder of `size % simd_width` will run in a single iteration if it's an exponent of 2.

The below example demonstrates how you could improve the performance of a
loop, by setting multiple values at the same time using SIMD registers on
the machine:

```mojo
from algorithm.functional import vectorize
from memory import UnsafePointer

# The amount of elements to loop through
alias size = 10
# How many Dtype.int32 elements fit into the SIMD register (4 on 128bit)
alias simd_width = simdwidthof[DType.int32]()  # assumed to be 4 in this example

fn main():
    var p = UnsafePointer[Int32].alloc(size)

    # @parameter allows the closure to capture the `p` pointer
    @parameter
    fn closure[width: Int](i: Int):
        print("storing", width, "els at pos", i)
        p.store[width=width](i, i)

    vectorize[closure, simd_width](size)
    print(p.load[width=simd_width]())
    print(p.load[width=simd_width](simd_width))
```

On a machine with a SIMD register size of 128, this will set 4xInt32 values
on each iteration. The remainder of 10 % 4 is 2, so those last two elements
will be set in a single iteration:

```plaintext
storing 4 els at pos 0
storing 4 els at pos 4
storing 2 els at pos 8
[0, 0, 0, 0, 4, 4, 4, 4, 8, 8]
```

If the remainder is not an exponent of 2 (2, 4, 8, 16 ...) there will be a
separate iteration for each element. However passing `size` as a parameter
also allows the loop for the remaining elements to be unrolled.

You can also unroll the main loop to potentially improve performance at the
cost of binary size:

```
vectorize[closure, width, size=size, unroll_factor=2]()
```

In the generated assembly the function calls will be repeated, resulting in
fewer arithmetic, comparison, and conditional jump operations. The assembly
would look like this in pseudocode:

```
closure[4](0)
closure[4](4)
closure[2](8)
```

**Parameters:**

* ​origins (`origin.set`): The capture origins.
* ​func (`fn[Int](Int) capturing -> None`): The function that will be called in the loop body.
* ​simd\_width (`Int`): The SIMD vector width.
* ​size (`Int`): The upper limit for the loop.
* ​unroll\_factor (`Int`): The unroll factor for the main loop (Default 1).

---

## Vendor

`@register_passable`
`struct Vendor`

Represents GPU vendors.

This struct provides identifiers for different GPU vendors and utility
methods for comparison and string representation.

The Vendor struct defines constants for common GPU vendors (NVIDIA, AMD)
and includes a NO\_GPU option for systems without GPU support. It provides
comparison operators and string conversion methods for vendor identification.

## Implemented traits

`AnyType`,
`UnknownDestructibility`,
`Writable`

## Aliases

### `AMD_GPU`

`alias AMD_GPU = Vendor(__init__[__mlir_type.!pop.int_literal](1))`

Represents AMD GPU vendor.

### `NO_GPU`

`alias NO_GPU = Vendor(__init__[__mlir_type.!pop.int_literal](0))`

Represents no GPU or CPU-only execution.

### `NVIDIA_GPU`

`alias NVIDIA_GPU = Vendor(__init__[__mlir_type.!pop.int_literal](2))`

Represents NVIDIA GPU vendor.

## Methods

### `__eq__`

`__eq__(self, other: Self) -> Bool`

Checks if two `Vendor` instances are equal.

**Args:**

* ​other (`Self`): The `Vendor` to compare with.

**Returns:**

True if vendors are equal, False otherwise.

### `__ne__`

`__ne__(self, other: Self) -> Bool`

Checks if two `Vendor` instances are not equal.

**Args:**

* ​other (`Self`): The `Vendor` to compare with.

**Returns:**

True if vendors are not equal, False otherwise.

### `__is__`

`__is__(self, other: Self) -> Bool`

Identity comparison for vendors.

**Args:**

* ​other (`Self`): The `Vendor` to compare with.

**Returns:**

True if vendors are identical, False otherwise.

### `__isnot__`

`__isnot__(self, other: Self) -> Bool`

Negative identity comparison for vendors.

**Args:**

* ​other (`Self`): The Vendor to compare with.

**Returns:**

True if vendors are not identical, False otherwise.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Writes vendor information to a writer.

**Parameters:**

* ​W (`Writer`): The type of writer to use for output. Must implement the Writer trait.

**Args:**

* ​writer (`W`): The writer to output vendor information to.

### `__str__`

`__str__(self) -> String`

Returns a string representation of the vendor.

**Returns:**

String representation of the vendor.

---

## vendor_blas

## Structs

* [​`Backend`](./Backend):
* [​`Handle`](./Handle):

## Functions

* [​`matmul`](./matmul): Matmul using the vendor BLAS library. With a global handle.

---

## vnni_intrinsics

## Functions

* [​`dot_i16_to_i32_AVX2`](./dot_i16_to_i32_AVX2): The dot product of the two words in each int32 element of a and b plus a int32 from src.
* [​`dot_i16_to_i32_x86`](./dot_i16_to_i32_x86): The dot product of the two words in each int32 element of a and b plus a int32 from src using VNNI or AVX2.
* [​`dot_i8_to_i32_AVX2`](./dot_i8_to_i32_AVX2): The dot product of the four bytes in each int32 element of a and b plus a int32 from src.
* [​`dot_i8_to_i32_saturated_AVX2`](./dot_i8_to_i32_saturated_AVX2): The dot product of the four bytes in each int32 element of a and b plus a int32 from src.
* [​`dot_i8_to_i32_saturated_x86`](./dot_i8_to_i32_saturated_x86): The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2.
* [​`dot_i8_to_i32_x86`](./dot_i8_to_i32_x86): The dot product of the four bytes in each int32 element of a and b plus a int32 from src using VNNI or AVX2.
* [​`pmaddubs`](./pmaddubs):
* [​`pmaddw`](./pmaddw):
* [​`vpdpbusd`](./vpdpbusd):
* [​`vpdpbusds`](./vpdpbusds):
* [​`vpdpwssd`](./vpdpwssd):
* [​`vpdpwssds`](./vpdpwssds):

---

## vpdpbusd

`vpdpbusd[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

---

## vpdpbusds

`vpdpbusds[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

---

## vpdpwssd

`vpdpwssd[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

---

## vpdpwssds

`vpdpwssds[width: Int, a_type: DType, b_type: DType, c_type: DType](src: SIMD[c_type, width], a: SIMD[a_type, width], b: SIMD[b_type, width]) -> SIMD[c_type, width]`

---

## wait_on_dependent_grids

`wait_on_dependent_grids()`

Waits for all dependent grids launched by this grid to complete execution.

This function blocks the calling grid until all dependent grids that were launched
by this grid have completed their execution. It provides a synchronization point
between parent and child grids in a multi-grid dependency chain.

Note:

* Only supported on NVIDIA SM90+ (Hopper architecture and newer) GPUs.
* Must be called by all threads in a thread block to avoid undefined behavior.
* Can be used to ensure dependent grid work is complete before proceeding
  with subsequent operations in the parent grid.

---

## warp

GPU warp-level operations and utilities.

This module provides warp-level operations for NVIDIA and AMD GPUs, including:

* Shuffle operations to exchange values between threads in a warp:
  * shuffle\_idx: Copy value from source lane to other lanes
  * shuffle\_up: Copy from lower lane IDs
  * shuffle\_down: Copy from higher lane IDs
  * shuffle\_xor: Exchange values in butterfly pattern

* Warp-wide reductions:
  * sum: Compute sum across warp
  * max: Find maximum value across warp
  * min: Find minimum value across warp
  * broadcast: Broadcast value to all lanes

The module handles both NVIDIA and AMD GPU architectures through architecture-specific
implementations of the core operations. It supports various data types including
integers, floats, and half-precision floats, with SIMD vectorization.

## Structs

* [​`ReductionMethod`](/mojo/stdlib/gpu/warp/ReductionMethod): Enumerates the supported reduction methods.

## Functions

* [​`broadcast`](/mojo/stdlib/gpu/warp/broadcast): Broadcasts a SIMD value from lane 0 to all lanes in the warp.
* [​`lane_group_max`](/mojo/stdlib/gpu/warp/lane_group_max): Reduces a SIMD value to its maximum within a lane group using warp-level operations.
* [​`lane_group_max_and_broadcast`](/mojo/stdlib/gpu/warp/lane_group_max_and_broadcast): Reduces and broadcasts the maximum value within a lane group using warp-level operations.
* [​`lane_group_min`](/mojo/stdlib/gpu/warp/lane_group_min): Reduces a SIMD value to its minimum within a lane group using warp-level operations.
* [​`lane_group_reduce`](/mojo/stdlib/gpu/warp/lane_group_reduce): Performs a generic warp-level reduction operation using shuffle operations.
* [​`lane_group_sum`](/mojo/stdlib/gpu/warp/lane_group_sum): Computes the sum of values across a group of lanes using warp-level operations.
* [​`lane_group_sum_and_broadcast`](/mojo/stdlib/gpu/warp/lane_group_sum_and_broadcast): Computes the sum across a lane group and broadcasts the result to all lanes.
* [​`max`](/mojo/stdlib/gpu/warp/max): Computes the maximum value across all lanes in a warp.
* [​`min`](/mojo/stdlib/gpu/warp/min): Computes the minimum value across all lanes in a warp.
* [​`prefix_sum`](/mojo/stdlib/gpu/warp/prefix_sum): Computes a warp-level prefix sum (scan) operation.
* [​`reduce`](/mojo/stdlib/gpu/warp/reduce): Performs a generic warp-wide reduction operation using shuffle operations.
* [​`shuffle_down`](/mojo/stdlib/gpu/warp/shuffle_down): Copies values from threads with higher lane IDs in the warp.
* [​`shuffle_idx`](/mojo/stdlib/gpu/warp/shuffle_idx): Copies a value from a source lane to other lanes in a warp.
* [​`shuffle_up`](/mojo/stdlib/gpu/warp/shuffle_up): Copies values from threads with lower lane IDs in the warp.
* [​`shuffle_xor`](/mojo/stdlib/gpu/warp/shuffle_xor): Exchanges values between threads in a warp using a butterfly pattern.
* [​`sum`](/mojo/stdlib/gpu/warp/sum): Computes the sum of values across all lanes in a warp.

---

## Warp

In GPU programming, a warp is a subset of [threads](thread.mdx) from a
[thread block](thread-block.mdx) that execute together in lockstep. When a GPU
assigns a thread block to execute on a [streaming
multiprocessor](streaming-multiprocessor.mdx) (SM), the SM divides the thread
block into warps of 32 or 64 threads, with the exact size depending on the GPU
architecture.

If a thread block contains a number of threads not evenly divisible by the warp
size, the SM creates a partially filled final warp that still consumes the full
warp's resources. For example, if a thread block has 100 threads and the warp
size is 32, the SM creates:

- 3 full warps of 32 threads each (96 threads total)

- 1 partial warp with only 4 active threads but still occupying a full warp's
  worth of resources (32 thread slots)

The SM effectively disables the unused thread slots in partial warps, but these
slots still consume hardware resources. For this reason, developers generally
should make thread block sizes a multiple of the warp size to optimize resource
usage.

Each thread in a warp executes the same instruction at the same time on
different data, following the single instruction, multiple threads (SIMT)
execution model. If threads within a warp take different execution paths (called
*warp divergence*), the warp serially executes each branch path taken, disabling
threads that are not on that path. This means that optimal performance is
achieved when all threads in a warp follow the same execution path.

An SM can actively manage multiple warps from different thread blocks
simultaneously, helping keep execution units busy. For example, the warp
scheduler can quickly switch to another ready warp if the current warp's threads
must wait for memory access.

Warps deliver several key performance advantages:

- The hardware needs to manage only warps instead of individual threads,
  reducing scheduling overhead

- Threads in a warp can access contiguous memory locations efficiently through
  memory coalescing

- The hardware automatically synchronizes threads within a warp, eliminating the
  need for explicit synchronization

- The warp scheduler can hide memory latency by switching between warps,
  maximizing compute resource utilization

---

## warp_id

`warp_id() -> UInt`

Returns the warp ID of the current thread within its block. The warp ID is a unique identifier for each warp within a block, ranging from 0 to BLOCK\_SIZE/WARP\_SIZE-1. This ID is commonly used for warp-level programming and synchronization within a block.

**Returns:**

The warp ID (0 to BLOCK\_SIZE/WARP\_SIZE-1) of the current thread.

---

## warp_specialize_gemm_with_multicasting

`warp_specialize_gemm_with_multicasting[c_type: DType, c_shape: DimList, a_type: DType, a_shape: DimList, b_type: DType, b_shape: DimList, //, *, transpose_b: Bool, wgmma_shape: IndexList[3], config: MatmulConfig[a_type, b_type, c_type, transpose_b, wgmma_shape], grid_shape: OptionalReg[IndexList[2]] = OptionalReg[IndexList[2]]({:i1 0, 1}), use_tma_store: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1}), schedule: MatmulSchedule = MatmulSchedule(__init__[__mlir_type.!pop.int_literal](-1))](c_device: NDBuffer[c_type, 2, origin, c_shape], a_device: NDBuffer[a_type, 2, origin, a_shape], b_device: NDBuffer[b_type, 2, origin, b_shape], M: Int, N: Int, K: Int, ctx: DeviceContext)`

---

## warp_specialized_gemm_output

`warp_specialized_gemm_output[c_type: DType, accum_type: DType, c_layout: Layout, c_smem_layout: Layout, c_tma_layout: Layout, c_reg_layout: Layout, c_desc_layout: Layout, /, *, c_tile_shape: IndexList[2], c_swizzle: TensorMapSwizzle, wgmma_shape: IndexList[3], num_consumer: Int = 1, use_tma_store: Bool = False, elementwise_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> None]({:i1 0, 1}), elementwise_compute_lambda_fn: OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]] = OptionalReg[fn[DType, Int, Int](IndexList[2], SIMD[$0, $1]) capturing -> SIMD[$0, $1]]({:i1 0, 1})](c_tma_op: TMATensorTile[c_type, c_tma_layout, c_desc_layout], c: LayoutTensor[c_type, c_layout, MutableAnyOrigin, address_space=address_space, element_layout=element_layout, layout_int_type=layout_int_type, linear_idx_type=linear_idx_type, masked=masked, alignment=alignment], c_smem_tile: LayoutTensor[c_type, c_smem_layout, MutableAnyOrigin, address_space=AddressSpace(3), alignment=128], c_reg_tile: LayoutTensor[accum_type, c_reg_layout, MutableAnyOrigin, address_space=AddressSpace(5)], warp_group_thread_idx: UInt, local_warp_group_idx: UInt, local_thread_idx: UInt, block_y: Int, block_x: Int)`

---

## warpgroup_reg_alloc

`warpgroup_reg_alloc[count: Int]()`

Allocates additional registers for the executing warp group.

Hints to the system to increase per-thread registers owned by the
executing warp. Requests additional registers to increase the absolute
per-thread maximum register count from its current value to the specified
count.

Note:

* Only supported on NVIDIA SM90+ GPUs
* Performance optimization hint that may be ignored by the hardware
* Pair with \`warpgroup\_reg\_dealloc() when extra registers are no
  longer needed

**Parameters:**

* ​count (`Int`): The desired number of registers per thread. Must be:
  * A multiple of 8
  * Between 24 and 256 (inclusive).

---

## warpgroup_reg_dealloc

`warpgroup_reg_dealloc[count: Int]()`

Deallocates additional registers for the executing warp group.

Hints to the system to decrease per-thread registers owned by the
executing warp. Releases extra registers to reduce the absolute per-thread
maximum register count from its current value to the specified count.

Note:

* Only supported on NVIDIA SM90+ GPUs.
* Performance optimization hint that may be ignored by the hardware.
* Pair with `warpgroup_reg_alloc()` when extra registers are needed.

**Parameters:**

* ​count (`Int`): The desired number of registers per thread. Must be:
  * A multiple of 8.
  * Between 24 and 256 (inclusive).

---

## weakly_compatible

`weakly_compatible(a: IntTuple[origin], b: IntTuple[origin]) -> Bool`

Test if shape A is weakly compatible with shape B.

A shape A is weakly compatible with shape B if there exists a shape C
congruent to A such that compatible(elem\_scale(A,C), B). This establishes
a partial order relation between shapes where A a (`IntTuple[origin]`): The first `IntTuple` to compare.
* ​b (`IntTuple[origin]`): The second `IntTuple` to compare.

**Returns:**

True if shape A is weakly compatible with shape B, False otherwise.

---

## weakly_congruent

`weakly_congruent(a: IntTuple[origin], b: IntTuple[origin]) -> Bool`

Test if two IntTuples have similar hierarchical structures.

This function establishes a partial order relation between IntTuples
based on their hierarchical structure. It's less strict than congruent.

**Args:**

* ​a (`IntTuple[origin]`): First IntTuple to compare.
* ​b (`IntTuple[origin]`): Second IntTuple to compare.

**Returns:**

True if a's structure is compatible with b's structure,
False otherwise.

---

## Weight

## `Weight` {#max.graph.Weight}

> *class* max.graph.Weight(\*args, \*\*kwargs)

Bases: [`TensorValue`](TensorValue.md#max.graph.TensorValue)

Represents a value in a Graph that can be loaded at a later time.

Weights can be initialized outside of a Graph and are lazily-added to
the parent graph when used. If there is no parent graph when a weight is
used, an error will be raised.

Value is abstract, it shouldn’t be constructed directly.

### `align` {#max.graph.Weight.align}

> align\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\*

### `device` {#max.graph.Weight.device}

> *property* device\*: DeviceRef\*

Returns the device of the TensorValue.

### `dtype` {#max.graph.Weight.dtype}

> *property* dtype\*: [DType](../dtype.md#max.dtype.DType)\*

Returns the tensor data type.

The following example demonstrates how to access the data type of a tensor:

```python
import numpy as np
from max.dtype import DType
from max.graph import Graph, ops

matrix = np.array([[1, 2], [3, 4]], dtype=np.float32)

# Create a Graph context to work with tensors
with Graph("dtype_demo") as graph:
    # Create a constant tensor from the matrix
    tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU())

    # Access tensor data type
    print(f"Data type: {tensor.dtype}")  # Output: DType.float32
```

### `original_dtype_and_shape` {#max.graph.Weight.original_dtype_and_shape}

> *property* original\_dtype\_and\_shape\*: [tuple](https://docs.python.org/3/library/stdtypes.html#tuple)\[[DType](../dtype.md#max.dtype.DType), [Shape](type.md#max.graph.type.Shape)]\*

The original dtype and shape of this weight.

This property should be used to store the original weight’s dtype and
shape the quantization encoding forces the weight to be loaded as uint8.

### `quantization_encoding` {#max.graph.Weight.quantization_encoding}

> quantization\_encoding\*: [QuantizationEncoding](quantization.md#max.graph.quantization.QuantizationEncoding) | [None](https://docs.python.org/3/library/constants.html#None)\*

### `set_sharding_strategy()` {#max.graph.Weight.set_sharding_strategy}

> set\_sharding\_strategy(sharding\_strategy)

Set the weight sharding strategy.

**Parameters:**

**sharding\_strategy** (`ShardingStrategy` ) – A callable that takes the host weight and shard
index, and returns the sharded value.

**Return type:**

None

### `shape` {#max.graph.Weight.shape}

> *property* shape\*: [Shape](type.md#max.graph.type.Shape)\*

Returns the shape of the [`TensorValue`](TensorValue.md#max.graph.TensorValue).

The following example demonstrates how to access the shape of a tensor:

```python
import numpy as np
from max.dtype import DType
from max.graph import Graph, ops

# Create a 2x2 matrix
matrix = np.array([[1, 2], [3, 4]], dtype=np.float32)

# Create a Graph context to work with tensors
with Graph("shape_demo") as graph:
    # Create a constant tensor from the matrix
    tensor = ops.constant(matrix, dtype=DType.float32, device=DeviceRef.CPU())

    # Access tensor shape
    print(f"Shape: {tensor.shape}")  # Shape: [Dim(2), Dim(2)]
```

### `shard()` {#max.graph.Weight.shard}

> shard(shard\_idx, device)

Gets a specific shard from the Weight.

This Weight must have sharding\_strategy defined. The shard object
returned is also a Weight object, but cannot be sharded further.

**Parameters:**

* **shard\_idx** ([`int`](https://docs.python.org/3/library/functions.html#int) ) – int value of the shard.
* **device** (`DeviceRef` ) – device to place the shard.

**Returns:**

The sharded weight.

**Return type:**

[*Weight*](#max.graph.Weight)

### `shard_idx` {#max.graph.Weight.shard_idx}

> shard\_idx\*: [int](https://docs.python.org/3/library/functions.html#int) | [None](https://docs.python.org/3/library/constants.html#None)\*

### `sharding_strategy` {#max.graph.Weight.sharding_strategy}

> sharding\_strategy\*: \_ShardingStrategyContainer | [None](https://docs.python.org/3/library/constants.html#None)\*

---

## Weighted2DPoint

`@register_passable(trivial)`
`struct Weighted2DPoint[type: DType]`

Utility class to wrap 2-d point coordinates and floating point weight for bilinear interpolation.

## Fields

* ​y (`Int`):
* ​x (`Int`):
* ​w (`SIMD[type, 1]`):

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`__init__(y: Int, x: Int, weight: SIMD[type, 1]) -> Self`

---

## welford_block_all_reduce

`welford_block_all_reduce[type: DType, //](thread_mean: SIMD[type, 1], thread_m2: SIMD[type, 1], thread_count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])`

---

## welford_combine

`welford_combine[type: DType, //](mean: SIMD[type, 1], m2: SIMD[type, 1], count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])`

---

## welford_update

`welford_update[type: DType, //](val: SIMD[type, 1], mut mean: SIMD[type, 1], mut m2: SIMD[type, 1], mut count: SIMD[type, 1])`

---

## welford_warp_all_reduce

`welford_warp_all_reduce[type: DType, //](thread_mean: SIMD[type, 1], thread_m2: SIMD[type, 1], thread_count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])`

---

## welford_warp_reduce

`welford_warp_reduce[type: DType, //](thread_mean: SIMD[type, 1], thread_m2: SIMD[type, 1], thread_count: SIMD[type, 1], mut res_mean: SIMD[type, 1], mut res_m2: SIMD[type, 1], mut res_count: SIMD[type, 1])`

---

## wgmma_async

`wgmma_async[m: Int, n: Int, k: Int, c_dtype: DType, width: Int, /, *, a_type: DType, b_type: DType, accum_type: DType = c_dtype, layout_a: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("row"), layout_b: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("col"), scale_d: Int = 1, scale_a: Int = 1, scale_b: Int = 1](mat_a_desc: WGMMADescriptor[dtype], mat_b_desc: WGMMADescriptor[dtype], c_reg: StaticTuple[SIMD[c_dtype, 1], width]) -> StaticTuple[SIMD[c_dtype, 1], width]`

Performs warp group async Matrix-multiply and accumulate (WGMMA) operation.

This function executes an asynchronous matrix multiplication using warp group MMA instructions.
It supports various data types including tensor float32, bfloat16, float16, float8, int8, and uint8.

**Constraints:**

* The number of output registers must match the instruction shape:
  `(m * n // 128) * sizeof(accum_type) == width * sizeof(c_dtype)`.
* Data type combinations must be compatible with hardware WGMMA instructions.

**Parameters:**

* ​m (`Int`): Number of rows in matrix A and output matrix.
* ​n (`Int`): Number of columns in matrix B and output matrix.
* ​k (`Int`): Number of columns in matrix A / rows in matrix B.
* ​c\_dtype (`DType`): Data type of the output matrix C.
* ​width (`Int`): Width of the InlineArray register for matrix C.
* ​a\_type (`DType`): Data type of matrix A.
* ​b\_type (`DType`): Data type of matrix B.
* ​accum\_type (`DType`): Accumulation data type (defaults to c\_dtype).
* ​layout\_a (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix A ("row" or "col").
* ​layout\_b (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix B ("row" or "col").
* ​scale\_d (`Int`): Scale factor for matrix C.
* ​scale\_a (`Int`): Scale factor for matrix A.
* ​scale\_b (`Int`): Scale factor for matrix B.

**Args:**

* ​mat\_a\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix A.
* ​mat\_b\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix B.
* ​c\_reg (`StaticTuple[SIMD[c_dtype, 1], width]`): StaticTuple containing matrix C values.

**Returns:**

`StaticTuple` containing the result of the matrix multiplication.

`wgmma_async[m: Int, n: Int, k: Int, c_dtype: DType, width: Int, /, *, a_type: DType, b_type: DType, accum_type: DType = c_dtype, layout_a: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("row"), layout_b: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("col"), scale_d: Int = 1, scale_a: Int = 1, scale_b: Int = 1](mat_a_desc: WGMMADescriptor[dtype], mat_b_desc: WGMMADescriptor[dtype], c_reg: SIMD[c_dtype, width]) -> SIMD[c_dtype, width]`

Performs warp group async Matrix-multiply and accumulate (WGMMA) operation.

This function executes an asynchronous matrix multiplication using warp group MMA instructions.
It supports various data types including tensor float32, bfloat16, float16, float8, int8, and uint8.

**Constraints:**

* The number of output registers must match the instruction shape:
  `(m * n // 128) * sizeof(accum_type) == width * sizeof(c_dtype)`.
* Data type combinations must be compatible with hardware WGMMA instructions.

**Parameters:**

* ​m (`Int`): Number of rows in matrix A and output matrix.
* ​n (`Int`): Number of columns in matrix B and output matrix.
* ​k (`Int`): Number of columns in matrix A / rows in matrix B.
* ​c\_dtype (`DType`): Data type of the output matrix C.
* ​width (`Int`): Width of the SIMD register for matrix C.
* ​a\_type (`DType`): Data type of matrix A.
* ​b\_type (`DType`): Data type of matrix B.
* ​accum\_type (`DType`): Accumulation data type (defaults to c\_dtype).
* ​layout\_a (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix A ("row" or "col").
* ​layout\_b (`StringSlice[StaticConstantOrigin]`): Memory layout for matrix B ("row" or "col").
* ​scale\_d (`Int`): Scale factor for matrix C.
* ​scale\_a (`Int`): Scale factor for matrix A.
* ​scale\_b (`Int`): Scale factor for matrix B.

**Args:**

* ​mat\_a\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix A.
* ​mat\_b\_desc (`WGMMADescriptor[dtype]`): WGMMA descriptor for matrix B.
* ​c\_reg (`SIMD[c_dtype, width]`): SIMD register containing matrix C values.

**Returns:**

SIMD register containing the result of the matrix multiplication.

`wgmma_async[m: Int, n: Int, k: Int, a_dtype: DType, c_dtype: DType, frag_a_width: Int, frag_c_width: Int, /, *, a_type: DType, b_type: DType, accum_type: DType = c_dtype, layout_a: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("row"), layout_b: StringSlice[StaticConstantOrigin] = __init__[__mlir_type.!kgen.string]("col"), scale_d: Int = 1, scale_a: Int = 1, scale_b: Int = 1](mat_a_frag: SIMD[a_dtype, frag_a_width], mat_b_desc: WGMMADescriptor[dtype], c: SIMD[c_dtype, frag_c_width]) -> SIMD[c_dtype, frag_c_width]`

Performs warp group async Matrix-multiply and accumulate (WGMMA) operation.

Currently only supports:

* m=64, k=16.
* BF16 input types.
* FP32 accumulation.
* Row major matrix A.
* Column major matrix B (or row major for BF16).

**Parameters:**

* ​m (`Int`): Number of rows in output matrix.
* ​n (`Int`): Number of columns in output matrix.
* ​k (`Int`): Inner dimension for matrix multiplication.
* ​a\_dtype (`DType`): Data type of matrix A fragment.
* ​c\_dtype (`DType`): Data type of output matrix C.
* ​frag\_a\_width (`Int`): Width of matrix A fragment.
* ​frag\_c\_width (`Int`): Width of output matrix C fragment.
* ​a\_type (`DType`): Data type of matrix A.
* ​b\_type (`DType`): Data type of matrix B.
* ​accum\_type (`DType`): Data type used for accumulation (defaults to c\_dtype).
* ​layout\_a (`StringSlice[StaticConstantOrigin]`): Layout of matrix A ("row" or "col", defaults to "row").
* ​layout\_b (`StringSlice[StaticConstantOrigin]`): Layout of matrix B ("row" or "col", defaults to "col").
* ​scale\_d (`Int`): Scale factor for output matrix C (defaults to 1).
* ​scale\_a (`Int`): Scale factor for matrix A (defaults to 1).
* ​scale\_b (`Int`): Scale factor for matrix B (defaults to 1).

**Args:**

* ​mat\_a\_frag (`SIMD[a_dtype, frag_a_width]`): Fragment containing matrix A data.
* ​mat\_b\_desc (`WGMMADescriptor[dtype]`): Descriptor for matrix B data.
* ​c (`SIMD[c_dtype, frag_c_width]`): Fragment containing matrix C data.

**Returns:**

Updated matrix C fragment after WGMMA operation.

---

## wgmma_c_layout

`wgmma_c_layout[mma_m: Int, mma_n: Int, C: Layout]() -> List[Layout]`

Generates three layouts for mapping WGMMA C matrix coordinates.

This function creates three layout mappings that are essential for working with WGMMA
(Warp Group Matrix Multiply-Accumulate) operations:

1. A projection layout that maps linearized indices to row coordinates (i)
2. A projection layout that maps linearized indices to column coordinates (j)
3. A composite layout that maps thread and vector coordinates to linearized indices
   across multiple MMA tiles

These layouts are particularly useful for operations like attention masking and
matrix multiplication epilogues, where register values need to be mapped to the
coordinate system of the C matrix.

Note:
This function enforces constraints on the WGMMA dimensions and ensures the C matrix
dimensions are compatible with the WGMMA instruction size.

**Parameters:**

* ​mma\_m (`Int`): The M dimension (rows) of a single WGMMA instruction, must be 64.
* ​mma\_n (`Int`): The N dimension (columns) of a single WGMMA instruction, must be multiple of 8.
* ​C (`Layout`): The layout of the C matrix within a thread block.

**Returns:**

`List[Layout]` - A list containing three layouts:

1. proj\_i: Maps linearized indices to row coordinates
2. proj\_j: Maps linearized indices to column coordinates
3. TV\_tile\_to\_idx: Maps thread/vector/tile coordinates to linearized indices

---

## wgmma_c_thread_layout

`wgmma_c_thread_layout[C: Layout]() -> Layout`

Returns the thread layout component for WGMMA C matrix.

Generates the first mode of the WGMMA C layout, which maps thread coordinates
to linearized indices in the output matrix.

**Parameters:**

* ​C (`Layout`): The layout of the C matrix.

**Returns:**

`Layout` - A layout mapping thread coordinates to linearized indices.

---

## wgmma_commit_group_sync

`wgmma_commit_group_sync()`

Commits pending warp group matrix multiply operations.

This synchronizes the warp group and ensures all WGMMA operations have been committed.
Must be called after a sequence of WGMMA operations before accessing results.

---

## wgmma_fence_aligned

`wgmma_fence_aligned()`

Inserts a memory fence for warp group matrix multiply operations.

This ensures all prior shared memory accesses are visible before subsequent WGMMA operations.
Must be called before starting a new sequence of WGMMA operations.

---

## wgmma_output_layout

`wgmma_output_layout[mma_n: Int, C: Layout]() -> Layout`

Returns the output layout component for WGMMA C matrix.

Generates the second mode of the WGMMA C layout, which maps output vector
coordinates to linearized indices in the output matrix.

**Parameters:**

* ​mma\_n (`Int`): The N dimension of the WGMMA instruction.
* ​C (`Layout`): The layout of the C matrix.

**Returns:**

`Layout` - A layout mapping output vector coordinates to linearized indices.

---

## wgmma_wait_group_sync

`wgmma_wait_group_sync[group: Int = 0]()`

Waits for all pending warp group matrix multiply operations to complete.

This synchronizes the warp group and ensures all WGMMA operations have finished executing.
Must be called after commit and before accessing results.

**Parameters:**

* ​group (`Int`): The number of pending wgmma-groups to wait until.

---

## WGMMADescriptor

`@register_passable(trivial)`
`struct WGMMADescriptor[dtype: DType]`

Descriptor for shared memory operands used in warp group matrix multiply operations.

This struct represents a descriptor that encodes information about shared memory layout
and access patterns for warp group matrix multiply operations. The descriptor contains
the following bit fields:

* Start address (14 bits): Base address in shared memory.
* Leading byte offset (14 bits): Leading dimension stride in bytes.
* Stride byte offset (14 bits): Stride dimension offset in bytes.
* Base offset (3 bits): Additional offset.
* Swizzle mode (2 bits): Memory access pattern.

The bit layout is:
+----------+----+------------+----+------------+----+-----+----------+-----+
\|   0-13   |14-15|   16-29   |30-31|   32-45   |46-48|49-51|  52-61  |62-63|
+----------+----+------------+----+------------+----+-----+----------+-----+
\|  14bits  |2bits|  14bits   |2bits|  14bits   |2bits|3bits| 10bits  |2bits|
+----------+----+------------+----+------------+----+-----+----------+-----+
\| BaseAddr |  0  |LeadingDim |  0  |  Stride   |  0  |Offst|    0    |Swzle|
+----------+----+------------+----+------------+----+-----+----------+-----+

See: 

## Parameters

* ​dtype (`DType`): The data type of the shared memory operand. This affects memory alignment
  and access patterns for the descriptor.

## Fields

* ​desc (`SIMD[int64, 1]`): The 64-bit descriptor value that encodes shared memory layout information.
  This field stores the complete descriptor with all bit fields packed into a single 64-bit integer:

  * Bits 0-13: Base address in shared memory (14 bits)
  * Bits 16-29: Leading dimension stride in bytes (14 bits)
  * Bits 32-45: Stride dimension offset in bytes (14 bits)
  * Bits 49-51: Base offset (3 bits)
  * Bits 62-63: Swizzle mode for memory access pattern (2 bits)

  The descriptor is used by NVIDIA Hopper architecture's warp group matrix multiply instructions
  to efficiently access shared memory with the appropriate layout and access patterns.

## Implemented traits

`AnyType`,
`Copyable`,
`Movable`,
`UnknownDestructibility`

## Methods

### `__init__`

`@implicit`
`__init__(val: SIMD[int64, 1]) -> Self`

Initialize descriptor with raw 64-bit value.

This constructor allows creating a descriptor directly from a 64-bit integer
that already contains the properly formatted bit fields for the descriptor.

The implicit attribute enables automatic conversion from `Int64` to `WGMMADescriptor`.

**Args:**

* ​val (`SIMD[int64, 1]`): A 64-bit integer containing the complete descriptor bit layout.

### `__add__`

`__add__(self, offset: Int) -> Self`

Add offset to descriptor's base address.

**Args:**

* ​offset (`Int`): Byte offset to add to base address.

**Returns:**

New descriptor with updated base address.

### `__iadd__`

`__iadd__(mut self, offset: Int)`

Add offset to descriptor's base address in-place.

**Args:**

* ​offset (`Int`): Byte offset to add to base address.

### `create`

`static create[stride_byte_offset: Int, leading_byte_offset: Int, swizzle_mode: TensorMapSwizzle = TensorMapSwizzle(__init__[__mlir_type.!pop.int_literal](0))](smem_ptr: UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)]) -> Self`

Create a descriptor for shared memory operand.

**Parameters:**

* ​stride\_byte\_offset (`Int`): Stride dimension offset in bytes.
* ​leading\_byte\_offset (`Int`): Leading dimension stride in bytes.
* ​swizzle\_mode (`TensorMapSwizzle`): Memory access pattern mode.

**Args:**

* ​smem\_ptr (`UnsafePointer[SIMD[dtype, 1], address_space=AddressSpace(3)]`): Pointer to shared memory operand.

**Returns:**

Initialized descriptor for the shared memory operand.

---

## What is Modular

import { Button } from '@mantine/core';
import DocLink from '@site/src/components/DocLink';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import ContactSection from '@site/src/components/ContactSection';

The Modular Platform is an open and fully-integrated suite of AI libraries and
tools that accelerates model serving and scales GenAI deployments. It abstracts
away hardware complexity so you can run the most popular open models with
industry-leading GPU and CPU performance without any code changes.

Our ready-to-deploy Docker container removes the complexity of deploying your
own GenAI endpoint. And unlike other serving solutions, Modular enables
customization across the entire stack. You can customize everything from the
serving pipeline and model architecture all the way down to the metal by
writing custom ops and GPU kernels in Mojo. Most importantly, Modular is
hardware-agnostic and free from vendor lock-in—no CUDA required—so your code
runs seamlessly across diverse systems.

It takes only a moment to start an OpenAI-compatible endpoint with a model from
Hugging Face:

  
    ```sh
    max serve --model-path=modularai/Llama-3.1-8B-Instruct-GGUF
    ```

  
    ```sh
    docker run --gpus=1 \
      -v ~/.cache/huggingface:/root/.cache/huggingface \
      -p 8000:8000 \
      docker.modular.com/modular/max-nvidia-full:latest \
      --model-path=modularai/Llama-3.1-8B-Instruct-GGUF
    ```
  

```python
from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="EMPTY")

completion = client.chat.completions.create(
    model="modularai/Llama-3.1-8B-Instruct-GGUF",
    messages=[
        {
            "role": "user",
            "content": "Write a one-sentence bedtime story about a unicorn.",
        },
    ],
)

print(completion.choices[0].message.content)
```

  Try it now

## Capabilities

- [x] **High-performance, portable serving**: Serve 500+ AI models from Hugging
Face using our OpenAI-compatible REST API with industry-leading performance
across GPUs and CPUs.

- [x] **Large-scale, GenAI deployment**: Scale massive GenAI inference services
across thousands of GPU nodes. Modular intelligently routes workloads across
models and hardware types to maximize throughput and minimize latency.

- [x] **Flexible, faster development**: Deploy with a single Docker container
that's under 1GB across multiple hardware types, compile in seconds rather than
hours, and develop faster with a slim toolchain that makes versioning and
dependency nightmares disappear.

- [x] **Customize everywhere**: Customize at any layer of the stack by writing
hardware-agnostic GPU and CPU kernels, porting models into Modular's optimized
graph format, or programming hardware directly with Mojo (no hardware-specific
libraries).

## Components

Modular is a vertically integrated AI infrastructure stack that spans from the
hardware all the way up to Kubernetes, and it provides entry points for users
at every level.

  
  Figure 1. A simplified diagram of how the Modular Platform
    scales your GenAI deployment.

- 🏔️ **Mammoth**: A Kubernetes-native control plane, router, and
substrate specially-designed for large-scale distributed AI serving. It
supports multi-model management, prefill-aware routing, disaggregated compute
and cache, and other advanced AI optimizations.

- 🧑🏻‍🚀 **MAX**: A high-performance AI serving framework that includes advanced
model serving optimizations like speculative decoding, and graph compiler
optimizations like op-level fusions. It provides an OpenAI-compatible serving
endpoint, executes native MAX and PyTorch models across GPUs and CPUs, and can
be customized at the model and kernel level.

- 🔥 **Mojo**: A kernel-focused systems programming language that enables
high-performance GPU and CPU programming, blending Pythonic syntax with the
performance of C/C++ and the safety of Rust. All the kernels in MAX are written
with Mojo and it can be used to extend MAX Models with novel algorithms.

## Get started

You can create an OpenAI-compatible REST endpoint using our `max` CLI or our
Docker container:

- [**Start with pip**](/max/get-started): Install MAX with `pip` and run
inference with Python or a REST endpoint.

- [**Start with Docker**](/max/container): Run our Docker container to create a
REST endpoint.

In either case, you can select from hundreds of GenAI models in our [Model
repository](https://builds.modular.com/?category=models). You can also load
weights from Hugging Face or load your own fine-tuned weights.

For performance optimization, you can port models from PyTorch to MAX using the
[MAX Graph API](/max/tutorials/get-started-with-max-graph-in-python). For
deeper customization, you can extend MAX Models with [custom
operations](/max/tutorials/build-custom-ops) (ops) written in Mojo. Your custom
ops are automatically analyzed and fused into the model graph, delivering
low-level acceleration without sacrificing developer productivity.

:::note Get early access

Mammoth is not yet generally available, but enterprise customers can
get early access.

[Contact us now](https://www.modular.com/company/talk-to-us)

:::

## Stay in touch

---

## What's new

Here's everything you should know about what's changed in each release.

## v25.4 nightly

This version is still a work in progress.

See how to [install the nightly
release](/max/packages#nightly-release).

### Documentation {#25-4-docs}

* Added instructions on profiling MAX kernels (see `max/kernels/README.md`).

### MAX models {#25-4-models}

* GGUF quantized Llamas (q4\_0, q4\_k, and q6\_k) are now supported with paged
  KVCache strategy.

### MAX framework {#25-4-max}

#### Serving & inference engine {#25-4-max-serving}

* Inflight batching no longer requires chunked prefill.

* The naive KVCache has been deleted.

* Removed support for torchscript and torch MLIR models

* Continuous KVCache strategy is deprecated. Please use Paged KVCache strategy instead.

#### `max` CLI {#25-4-max-cli}

* Added `--use-subgraphs` flag to `max generate` to allow for the
  use of subgraphs in the model.

#### Python APIs {#25-4-max-python}

* Added `add_subgraph` method to `Graph` class. This method allows for the
  addition of a subgraph to a graph.

* Added the `call` operation which allows for the execution of a subgraph.

* Added `fold` op for combining sliding blocks into a larger tensor.

* Removed server setting from `llm.py` entrypoint for offline inference.
  Server is now automatically configured in background without consuming an
  HTTP port.

* Added a `strict` parameter to the `load_state_dict` method in
  `max.nn.Module`. When `strict=True` (default), an error is raised
  if the `state_dict` contains unused keys. When `strict=False`,
  extra keys are ignored. This helps model developers identify missing
  implementations in their models.

* Added the new `max.torch` module for using custom Mojo kernels from PyTorch.
  This module replaces the previously deprecated `max.torch` module. For example,
  a custom `grayscale` operation

  ```mojo
  @register("grayscale")
  struct Grayscale:
      @staticmethod
      fn execute[
          # The kind of device this is running on: "cpu" or "gpu"
          target: StaticString,
      ](
          img_out: OutputTensor[type = DType.uint8, rank=2],
          img_in: InputTensor[type = DType.uint8, rank=3],
          ctx: DeviceContextPtr,
      ) raises:
          ...
  ```

can be used from PyTorch like so:

```python
from max.torch import CustomOpLibrary

op_library = CustomOpLibrary("path/to/custom.mojopkg")

@torch.compile(backend=backend)
def grayscale(pic):
    result = pic.new_empty(pic.shape[:-1])
    op_library.grayscale(result, pic)
    return result

img = (torch.rand(64, 64, 3) * 255).to(torch.uint8)
result = grayscale(img)
```

See
[whisper.py](https://github.com/modularml/modular/blob/main/open-source/max/examples/custom_ops/whisper.py)
for a larger example which replaces the attention module with one using a custom
fused attention operation implemented in Mojo.

* Removed `graph.unique_symoblic_dim`.
* `ops.masked_scatter` now requires naming the `out_dim` explicitly as it is
  data-dependent, eg.

```python
ops.masked_scatter(
    inputs_embeds, video_mask, video_embeds, out_dim="unmasked_inputs"
)
```

* Removed `max_to_torch_type` and `torch_to_max_type` and replaced them with
  `DType.to_torch` and `DType.from_torch`, respectively.
  This aligns with the corresponding NumPy methods.

#### Mojo APIs {#25-4-max-mojo}

* Mojo Graph, Driver, and Engine APIs have been open sourced and removed from
  codebase.  In addition to this, many types from the `max.tensor` package have
  been removed:

  * `Tensor`
  * `TensorShape`
  * `TensorSpec`

  Please replace usage with  `LayoutTensor.`

* `LayoutTensor` now has a `size` method to get the total number of elements.

* `List`, `InlineArray`, `IntTuple`, and `IndexList` now work with list literals.

#### Custom ops {#25-4-custom-ops}

* Improve error messages when custom op parameters are provided with values that
  don't have the proper type.

### Mojo language {#25-4-mojo}

* Various packages pertaining to the MAX Kernel Library are now shipped in the
  nightly release.  You can now find `mojopkg` in the SDK for the following new
  packages:
  * `linalg`
  * `nn`
  * `nvml`
  * `quantization`
  * `weights_registry`

## v25.3 (2025-05-06)

* [Highlights](#25-3-highlights)
* [Documentation](#25-3-docs)
* [`max` CLI](#25-3-max-cli)
* [MAX models](#25-3-models)
* [MAX Serve](#25-3-serve)
* [MAX Engine & Graph](#25-3-engine)
  * [Python API](#25-3-engine-mojo-api)
  * [Mojo API](#25-3-engine-mojo-api)
  * [Custom ops](#25-3-custom-ops)
* [Kernels](#25-3-kernels)
* [GPU programming](#25-3-gpu-programming)
* [Mojo language](#25-3-mojo)

### ✨ Highlights {#25-3-highlights}

* You can now **install Modular APIs and tools with pip**:

  ```sh
  pip install modular \
    --index-url https://download.pytorch.org/whl/cpu
  ```

  This installs the `max` CLI, `max` Python library, `mojo` CLI, and Mojo
  libraries. However, the Mojo LSP and debugger are currently not included. If
  you plan to develop with Mojo, we still suggest using [`magic`](/magic).

  We use the `--index-url` argument to ensure that `torch` installs its CPU
  dependencies only, thus avoiding a lot of unnecessary GPU packages. This is a
  temporary workaround until we can remove our dependency on `torch`.

* We **open-sourced the MAX AI kernels** and the rest of the **Mojo standard
  library**!

  The [MAX AI kernels library](/mojo/lib#max-ai-kernels-library) is a new Mojo
  API for writing high-performance and portable programs across CPU and GPU, but
  it's also [the source code for our CPU/GPU
  kernels](https://github.com/modular/modular/tree/main/max/kernels/src). You
  can now see the Mojo code we use in MAX to power GenAI workloads on CPUs and
  GPUs.

  Just like the Mojo standard library, these kernels are open source under the
  Apache 2.0 License with LLVM exceptions. Plus, the rest of the Mojo standard
  library is also [now open source in
  GitHub](https://github.com/modular/modular/tree/main/mojo/stdlib/src).

* **Learn to program GPUs** with [Mojo GPU Puzzles](https://builds.modular.com/puzzles)!

  This is a brand new site that offers a hands-on guide to mastering GPU
  programming with Mojo. Starting from basic concepts, you'll learn
  step-by-step how to program for GPUs by solving increasingly challenging
  puzzles.

### Documentation {#25-3-docs}

We've restructured the documentation to unify MAX and Mojo documentation
under the Modular Platform. We believe this improves content discovery with a
simplified navigation and helps unify the platform story as a whole.

We've also added the following new docs:

* [REST API reference](/max/api/serve): Although it's not a new API (our
  serving library has supported OpenAI APIs for the last few versions), this
  now shows precisely which endpoints and body parameters we support.

* [Speculative decoding](/max/serve/speculative-decoding): An introduction to
  using speculative decoding to reduce latency for LLMs. This feature is still in
  development.

* [Offline inference](/max/serve/offline-inference): An introduction to our
  Python API for running inference with an LLM locally (without sending requests
  to a serving endpoint).

* [Introduction to layouts](/mojo/manual/layout/layouts): A guide to working
  with dense multidimensional arrays on CPUs and GPUs, using new Mojo `layout`
  types that abstract-away complex memory layout patterns.

### `max` CLI {#25-3-max-cli}

* Renamed the `max-pipelines` CLI tool to `max`. We recommend re-installing
  it as shown in the [`max` CLI docs](/max/max-cli/).

* Remove previously deprecated `--use-gpu`, `--serialized_model_path`,
  `--save_to_serialized_model_path`, `--max_cache_batch_size` and
  `--huggingface-repo-id` options.

* Move `InputContext`, `TextContext`, and `TextAndVisionContext` from
  `max.pipelines` to `max.pipelines.context`.

### MAX models {#25-3-models}

* Added `Llama4ForConditionalGeneration` support,
  featuring new MoE layers. Currently, it is limited to text inputs.
  Run the model by calling:

  ```sh
  max generate --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct --devices 0,1,2,3
  ```

* Added support for running text generations using the Mistral 3 24B model.
  Run the model with:

  ```sh
  max generate --model-path mistralai/Mistral-Small-3.1-24B-Instruct-2503 --devices 0
  ```

* Fixed empty textual outputs for certain Mistral models
  ([MAX issue 4193](https://github.com/modular/modular/issues/4193)).

* Added support for loading a custom pipeline architecture by module. Using
  `--custom-architectures=folder/path/to/import:my_module` will lead to loading
  architectures from the file. The architectures must be exposed via an
  `ARCHITECTURES` variable in the file. Once loaded, a model can be run using the
  new architectures. The flag can be specified multiple times to load more
  modules.

### MAX Serve {#25-3-serve}

* Moved from radix trie to hash based prefix caching implementation which has
  smaller CPU overheads. This improves performance particularly in workloads with
  high cache reuse rates.

* Added experimental support for offloading KVCache to host memory via the
  `--enable-kvcache-swapping-to-host` and `--host-kvcache-swap-space-gb` flags.
  This allows for superior KVCache reuse through prefix caching in workloads
  where the reusable KVCache amount exceeds GPU VRAM.

* Fixed the `usage.prompt_tokens` field in the OpenAI API Usage Info response.
  Previously this field was always set to Null, but now it correctly
  contains the number of prompt tokens in the request.

* Switched from Python Multiprocessing Queue to ZeroMQ. This reduces latencies
  between frontend server process and model worker process related to networking.

* Stray model workers on Linux now terminate more reliably when the parent
  process is killed.

### MAX Engine & Graph {#25-3-engine}

#### Python API {#25-3-engine-python-api}

* We now raise an error if there's a mismatch between the expected device of a
  weight on a graph and the device of the actual tensor data specified in
  [`InferenceSession.load()`](/max/api/python/engine#max.engine.InferenceSession.load).

* Removed `output_device` argument from
  [`Model.execute()`](/max/api/python/engine#max.engine.Model.execute).

* Removed the `copy_inputs_to_device` argument in
  [`Model.execute`](/max/api/python/engine#max.engine.Model.execute) to improve
  predictability of the API. Now `execute()` raises a `TypeError` if arguments
  are passed whose devices don't match the model.

* Swapped the order of the `dtype` and `shape` fields of
  [`driver.Tensor`](/max/api/python/driver#max.driver.Tensor).
  Previously, the arguments are ordered as `(shape, dtype)`. They are now swapped
  to `(dtype, shape)` to be in line with other tensor-like types.

* Replaced some instances of
  [`Tensor.zeros`](/max/api/python/driver#max.driver.Tensor.zeros)
  with `Tensor.__init__` when the engine did not depend on the tensor being zero
  initialized. This elides the unnecessary memset to provide a minor performance
  improvement.

* Added a new experimental
  [`Tensor.inplace_copy_from()`](/max/api/python/driver#max.driver.Tensor.inplace_copy_from).
  This allows users to copy the contents of one `Tensor` into another.

* Made the default behavior of [`Weight`](/max/api/python/graph/Weight) as
  expecting the initial allocation on host. A transfer is then inserted to the
  target device and this value is returned when weights generate an MLIR value.
  This is done due to current conservative ownership around external weights.

* Added the [`irfft`](/max/api/python/graph/ops/#max.graph.ops.irfft) op, which
  computes the inverse real fast fourier transform (FFT).

* Added the [`argmax`](/max/api/python/graph/ops#max.graph.ops.argmax) op,
  which returns the index of the maximum value in an array or sequence.

* Added the [`GroupNorm`](/max/api/python/nn/norm/group_norm) layer.

* Switched layer names so that `max.nn` layers that are implemented with the
  deprecated `Layer` class are marked as "V1", and layers that are implemented
  with the new [`max.nn.Module`](/max/api/python/nn/layer#max.nn.layer.Module)
  are the default. That is, `max.nn.LinearV2` is now
  [`max.nn.Linear`](/max/api/python/nn/linear#max.nn.linear.Linear), and the
  previous `max.nn.Linear` is now
  [`max.nn.LinearV1`](/max/api/python/nn/linear#max.nn.linear.LinearV1).

* DeviceRefs in types/layers are in general expected to be explicit rather than
  implicit.

#### Mojo API {#25-3-engine-mojo-api}

* Removed some functionality from
  [`tensor.Tensor`](/max/api/mojo/tensor/tensor/Tensor):

  * Serializing `Tensor` to disk (`Tensor.tofile(path)` and `Tensor.save(path)`).
  * Reading the serialized data back from disk (`Tensor.load(path)` and
    `Tensor.fromfile(path)`.
  * `rand` and `randn` methods have been removed.  Use the ones in the Mojo
    standard library if you still need access for constructing a new `Tensor`
    with random elements based on a particular `TensorShape`.

* **Deprecated the Mojo Driver, Graph, and Engine APIs**

  These APIs are not currently used internally. Instead, we build graphs using
  the Python APIs, and our engineering efforts have been focused on making that
  experience as robust and user-friendly as possible. As a result, the Mojo
  versions of these APIs have not kept pace with new features and language
  improvements. These APIs will be open sourced for the community before being
  removed.

#### Custom ops API {#25-3-custom-ops}

* You can now pass Mojo source package paths as
  [`Graph`](/max/api/python/graph/Graph) custom extensions. The Mojo code will be
  compiled automatically, no need to run `mojo package` manually as a prior step.
  Previously, only pre-compiled `.mojopkg` paths were accepted, requiring the
  Mojo code to be built as a prerequisite step before running a `Graph` with a
  custom op.

  Given a project structure like:

  ```text
  project
  |-- main.py
  \-- kernels
      |-- __init__.mojo
      \-- my_custom_op.mojo
  ```

  You can construct a `Graph` in `main.py` using Mojo custom op kernels simply
  using:

  ```python
  g = Graph(
    ...,
    custom_extensions = [Path(__file__).parent / "kernels"]
  )
  ```

  A change to your Mojo source code defining a custom op will be reflected
  immediately the next time the `Graph` is constructed.

* New [image\_pipeline example](https://github.com/modular/modular/tree/main/examples/custom_ops)
  that demonstrates sequencing custom ops together which modify an image,
  leaving data on the GPU for each op, before writing it back to CPU and disk.

### Kernels {#25-3-kernels}

* More compute overlap is now enabled for Hopper GPUs. This allows finer-grained
  scheduling of kernel operations by analyzing producer-consumer patterns within
  a compute kernel. As a result, there is more kernel compute overlap, especially
  for compute-heavy kernels with data-dependent execution paths.

### GPU programming {#25-3-gpu-programming}

* CUDA driver requirement reduced to version 12.4 and the NVIDIA driver to be
  version 550. Requiring these earlier driver versions allows MAX to be more
  easily deployed on AWS and GCP, since these are the default versions used by
  those cloud providers.

* Added support for programming NVIDIA Jetson Orin GPUs (`sm_87`).

Also see the [Mojo changelog of GPU changes](/mojo/changelog#gpu-changes).

### Mojo language {#25-3-mojo}

* We recently open-sourced the rest of the Mojo standard library, including the
  `algorithm`, `benchmark`, `buffer`, `compile`, `complex`, `gpu`, and `layout`
  packages. [See it all in
  GitHub](https://github.com/modular/modular/tree/main/mojo/stdlib/src).

* We've also open sourced [all our MAX AI
  kernels](https://github.com/modular/modular/tree/main/max/kernels/src). This
  new library includes `kv_cache`, `layout`, `linalg`, `nn`, `nvml`, and
  `quantization`.

For all the updates to the Mojo language, standard library, and tools, see the
[Mojo changelog](/mojo/changelog).

## v25.2 (2025-03-25)

* [Highlights](#25-2-highlights)
* [MAX Serve](#25-2-serve)
* [MAX models](#25-2-models)
  * [`max-pipelines` CLI](#25-2-pipelines-cli)
* [MAX Engine](#25-2-engine)
  * [Driver APIs](#25-2-driver)
  * [Graph APIs](#25-2-graph)
  * [Custom ops](#25-2-custom-ops)
  * [Hopper Kernels](#25-2-hopper-kernels)
* [GPU programming](#25-2-gpu-programming)
* [Mojo](#25-2-mojo)
* [Documentation](#25-2-documentation)

### ✨ Highlights {#25-2-highlights}

* **Support for NVIDIA Hopper GPUs**

  MAX has been optimized to run on Hopper GPUs. For more information on MAX and
  NVIDIA's hardware, see the [MAX
  container](/max/container#recommended-cloud-instances) documentation.

* **Multi-GPU support**

  MAX uses tensor parallelism to distribute work across multiple GPUs so you can
  run LLMs like
  [`Llama-3.3-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct),
  even with long context window.

* **Expanded library of MAX models**

  We're rapidly growing our library of base model architectures that MAX can
  accelerate with MAX Serve (including `Phi3ForCausalLM`, `OlmoForCausalLM`,
  and `GraniteForCausalLM`). We also now support `GTPQ` for the Llama models.
  For more information, check out our [MAX model
  repository](https://builds.modular.com/?category=models).

* **Advanced E2E optimizations for long context window**

  In flight batching, chunked prefill, and copy-on-write optimize the execution
  for prefix heavy and long context window scenario.

* **GPU programming with Mojo**

  Lots of new APIs are now available to enable both low-level GPU programming and
  abstracted programming patterns that simplify the code required to write GPU
  kernels for your AI models.

### MAX Serve {#25-2-serve}

* Extended MAX Serve batch scheduling to account for the prefix cache. The
  scheduler can now create larger batches when many prompt tokens are already
  cached, improving throughput up to 10% in some benchmarks.

* Added support for in-flight batching, allowing token generation requests to be
  scheduled alongside context encoding requests to reduce inter-token latency. This
  behavior can be controlled by CLI argument `--enable-in-flight-batch`.

* Added support for copy-on-write on KV blocks when using PagedAttention with
  Prefix Caching. This improves the prefix cache hit rate and prefill performance
  in some scenarios.

* MAX Serve now supports `transformers` v.4.49.0, with a patch
  to avoid graph breaks when using `torch.compile()` on Llama models.

* Added support for recording HTTP traffic out to a file for diagnostics or later
  replay.

### MAX models {#25-2-models}

* Added support for executing `LlamaForCausalLM` architecture models on multiple
  GPUs. The model uses tensor parallelism automatically when passing multiple
  device IDs to the `--devices` CLI argument. Try running
  [`meta-llama/Llama-3.3-70B-Instruct`](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
  on 4 GPUs with the following example:

  ```sh
  max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
    --quantization-encoding bfloat16 \
    --devices gpu:0,1,2,3 \
    --prompt="Design a
      self-sustaining colony on Neptune's moon Triton with a myth/science
      fusion name, three quantum tech breakthroughs, one ethical debate, a
      neon-lit cultural ritual, and a hidden flaw—presented in bullet points."
  ```

* Added support for the `Phi3ForCausalLM` model architecture (such as
  [`microsoft/phi-4`](https://huggingface.co/microsoft/phi-4)). For example:

  ```sh
  max-pipelines generate \
    --model-path microsoft/phi-4 \
    --prompt "Write bubble sort in mojo"
  ```

* Added support for the `OlmoForCausalLM` model architecture (such as
  [`allenai/OLMo-1B-0724-hf`](https://huggingface.co/allenai/OLMo-1B-0724-hf)). For
  example:

  ```sh
  max-pipelines generate \
    --model-path allenai/OLMo-1B-0724-hf \
    --prompt "Write bubble sort in mojo"
  ```

* Added support for the `GraniteForCausalLM` model architecture (such as
  [`ibm-granite/granite-3.1-8b-instruct`](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)).
  For example:

  ```sh
  max-pipelines generate \
    --model-path ibm-granite/granite-3.1-8b-instruct \
    --prompt "Write bubble sort in mojo"
  ```

* Added support for:

  * [`microsoft/Phi-3.5-mini-instruct`](https://huggingface.co/microsoft/Phi-3.5-mini-instruct)
  * [`microsoft/phi-4`](https://huggingface.co/microsoft/phi-4)
  * [`LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct`](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
  * [`LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct`](https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct)

* We now support GPTQ quantization for models that run on the GPU. This is
  handled transparently when the model weights are specified. For example, this
  runs Llama 3.1 8B using int4-quantized GPTQ weights:

  ```sh
  max-pipelines generate \
    --model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 \
    --prompt "Why is the sky blue?" \
    --max-batch-size 1 \
    --max-length 10000
  ```

  This reduces the total memory consumption of this model from \~16 GB to \~5 GB,
  allowing the model to fit in the RAM smaller GPUs.

* Model weights are now downloaded in parallel.

* Added constraints on whitespace during [Structured
  Output](/max/serve/structured-output). This reduces tokens counts and improves
  model adherence.

* Added jump ahead decoding during Structured Output. This auto-completes tokens
  when a singular path forward is identified, improving single completion times by
  up to \~20% for long prompts.

* In the event of an unhandled exception, we now use the standard Python
  traceback format instead of using pretty-printed Rich tracebacks.

* We now need to explicitly import `LLM` from
  [`max.entrypoints.llm`](/max/api/python/entrypoints) rather than the previous
  `max.entrypoints` import.

* The `max.pipelines.dataprocessing.tokenizer` and
  `max.pipelines.dataprocessing.gguf_utils` modules have been removed.

* The previously deprecated `PipelineConfig.architecture` field and its
  corresponding `--architecture` CLI argument have been removed.

### `max-pipelines` CLI {#25-2-pipelines-cli}

* The `--devices` CLI argument now supports a comma-separated list of GPU IDs
  prefixed with `gpu:` like `--devices=gpu:0,1,2,3`. We no longer support the
  previous `--devices=gpu-` format.

  ```sh
  max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
    --quantization-encoding bfloat16 \
    --devices gpu:0,1,2,3 \
    --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flaw—presented in bullet points."
  ```

* Removed `--huggingface-repo-id`
  [PipelineConfig](/max/api/python/pipelines/config/#max.pipelines.config.PipelineConfig)
  option and CLI argument in favor of `--model-path`.

* We consolidated `--model-path` and `-weight-path`. Valid `--weight-path` values
  now override `--model-path`, which handles both local and remote (Hugging Face)
  cases. If we cannot derive the weights from the `--weight-path`, we now fall back
  to the `--model-path`, which you must set explicitly.

* Added `--huggingface-revision` option, to allow selecting a non-default branch
  or a specific commit in a Hugging Face model repository.

### MAX Engine {#25-2-engine}

* The MAX graph compiler now has kernel caching. This is a significant
  improvement to our compilation pipeline. Here are some of the highlights:

* Up to 28% faster compilation times when making iterative changes to models

* Improved caching between different but similar models (up to 27% faster)

* Lays foundation for future caching optimizations

What does this mean for you? Faster development cycles! When you're working on
model pipelines and making changes to the graph, the graph compiler will now
intelligently reuse kernels that haven't changed, significantly reducing
compilation times.

The improvements are particularly noticeable during iterative development, with
compilation times dropping from \~80s to \~57s in some cases of compiling
Llama3.1-8B for 4 GPUs. Even when compiling different models from the same family
(like Llama/Granite variants), you'll see significant speedups on subsequent
compilations.

### Driver APIs {#25-2-driver}

* Added `Accelerator.can_access(other: Device) -> bool` method to check if one
  device can directly access memory of another device.

* Fixed a bug in `max.driver.tensor.load_max_tensor()` for `bfloat16` dtype,
  which would cause an error about mmap size being too large.

* `max.driver.Tensor.item()` now works on any single-element tensor (previously
  restricted to rank-0 tensors).

* Added
  [`Device.synchronize()`](/max/api/python/driver#max.driver.Device.synchronize),
  which ensures all operations on the device complete before returning.

* Removed `MojoCallContextPtr` in favor of `DeviceContextPtr`.
  `MojoCallContextPtr` only contained a `DeviceContextPtr`, so this change
  directly exposes the `DeviceContextPtr`. Custom ops using `MojoCallContextPtr`
  now directly take a `DeviceContextPtr` argument:

  ```mojo
      @staticmethod
      fn execute[
          type: DType, rank: Int
      ](
          output: OutputTensor[type=type, rank=rank],
          input: InputTensor[type=type, rank=rank],
          ctx: MojoCallContextPtr,
      ):
  ```

  becomes

  ```mojo
      @staticmethod
      fn execute[
          type: DType, rank: Int
      ](
          output: OutputTensor[type=type, rank=rank],
          input: InputTensor[type=type, rank=rank],
          ctx: DeviceContextPtr,
      ):
  ```

* You can now skip compiling a GPU kernel first before enqueueing it, and pass
  a function directly to `ctx.enqueue_function[func](...)`:

  ```mojo
  fn func():
      print("Hello from GPU")

  @register("custom_op")
  struct CustomOp:

      @staticmethod
      fn execute(ctx: DeviceContextPtr) raises:
          var dev_ctx = ctx.get_device_context()
          dev_ctx.enqueue_function[func](grid_dim=1, block_dim=1)
  ```

  However, if you're reusing the same function and parameters multiple times, this
  incurs some overhead of around 50-500 nanoseconds per enqueue. So you can still
  compile the function first and pass it to `ctx.enqueue_function` in this scenario:

  ```mojo
  var compiled_func = ctx.compile_function[func]()
  # Multiple kernel launches with the same function/parameters
  ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
  ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
  ```

* Changed `Accelerator` and `CPU` from factory methods that created `Device`
  objects in Python (which were accelerators and CPUs in the C++ implementation) to
  actual Python types. This change elevates the `Accelerator` and `CPU` type
  concepts to Python, making them types rather than methods.

  This allows type annotations in Python. For example, a list of accelerators
  used to be defined like this:

  ```python
  graph_devices: list[DeviceRef]
  ```

  Now it can be defined like this:

  ```python
  graph_devices: list[Accelerator]
  ```

* Elementwise operations (e.g. `__add__`) have been removed from `Tensor`
  (that is, `tensor_internal.Tensor`). This `Tensor` type is being phased out; please
  reduce usage in favor of `LayoutTensor`.

### Graph APIs {#25-2-graph}

* The `nn` package is now [`max.nn`](/max/api/python/nn/).

* Added [`ops.chunk`](/max/api/python/graph#max.graphs.ops.chunk)) to support
  chunking tensors along an axis.

* Added support for while loops with [`ops.while_loop`](/max/api/python/graph#max.graphs.ops.while_loop).

* Added support for conditional execution with [`ops.cond`](/max/api/python/graph#max.graph.ops.cond).

* Added axis reduction overloads for
  [`ops.min`](/max/api/python/graph/ops#max.graph.ops.min) and
  [`ops.max`](/max/api/python/graph/ops#max.graph.ops.max). For example;
  `ops.min(tensor, axis=-1)`.

* The [`gelu()`](/max/api/python/graph/ops#max.graph.ops.gelu) function now accepts
  an `approximate` keyword. The keyword controls the `gelu` approximation with
  `none`, `tanh`, and `fast` approximations accepted.

* Removed the `roundeven()` operation from the Python API. The
  [`round()`](/max/api/python/graph/ops#max.graph.ops.round) operation now has the
  same behavior as `roundeven()`, so there is no need for both to exist.

* Added helpers to create analogous tensors from buffer types and vice versa.

* Added `max.nn.Module`, a base class for writing layers and constructing
  networks of layers (e.g. using `max.nn.Sequential`). Currently, this class
  supports graph building by ensuring that all weight names are unique and
  systematically generated. This class also supports managing the weight values
  with the `module.state_dict()` and `module.load_state_dict()` methods. More
  functionality and documentation will be added in future releases.

### Custom ops {#25-2-custom-ops}

* Changes have been made to the way that custom ops are registered: rather
  than using the `num_dps_outputs` attribute on `@compiler.register` to specify the
  number of outputs, that number is now inferred from the signature of the custom
  operation. Inputs to the operation now use the `InputTensor` type and outputs
  from the operation use `OutputTensor`, instead of the previous
  `ManagedTensorSlice` for both. This eliminates the need for a manual
  `num_dps_outputs` attribute, and makes it safer to work with these inputs and
  outputs by preventing accidental writes to input tensors. The new interface looks
  something like the following:

  ```mojo
  @compiler.register("add_one_custom")
  struct AddOneCustom:
      @staticmethod
      fn execute[
          target: StringLiteral,
      ](
          out: OutputTensor,
          x: InputTensor[type = out.type, rank = out.rank],
          ctx: DeviceContextPtr,
      ) raises:
          @parameter
          @always_inline
          fn elementwise_add_one[
              width: Int
          ](idx: IndexList[x.rank]) -> SIMD[x.type, width]:
              return x.load[width](idx) + 1

          foreach[elementwise_add_one, target=target](out, ctx)
  ```

* The `foreach` function now `raises` to be able to handle errors within an
  elementwise calculation.

### Hopper kernels {#25-2-hopper-kernels}

State-of-the-Art Kernels in Mojo for H100/H200 GPUs

* **Hopper Architecture Matrix Multiplication Kernels**: The implementation
  achieved performance comparable to NVIDIA's highly optimized cuBLAS library.
  These kernels take full advantage of the Tensor Cores in Hopper architecture GPUs
  to accelerate the fundamental matrix multiplication operations that underpin deep
  learning workloads.

* **Multi-GPU AllReduce Implementation**: The AllReduce operation is critical for
  distributed inference across multiple GPUs, as it efficiently aggregates
  gradients. The Mojo implementation surpassed NVIDIA's NCCL library in performance
  benchmarks. This improvement reduces communication overhead during distributed
  inference.

* **MAX Attention Kernel with Flash Attention 3:** This implementation
  incorporates the latest Flash Attention 3 algorithm and extends it, which
  significantly accelerates the computation of attention mechanisms in transformer
  models. The MAX attention kernel optimizes memory access patterns and
  computational steps, reducing both the memory footprint and execution time of
  attention operations. This is particularly important for LLMs where attention
  calculations represent a substantial portion of the computational workload.

### GPU programming {#25-2-gpu-programming}

* Added the [Mojo `max.driver` API](/max/api/mojo/driver) to enable dispatching
  GPU functions from Mojo.

  Check out [examples for GPU programming in
  Mojo](https://github.com/modular/modular/tree/main/examples/gpu_functions), which
  use this new API.

### Mojo {#25-2-mojo}

Mojo is a crucial component of the MAX stack that enables all of MAX's
performance-oriented code across hardware. For all the updates to the Mojo
language, standard library, and tools, see the [Mojo
changelog](/mojo/changelog).

### Documentation {#25-2-documentation}

New examples for writing custom ops:

* [`fused_attention`](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/fused_attention.mojo)
  demonstrates complex GPU programming using MAX abstractions for a
  practical use in AI model development.

* [`matrix_multiplication`](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/matrix_multiplication.mojo)
  includes a series of progressive optimizations for matrix multiplications
  on GPUs.

* [`histogram`](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/histogram.mojo)
  shows how to implement the histogram pattern as a custom op.

* New [examples for GPU programming in
  Mojo](https://github.com/modular/modular/tree/main/examples/gpu_functions)
  using the new [MAX Driver API](/max/api/mojo/driver/)

  These use a Mojo programming model that should look familiar to CUDA C
  programmers, showing how to define and dispatch GPU functions within a
  single Mojo file. These examples recreate the first three samples from
  the popular textbook ["Programming Massively Parallel
  Processors"](https://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0323912311),
  showing how basic concepts translate from CUDA into Mojo. Additionally, a
  Mandelbrot set calculation example that parallels a similar one in the
  existing custom ops examples.

* New [MAX containers](/max/container/) available. For
  more information on the base and full MAX containers, see [Container
  contents](/max/container/#container-contents).

## v25.1.1 (2025-02-19)

Fix performance issues in autoregressive models with paged attention
by setting sensible default values for `--max-num-steps` that are
platform-specific.

## v25.1 (2025-02-13)

* [Highlights](#25-1-highlights)
* [Documentation](#25-1-docs)
* [MAX Serve](#25-1-serve)
* [MAX models](#25-1-max-models)
* [MAX Engine](#25-1-engine)
  * [Graph APIs](#25-1-graph)
  * [Pipeline APIs](#25-1-pipelines)
  * [GPU programming](#25-1-gpus)
* [Mojo](#25-1-mojo)

### ✨ Highlights {#25-1-highlights}

* **Custom ops for GPUs**

  Our new custom op API allows you to extend MAX Engine with new graph
  operations written in Mojo that execute on either CPU or GPU, providing full
  composability and extensibility for your models. See more in the section
  about [GPU programming](#25-1-gpus).

* **Enhanced support for agentic workflows**

  MAX Serve now supports function calling, which allows you to instruct your
  model to interact with other systems, such as retrieve data and execute
  external tasks. [Learn more about function calling and tool
  use](/max/serve/function-calling).

  MAX Serve now supports structured output (also known as constrained decoding)
  for MAX models on GPU. This allows you to enforce the output format from a
  model using an input schema that defines the output structure. [Learn more about
  structured output](/max/serve/structured-output).

* **Extended model architecture support**

  * MAX Serve now supports multimodal models that take both text and image
    inputs. For example, see [how to deploy Llama 3.2
    Vision](/max/tutorials/deploy-llama-vision).

  * MAX Serve now supports text embedding models. Learn how to [deploy a text
    embedding model](/max/tutorials/run-embeddings-with-max-serve).

* **New `max-pipelines` CLI tool**

  Instead of cloning our GitHub repo to access our latest GenAI models, you can
  instead install the `max-pipelines` CLI tool and quickly run an inference or
  deploy an endpoint. Learn more in the [`max-pipelines`
  docs](/max/max-pipelines).

### Documentation {#25-1-docs}

New tutorials:

* [Build custom ops for GPUs](/max/tutorials/build-custom-ops)

* [Serverless GPU inference on Google Cloud
  Run](/max/tutorials/deploy-serverless-cloud-run)

* [Generate image descriptions with Llama 3.2
  Vision](/max/tutorials/deploy-llama-vision)

* [Deploy a text embedding model](/max/tutorials/run-embeddings-with-max-serve)

Other docs:

* [Function calling and tool use](/max/serve/function-calling)

* [Structured output](/max/serve/structured-output)

* [Prefix caching with PagedAttention](/max/serve/prefix-caching)

* [max-pipelines](/max/max-pipelines)

### MAX Serve {#25-1-serve}

* The `/v1/completions` REST endpoint now supports:

  * Pre-tokenized prompts.

  * Image inputs for multimodal models such as `Llama-3.2-11B-Vision-Instruct`.
    For an example, see [how to generate image
    descriptions with Llama 3.2 Vision](/max/tutorials/deploy-llama-vision).

    **Known issue:** You might receive faulty results because some parts of the
    text prompt get ignored for certain input combinations. We've identified
    the problem and will have a fix in a subsequent [nightly
    release](/max/packages/#nightly-release).

  * Function calling and tool use, which allows you to instruct your
    model to interact with other systems, such as retrieve data and execute
    external tasks. [Learn more about function calling and tool
    use](/max/serve/function-calling).

  * Structured output (also known as constrained decoding), which allows you to
    enforce the output format from a model using a JSON schema and the
    `response_format` field. To enable constrained decoding pass
    `--enable-structured-output` when running the server. However, this feature
    currently works for MAX models on GPU only (support for PyTorch models and
    CPU is in progress). [Learn more about structured
    output](/max/serve/structured-output).

* Added support for the `/v1/embeddings` API endpoint, allowing you to generate
  vector representations using embedding models. See how to [deploy a text
  embedding model](/max/tutorials/run-embeddings-with-max-serve).

* Max Serve can evict requests when the number of available pages in the
  PagedAttention KVCache is limited. Before, the KV manager would throw an OOM
  error when a batch that cannot fit in the cache was scheduled.

### MAX models {#25-1-max-models}

* Added the [`max-pipelines`](/max/max-pipelines) CLI tool that simplifies the
  process to run inference with GenAI models (specified with a Hugging Face repo
  ID) and deploy them to a local endpoint with MAX Serve.

  Previously, running or serving these models required cloning the
  [modular/max](https://github.com/modular/max) GitHub repo and then running
  commands such as `magic run llama3`.

  These model-specific commands like `llama3` and `replit` commands have been
  removed. They're now standardized and subsumed by flags like
  `--model-path` in the `max-pipelines` tool. Arguments such as
  `--max-length` and `--weight-path` are also still supported by
  `max-pipelines`.

  To view a list of supported model architectures from Hugging Face, run
  `max-pipelines list`.

* Added support for PagedAttention, which improves memory efficiency by
  partitioning the KV cache into smaller blocks, reducing fragmentation and
  enabling larger inference batches. You can enable it with
  `--cache-strategy=paged` and `--kv-cache-page-size` with a value that's a
  multiple of 128.

* Added support for prefix caching in all cases where PagedAttention is
  supported. This allows for more efficient usage of KVCache and improved prefill
  performance for workloads with common prefixes. You can enable it by setting
  `--enable-prefix-caching`. For more information, see [Prefix caching with
  PagedAttention](/max/serve/prefix-caching).

* Batch size and max length are now inferred from available memory and the HF
  Models' default values for max length, respectively. If a configuration leads
  to an OOM, then we provide recommendations (to the best of our ability) to the
  user to fit the model into memory.

* Added support for heterogeneous KV caches for multi-modal models, such as
  Llama Vision, which cache different KV states for self and cross attention
  layers.

* Added support for embedding models, starting with MPNet. For example:

  ```shell
  max-pipelines generate \
    --model-path=sentence-transformers/all-mpnet-base-v2 \
    --prompt="Encode this sentence."
  ```

  Also see [how to deploy a text
  embedding model](/max/tutorials/run-embeddings-with-max-serve).

* Added support for image and text multimodal models:

  * `max-pipelines generate` now accepts image input with `--image_url`.

  * Added an experimental Pixtral pipeline you can run as follows:

    ```shell
    max-pipelines generate \
      --model-path=mistral-community/pixtral-12b \
      --prompt="What is in this image? [IMG]" \
      --image_url=/images/artwork/max-serve-cloud.png
    ```

    The pipeline is automatically used for all models implementing the
    `LlavaForConditionalGeneration` architecture.

    The implementation currently has a limit of one image. We plan support an
    arbitrary number of images of mixed sizes soon.

  * Added an experimental Llama Vision pipeline you can run as follows:

    ```shell
    max-pipelines generate \
      --model-path=meta-llama/Llama-3.2-11B-Vision-Instruct \
      --prompt="What is in this image?" \
      --image_url=/images/artwork/max-serve-cloud.png
    ```

    The pipeline is automatically used for all models implementing the
    `MllamaForConditionalGeneration` architecture.

    Note: This model is gated and requires that you set the
    [`HF_TOKEN`](https://huggingface.co/docs/huggingface_hub/en/package_reference/environment_variables#hftoken)
    environment variable. See
    [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct).

  * See [how to generate image
    descriptions with Llama 3.2 Vision](/max/tutorials/deploy-llama-vision).

* Added support for the `Qwen2ForCausalLM` model architecture (such as
  `Qwen/Qwen2.5-7B-Instruct`). For example:

  ```shell
  max-pipelines generate \
    --model-path=Qwen/Qwen2.5-7B-Instruct \
    --prompt="Write bubble sort in python" \
    --quantization-encoding bfloat16
  ```

* Added support for offline batched inference for text-based LLMs, allowing you
  to load a model and run inference with a batch of inputs directly from Python,
  instead of relying on an HTTP interface. For an example, see
  [`examples/offline-inference/basic.py`](https://github.com/modular/modular/blob/main/examples/offline-inference/basic.py).

* The `--max-cache-batch-size` flag has been deprecated in favor of
  `--max-batch-size`. Using `--max-cache-batch-size` now emits a deprecation
  warning and will stop working in a future release.

* The `--use-gpu` flag has been deprecated in favor of `--devices=cpu`,
  `--devices=gpu`, or `--devices=gpu-0,gpu-1,...`. If the device isn't specified,
  the model runs on the first available GPU, or CPU if no GPUs are available.

### MAX Engine {#25-1-engine}

* Improved internal kernel compilation speed 1.5 - 4X across different models.

  We've revamped our GPU compilation process so that all kernels in a program
  are compiled together into a single LLVM module, then split into separate
  kernels afterward. This ensures shared code between kernel entry points is
  only compiled once. For example, we observe a 3.7x speed up for Llama3.1-8b
  GPU startup time.

* Improved initial model execution speed on NVIDIA GPUs.

  Instead of compiling to PTX and performing just-in-time compilation during
  runtime, we now generate CUBIN binaries directly. While this increases
  initial compilation time, it significantly improves execution speed.

* The kernels have been further tuned for performance on NVIDIA A100 GPUs.

#### Graph APIs {#25-1-graph}

* You can now write custom operations (ops) in Mojo, and add them to a graph
  constructed in Python, using
  [`custom()`](/max/api/python/graph/ops#max.graph.ops.custom) and
  [`inplace_custom()`](/max/api/python/max/graph/ops#max.graph.ops.inplace_custom).

  For more detail, see the section below about [GPU programming](#25-1-gpus).

* Cached compiled MAX graphs that make use of custom operations now get
  invalidated when the implementation of the custom operations change.

* [`Graph.add_weight()`](/max/api/python/graph/Graph#max.graph.Graph.add_weight)
  now takes an explicit `device` argument. This enables explicitly passing
  GPU-resident weights to
  [`session.load()`](/max/api/python/engine#max.engine.InferenceSession.load) via
  the weights registry to initialize the model.

* [`max.graph.Weight`](/max/api/python/graph/Weight) now inherits
  from `TensorValue`, allowing you to call `weight.cast()` or `weight.T`. As such,
  the [`TensorValue`](/max/api/python/graph/TensorValue#max.graph.TensorValue) no
  longer accepts `Weight` for the `value` argument.

#### Pipeline APIs {#25-1-pipelines}

* [`TextTokenizer.new_context()`](/max/api/python/pipelines/tokenizer#max.pipelines.tokenizer.TextTokenizer.new_context)
  now supports tool definitions passed through its `request` argument (via
  `TokenGeneratorRequest.tools`).

  It also now supports JSON schemas passed through its `request` argument (via
  [`TokenGeneratorRequest.response_format`](/max/api/python/pipelines/interfaces/#max.pipelines.interfaces.TokenGeneratorRequest.response_format)).

* Removed the default `num_steps` value for
  [`TokenGenerator.next_token()`](/max/api/python/pipelines/interfaces/#max.pipelines.interfaces.TokenGenerator.next_token),
  ensuring users pass a value, reducing the potential for silent errors.

* [`KVCacheStrategy`](/max/api/python/pipelines/kv_cache/cache_params#max.pipelines.kv_cache.cache_params.KVCacheStrategy)
  now defaults to `MODEL_DEFAULT`.

  As opposed to the previous setting which always used the "continuous" caching
  strategy, KV caching strategy is now defaulted on an architecture-specific
  basis to ensure the most optimized caching strategy is used.

* The
  [`Linear`](/max/api/python/nn/linear#max.nn.linear.Linear)
  layer now has a `create()` class method that automatically creates
  specializations of `Linear` for non-quantized, k-quant, or GPTQ layers.

* Added
  [`nn.Conv1D`](/max/api/python/nn/conv#max.nn.conv.Conv1D)
  for audio models like Whisper.

#### GPU programming {#25-1-gpus}

This release includes all new APIs to program on GPUs. The way to write code
for GPUs is to create custom operations with GPU functions that you can load
into a MAX graph. This foundational API includes a few key components:

* Mojo APIs to write custom op functions:

  * The [`@compiler.register`](/max/api/mojo-decorators/compiler-register)
    decorator is applied to a Mojo struct that implements a custom op in an
    `execute()` function—for either CPU or GPU—and a `shape()` function that
    defines the custom op's output tensor.

  * The [`max.tensor`](/max/api/mojo/tensor/) package adds essential Mojo
    APIs for writing custom ops, such as:

    * The [`foreach()`](/max/api/mojo/tensor/managed_tensor_slice/foreach)
      function, which efficiently executes an element-wise computation in parallel
      on either a GPU or CPU.

    * The
      [`ManagedTensorSlice`](/max/api/mojo/tensor/managed_tensor_slice/ManagedTensorSlice)
      type defines the input and output tensors for the custom op.

* Python APIs to load custom ops into a model:

  * The [`custom()`](/max/api/python/graph/ops#max.graph.ops.custom) and
    [`inplace_custom()`](/python/max/graph/ops#max.graph.ops.inplace_custom)
    functions allow you to add the previously-defined Mojo custom op to a MAX
    graph written in Python.

  * The [`InferenceSession`](/max/api/python/engine#max.engine.InferenceSession)
    constructor accepts the custom op implementation as a [Mojo
    package](/mojo/manual/packages#mojo-packages) in the `custom_extensions`
    argument.

For more detail, see the [tutorial to build custom ops for
GPUs](/max/tutorials/build-custom-ops), or check out this [simple example of
a custom
op](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/add_custom.mojo).

Additionally, we've added a new [`gpu` package](/mojo/stdlib/gpu/) to the Mojo
standard library that provides low-level programming constructs for working
with GPUs. These APIs let you do things that you can't currently do with the
high-level `foreach()` abstraction above. The Mojo `gpu` APIs allow you to
manually manage interaction between the CPU host and GPU device, manage memory
between devices, synchronize threads, and more. For some examples, see
[`vector_addition.mojo`](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/vector_addition.mojo)
and
[`top_k.mojo`](https://github.com/modular/modular/blob/main/examples/custom_ops/kernels/top_k.mojo).

### Mojo {#25-1-mojo}

Mojo is a crucial component of the MAX stack that enables all of MAX's
performance-oriented code across hardware. For all the updates to the Mojo
language, standard library, and tools, see the [Mojo
changelog](/mojo/changelog).

## v24.6 (2024-12-17)

This is a huge update that offers a first look at our serving library for
MAX on GPUs!

* [Highlights](#24-6-highlights)
* [Documentation](#24-6-docs)
* [MAX Serve](#24-6-serve)
* [MAX models](#24-6-models)
* [MAX Engine](#24-6-engine)
  * [Driver APIs](#24-6-driver-api)
  * [Graph compiler](#24-6-graph-compiler)
  * [Graph APIs](#24-6-graph-api)
  * [Custom op registration](#24-6-custom-ops)
  * [Numeric kernels](#24-6-kernels)
* [Mojo](#24-6-mojo)

Also check out our [blog post introducing MAX
24.6](https://www.modular.com/blog/introducing-max-24-6-a-gpu-native-generative-ai-platform).

### ✨ Highlights {#24-6-highlights}

* **MAX Engine on GPUs preview**

  We're excited to share a preview of MAX Engine on GPUs. We've created a few
  tutorials that demonstrate MAX's ability to run GenAI models with our
  next-generation MAX graph compiler on NVIDIA GPU architectures (including
  A100, A10, L4, and L40 GPUs). You can experience it today by [deploying
  Llama 3 on an A100 GPU](/max/tutorials/max-serve-local-to-cloud).

* **MAX Serve preview**

  This release also includes an all-new serving interface called MAX
  Serve. It's a Python-based serving layer that supports both
  native MAX models when you want a high-performance deployment, and
  off-the-shelf PyTorch LLMs from Hugging Face when you want to explore and
  experiment—all with GPU support. It provides an OpenAI-compatible REST
  endpoint for inference requests, and a Prometheus-compatible metrics
  endpoint. You can use a `magic` command to start a local server , or use our
  ready-to-deploy MAX container to start an endpoint in the cloud. Try it now
  [with an LLM from Hugging Face](/max/tutorials/deploy-pytorch-llm).

* **Upgraded MAX models**

  As we continue to build our Python-based MAX Graph API that allows you to
  build high-performance GenAI models, we've made a ton of performance
  improvements to the existing models and added a few new models to our GitHub
  repo. All the Python-based MAX models now support GPUs and broad model
  architectures. For example,
  [`llama3`](https://github.com/modular/modular/tree/main/max/pipelines/architectures/llama3)
  adds compatibility for the LlamaForCausalLM family, which includes over
  20,000 model variants and weights on Hugging Face.

### Documentation {#24-6-docs}

New tutorials:

* [Deploy Llama 3 on GPU with MAX
  Serve](/max/tutorials/max-serve-local-to-cloud)

* [Deploy a PyTorch model from Hugging Face](/max/tutorials/deploy-pytorch-llm)

* [Deploy Llama 3.1 on GPU-powered Kubernetes
  clusters](/max/tutorials/deploy-max-serve-on-kubernetes)

* [Get started with MAX Graph in
  Python](/max/tutorials/get-started-with-max-graph-in-python)

Other new docs:

* [MAX container](/max/container)

* [Benchmark MAX
  Serve](https://github.com/modular/modular/tree/main/benchmark)

Also, our documentation is now available for **MAX nightly builds**! If you're
building with a [MAX nightly
release](/max/packages#nightly-release), you can
switch to see the nightly docs using a toggle to the right of the search bar.

### MAX Serve {#24-6-serve}

This release includes a preview of our Python-based serving library called MAX
Serve. It simplifies the process to deploy your own inference
server with consistent and reliable performance.

MAX Serve currently includes the following features:

* Deploys locally and to the cloud with our [MAX container
  image](/max/container), or with the `magic` CLI.

* An OpenAI-compatible server with streaming `/chat/completion` and
  `/completion` endpoints for LLM inference requests.

* Prometheus-compatible [metrics endpoint](/max/container#metrics) with LLM
  KPIs (TTFT and ITL) for monitoring and evaluating performance.

* Supports most `TextGeneration` Hugging Face Hub models.

* Multiprocess HTTP/model worker architecture to maximize CPU core utilization
  by distributing multiple incoming requests across multiple processes, ensuring
  both high throughput and responsiveness.

* Continuous heterogeneous batching to combine multiple incoming requests into
  a single inference (no waiting to fill a batch size) and improve total
  throughput.

There's much more still in the works for MAX Serve, but you can try it today
with our tutorials to [Deploy Llama 3 on GPU with MAX
Serve](/max/tutorials/max-serve-local-to-cloud)
and [Deploy a PyTorch model from Hugging
Face](/max/tutorials/deploy-pytorch-llm).

**Known issues:**

* While this release is enough to support typical chatbot applications,
  this release does not yet support the function-calling portion of the
  OpenAI API specification needed to enable robust agentic workflows.

* Sampling is still limited and doesn't currently respect temperature or
  other sampling-related API request input.

* Structured generation is not supported.

* Support for multi-modal models is still nascent.

### MAX models {#24-6-models}

All of our Python-based GenAI [models on
GitHub](https://github.com/modular/modular/tree/main/max/pipelines/architectures)
now support GPUs!

As we add more models, we're also building a robust set of libraries and
infrastructure that make it easier to build and deploy a growing library of
LLMs. Some of which is available in a new
[`max.pipelines`](/max/api/python/pipelines/) package and some of it is
alongside the [models on
GitHub](https://github.com/modular/modular/tree/main/max/pipelines/architectures).
Here are just some of the highlights:

* Deep integration with the Hugging Face ecosystem for a quick-to-deploy
  experience, such as using HF Model Hub tools to fetch config files, support for
  weights in [safetensor](https://github.com/huggingface/safetensors) format,
  support for HF tokenizers, and more. (We also support GGUF weight formats.)

* Expanded set of model abstractions for use by different LLM architectures:

  * Attention layers (including highly optimized implementations with
    configurable masking, like
    [`AttentionWithRope`](https://github.com/modular/modular/tree/main/max/nn/attention/attention_with_rope.py)).
    The optimized attention layers include variants that accept an attention
    mask. More memory-efficient variants that don't take a mask instead take a
    "mask functor" argument to the kernel, which implements masking without
    materializing a mask by computing a mask value from input coordinates on the
    fly.

  * Transformers such as [`Transformer` and
    `TransformerBlock`](https://github.com/modular/modular/tree/main/max/nn/transformer/transformer.py).
    These include an initial implementation of ragged tensors—tensors for which
    each dimension can have a different size, avoiding the use of padding tokens
    by flattening a batch of sequences of differing lengths.

  * Common layers such as
    [`RMSNorm`](https://github.com/modular/modular/tree/main/max/nn/norm/rms_norm.py)
    ,
    [`Embedding`](https://github.com/modular/modular/tree/main/max/nn/embedding.py),
    and
    [`Sequential`](https://github.com/modular/modular/tree/main/max/nn/sequential.py).

  * KV cache management helpers, like
    [`ContinuousBatchingKVCacheManager`](/max/api/python/pipelines/kv_cache/continuous_batching_cache#max.pipelines.kv_cache.continuous_batching_cache.ContinuousBatchingKVCacheManager).

  * Low-level wrappers over optimized kernels like
    [`fused_qk_ragged_rope`](https://github.com/modular/modular/tree/main/max/nn/kernels.py).
    These are custom fused kernels that update the KV cache in place. Although
    they are custom, they reuse the underlying kernel implementation by passing
    in lambda functions used to retrieve inputs and write to outputs in place.

* Added generalized interfaces for text generation such as
  [`TokenGenerator`](/max/api/python/pipelines/interfaces#max.pipelines.interfaces.TokenGenerator)
  and
  [`PipelineModel`](/max/api/python/pipelines/pipeline#max.pipelines.pipeline.PipelineModel),
  which provide modularity within the models and serving infrastructure. Also
  added a plug-in mechanism
  ([`PipelineRegistry`](/max/api/python/pipelines/registry#max.pipelines.registry.PipelineRegistry))
  to more quickly define new models, tokenizers, and other reusable components.
  For example, anything that conforms to
  [`TokenGenerator`](/max/api/python/pipelines/interfaces#max.pipelines.interfaces.TokenGenerator)
  can be served using the LLM infrastructure within MAX Serve. We then used this
  interface to create the following:

  * An optimized
    [`TextGenerationPipeline`](/max/api/python/pipelines/pipeline#max.pipelines.pipeline.TextGenerationPipeline)
    that can be combined with any compatible graph and has powerful performance
    features like graph-based multi-step scheduling, sampling, KV cache
    management, ragged tensor support, and more.

  * A generic
    [`HFTextGenerationPipeline`](/max/api/python/pipelines/hf_pipeline#max.pipelines.hf_pipeline.HFTextGenerationPipeline)
    that can run any Hugging Face model for which we don't yet have an optimized
    implementation in eager mode.

* Models now accept weights via a weights registry, which is passed to the
  [`session.load()`](/max/api/python/engine#max.engine.InferenceSession.load)
  method's `weights_registry` argument. The decoupling of weights and model
  architecture allows implementing all of the different fine-tunes for a given
  model with the same graph. Furthermore, because the underlying design is
  decoupled, we can later expose the ability to compile a model once and swap
  weights out on the fly, without re-compiling the model.

* Added generic implementations of common kernels, which allow you to plug-in
  different batching strategies (ragged or padded), KV cache management
  approaches (continuous batching), masking (causal, sliding window, etc.), and
  position encoding (RoPE or ALIBI) without having to re-write any kernel code.
  (More about this in a future release.)

* Multi-step scheduling to run multiple token-generation steps on GPU before
  synchronizing to the CPU.

**Updated models:**

* Significant performance upgrades for [Llama
  3](https://github.com/modular/modular/tree/main/max/pipelines/architectures/llama3),
  and expanded compatibility with the `LlamaForCausalLM` models family. For
  example, it also supports Llama 3.2 1B and 3B text models.

**New models:**

* [Mistral
  NeMo](https://github.com/modular/modular/tree/main/max/pipelines/architectures/mistral)
  (and other `MistralForCausalLM` models)

* [Replit Code V1.5
  3B](https://github.com/modular/modular/tree/main/max/pipelines/architectures/replit)

**Known issues:**

* The Q4 quantized models currently work on CPU only.

* Using a large setting for `top-k` with the Llama 3.1 model may lead to
  segmentation faults for certain workloads when run on NVIDIA GPUs. This should
  be resolved in the latest nightly MAX builds.

* The models currently use a smaller default context window than the
  `max_seq_len` specified in the Hugging Face configuration files for a given
  model. This can be manually adjusted by setting the `--max-length` parameter to
  the desired context length when serving a model.

* Some variants of the supported core models (like `LlamaForCausalLM` with
  different number of heads, head sizes, etc.) might not be fully optimized yet.
  We plan to fully generalize our implementations in a future release.

### MAX Engine {#24-6-engine}

MAX Engine includes a lot of the
core infrastructure that enables MAX to accelerate AI models on any hardware,
such as the graph compiler, runtime, kernels, and the APIs to interact with it
all, and it all works without external dependencies such as PyTorch or CUDA.

This release includes a bunch of performance upgrades to our graph compiler and
runtime. We've added support for NVIDIA GPU architectures (including A100, A10,
L4, and L40 GPUs), and built out new infrastructure so we can quickly add
support for other GPU hardware.

**Engine API changes:**

* [`InferenceSession`](/max/api/python/engine#max.engine.InferenceSession)
  now accepts a `custom_extensions` constructor argument, same as `load()`, to
  specify model extension libraries.

* The [`Model`](/max/api/python/engine#max.engine.Model) object is now callable
  to run an inference.

**Breaking changes**:

* `Model.execute()` signature changed to support GPUs.

  * The [`execute()`](/max/api/python/engine#max.engine.Model.execute) function
    currently doesn't accept keyword arguments. Instead you can pass tensors as a
    [`driver.Tensor`](/max/api/python/driver#max.driver.Tensor), `int`, `float`,
    `bool`,
    [`np.generic`](https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.generic),
    or [`DLPackArray`](/max/api/python/driver#max.driver.DLPackArray)
    ([DLPack](https://github.com/dmlc/dlpack)). Note that both PyTorch and NumPy
    arrays implement the DLPack protocol, which means you can also pass either of
    those types to `execute()`.

  * [`execute_legacy()`](/max/api/python/engine#max.engine.Model.execute_legacy)
    preserves the semantics of `execute()` with support for keyword arguments to
    help with migration, but will be removed in a future release.
    `execute_legacy()` doesn't support GPUs.

  * Calling `execute()` with positional arguments still works the same.

#### Driver APIs {#24-6-driver-api}

MAX Driver (the [`max.driver`](/max/api/python/driver) module) is a new
component of MAX Engine that's still a work in progress. It provides primitives
for working with heterogeneous hardware systems (GPUs and CPUs), such as to
allocate on-device memory, transfer data between host and device, query device
stats, and more. It's a foundation on which other components of MAX Engine
operate (for example, `InferenceEngine` now uses
[`driver.Tensor`](/max/api/python/driver#max.driver.Tensor) to handle model
inputs and outputs).

**Driver API changes:**

* Added `CUDA()` device to open an NVIDIA GPU.

* Added support for fp16 and bfloat16 dtypes.

* Expanded functionality for `max.driver.Device`, with new class methods and
  properties. We are still working on building this out to support more
  accelerator features.

* [`driver.Tensor`](/max/api/python/driver#max.driver.Tensor) (and the
  `InferenceSession.load()` argument `weights_registry` ) now supports zero-copy
  interoperability with NumPy arrays and PyTorch tensors, using
  [DLPack](https://github.com/dmlc/dlpack) /
  [`DLPackArray`](/max/api/python/driver#max.driver.DLPackArray).

* [`driver.Tensor`](/max/api/python/driver#max.driver.Tensor) has new methods,
  such as `from_dlpack()`, `element_size()` , `to()`, `to_numpy()`, `view()`,
  `zeros()`, and more.

MAX Driver APIs are still changing rapidly and not yet ready for general use.
We'll publish more documentation in a future release.

**Known issues:**

* MAX Driver is currently limited to managing just one NVIDIA GPU at a time (it
  does not yet support multi-GPU). It also does not yet support remote devices.

* DLPack support is not complete. For example, streams are not yet supported.

#### Graph compiler {#24-6-graph-compiler}

When you load a model into MAX Engine, the graph compiler is the component that
inspects and optimizes all graph operations (ops) to deliver the best run time
performance on each device.

This release includes various graph compiler improvements:

* Major extensions to support NVIDIA GPUs (and other devices in the future),
  including async copies and caching of JIT'd kernels.

* The runtime now performs scheduling to enable GPU compute overlap with the
  CPU.

* New transformations to the Mojo kernels to enable a number of optimizations,
  including specialization on tensor dimensions, specialization on target
  hardware, specialization on non-tensor dimension input to kernels, automatic
  kernel fusion between operators, and more.

* New algebraic simplifications and algorithms for ops such as horizontal
  fusion of matrix multiplications.

* New CPU-side primitives for device management that are automatically
  transformed and optimized to reduce overhead (MAX does not need to use things
  like CUDA Graphs).

* Updated memory planning to preallocate device memory (hoist computation from
  inference runtime to initialization time) and reduce per-inference overhead.

#### Graph APIs {#24-6-graph-api}

The graph compiler is also exposed through the MAX Graph APIs (the
[`max.graph`](/max/api/python/graph/) package), which allow you to build
high-performance GenAI models in Python.

**Graph API changes:**

* Python stack traces from model execution failures now include a trace to the
  original op-creation, allowing for easier debugging during development.

* The [`max.graph`](/max/api/python/graph/) APIs now include preliminary
  support for symbolic algebraic expressions using
  [`AlgebraicDim`](/max/api/python/graph/type#max.graph.type.AlgebraicDim),
  enabling more powerful support for checked dynamic shapes. This allows
  `-Dim("x") - 4`. Furthermore, the algebraic expressions simplify to a canonical
  form, so that for example `-Dim("x") - 4 == -(Dim("x") + 4)` holds.

* More advanced dtype promotion now allows
  [`TensorValue`](/max/api/python/graph/TensorValue) math operators to just work
  when used with NumPy arrays and python primitives.

* [`TensorValue`](/max/api/python/graph/TensorValue) has new methods, such as
  `broadcast_to()`, `cast()`, `flatten()`, `permute()`, and more.

* Added [`BufferValue`](/max/api/python/graph/BufferValue), which allows for
  device-resident tensors that are read and mutated within the graph.

* [`DType`](/max/api/python/dtype#max.dtype.DType) has new methods/properties,
  `align`, `size_in_bytes`, and `is_float()`.

* [`Value`](/max/api/python/graph/Value) constructor accepts more types for
  `value`.

* [`TensorValue`](/max/api/python/graph/TensorValue) constructor accepts more
  types for `value`.

* [`TensorValue.rebind()`](/max/api/python/graph/TensorValue#max.graph.TensorValue.rebind)
  accepts a new `message` argument.

**Breaking changes:**

* [`Graph.add_weight()`](/max/api/python/graph/Graph#max.graph.Graph.add_weight)
  now accepts [`Weight`](/max/api/python/graph/Weight#max.graph.Weight) and
  returns [`TensorValue`](/max/api/python/graph/TensorValue).
  [`Weight`](/max/api/python/graph/Weight#max.graph.Weight) is essentially a
  named placeholder for a tensor that knows its name, dtype, shape, and
  optionally device and quantization encoding. `Graph.add_weight()` stages an op
  in the graph that is populated by a named weight in the weights registry passed
  to `session.load`.

* The [`Weight`](/max/api/python/graph/Weight#max.graph.Weight) constructor
  arguments changed; added `align` , `dtype` , and `shape`; removed `assign` ,
  `filepath`, `offset`, and `value`.

* The `ops.scalar()` method was removed along with the `is_static()` and
  `is_symbolic()` methods from all `graph.type` objects.

  * Instead of `ops.scalar()`, use
    [`ops.constant()`](/max/api/python/graph/ops#max.graph.ops.constant).

  * Instead of `is_static()` and `is_symbolic()`, use
    `isinstance(dim, SymbolicDim)` and `isinstance(dim, StaticDim)`.

The MAX Graph APIs are not ready for general use but you can [experiment with
it now by following this
tutorial](/max/tutorials/get-started-with-max-graph-in-python). We'll add more
documentation when we finish some API redesigns.

#### Custom op registration {#24-6-custom-ops}

Although the APIs to write custom operators (ops) isn't ready for general use,
this release includes a significant redesign that lays the groundwork. You
might notice some associated APIs in this release and more APIs in the
nightlies, so here's a little about the work in progress:

* The custom op APIs will allow you to extend MAX Engine with new ops written
  in Mojo, providing full composability and extensibility for your models. It's
  the exact same API we use to write MAX Engine's built-in ops such as `matmul`.
  That means your custom ops can benefit from all our compiler optimization
  features such as kernel fusion—your ops are treated the same as all the ops
  included "in the box."

* The new API requires far less adornment at the definition site to enable the
  MAX model compiler to optimize custom ops along with the rest of the graph
  (compared to our previous version that used `NDBuffer`).

* Custom ops support "destination passing style" for tensors.

* The design composes on top of Mojo's powerful meta programming, as well as
  the kernel libraries abstractions for composable kernels.

We'll publish more documentation when the custom op API is ready for general
use. Check out the MAX repo's `nightly` branch to see the latest [custom op
examples](https://github.com/modular/modular/tree/main/examples/custom_ops).

**Known issues:**

* Custom ops don't have type or lifetime checking. They also don't reason about
  mutability. Expect lots of sharp corners and segfaults if you hold them wrong
  while we improve this!

#### Numeric kernels {#24-6-kernels}

The GPU kernels for MAX Engine are built from the ground up in Mojo with no
dependencies on external vendor code or libraries. This release includes the
following kernel improvements:

* AttenGen: a novel way to express attention pattern that's able to express
  different attention masks, score functions, as well as caching strategies.

* State-of-the-art matrix multiplication algorithms with optimizations such as
  the following:

  * Pipelining and double-buffering to overlap data transfer and computation
    and to hide memory access latency (for both global and shared memory).

  * Thread swizzling to avoid shared memory bank conflicts associated with
    tensor core layouts.

  * Block swizzling to increase L2 cache locality.

* SplitK/StreamK GEMM algorithms: divides the computation along the shared K
  dimension into smaller matrices which can then be executed independently on
  streaming multiprocessors (such as CUDA cores). These algorithms are ideal for
  matrices with large K dimension but small M dimension.

* Large context length MHA: uses SplitK/StreamK to implement the attention
  mechanism and eliminate the need of a huge score matrix, which drastically
  reduces memory usage/traffic to enable large context length.

* DualGemm: accelerates the multi-layer perceptron (MLP) layers where the
  left-hand side (LHS) is shared between two matrix multiplications.

**Known issues:**

* The MAX kernels are optimized for bfloat16 on GPUs.

* Convolution on GPU is not performance optimized yet.

* Although v24.6 technically runs on H100, it doesn't include
  performance-optimized kernels for that device yet and it isn't recommended.

### Mojo {#24-6-mojo}

Mojo is a crucial component of the MAX stack that enables all of MAX's
performance-oriented code across hardware. For all the updates to the Mojo
language, standard library, and tools, see the [Mojo
changelog](/mojo/changelog#v246-2024-12-17).

## v24.5 (2024-09-13)

### ✨ Highlights

* Mojo and MAX are magical! We've created a new package and virtual environment
  manager, `magic`, for MAX and Mojo. [Check it out!](/magic/)

* New [Llama3.1
  pipeline](https://github.com/modular/modular/tree/main/max/pipelines/architectures)
  built with the new MAX Graph Python API.

* We have not one, but two new Python APIs that we're introducing in this
  release:
  * [MAX Graph Python API](#max-graph-python-api)
  * [MAX Driver Python API](#max-driver-python-api)

### ⭐️ New

* Added `repeat_interleave` graph op.

* Added caching for MAX graph models.
  This means that graph compilation is cached and the executable model is
  retrieved from cache on the 2nd and subsequent runs.
  Note that the model cache is architecture specific and isn't portable across
  different targets.

* Support for Python 3.12.

#### MAX Graph Python API

This Python API
will ultimately provide the same low-level programming interface for
high-performance inference graphs as the Mojo API. As with the Mojo API, it's an
API for graph-building only, and it does not implement support for training.

You can take a look at how the API works in the
[MAX Graph Python API reference](/max/api/python/graph/).

#### MAX Driver Python API

The MAX Driver API allows you to interact with devices (such as CPUs and GPUs)
and allocate memory directly onto them. With this API, you interact with
this memory as tensors.

Note that this API is still under development, with support for non-host
devices, such as GPUs, planned for a future release.

To learn more, check out the
[MAX Driver Python APIreference](/max/api/python/driver).

#### MAX C API

New APIs for adding torch metadata libraries:

* `M_setTorchMetadataLibraryPath`
* `M_setTorchMetadataLibraryPtr`

### 🦋 Changed

#### MAX Engine performance

* Compared to v24.4, MAX Engine v24.5 generates tokens for Llama an average of
  15%-48% faster.

#### MAX C API

Simplified the API for adding torch library paths, which now only takes one path
per API call, but can be called multiple times to add paths to the config:

* `M_setTorchLibraries` -> `M_setTorchLibraryPath`

### ⚠️ Deprecated

* The `max` command line tool is no longer supported and will be removed
  in a future release.

### ❌ Removed

* Dropped support for Ubuntu 20.04. If you're using Ubuntu, we currently
  support Ubuntu 22.04 LTS only.
* Dropped support for Python 3.8.
* Removed built-in PyTorch libraries from the max package. See the
  [FAQ](/max/faq) for information on supported torch versions.

## v24.4 (2024-06-07)

### 🔥 Legendary

* MAX is now available on macOS! [Try it now](/max).

* New quantization APIs for MAX Graph. You can now build high-performance
  graphs in Mojo that use the latest quantization techniques, enabling even
  faster performance and more system compatibility for large models.

  Learn more in the guide to [quantize your graph weights](/max/graph/quantize).

### ⭐️ New

#### MAX Mojo APIs

* Added AI pipeline examples in the `max` repo, with Mojo implementations for
  common transformer layers, including quantization support.

  * New Llama3 pipeline built with MAX Graph.

  * New Replit Code pipeline built with MAX Graph.

  * New TinyStories pipeline (based on TinyLlama) that offers a simple demo of
    the MAX Graph quantization API.

* Added [`max.graph.checkpoint`](/max/api/mojo/graph/checkpoint/) package
  to save and load model weights.

  All weights are stored in a
  [`TensorDict`](/max/api/mojo/graph/checkpoint/tensor_dict/TensorDict).
  You can save and load a `TensorDict` to disk with
  [`save()`](/max/api/mojo/graph/checkpoint/save_load/save) and
  [`load()`](/max/api/mojo/graph/checkpoint/save_load/load) functions.

* Added MAX Graph quantization APIs:

  * Added quantization encodings
    [`BFloat16Encoding`](/max/api/mojo/graph/quantization/encodings/BFloat16Encoding),
    [`Q4_0Encoding`](/max/api/mojo/graph/quantization/encodings/Q4_0Encoding),
    [`Q4_KEncoding`](/max/api/mojo/graph/quantization/encodings/Q4_KEncoding),
    and
    [`Q6_KEncoding`](/max/api/mojo/graph/quantization/encodings/Q6_KEncoding).
  * Added the
    [`QuantizationEncoding`](/max/api/mojo/graph/quantization/quantization_encoding/QuantizationEncoding)
    trait so you can build custom quantization encodings.
  * Added [`Graph.quantize()`](/max/api/mojo/graph/graph/Graph#quantize)
    to create a quantized tensor node.
  * Added [`qmatmul()`](/max/api/mojo/graph/ops/quantized_ops/qmatmul) to
    perform matrix-multiplication with a float32 and a quantized matrix.

* Added some MAX Graph ops:

  * [`avg_pool()`](/max/api/mojo/graph/ops/convolution/avg_pool)
  * [`max_pool()`](/max/api/mojo/graph/ops/convolution/max_pool)
  * [`conv2d()`](/max/api/mojo/graph/ops/convolution/conv2d)
  * [`conv3d()`](/max/api/mojo/graph/ops/convolution/conv3d)
  * [`layer_norm()`](/max/api/mojo/graph/ops/linalg/layer_norm)
  * [`tile()`](/max/api/mojo/graph/ops/linalg/tile)
  * [`select()`](/max/api/mojo/graph/ops/slicing/select)

* Added a [`layer()`](/max/api/mojo/graph/graph/Graph#layer) context
  manager and
  [`current_layer()`](/max/api/mojo/graph/graph/Graph#current_layer)
  function to aid in debugging during graph construction. For example:

  ```mojo
  with graph.layer("foo"):
      with graph.layer("bar"):
          print(graph.current_layer())  # prints "foo.bar"
          x = graph.constant[DType.int64](1)
          graph.output(x)
  ```

  This adds a path `foo.bar` to the added nodes, which will
  be reported during errors.

* Added
  [`format_system_stack()`](/max/api/mojo/graph/error/format_system_stack)
  function to format the stack trace, which we use to print better error
  messages from [`error()`](/max/api/mojo/graph/error/error).

* Added
  [`TensorMap.keys()`](/max/api/mojo/engine/tensor_map/TensorMap#keys) to
  get all the tensor key names.

#### MAX C API

Miscellaneous new APIs:

* `M_cloneCompileConfig()`
* `M_copyAsyncTensorMap()`
* `M_tensorMapKeys()` and `M_deleteTensorMapKeys()`
* `M_setTorchLibraries()`

### 🦋 Changed

#### MAX Mojo API

* [`EngineNumpyView.data()`](/max/api/mojo/engine/tensor/EngineNumpyView#unsafe_ptr)
  and [`EngineTensorView.data()`](/max/api/mojo/engine/tensor/EngineTensorView#unsafe_ptr)
  functions that return a type-erased pointer were renamed to `unsafe_ptr()`.

* [`TensorMap`](/max/api/mojo/engine/tensor_map/TensorMap) now conforms
  to `CollectionElement` trait to be copyable and movable.

* `custom_nv()` was removed, and its functionality moved into
  [`custom()`](/max/api/mojo/graph/ops/custom_ops/custom) as an function
  overload, so it can now output a list of tensor symbols.

## v24.3 (2024-05-02)

### 🔥 Legendary

* You can now write custom ops for your models with Mojo!

  Learn more about [MAX extensibility](/max/custom-ops/).

### 🦋 Changed

* Added support for named dynamic dimensions. This means you can specify when two
  or more dimensions in your model's input are dynamic but their sizes at run
  time must match each other. By specifying each of these dimension sizes with a
  name (instead of using `None` to indicate a dynamic size), the MAX Engine
  compiler can perform additional optimizations. See the notes below for the
  corresponding API changes that support named dimensions.

* Simplified all the APIs to load input specs for models, making them more
  consistent.

#### MAX Engine performance

* Compared to v24.2, MAX Engine v24.3 shows an average speedup of 10% on PyTorch
  models, and an average 20% speedup on dynamically quantized ONNX transformers.

#### MAX Graph API

The [`max.graph`](/max/api/mojo/graph/) APIs are still changing
rapidly, but starting to stabilize.

* `AnyMoType` renamed to [`Type`](/max/api/mojo/graph/type/Type),
  `MOTensor` renamed to
  [`TensorType`](/max/api/mojo/graph/type/TensorType), and `MOList`
  renamed to [`ListType`](/max/api/mojo/graph/type/ListType).

* Removed `ElementType` in favor of using `DType`.

* Removed `TypeTuple` in favor of using `List[Type]`.

* Removed the `Module` type so you can now start building a graph by directly
  instantiating a [`Graph`](/max/api/mojo/graph/graph/Graph).

* Some new ops in [`max.ops`](/max/api/mojo/graph/ops/), including
  support for custom ops.

  See how to [create a custom op in MAX
  Graph](/max/extensibility/).

#### MAX Engine Python API

* Redesigned
  [`InferenceSession.load()`](/max/api/python/engine#max.engine.InferenceSession.load)
  to replace the confusing `options` argument with a `custom_ops_path` argument.

  As a result, `CommonLoadOptions`, `TorchLoadOptions`, and
  `TensorFlowLoadOptions` have all been removed.

* [`TorchInputSpec`](/max/api/python/engine#max.engine.TorchInputSpec)
  now supports named dynamic dimensions (previously, dynamic dimension sizes
  could be specified only as `None`). This lets you tell MAX which dynamic
  dimensions are required to have the same size, which helps MAX better optimize
  your model.

#### MAX Engine Mojo API

* `InferenceSession.load_model()` was renamed to
  [`load()`](/max/api/mojo/engine/session/InferenceSession#load).

* Redesigned
  [`InferenceSession.load()`](/max/api/mojo/engine/session/InferenceSession#load)
  to replace the confusing `config` argument with a `custom_ops_path` argument
  for use when [loading a custom op](/max/extensibility/), and an
  `input_specs` argument for use when loading TorchScript models.

  Doing so removed `LoadOptions` and introduced the new
  [`InputSpec`](/max/api/mojo/engine/session/InputSpec) type to define
  the input shape/type of a model (instead of `LoadOptions`).

* New [`ShapeElement`](/max/api/mojo/engine/shape_element/ShapeElement)
  type to allow for named dynamic dimensions (in `InputSpec`).

* `max.engine.engine` module was renamed to
  [`max.engine.info`](/max/api/mojo/engine/info/).

#### MAX Engine C API

* [`M_newTorchInputSpec()`](/max/api/c/pytorch/config#m_newtorchinputspec)
  now supports named dynamic dimensions (via new `dimNames` argument).

### ❌ Removed

* Removed TensorFlow support in the MAX SDK, so you can no longer load a
  TensorFlow SavedModel for inference. However, TensorFlow is still available for
  enterprise customers.

  We removed TensorFlow because industry-wide TensorFlow usage has declined
  significantly, especially for the latest AI innovations. Removing TensorFlow
  also cuts our package size by over 50% and accelerates the development of
  other customer-requested features. If you have a production use-case for a
  TensorFlow model, please [contact
  us](https://www.modular.com/company/contact).

* Removed the Python `CommonLoadOptions`, `TorchLoadOptions`, and
  `TensorFlowLoadOptions` classes. See note above about
  `InferenceSession.load()` changes.

* Removed the Mojo `LoadOptions` type. See the note above about
  `InferenceSession.load()` changes.

## v24.2.1 (2024-04-11)

* You can now import more MAX Graph functions from `max.graph.ops` instead of
  using `max.graph.ops.elementwise`. For example:

  ```mojo
  from max.graph import ops

  var relu = ops.relu(matmul)
  ```

## v24.2 (2024-03-28)

* MAX Engine now supports TorchScript models with dynamic input shapes.

  No matter what the input shapes are, you still need to [specify the input
  specs](/max/model-formats#specify-torchscript-input-specs) for all
  TorchScript models.

* The Mojo standard library is now open source!

  Read more about it in [this blog
  post](https://www.modular.com/blog/the-next-big-step-in-mojo-open-source).

* And, of course, lots of Mojo updates, including implicit traits, support for
  keyword arguments in Python calls, a new `List` type (previously
  `DynamicVector`), some refactoring that might break your code, and much more.

  For details, see the [Mojo changelog](/mojo/changelog#v242-2024-03-28).

## v24.1.1 (2024-03-18)

This is a minor release that improves error reports.

## v24.1 (2024-02-29)

The first release of the MAX platform is here! 🚀

This is a **preview version** of the MAX platform. That means it
is not ready for production deployment and designed only for local development
and evaluation.

Because this is a preview, some API libraries are still in development and
subject to change, and some features that we previously announced are not quite
ready yet. But there is a lot that you can do in this release!

This release includes our flagship developer tools, currently for **Linux
only**:

* **MAX Engine**: Our state-of-the-art graph compiler and runtime library that
  executes models from PyTorch and ONNX, with incredible inference
  speed on a wide range of hardware.

  * API libraries in Python, C, and Mojo to run inference with your existing
    models. [See the API references](/max/api).

  * The `max benchmark` tool, which runs MLPerf
    benchmarks on any compatible model without writing any code.

  * The `max visualize` tool, which allows you to visualize
    your model in Netron after partially lowering in MAX Engine.

  * An early look at the [MAX Graph API](/max/model-formats#max-graph), our
    low-level library for building high-performance inference graphs.

* **MAX Serving**: A preview of our serving wrapper for MAX Engine that
  provides full interoperability with existing AI serving systems (such as
  Triton) and that seamlessly deploys within existing container infrastructure
  (such as Kubernetes).

  * A Docker image that runs MAX Engine as a backend for NVIDIA Triton
    Inference Server.

* **Mojo**: The world's first programming language built from the ground-up for AI
  developers, with cutting-edge compiler technology that delivers unparalleled
  performance and programmability for any hardware.

  * The latest version of Mojo, the standard library, and the `mojo` command
    line tool. These are always included in MAX, so you don't need to download
    any separate packages.

  * The Mojo changes in each release are often quite long, so we're going to
    continue sharing those in the existing [Mojo changelog](/mojo/changelog).

Additionally, we've started a new [GitHub repo for
MAX](https://github.com/modular/max), where we currently share a bunch of
code examples for our API libraries, including some large model pipelines.
You can also use this repo to [report issues with
MAX](https://github.com/modular/modular/issues/new/choose).

### Model Architecture Support

* Added support for the following model architectures:

  * `OlmoForCausalLM` (such as `allenai/OLMo-1B-0724-hf`)
  * `GraniteForCausalLM` (such as `ibm-granite/granite-3.1-8b-instruct`)
  * `Phi3ForCausalLM` (for Microsoft Phi-3 models)
  * `Qwen2ForCausalLM` (such as Qwen2 models)

  Example usage:

  ```sh
  max-pipelines generate \
    --model-path allenai/OLMo-1B-0724-hf \
    --prompt "Write bubble sort in mojo"
  ```

* The `max.pipelines.dataprocessing.tokenizer` and
  `max.pipelines.dataprocessing.gguf_utils` modules have been removed.

* The previously deprecated `PipelineConfig.architecture` field and its
  corresponding `--architecture` CLI argument have been removed.

### `max-pipelines` CLI

* The `--devices` CLI argument now supports a comma-separated list of GPU IDs
  prefixed with `gpu:` like `--devices=gpu:0,1,2,3`. We no longer support the
  previous `--devices=gpu-` format.

  ```sh
  max-pipelines generate --model-path=meta-llama/Llama-3.3-70B-Instruct \
    --quantization-encoding bfloat16 \
    --devices gpu:0,1,2,3 \
    --prompt="Design a self-sustaining colony on Neptune's moon Triton with a myth/science fusion name, three quantum tech breakthroughs, one ethical debate, a neon-lit cultural ritual, and a hidden flaw—presented in bullet points."
  ```

* Removed `--huggingface-repo-id` PipelineConfig option and CLI argument in favor
  of `--model-path`.

* Consolidated `-model-path` and `-weight-path`. If valid `-weight-path`(s) are
  provided, they'll now override `--model-path`, which in turn handles both local
  and remote (Hugging Face) cases. If we cannot derive the weights from the
  `--weight-path`(s), we'll now fall back to the `--model-path`, which has to be set
  explicitly by the user.

* Added `--huggingface-revision` option, to allow selecting a non-default branch
  or a specific commit in a Hugging Face model repository.

---

## Why Mojo🔥

When we started Modular, we had no intention of building a new programming
language. But as we were building our [platform to unify the world's ML/AI
infrastructure](https://www.modular.com/blog/the-case-for-a-next-generation-ai-developer-platform),
we realized that programming across the entire stack was too complicated. Plus,
we were writing a lot of MLIR by hand and not having a good time.

What we wanted was an innovative and scalable programming model that could
target accelerators and other heterogeneous systems that are pervasive in
the AI field. This meant a programming language with powerful compile-time
metaprogramming, integration of adaptive compilation techniques, caching
throughout the compilation flow, and other features that are not supported by
existing languages.

And although accelerators are important, one of the most prevalent and
sometimes overlooked "accelerators" is the host CPU. Nowadays, CPUs have lots of
tensor-core-like accelerator blocks and other AI acceleration units, but they
also serve as the "fallback" for operations that specialized accelerators
don't handle, such as data loading, pre- and post-processing, and integrations
with foreign systems. So it was clear that we couldn't lift AI with just an
"accelerator language" that worked with only specific processors.

Applied AI systems need to address all these issues, and we decided there was
no reason it couldn't be done with just one language. Thus, Mojo was born.

## A language for next-generation compiler technology {#mlir}

When we realized that no existing language could solve the challenges in
AI compute, we embarked on a first-principles rethinking of how a programming
language should be designed and implemented to solve our problems. Because we
require high-performance support for a wide variety of accelerators,
traditional compiler technologies like LLVM and GCC were not suitable (and any
languages and tools based on them would not suffice). Although they support a
wide range of CPUs and some commonly used GPUs, these compiler technologies
were designed decades ago and are unable to fully support modern chip
architectures. Nowadays, the standard technology for specialized machine
learning accelerators is MLIR.

[MLIR](https://mlir.llvm.org/) is a relatively new open-source compiler
infrastructure started at Google (whose leads moved to Modular) that has been
widely adopted across the machine learning accelerator community. MLIR’s
strength is its ability to build *domain specific* compilers, particularly for
weird domains that aren’t traditional CPUs and GPUs, such as AI ASICS, [quantum
computing systems](https://github.com/PennyLaneAI/catalyst), FPGAs, and [custom
silicon](https://circt.llvm.org/).

Given our goals at Modular to build a next-generation AI platform, we were
already using MLIR for some of our infrastructure, but we didn't have a
programming language that could unlock MLIR's full potential across our stack.
While many other projects now use MLIR, Mojo is the first major language
designed expressly *for MLIR*, which makes Mojo uniquely powerful when writing
systems-level code for AI workloads.

## A member of the Python family

Our core mission for Mojo includes innovations in compiler internals and
support for current and emerging accelerators, but we don't see any need to
innovate in language *syntax* or *community*. So we chose to embrace the Python
ecosystem because it is so widely used, it is loved by the AI ecosystem, and
because we believe it is a really nice language.

The Mojo language has lofty goals: we want full compatibility with the Python
ecosystem, we want predictable low-level performance and low-level control, and
we need the ability to deploy subsets of code to accelerators. Additionally, we
don't want to create a fragmented software ecosystem—we don't want Python
users who adopt Mojo to draw comparisons to the painful migration from Python 2
to 3. These are no small goals!

Fortunately, while Mojo is a brand-new code base, we aren't really starting
from scratch conceptually. Embracing Python massively simplifies our design
efforts, because most of the syntax is already specified. We can instead focus
our efforts on building Mojo's compilation model and systems programming
features. We also benefit from tremendous lessons learned from other languages
(such as Rust, Swift, Julia, Zig, Nim, etc.), from our prior experience
migrating developers to new compilers and languages, and we leverage the
existing MLIR compiler ecosystem.

Further, we decided that the right *long-term goal* for Mojo is to adopt the
**syntax of Python** (that is, to make Mojo compatible with existing Python
programs) and to embrace the CPython implementation for long-tail ecosystem
support. If you're a Python programmer, we hope that Mojo is immediately
familiar, while also providing new tools to develop safe and performant
systems-level code that would otherwise require C and C++ below Python.

We aren't trying to convince the world that "static is best" or "dynamic is
best." Rather, we believe that both are good when used for the right
applications, so we designed Mojo to allow you, the programmer, to decide when
to use static or dynamic.

### Why we chose Python

Python is the dominant force in ML and countless other fields. It's easy to
learn, known by important cohorts of programmers, has an amazing community, has
tons of valuable packages, and has a wide variety of good tooling. Python
supports the development of beautiful and expressive APIs through its dynamic
programming features, which led machine learning frameworks like TensorFlow and
PyTorch to embrace Python as a frontend to their high-performance runtimes
implemented in C++.

For Modular today, Python is a non-negotiable part of our API surface
stack—this is dictated by our customers. Given that everything else in our
stack is negotiable, it stands to reason that we should start from a
"Python-first" approach.

More subjectively, we believe that Python is a beautiful language. It's
designed with simple and composable abstractions, it eschews needless
punctuation that is redundant-in-practice with indentation, and it's built with
powerful (dynamic) metaprogramming features. All of which provide a runway for
us to extend the language to what we need at Modular. We hope that people in
the Python ecosystem see our direction for Mojo as taking Python ahead to the
next level—completing it—instead of competing with it.

## Compatibility with Python

We plan for full compatibility with the Python ecosystem, but there are actually
two types of compatibility, so here's where we currently stand on them both:

* In terms of your ability to import existing Python modules and use them in a
  Mojo program, Mojo is 100% compatible because we use CPython for
  interoperability.

* In terms of your ability to migrate any Python code to Mojo, it's not fully
  compatible yet. Mojo already supports many core features from Python, including
  async/await, error handling, variadics, and so on. However, Mojo is still young
  and missing many other features from Python. Mojo doesn't even support classes
  yet!

There is a lot of work to be done, but we're confident we'll get there, and
we're guided by our team's experience building other major technologies with
their own compatibility journeys:

* The journey to the [Clang compiler](https://clang.llvm.org/) (a
  compiler for C, C++, Objective-C, CUDA, OpenCL, and others), which is a
  "compatible replacement" for GCC, MSVC and other existing compilers. It is hard
  to make a direct comparison, but the complexity of the Clang problem appears to
  be an order of magnitude bigger than implementing a compatible replacement for
  Python.

* The journey to the [Swift programming language](https://www.swift.org/),
  which embraced the Objective-C runtime and language ecosystem, and progressively
  migrated millions of programmers (and huge amounts of code). With Swift, we
  learned lessons about how to be "run-time compatible" and cooperate with a
  legacy runtime.

In situations where you want to mix Python and Mojo code, we expect Mojo to
cooperate directly with the CPython runtime and have similar support for
integrating with CPython classes and objects without having to compile the code
itself. This provides plug-in compatibility with a massive ecosystem of
existing code, and it enables a progressive migration approach in which
incremental migration to Mojo yields incremental benefits.

Overall, we believe that by focusing on language design and incremental
progress towards full compatibility with Python, we will get where we
need to be in time.

However, it's important to understand that when you write pure Mojo code, there
is nothing in the implementation, compilation, or runtime that uses any
existing Python technologies. On its own, it is an entirely new language with
an entirely new compilation and runtime system.

### Intentional differences from Python

While Python compatibility and migratability are key to Mojo's success, we also
want Mojo to be a first-class language (meaning that it's a standalone language
rather than dependent upon another language). It should not be limited in its
ability to introduce new keywords or grammar productions merely to maintain
compatibility. As such, our approach to compatibility is two-fold:

1. We utilize CPython to run all existing Python 3 code
   without modification and use its runtime, unmodified, for full compatibility
   with the entire ecosystem. Running code this way provides no benefit from Mojo,
   but the sheer existence and availability of this ecosystem will rapidly
   accelerate the bring-up of Mojo, and leverage the fact that Python is really
   great for high-level programming already.

2. We will provide a mechanical migration tool that provides very good
   compatibility for people who want to migrate code from Python to Mojo. For
   example, to avoid migration errors with Python code that uses identifier names
   that match Mojo keywords, Mojo provides a backtick feature that allows any
   keyword to behave as an identifier.

Together, this allows Mojo to integrate well in a mostly-CPython world, but
allows Mojo programmers to progressively move code (a module or file
at a time) to Mojo. This is a proven approach from the Objective-C to
Swift migration that Apple performed.

It will take some time to build the rest of Mojo and the migration support, but
we are confident that this strategy allows us to focus our energies and avoid
distractions. We also think the relationship with CPython can build in both
directions—wouldn't it be cool if the CPython team eventually reimplemented the
interpreter in Mojo instead of C? 🔥

## Python's problems

By aiming to make Mojo the best way to extend Python, we believe we can solve
many of Python's existing problems.

Python has some well-known problems—most obviously, poor low-level performance
and CPython implementation details like the global interpreter lock (GIL),
which makes Python single-threaded. While there are many active projects
underway to improve these challenges, the issues brought by Python go deeper
and are particularly impactful in the AI field. Instead of talking about those
technical limitations in detail, we'll talk about their implications here in
the present.

Note that everywhere we refer to Python in this section is referring to the
CPython implementation. We'll talk about other implementations later.

### The two-world problem

For a variety of reasons, Python isn't suitable for systems programming.
Fortunately, Python has amazing strengths as a glue layer, and low-level
bindings to C and C++ allow building libraries in C, C++ and many other
languages with better performance characteristics. This is what has enabled
things like NumPy, TensorFlow, PyTorch, and a vast number of other libraries
in the ecosystem.

Unfortunately, while this approach is an effective way to build
high-performance Python libraries, it comes with a cost: building
these hybrid libraries is very complicated. It requires low-level understanding
of the internals of CPython, requires knowledge of C/C++ (or other) programming
(undermining one of the original goals of using Python in the first place),
makes it difficult to evolve large frameworks, and (in the case of ML) pushes
the world towards "graph based" programming models, which have worse fundamental
usability than "eager mode" systems. Both TensorFlow and PyTorch have faced
significant challenges in this regard.

Beyond the fundamental nature of how the two-world problem creates system
complexity, it makes everything else in the ecosystem more complicated.
Debuggers generally can't step across Python and C code, and those that can
aren't widely accepted. It's painful that the Python package ecosystem has to
deal with C/C++ code in addition to Python. Projects like PyTorch, with
significant C++ investments, are intentionally trying to move more of their
codebase to Python because they know it gains usability.

### The three-world and N-world problem

The two-world problem is commonly felt across the Python ecosystem, but things
are even worse for developers of machine learning frameworks. AI is pervasively
accelerated, and those accelerators use bespoke programming languages like
CUDA. While CUDA is a relative of C++, it has its own special problems and
limitations, and it does not have consistent tools like debuggers or profilers.
It is also effectively locked into a single hardware maker.

The AI world has an incredible amount of innovation on the hardware front, and
as a consequence, complexity is spiraling out of control. There are now several
attempts to build limited programming systems for accelerators (OpenCL, Sycl,
OneAPI, and others). This complexity explosion is continuing to increase and
none of these systems solve the fundamental fragmentation in the tools and
ecosystem that is hurting the industry so badly—they're *adding to the
fragmentation*.

### Mobile and server deployment

Another challenge for the Python ecosystem is deployment. There are many facets
to this, including how to control dependencies, how to deploy hermetically
compiled "a.out" files, and how to improve multi-threading and performance.
These are areas where we would like to see the Python ecosystem take
significant steps forward.

## Related work

We are aware of many other efforts to improve Python, but they do not solve the
[fundamental problem](#mlir) we aim to solve with Mojo.

Some ongoing efforts to improve Python include work to speed up Python and
replace the GIL, to build languages that look like Python but are subsets of
it, and to build embedded domain-specific languages (DSLs) that integrate with
Python but which are not first-class languages. While we cannot provide an
exhaustive list of all the efforts, we can talk about some challenges faced in
these projects, and why they don't solve the problems that Mojo does.

### Improving CPython and JIT compiling Python

Recently, the community has spent significant energy on improving CPython
performance and other implementation issues, and this is showing huge results.
This work is fantastic because it incrementally improves the current CPython
implementation. For example, Python 3.11 has increased performance 10-60% over
Python 3.10 through internal improvements, and [Python
3.12](https://github.com/faster-cpython/ideas/wiki/Python-3.12-Goals) aims to
go further with a trace optimizer. [Python
3.13](https://github.com/faster-cpython/ideas/blob/main/3.13/README.md) adds a
[JIT compiler](https://peps.python.org/pep-0744/) to CPython, enables the use
of [multiple subinterpreters](https://peps.python.org/pep-0554/) in a single
Python process (thus sidestepping the GIL) and speeds up memory management.
Many other projects are attempting to tame the GIL, and projects like PyPy
(among many others) have used JIT compilation and tracing approaches to speed
up Python.

While we are fans of these great efforts, and feel they are valuable and
exciting to the community, they unfortunately do not satisfy our needs at
Modular, because they do not help provide a unified language onto an
accelerator. Many accelerators these days support very limited dynamic
features, or do so with terrible performance. Furthermore, systems programmers
don't seek only "performance," but they also typically want a lot of
**predictability and control** over how a computation happens.

We are looking to eliminate the need to use C or C++ within Python libraries,
we seek the highest performance possible, and we cannot accept dynamic features
at all in some cases. Therefore, these approaches don't help.

### Python subsets and other Python-like languages

There are many attempts to build a "deployable" Python, such as TorchScript
from the PyTorch project. These are useful because they often provide
low-dependency deployment solutions and sometimes have high performance.
Because they use Python-like syntax, they can be easier to learn than a novel
language.

On the other hand, these languages have not seen wide adoption—because they are
a subset of Python, they generally don't interoperate with the Python
ecosystem, don't have fantastic tooling (such as debuggers), and often
change-out inconvenient behavior in Python unilaterally, which breaks
compatibility and fragments the ecosystem further. For example, many of these
change the behavior of simple integers to wrap instead of producing
Python-compatible math.

The challenge with these approaches is that they attempt to solve a weak point
of Python, but they aren't as good at Python's strong points. At best, these
can provide a new alternative to C and C++, but without solving the dynamic
use-cases of Python, they cannot solve the "two world problem." This approach
drives fragmentation, and incompatibility makes *migration* difficult to
impossible—recall how challenging it was to migrate from Python 2 to Python 3.

### Python family languages with C compatibility

Because Mojo is designed to adopt the syntax of Python with improved systems
programming capabilities, it shares some high-level ideas with other members of
the Python family of languages like [Pyrex](https://wiki.python.org/moin/Pyrex)
and [Cython](https://cython.org/). Like Mojo, these projects define their own
language while also supporting the Python language. They allow you to write more
performant extensions for Python that interoperate with both Python and C
libraries.

These Python family languages are great for some kinds of applications, and they've
been applied to great effect by some popular Python libraries. However, they
don't solve [Python's two-world problem](#the-two-world-problem) and because
they rely on CPython for their core semantics, they can't work without it,
whereas Mojo uses CPython only when necessary to provide [compatibility with
existing Python code](#compatibility-with-python). Pure Mojo code does not use
any pre-existing runtime or compiler technologies, it instead uses an
[MLIR-based infrastructure](#mlir) to enable high-performance execution on a
wide range of hardware.

### Embedded DSLs in Python

Another common approach is to build embedded domain-specific languages (DSLs)
in Python, typically installed with a Python decorator. There are many
examples of this (the `@tf.function` decorator in TensorFlow, the
`@triton.jit` in OpenAI's Triton programming model, etc.). A major benefit of
these systems is that they maintain compatibility with the Python
ecosystem of tools, and integrate natively into Python logic, allowing an
embedded mini language to co-exist with the strengths of Python for dynamic use
cases.

Unfortunately, the embedded mini-languages provided by these systems often have
surprising limitations, don't integrate well with debuggers and other workflow
tooling, and do not support the level of native language integration that we
seek for a language that unifies heterogeneous compute and is the primary way
to write large-scale kernels and systems.

With Mojo, we hope to move the usability of the overall system forward by
simplifying things and making it more consistent. Embedded DSLs are an
expedient way to get demos up and running, but we are willing to put in the
additional effort and work to provide better usability and predictability for
our use-case.

To learn about what we've built with Mojo so far, see the [Mojo
Manual](/mojo/manual/).

---

## WorkInfo

`@register_passable(trivial)`
`struct WorkInfo`

## Fields

* ​m (`SIMD[uint32, 1]`):
* ​n (`SIMD[uint32, 1]`):
* ​k\_start (`SIMD[uint32, 1]`):
* ​num\_k\_tiles (`SIMD[uint32, 1]`):
* ​is\_valid\_tile (`Bool`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `is_valid`

`is_valid(self) -> Bool`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

---

## WorkInfo

`@register_passable(trivial)`
`struct WorkInfo`

## Fields

* ​prompt\_offset (`SIMD[uint32, 1]`):
* ​head\_idx (`SIMD[uint32, 1]`):
* ​prompt\_idx (`SIMD[uint32, 1]`):
* ​is\_valid\_tile (`Bool`):

## Implemented traits

`AnyType`,
`Copyable`,
`ExplicitlyCopyable`,
`Movable`,
`Stringable`,
`UnknownDestructibility`,
`Writable`

## Methods

### `is_valid`

`is_valid(self) -> Bool`

### `__str__`

`__str__(self) -> String`

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

---

## Writable

The `Writable` trait describes how a type is written into a `Writer`.

You must implement `write_to` which takes `self` and a type conforming to
`Writer`:

```mojo
struct Point(Writable):
    var x: Float64
    var y: Float64

    fn write_to[W: Writer](self, mut writer: W):
        var string = "Point"
        # Write a single `Span[Byte]`:
        writer.write_bytes(string.as_bytes())
        # Pass multiple args that can be converted to a `Span[Byte]`:
        writer.write("(", self.x, ", ", self.y, ")")
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `write_to`

`write_to[W: Writer](self: _Self, mut writer: W)`

Formats the string representation of this type to the provided Writer.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The type conforming to `Writable`.

---

## WritableVariadicPack

`@register_passable`
`struct WritableVariadicPack[mut: Bool, //, is_owned: Bool, origin: Origin[mut], pack_origin: Origin[mut], *Ts: Writable]`

Wraps a `VariadicPack`, enabling it to be passed to a writer along with extra arguments.

Example:

```mojo
from utils.write import WritableVariadicPack

fn foo[*Ts: Writable](*messages: *Ts):
    print("message:", WritableVariadicPack(messages), "[end]")

x = 42
foo("'x = ", x, "'")
```

Output:

```text
message: 'x = 42' [end]
```

## Parameters

* ​mut (`Bool`): Whether the origin is mutable.
* ​is\_owned (`Bool`): Whether the `VariadicPack` owns its elements.
* ​origin (`Origin[mut]`): The origin of the reference to the `VariadicPack`.
* ​pack\_origin (`Origin[mut]`): The origin of the `VariadicPack`.
* ​\*Ts (`Writable`): The types of the variadic arguments conforming to `Writable`.

## Fields

* ​value (`Pointer[VariadicPack[is_owned, pack_origin, Writable, Ts], origin]`): Reference to a `VariadicPack` that conforms to `Writable`.

## Implemented traits

`AnyType`,
`UnknownDestructibility`,
`Writable`

## Methods

### `__init__`

`__init__(ref [origin] value: VariadicPack[is_owned, pack_origin, Writable, Ts]) -> Self`

Initialize using a reference to the `VariadicPack`.

**Args:**

* ​value (`VariadicPack[is_owned, pack_origin, Writable, Ts]`): The `VariadicPack` to take a reference to.

### `write_to`

`write_to[W: Writer](self, mut writer: W)`

Formats the string representation of all the arguments in the `VariadicPack` to the provided `Writer`.

**Parameters:**

* ​W (`Writer`): A type conforming to the Writable trait.

**Args:**

* ​writer (`W`): The type conforming to `Writable`.

---

## write

Establishes the contract between `Writer` and `Writable` types.

## Structs

* [​`WritableVariadicPack`](/mojo/stdlib/utils/write/WritableVariadicPack): Wraps a `VariadicPack`, enabling it to be passed to a writer along with extra arguments.

## Traits

* [​`Writable`](/mojo/stdlib/utils/write/Writable): The `Writable` trait describes how a type is written into a `Writer`.
* [​`Writer`](/mojo/stdlib/utils/write/Writer): Describes a type that can be written to by any type that implements the `write_to` function.

## Functions

* [​`write_args`](/mojo/stdlib/utils/write/write_args): Add separators and end characters when writing variadics into a `Writer`.
* [​`write_buffered`](/mojo/stdlib/utils/write/write_buffered): Use a buffer on the stack to minimize expensive calls to the writer. When the buffer would overflow it writes to the `writer` passed in. You can also add separators between the args, and end characters. The default stack space used for the buffer is 4096 bytes which matches the default arm64 and x86-64 page size, you can modify this e.g. when writing a large amount of data to a file.

---

## write_args

`write_args[W: Writer, *Ts: Writable](mut writer: W, args: VariadicPack[is_owned, origin, Writable, Ts], *, sep: StringSlice[StaticConstantOrigin] = StringSlice(), end: StringSlice[StaticConstantOrigin] = StringSlice())`

Add separators and end characters when writing variadics into a `Writer`.

Example

```mojo
import sys
from utils import write_args

fn variadic_pack_function[*Ts: Writable](
    *args: *Ts, sep: StaticString, end: StaticString
):
    var stdout = sys.stdout
    write_args(stdout, args, sep=sep, end=end)

variadic_pack_function(3, "total", "args", sep=",", end="[end]")
```

```
3, total, args[end]
```

.

**Parameters:**

* ​W (`Writer`): The type of the `Writer` to write to.
* ​\*Ts (`Writable`): The types of each arg to write. Each type must satisfy `Writable`.

**Args:**

* ​writer (`W`): The `Writer` to write to.
* ​args (`VariadicPack[is_owned, origin, Writable, Ts]`): A VariadicPack of Writable arguments.
* ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements.
* ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements.

---

## write_buffered

`write_buffered[W: Writer, //, *Ts: Writable, *, buffer_size: Int = 4096, use_heap: Bool = False](mut writer: W, args: VariadicPack[is_owned, origin, Writable, Ts], *, sep: StringSlice[StaticConstantOrigin] = StringSlice(), end: StringSlice[StaticConstantOrigin] = StringSlice())`

Use a buffer on the stack to minimize expensive calls to the writer. When the buffer would overflow it writes to the `writer` passed in. You can also add separators between the args, and end characters. The default stack space used for the buffer is 4096 bytes which matches the default arm64 and x86-64 page size, you can modify this e.g. when writing a large amount of data to a file.

Example

```mojo
import sys
from utils import write_buffered

fn print_err_buffered[*Ts: Writable](
    *args: *Ts, sep: StaticString, end: StaticString
):
    var stderr = sys.stderr
    write_buffered(stderr, args, sep=sep, end=end)

    # Buffer before allocating a string
    var string = String()
    write_buffered(string, args, sep=sep, end=end)

print_err_buffered(3, "total", "args", sep=",", end="[end]")
```

```
3, total, args[end]
```

.

**Parameters:**

* ​W (`Writer`): The type of the `Writer` to write to.
* ​\*Ts (`Writable`): The types of each arg to write. Each type must satisfy `Writable`.
* ​buffer\_size (`Int`): How many bytes to write to a buffer before writing out to
  the `writer` (default `4096`).
* ​use\_heap (`Bool`): Buffer to the heap, first calculating the total byte size
  of all the args and then allocating only once. `buffer_size` is not
  used in this case as it's dynamically calculated. (default `False`).

**Args:**

* ​writer (`W`): The `Writer` to write to.
* ​args (`VariadicPack[is_owned, origin, Writable, Ts]`): A VariadicPack of Writable arguments.
* ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements.
* ​end (`StringSlice[StaticConstantOrigin]`): The String to write after printing the elements.

`write_buffered[W: Writer, T: Copyable & Movable & Writable, //, buffer_size: Int = 4096](mut writer: W, values: List[T, hint_trivial_type], *, sep: StringSlice[StaticConstantOrigin] = StringSlice())`

Use a buffer on the stack to minimize expensive calls to the writer. You can also add separators between the values. The default stack space used for the buffer is 4096 bytes which matches the default arm64 and x86-64 page size, you can modify this e.g. when writing a large amount of data to a file.

Example

```mojo
import sys
from utils import write_buffered

var string = String()
var values = List[String]("3", "total", "args")
write_buffered(string, values, sep=",")
```

```
3, total, args
```

.

**Parameters:**

* ​W (`Writer`): The type of the `Writer` to write to.
* ​T (`Copyable & Movable & Writable`): The element type of the `List`. Must implement the `Writable`,
  `Copyable` and `Movable` traits.
* ​buffer\_size (`Int`): How many bytes to write to a buffer before writing out to
  the `writer` (default `4096`).

**Args:**

* ​writer (`W`): The `Writer` to write to.
* ​values (`List[T, hint_trivial_type]`): A `List` of Writable arguments.
* ​sep (`StringSlice[StaticConstantOrigin]`): The separator used between elements.

---

## Writer

Describes a type that can be written to by any type that implements the `write_to` function.

This enables you to write one implementation that can be written to a
variety of types such as file descriptors, strings, network locations etc.
The types are written as a `Span[Byte]`, so the `Writer` can avoid
allocations depending on the requirements. There is also a general `write`
that takes multiple args that implement `write_to`.

Example:

```mojo
from memory import Span

@value
struct NewString(Writer, Writable):
    var s: String

    # Writer requirement to write a Span of Bytes
    fn write_bytes(mut self, bytes: Span[Byte, _]):
        self.s._iadd(bytes)

    # Writer requirement to take multiple args
    fn write[*Ts: Writable](mut self, *args: *Ts):
        @parameter
        fn write_arg[T: Writable](arg: T):
            arg.write_to(self)

        args.each[write_arg]()

    # Also make it Writable to allow `print` to write the inner String
    fn write_to[W: Writer](self, mut writer: W):
        writer.write(self.s)

@value
struct Point(Writable):
    var x: Int
    var y: Int

    # Pass multiple args to the Writer. The Int and StaticString types
    # call `writer.write_bytes` in their own `write_to` implementations.
    fn write_to[W: Writer](self, mut writer: W):
        writer.write("Point(", self.x, ", ", self.y, ")")

    # Enable conversion to a String using `String(point)`
    fn __str__(self) -> String:
        return String.write(self)

fn main():
    var point = Point(1, 2)
    var new_string = NewString(String(point))
    new_string.write("\n", Point(3, 4))
    print(new_string)
```

Output:

```plaintext
Point(1, 2)
Point(3, 4)
```

## Implemented traits

`AnyType`,
`UnknownDestructibility`

## Methods

### `write_bytes`

`write_bytes(mut self: _Self, bytes: Span[SIMD[uint8, 1], origin])`

Write a `Span[Byte]` to this `Writer`.

**Args:**

* ​bytes (`Span[SIMD[uint8, 1], origin]`): The string slice to write to this Writer. Must NOT be
  null-terminated.

### `write`

`write[*Ts: Writable](mut self: _Self, *args: *Ts)`

Write a sequence of Writable arguments to the provided Writer.

**Parameters:**

* ​\*Ts (`Writable`): Types of the provided argument sequence.

**Args:**

* ​\*args (`*Ts`): Sequence of arguments to write to this Writer.

---

## y0

`y0[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the Bessel function of the second kind of order 0 for each input value.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input vector.

**Returns:**

A vector containing the computed value for each value in the input.

---

## y1

`y1[dtype: DType, width: Int, //](x: SIMD[dtype, width]) -> SIMD[dtype, width]`

Computes the Bessel function of the second kind of order 1 for each input value.

**Constraints:**

The input must be a floating-point type.

**Parameters:**

* ​dtype (`DType`): The `dtype` of the input and output SIMD vector.
* ​width (`Int`): The width of the input and output SIMD vector.

**Args:**

* ​x (`SIMD[dtype, width]`): The input vector.

**Returns:**

A vector containing the computed value for each value in the input.

---

## zip

`zip[origin: ImmutableOrigin, n: Int](ts: InlineArray[Pointer[IntTuple, origin], n]) -> _zip[origin, n]`

Create a zip iterator from an array of `IntTuple` pointers.

This function creates a zip iterator that allows simultaneous traversal
of multiple `IntTuple` collections.

**Parameters:**

* ​origin (`ImmutableOrigin`): The origin tracking parameter for memory safety.
* ​n (`Int`): The number of `IntTuple` collections being zipped together.

**Args:**

* ​ts (`InlineArray[Pointer[IntTuple, origin], n]`): Array of pointers to the `IntTuple` collections to zip.

**Returns:**

A `_zip` object that can be iterated over.

`zip(a: IntTuple[origin], b: IntTuple[origin], out result: _zip[{a, b}, 2])`

Create a zip iterator for two `IntTuple`s.

This function creates a zip iterator that allows simultaneous traversal
of two `IntTuple`s, yielding pairs of corresponding elements.

**Args:**

* ​a (`IntTuple[origin]`): First `IntTuple` to zip.
* ​b (`IntTuple[origin]`): Second `IntTuple` to zip.

**Returns:**

The resulting zip iterator for the input `IntTuple`s.

`zip(a: IntTuple[origin], b: IntTuple[origin], c: IntTuple[origin], out result: _zip[{a, b, c}, 3])`

Create a zip iterator for three `IntTuple`s.

This function creates a zip iterator that allows simultaneous traversal
of three `IntTuple`s, yielding triplets of corresponding elements.

**Args:**

* ​a (`IntTuple[origin]`): First `IntTuple` to zip.
* ​b (`IntTuple[origin]`): Second `IntTuple` to zip.
* ​c (`IntTuple[origin]`): Third `IntTuple` to zip.

**Returns:**

The resulting zip iterator for the input `IntTuple`s.

---

## zip_modes

`zip_modes(layout_a: Layout, layout_b: Layout) -> Layout`

Combines corresponding modes from two layouts.

This function creates a new layout by combining corresponding dimensions
from two layouts. If a dimension in layout\_b has a non-positive shape,
the corresponding dimension from layout\_a is used directly.

**Args:**

* ​layout\_a (`Layout`): The first layout.
* ​layout\_b (`Layout`): The second layout.

**Returns:**

A new layout with combined dimensions from both input layouts.

---

## zipped_divide

`zipped_divide(layout_a: Layout, layout_b: Layout) -> Layout`

Divides a layout into blocks according to another layout.

This function creates a hierarchical layout by dividing the first layout
according to the second layout. It's an alias for hierarchical\_unzip that provides a
more intuitive name for the division operation. This is useful for creating
blocked or tiled representations of tensors.

Example:

```mojo
from layout import Layout, IntTuple
from layout.layout import zipped_divide

# Create layouts
var base = Layout.row_major(6, 8)
var pattern = Layout(IntTuple(2, 2))
var result = zipped_divide(base, pattern)
```

.

**Args:**

* ​layout\_a (`Layout`): The layout to be divided.
* ​layout\_b (`Layout`): The layout defining the division pattern.

**Returns:**

A new layout representing the hierarchical division of layout\_a according
to layout\_b.

`zipped_divide(layout_a: Layout, tiler: List[Layout]) -> Layout`

Divides a layout into blocks according to a list of layouts.

This function creates a hierarchical layout by dividing the first layout
according to the layouts in the tiler list. It's an alias for hierarchical\_unzip that
provides a more intuitive name for the division operation when working with
multiple tiling patterns.

Example:

```mojo
from layout import Layout, LayoutList, IntTuple
from layout.layout import zipped_divide

# Create layouts
var base = Layout.row_major(6, 8)
var tilers = LayoutList()
tilers.append(Layout(IntTuple(2, 2)))
var result = zipped_divide(base, tilers)
```

.

**Args:**

* ​layout\_a (`Layout`): The layout to be divided.
* ​tiler (`List[Layout]`): A list of layouts defining the division patterns.

**Returns:**

A new layout representing the hierarchical division of layout\_a according
to the patterns in tiler.