BendKernels

Pure Bend examples for algorithms that want thousands of lightweight threads.

BendKernels is a small benchmark and learning repo for Bend, a high-level language that exposes parallelism from ordinary recursive structure. The goal is not to replace Rust, C, or CUDA for every workload. The goal is to show the kinds of programs where Bend's compiler/runtime can find a lot of independent work without explicit threads, locks, atomics, or GPU kernels.

Why this exists

Most parallel examples hide the interesting part behind a framework. Bend makes the parallel shape visible in the program itself:

balanced recursion
tree folds
divide-and-conquer algorithms
sorting networks
independent map/reduce kernels

The exact same .bend file can be run on the reference runtime, parallel CPU runtime, or CUDA runtime when supported by the machine.

Examples

Example	Pattern	What it demonstrates
`examples/parallel_sum.bend`	tree generation + fold	balanced reduction over independent branches
`examples/tree_map_reduce.bend`	map/reduce	independent leaf kernels with a final checksum
`examples/bitonic_sort.bend`	sorting network	compare/swap structure that maps well to parallel execution
`examples/fibonacci_modes.bend`	branching vs sequential recursion	parallelism is not the same as algorithmic efficiency
`examples/ai/tiny_mlp.bend`	neural-network inference	independent hidden neurons and batch samples
`examples/ai/minimax_tree.bend`	game-tree search	independent child-position evaluation
`examples/ai/ensemble_vote.bend`	ensemble inference	independent weak-learner votes over a batch

See docs/ai-kernels.md for the AI-focused examples.

Install Bend

cargo install hvm
cargo install bend-lang
bend --version

For GPU execution, install the NVIDIA CUDA toolkit supported by Bend/HVM, then use bend run-cu.

Run

Reference runtime:

bend run-rs examples/parallel_sum.bend -s

Parallel CPU runtime:

bend run-c examples/parallel_sum.bend -s

CUDA runtime, on supported NVIDIA systems:

bend run-cu examples/parallel_sum.bend -s

Run the full smoke set locally:

bend run-rs examples/parallel_sum.bend
bend run-rs examples/tree_map_reduce.bend
bend run-rs examples/bitonic_sort.bend
bend run-rs examples/fibonacci_modes.bend
bend run-rs examples/ai/tiny_mlp.bend
bend run-rs examples/ai/minimax_tree.bend
bend run-rs examples/ai/ensemble_vote.bend
bend run-c examples/parallel_sum.bend

Usage examples with output

Parallel tree sum

bend run-rs examples/parallel_sum.bend

Result: 8386560

Run the same source on the parallel CPU runtime:

bend run-c examples/parallel_sum.bend

Result: 8386560

With Bend's stats flag, you can compare runtime behavior. Numbers vary by machine, but the checksum should match:

bend run-rs examples/parallel_sum.bend -s
bend run-c examples/parallel_sum.bend -s

Example output from one local run:

Result: 8386560
- ITRS: 360395
- TIME: 0.01s
- MIPS: 28.19

Result: 8386560
- ITRS: 352203
- TIME: 0.01s
- MIPS: 55.38

Tree map/reduce

bend run-rs examples/tree_map_reduce.bend

Result: 2064384

Bitonic sort checksum

bend run-rs examples/bitonic_sort.bend

Result: 32640

Fibonacci modes

bend run-rs examples/fibonacci_modes.bend

Result: 13530

This example intentionally includes both a branching recursive Fibonacci and a tail-recursive Fibonacci. It is a reminder that more parallel structure does not automatically mean a better algorithm.

Tiny MLP inference

bend run-rs examples/ai/tiny_mlp.bend

Result: 426312

This runs a fixed integer neural-network forward pass over a synthetic batch and returns the batch-score checksum.

Minimax tree search

bend run-rs examples/ai/minimax_tree.bend

Result: 745

This evaluates a synthetic ternary game tree with alternating min/max layers.

Ensemble vote

bend run-rs examples/ai/ensemble_vote.bend

Result: 2024

This evaluates several independent decision stumps across a synthetic batch and returns the positive-class count.

Interpreting results

Bend currently uses 24-bit numeric types, so large integer examples may wrap modulo 2^24. The examples return checksums rather than printing arrays or images, because Bend's IO ecosystem is still young and benchmark-style programs are easier to compare across runtimes.

When comparing runtimes, look at Bend's -s output:

bend run-rs examples/bitonic_sort.bend -s
bend run-c  examples/bitonic_sort.bend -s
bend run-cu examples/bitonic_sort.bend -s

Good Bend candidates usually have lots of independent subproblems. Bad candidates are "helplessly sequential": each step depends on the exact result of the previous step.

Roadmap

matrix multiplication checksum
Mandelbrot / fractal checksum
N-body simulation checksum
prefix scan
larger AI inference/search kernels
benchmark result tables across CPU and GPU machines

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BendKernels

Why this exists

Examples

Install Bend

Run

Usage examples with output

Parallel tree sum

Tree map/reduce

Bitonic sort checksum

Fibonacci modes

Tiny MLP inference

Minimax tree search

Ensemble vote

Interpreting results

Roadmap

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

BendKernels

Why this exists

Examples

Install Bend

Run

Usage examples with output

Parallel tree sum

Tree map/reduce

Bitonic sort checksum

Fibonacci modes

Tiny MLP inference

Minimax tree search

Ensemble vote

Interpreting results

Roadmap

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages