8legs: Tobias mann

#1 out of 125.00%

Unpacking the deceptively simple science of tokenomics

The economics of AI inference hinge on tokens per watt, not just raw GPU count.
Goodput depends on hardware, software, and model choice, influencing efficiency.
Disaggregated compute and rack-scale architectures improve throughput at scale.
Mixture of experts models and high-speed fabric reduce latency and boost efficiency.
Rack-scale systems like GB300 NVL72 rack offer higher interactivity with sustained throughput.
Software matters: TensorRT LLM often outperforms open-source engines in specific configs.
Quantization to FP4/FP8 lowers weights but can degrade accuracy without careful tuning.
Open-weight models are converging with closed models due to tuning tools and industry pressure.
The market is a race to the bottom on tokens, where price and quality diverge by provider.
Open AI datacenters are described as factories where power-in and tokens-out define profits.

Vote 0