SLMs

Small Language Model Tool-use Accuracy and Throughput Optimization for the Jetson AGX Orin

PDF
Small Language Model Tool-use Accuracy and Throughput Optimization for the Jetson AGX Orin
A brief survey and motivation for optimizing low-latency tool-use throughput on the Jetson AGX Orin.
Bence Danko
Last updated February 14, 2026 at 6:00 PM
Index Terms-- small language models, mixture of experts, tool calling, Jetson AGX Orin, prompt engineering
Appendices: Appendix A

Abstract

The deployment of agentic Artificial Intelligence on edge hardware is currently constrained by the trade-off between computational latency and reasoning capability. This preliminary survey proposes a comparative analysis of Small Language Models (SLMs), specifically focusing on recent sparse Mixture-of-Experts (MoE) architectures, to optimize tool-use throughput and accuracy on the NVIDIA Jetson AGX Orin. We examine the performance of emerging SLMs—including gpt-oss-20b, Qwen3-Coder-Next, and Nemotron 3 Nano—to determine their viability in resource-constrained, low-latency environments. The proposed methodology integrates QLoRA fine-tuning with rigorous ablations of prompt engineering frameworks (ReAct vs. ReflAct) and token-efficient serialization formats, contrasting standard JSON against Token-Oriented Object Notation (Toon). Performance is evaluated against the Berkeley Function Calling Leaderboard (BFCL) and tau2-bench, with specific attention to Time-to-First-Token (TTFT), decode rates, and power efficiency across the Orin’s variable thermal envelopes. This work aims to establish a methodology for maximizing agentic tool-calling density on consumer-grade embedded hardware.

Summary


We intend to optimize and compare several small language models, specifically recently developed sparse mixture-of-expert models, on the Jetson AGX Orin [1]:

We will compare baseline performance on BFCL [5] and tau2-bench [6], two performance benchmarks for agentic tool-calling. We will perform ablations on prompting strategies, and QLoRA fine-tuning on ToolMind.

Dataset


ToolMind [7] consists of 160k synthetic data instances generated using over 20k tools, and 200k augmented open-source data sources.

An original synthetic portion contains 160k high-quality instances created using agent simulations, and judge pruning for higher quality samples [7]. ~200k samples are from open source projects [8], [9], [10], [11], [12], [13], [14]. All are standardized to the same format.

alt text
ToolMind synthesis process. Source: Toolmind

In total, there are 368,611 tool calling chain samples. See Appendix A to see a sample. Data entries include:

Background and Literature Review


Despite the engineering overhead in sparse MoE training and tuning, recent work has argued that small language models (SLMs) are essential for agentic and tool-oriented workloads, citing advantages in lower latency, operational cost, reduced computational requirements, and task performance [15]. In our work we are hoping to demonstrate tool-use proficiency of small, sparse MoE models on efficient hardware, which will justify these claims. We are also hoping to test and demonstrate efficient prompting and tool use frameworks to optimize for efficient token use.

Sparse Mixture of Experts Fine Tuning

Mixture of Experts (MoE) architecture has recently undergone a revival under the domain of Large Language Models (LLMs), and particularly sparse MoEs have emerged in mainstream popularity [16]. However, training and fine-tuning large-scale MoE models to specific domains and tasks have been widely documented to prove difficult, raising issues such as router collapse, training divergence, and routing inefficiencies [17], [18], [19], [20], [21].

Recent developments in fine tuning frameworks such as Unsloth directly aim to address sparse MoE fine-tuning [22], though stability is primarily achieved by simply freezing the router. Techniques such as selective PEFT expert tuning [23] have emerged. Large-scale testing and ablations for these techniques on modern MoE architectures, on capable edge hardware still require contribution and experimentation.

Prompt Engineering

Prompt engineering in agentic tool-using cases have been documented to come in planning, tool-selecting, calling, and response generation workflow stages [24]. ReAct [25] was foundational in introducing a widely accepted paradigm for contextual engineering, but there is still ongoing research being done into alternatives such as ReflAct, which introduces reasoning on reflecting on the agent’s state relative to its goal [26]. However, we hope to optimize in our experiments not just accuracy and enhanced reasoning, but the efficiency of those reason tokens and reducing times to action and evocation.

alt text
ReAct versus ReflAct. Source: ReflAct.

MedReason [27] addresses efficient reasoning research on smaller models. By engineering Tree-of-Thought trees offline, and pruning for shortest-path reasonings, they created a high-quality reasoning dataset. By pruning for the shortest path, this allows for training on straightforward reasoning chains while still offering explainability [27], and still indirectly benefiting from the exploratory nature of ToT outputs. Synthetic Toolmind data also used a similarly-spirited approach, using offline chain-preparation. Random tool-walks are first generated, explained by an LLM agent role. These are presented to an LLM user-agent role that determines successful goal completion. A second LLM-judge evaluation prunes for coherent toolchaining paths and associated reasoning [27]. In one limitation Toolmind data is not particularly curated for the most straightforward or efficient tool-calling pathways, relying mainly on synthetic pass-fail judgements by the two LLM judges.

ToolMind makes tool calls using JSON [7]. There is a lack in experimentation with possibly superior tool-calling paradigms, such as Natural Language Tools [28], and Toon (Token-Oriented Object Notation) a notation that can use up to ~40% less tokens for the same JSON representation [29]. There is yet to be significant work demonstrating ablations on these techniques.

Hardware: AGX Orin

The AGX Orin is manufactured using the Ampere architecture [1]. This offers exceptional performance on INT8 and INT4 operations and quantization methods. However, it’s important to consider that FP4 is likely a superior alternative for future researchers [30], and that our main contribution lies in the data preparation for such superior consumer hardware. The AGX Orin is only capable of 275 FLOPs on its highest setting, and despite the 64GB high-memory availability, quantization is necessary for low-latency performance.

Technical Approach


Prompt Engineering

We intend to explore ablations on tool-calling techniques for our various models, including

In addition, we intend to test various agentic frameworks for tool calling, and perform comparisons for efficiency and benchmark accuracy:

In order to implement ReflAct, we can further synthesize ToolMind. Though the dataset contains valid tool-calling chains which are invaluable, we can further synthesize the dataset to include state-reflection. We can use superior frontier models to prepare state-awareness, and thus potentially yield the performance gains indicated by the ReflAct framework.

Fine-Tuning Techniques

Frameworks like Unsloth have developed to specialize in MoE fine-tuning, most recently to accelerate LoRA training times [22]. We will take advantage of these frameworks to help us conduct the fine-tuning process. However, most effort in quality tuning comes from ensuring clean and quality data representations.

Performance Evaluation


ToolMind itself was evaluated on BFCL [5], tau-bench [31], and tau2-bench [6]. We’ll be evaluating on the same benchmarks in order to establish consistent and comparable baselines our own developments.

We hope to construct these ablations specifically that will demonstrate performance on the AGX Orin:

Baseline Metrics Template

ModelTTFT1 (m/s)Prefill2 (tok/sec)Decode3 (tok/sec)Accuracy
xxxxx
Baseline metrics to run for the base MoE models.

These will be carried out on 15W, 30W, 45W and 60W modes.

References

[1]
NVIDIA, “NVIDIA Jetson AGX Orin Technical Brief.” NVIDIA Corporation, 2022. [Online]. Available: https://www.nvidia.com/content/dam/en-zz/Solutions/gtcf21/jetson-orin/nvidia-jetson-agx-orin-technical-brief.pdf
[2]
O. S. A. Team, “gpt-oss-120b and gpt-oss-20b Release Card,” arXiv preprint arXiv:2508.10925, 2025, [Online]. Available: https://arxiv.org/abs/2508.10925
[3]
Q. Team, “Qwen3-Coder-Next Technical Report.” Alibaba Group, 2026. [Online]. Available: https://github.com/QwenLM/Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf
[4]
NVIDIA, “NVIDIA-Nemotron-3-Nano-Technical-Report.” NVIDIA Research, 2025. [Online]. Available: https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf
[5]
G. O. F. Team, “BFCL v4: Berkeley Function Calling Leaderboard with Web Search.” 2024. [Online]. Available: https://gorilla.cs.berkeley.edu/blogs/15_bfcl_v4_web_search.html
[6]
S. Research, “Tau2-Bench: Next-Generation Tool-Use Benchmarking.” 2025. Accessed: Feb. 25, 2026. [Online]. Available: https://github.com/sierra-research/tau2-bench
[7]
T. Authors, “ToolMind: A Comprehensive Benchmark for Tool-use in LLMs,” arXiv preprint arXiv:2511.15718, 2025, [Online]. Available: https://arxiv.org/abs/2511.15718
[8]
Apig. Authors, “APIGen: Automated Pipeline for Generating High-Quality Datasets for Tool-Use,” arXiv preprint arXiv:2406.18518, 2024, [Online]. Available: https://arxiv.org/abs/2406.18518
[9]
Glaive AI, “Glaive Function Calling V2.” Accessed: Feb. 25, 2026. [Online]. Available: https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2
[10]
C. Zhang and others, “ToolACE: Deterministic Tool Use with Trace-Level Post-Training,” arXiv preprint arXiv:2409.00920, 2024, [Online]. Available: https://arxiv.org/abs/2409.00920
[11]
N. Author and others, “When2Call: Optimizing Tool Usage in Large Language Models,” arXiv preprint arXiv:2504.18851, 2025, [Online]. Available: https://arxiv.org/abs/2504.18851
[12]
C. Yuan and others, “BUTTONInstruct: Bounding User-defined Tasks with Objective Navigation.” 2024. [Online]. Available: https://arxiv.org/abs/2410.12952
[13]
L. Zheng and others, “APIGen-MT-5k: A multi-turn API execution dataset.” 2025. [Online]. Available: https://arxiv.org/abs/2504.03601
[14]
T. Zhu and others, “Tau-bench: A Benchmark for Tool-learning Agents in Real-world Scenarios.” 2024. [Online]. Available: https://arxiv.org/abs/2406.12045
[15]
N. Research, “Small Language Models (SLMs) are the Future,” arXiv preprint arXiv:2506.02153, 2025, [Online]. Available: https://arxiv.org/abs/2506.02153
[16]
W. Cai, J. Jiang, F. Wang, J. Tang, S. Kim, and J. Huang, “A Survey on Mixture of Experts in Large Language Models,” IEEE Transactions on Knowledge and Data Engineering, 2025, [Online]. Available: https://arxiv.org/pdf/2407.06204
[17]
W. Fedus, B. Zoph, and N. Shazeer, “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity,” Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022, [Online]. Available: https://arxiv.org/abs/2101.03961
[18]
B. Zoph et al., “ST-MoE: Designing Stable and Transferable Sparse Expert Models.” 2022. [Online]. Available: https://arxiv.org/abs/2202.08906
[19]
N. Shazeer et al., “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,” arXiv preprint arXiv:1701.06538, 2017, [Online]. Available: https://arxiv.org/abs/1701.06538
[20]
D. Lepikhin et al., “GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding,” arXiv preprint arXiv:2006.16668, 2020, [Online]. Available: https://arxiv.org/abs/2006.16668
[21]
C. Hwang et al., “Tutel: Adaptive Mixture-of-Experts at Scale,” arXiv preprint arXiv:2206.03382, 2022, [Online]. Available: https://arxiv.org/abs/2206.03382
[22]
U. AI, “Unsloth Specialized MoE Fine-Tuning.” Accessed: Feb. 25, 2026. [Online]. Available: https://unsloth.ai/docs/new/faster-moe
[23]
Y. T. et al., “Exploring Expert Concentration for Parameter-efficient Fine-tuning of Mixture-of-Expert LLMs,” OpenReview, 2025, [Online]. Available: https://openreview.net/forum?id=zBgjWTWgCh
[24]
C. Qu et al., “Tool Learning with Large Language Models: A Survey,” arXiv preprint arXiv:2405.17935, 2024, [Online]. Available: https://arxiv.org/abs/2405.17935
[25]
S. Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” in International Conference on Learning Representations (ICLR), 2023. [Online]. Available: https://arxiv.org/abs/2210.03629
[26]
J. Kim et al., “ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection,” arXiv preprint arXiv:2505.15182, 2025, [Online]. Available: https://arxiv.org/abs/2505.15182
[27]
M. Authors, “MedReason: Reasoning-focused Medical Large Language Models,” arXiv preprint arXiv:2504.00993, 2025, [Online]. Available: https://arxiv.org/abs/2504.00993
[28]
R. T. Johnson, M. D. Pain, and J. D. West, “Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents,” arXiv preprint arXiv:2510.14453, 2025, [Online]. Available: https://arxiv.org/abs/2510.14453
[29]
TOON Format Contributors, “Token-Oriented Object Notation (TOON).” 2025. [Online]. Available: https://github.com/toon-format/toon
[30]
onekq, “NVFP4 vs INT4.” Hugging Face Blog. [Online]. Available: https://huggingface.co/blog/onekq/nvfp4-int4
[31]
S. Research, “Tau-Bench: A Benchmark for Tool-Use Agents.” 2024. Accessed: Feb. 25, 2026. [Online]. Available: https://github.com/sierra-research/tau-bench

Footnotes

  1. Time to first token.

  2. How fast a model can load the context.

  3. How fast a model can create new tokens