Skip to main content
Top

2025 | OriginalPaper | Chapter

ControlLLM: Augment Language Models with Tools by Searching on Graphs

Authors : Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Ziheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, Wenhai Wang

Published in: Computer Vision – ECCV 2024

Publisher: Springer Nature Switzerland

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving complex real-world tasks. Despite the remarkable performance of LLMs, they still struggle with tool invocation due to ambiguous user prompts, inaccurate tool selection and mismatched input arguments. To overcome these challenges, our framework comprises three key components: (1) a task decomposer that breaks down a complex task into clear subtasks with well-defined inputs and outputs; (2) a Thoughts-on-Graph (ToG) paradigm that searches the optimal solution path on a pre-built tool graph, which specifies the parameter and dependency relations among different tools; and (3) an execution engine with a rich toolbox that interprets the solution path and runs the tools efficiently on different computational devices. We evaluate our framework on diverse tasks involving image, audio, and video processing, demonstrating its superior accuracy, efficiency, and versatility compared to existing methods. The code is available at https://​github.​com/​OpenGVLab/​ControlLLM.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
1.
2.
go back to reference Besta, M., et al.: Graph of Thoughts: Solving Elaborate Problems with Large Language Models (2023) Besta, M., et al.: Graph of Thoughts: Solving Elaborate Problems with Large Language Models (2023)
3.
go back to reference Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020) Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
7.
go back to reference Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning (2023) Dai, W., et al.: Instructblip: towards general-purpose vision-language models with instruction tuning (2023)
11.
go back to reference Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14953–14962 (2023) Gupta, T., Kembhavi, A.: Visual programming: compositional visual reasoning without training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14953–14962 (2023)
12.
go back to reference Hao, S., Liu, T., Wang, Z., Hu, Z.: ToolkenGPT: augmenting frozen language models with massive tools via tool embeddings. arXiv preprint arXiv:2305.11554 (2023) Hao, S., Liu, T., Wang, Z., Hu, Z.: ToolkenGPT: augmenting frozen language models with massive tools via tool embeddings. arXiv preprint arXiv:​2305.​11554 (2023)
16.
18.
go back to reference Latombe, J.C.: Robot Motion Planning, vol. 124. Springer, Cham (2012) Latombe, J.C.: Robot Motion Planning, vol. 124. Springer, Cham (2012)
19.
go back to reference Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023) Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:​2301.​12597 (2023)
20.
go back to reference Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022) Li, J., Li, D., Xiong, C., Hoi, S.: Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning, pp. 12888–12900. PMLR (2022)
21.
go back to reference Liu, C., Jiang, X., Ding, H.: Primitivenet: decomposing the global constraints for referring segmentation. Vis. Intell. 2(1), 16 (2024)CrossRef Liu, C., Jiang, X., Ding, H.: Primitivenet: decomposing the global constraints for referring segmentation. Vis. Intell. 2(1), 16 (2024)CrossRef
23.
go back to reference Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023) Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)
24.
go back to reference Liu, Z., et al.: InternGPT: solving vision-centric tasks by interacting with chatbots beyond language. arXiv preprint arXiv:2305.05662 (2023) Liu, Z., et al.: InternGPT: solving vision-centric tasks by interacting with chatbots beyond language. arXiv preprint arXiv:​2305.​05662 (2023)
25.
go back to reference Ma, L., Han, J., Wang, Z., Zhang, D.: CephGPT-4: an interactive multimodal cephalometric measurement and diagnostic system with visual large language model. arXiv preprint arXiv:2307.07518 (2023) Ma, L., Han, J., Wang, Z., Zhang, D.: CephGPT-4: an interactive multimodal cephalometric measurement and diagnostic system with visual large language model. arXiv preprint arXiv:​2307.​07518 (2023)
26.
27.
28.
go back to reference OpenAI: Chatgpt (Mar 14 version) [large language model]. 6 (2023) OpenAI: Chatgpt (Mar 14 version) [large language model]. 6 (2023)
29.
go back to reference Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744 (2022) Ouyang, L., et al.: Training language models to follow instructions with human feedback. In: Advances in Neural Information Processing Systems, vol. 35, pp. 27730–27744 (2022)
31.
go back to reference Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: large language model connected with massive APIs. arXiv preprint arXiv:2305.15334 (2023) Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: large language model connected with massive APIs. arXiv preprint arXiv:​2305.​15334 (2023)
32.
33.
go back to reference Qin, Y., et al.: ToolLLM: facilitating large language models to master 16000+ real-world APIs. arXiv preprint arXiv:2307.16789 (2023) Qin, Y., et al.: ToolLLM: facilitating large language models to master 16000+ real-world APIs. arXiv preprint arXiv:​2307.​16789 (2023)
34.
go back to reference Rae, J.W., et al.: Scaling language models: methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446 (2021) Rae, J.W., et al.: Scaling language models: methods, analysis & insights from training gopher. arXiv preprint arXiv:​2112.​11446 (2021)
36.
go back to reference Shen, Y., Song, K., Tan, X., Li, D., Lu, W., Zhuang, Y.: HuggingGPT: solving AI tasks with ChatGPT and its friends in huggingface. arXiv preprint arXiv:2303.17580 (2023) Shen, Y., Song, K., Tan, X., Li, D., Lu, W., Zhuang, Y.: HuggingGPT: solving AI tasks with ChatGPT and its friends in huggingface. arXiv preprint arXiv:​2303.​17580 (2023)
37.
go back to reference Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)CrossRef Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)CrossRef
38.
go back to reference Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. In: Proceedings of IEEE International Conference on Computer Vision (ICCV) (2023) Surís, D., Menon, S., Vondrick, C.: ViperGPT: visual inference via python execution for reasoning. In: Proceedings of IEEE International Conference on Computer Vision (ICCV) (2023)
39.
go back to reference Tang, Q., Deng, Z., Lin, H., Han, X., Liang, Q., Sun, L.: Toolalpaca: generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301 (2023) Tang, Q., Deng, Z., Lin, H., Han, X., Liang, Q., Sun, L.: Toolalpaca: generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:​2306.​05301 (2023)
42.
go back to reference Vemprala, S., Bonatti, R., Bucker, A., Kapoor, A.: ChatGPT for robotics: design principles and model abilities. Microsoft Auton. Syst. Robot. Res. 2, 20 (2023) Vemprala, S., Bonatti, R., Bucker, A., Kapoor, A.: ChatGPT for robotics: design principles and model abilities. Microsoft Auton. Syst. Robot. Res. 2, 20 (2023)
43.
go back to reference Wang, W., et al.: VisionLLM: large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175 (2023) Wang, W., et al.: VisionLLM: large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:​2305.​11175 (2023)
44.
45.
go back to reference Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022) Wei, J., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837 (2022)
46.
go back to reference Weiss, S.M., Kulikowski, C.A., Amarel, S., Safir, A.: A model-based method for computer-aided medical decision-making. Artif. Intell. 11(1–2), 145–172 (1978)CrossRef Weiss, S.M., Kulikowski, C.A., Amarel, S., Safir, A.: A model-based method for computer-aided medical decision-making. Artif. Intell. 11(1–2), 145–172 (1978)CrossRef
47.
go back to reference Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671 (2023) Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., Duan, N.: Visual ChatGPT: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:​2303.​04671 (2023)
49.
go back to reference Yang, R., et al.: GPT4tools: teaching large language model to use tools via self-instruction (2023) Yang, R., et al.: GPT4tools: teaching large language model to use tools via self-instruction (2023)
50.
52.
go back to reference Yao, Y., Li, Z., Zhao, H.: Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:2305.16582 (2023) Yao, Y., Li, Z., Zhao, H.: Beyond chain-of-thought, effective graph-of-thought reasoning in large language models. arXiv preprint arXiv:​2305.​16582 (2023)
53.
go back to reference You, H., et al.: Ferret: refer and ground anything anywhere at any granularity (2023) You, H., et al.: Ferret: refer and ground anything anywhere at any granularity (2023)
55.
go back to reference Zhang, R., et al.: Llama-adapter: efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199 (2023) Zhang, R., et al.: Llama-adapter: efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:​2303.​16199 (2023)
56.
57.
go back to reference Zheng, K., He, X., Wang, X.E.: MiniGPT-5: interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239 (2023) Zheng, K., He, X., Wang, X.E.: MiniGPT-5: interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:​2310.​02239 (2023)
58.
go back to reference Zhou, D., et al.: Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 (2022) Zhou, D., et al.: Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:​2205.​10625 (2022)
Metadata
Title
ControlLLM: Augment Language Models with Tools by Searching on Graphs
Authors
Zhaoyang Liu
Zeqiang Lai
Zhangwei Gao
Erfei Cui
Ziheng Li
Xizhou Zhu
Lewei Lu
Qifeng Chen
Yu Qiao
Jifeng Dai
Wenhai Wang
Copyright Year
2025
DOI
https://doi.org/10.1007/978-3-031-73254-6_6

Premium Partner