Unlocking the Potential of Computer Vision and Robotics

Human Robot Interaction

Author: Taisei Hanyu | Major: Computer Science | Semester: Spring 2025

Hello, my name is Taisei Hanyu, and I am a senior Computer Science major at the University of Arkansas. I have been conducting research in the Artificial Intelligence and Computer Vision Lab under the guidance of Dr. Ngan Le. My recent work focuses on enabling efficient and interpretable robotic manipulation using vision-language-action models. Specifically, I contributed to the development and evaluation of SlotVLA, a novel framework that uses relation-centric visual representations to help robots reason about their environment and perform complex tasks using only a few semantically meaningful visual tokens.

When I first joined the lab, I had no formal experience with robotics. However, with the support of Dr. Le and the resources provided by the Honors College Research Grant, I was able to gain foundational knowledge in robotics and apply my skills in computer vision and machine learning to embodied AI systems. Throughout the project, I collaborated with graduate student Kashu Yamazaki and external collaborator Nhat Chung, learning a great deal about model implementation, debugging, and experimental design. I played a key role in integrating the relation-centric visual tokenizer with a large language model for action decoding and ran experiments using the LIBERO-Goal benchmark to assess the system’s performance across both single-view and multi-view scenarios.

Because no one in the lab was focused on robotic hardware, I was responsible for building and configuring the entire physical platform from the ground up. One of the most difficult parts of the project was developing a unified control system that could reliably manage the robot arm and gripper under a variety of hardware setups, communication protocols, and software libraries. My prior knowledge of robotics was limited, which made this task particularly challenging. However, instead of relying on trial-and-error, I chose to approach the problem by studying the underlying control theory and system design principles. This effort led to the successful creation of a reliable pipeline that was used to collect data for training and evaluating our framework.

This project provided me with practical experience working with real-world robotics datasets and multimodal learning frameworks. I learned how to structure a complex research system, design evaluation protocols, and interpret quantitative results. One of the most rewarding parts of the experience was understanding how a few structured visual tokens, when properly aligned with language instructions, can improve robot behavior and efficiency. I also contributed to writing and revising our technical paper, which strengthened my ability to communicate ideas clearly and precisely.

Dr. Ngan Le and Kashu Yamazaki played a key role in supporting my research. Dr. Le always treated me as a full contributor, meeting with me regularly to discuss research direction and give detailed, constructive feedback. Kashu provided extensive technical guidance, especially during implementation, and often suggested ways to make my system flexible and future-proof. Their support was critical to pushing this project forward.

Moving forward, I plan to continue exploring the intersection of vision, language, and robotics. This project has deepened my interest in pursuing graduate studies and conducting research on how robots can interact with dynamic, unstructured environments using minimal yet meaningful information. I am grateful for the opportunity to be part of a project that not only expanded my technical capabilities but also clarified my long-term research goals.