The Transformer Robots are Here, Just a Different Kind (2024)

The Transformer Robots are Here, Just a Different Kind (1)

Edge 259: Our series about LLM reasoning dives into the fascinating tree-of-thoughts technique including the original paper. We also review the : Language Model Evaluation Harness framework for LLM evaluation.
Edge 260: We dive into Ghostbuster, Berkerley University model for detecting LLM generated content.

You can subscribe below!

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Robotics has always been one of the most fertile grounds for adopting artificial intelligence (AI) techniques. With recent advancements in computer vision, language, and audio foundation models, we can expect to see a new generation of robotic applications that dazzle us. However, the challenges of building effective robotic solutions extend beyond AI and require deep mastery of the physics of an environment and incredibly effective coordination of perception and action. Typically, collecting those training datasets requires massive effort, but the advent of foundation models has drastically lowered the entry point.

A few months ago, Google DeepMind unveiled the Robotic Transformer 2 (RT-2) models, which use language and computer vision to translate knowledge into robotic actions. Last week, DeepMind followed this research with three notable additions:

AutoRT: A system that leverages vision-language models to deploy robots in completely new environments with minimal human supervision.
SARA-RT: A method that converts RT-2 into a version that is 10% more accurate and 14% faster.
See Also
Robots Take on Tough Tasks in Transmission and Distribution The science behind Transformers' shape-changing robots
RT-Trajectory: A video model for learning control policies in physical actions in robotic applications. This method takes a video and overlays a 2D sketch of an action that the robot can follow.

These three methods combine foundation models in image, language, and video to improve robotic applications. Certainly, aspects such as perception and its translation into action using foundation models can accelerate robotics to levels we haven’t seen before. The robo transformers are definitely on their way!

📣 apply() Spring ‘24 Call for Speakers!

The Transformer Robots are Here, Just a Different Kind (2)

The next apply() is set for March 14 and we’re looking for speakers! apply() is the biggest virtual ML conference in the world, and is designed to bring together ML practitioners in one space to share best practices, development patterns, and emerging tooling.

Has your team built an ML platform? Pushed ML models to production? Have learned valuable lessons on how to organize an ML team or data scientist team? If yes, we want to hear from you – submit your talk today!

Submit Talk

🔎 ML Research

Robotics with Foundation Models

Google DeepMind published the research and code behind AutoRT, SARA-RT and RT-Trajectory, three methods that leverage foundation models om robotic scenarios. The three techniques are part of the Robotics Transformer initiative aimed to help robots navigate environments and make quick decisions —> Read more.

Mobile ALOHA

Researchers from Stanford University, a very impressive robotic application for object manipulation. The robot uses imitation learning to master a series of complex tasks following specific demonstrations. What the videos —> Read more.

GPU Split

Microsoft Research published a paper detailing Splitwise, an optimization technique for GPU utilization. Splitwise works by separating the token generation adn prompt computation phases of LLM inference into different machines —> Read more.

LLM Augmented LLMs

Google DeepMind published a super interesting paper introducing Composition of Augmented Language Models(CALM), a method that augments the capabilities of LLMs with other LLMs. Specifically, CALM introduces cross-attention between models so that they can reuse knowledge representations —> Read more.

High Quality Text Embeddings Using Synthetic Data

Microsoft Research published a paper detailing a method for obtaining high quality text embeddings using only synthetic data and LLMs. More impressively, the method seems to require only about a thousand steps instead of billions of data pairs used to pretrain embedding models —> Read more.

OpenVoice

Researchers from decentralized AI platform MyShell published a paper detailing OpenVoice, a voice cloning that only requires a short audio clip as input. OpenVoice enables super granular control over voice characteristics such as accent, rhythm, emotion, intonation and several others —> Read more.

🤖 Cool AI Tech Releases

CrewAI

A new open source framework for orchestrating autonomous agents —> Read more.

📡AI Radar

Hasn’t happened since the Windows Key but Microsoft annouced its adding a Copilot key to Windows 11 PCs.
AI search startup Perplexity AI raised $73.6 million at a $520 million valuation.
Intel spins out a new enterprise AI firm called Articul8.
Databricks published a research detailing that inference and training on Intel Gaudi 2 accelerator matches NVIDIA’s performance.
OpenAI is set to release its GPT store next week.
AI-legal platform Robin AI announced a $26 million raise.
Nabla, a medical AI asisstant platform, announced a $24 million series B.

TheSequence is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

I am an expert and enthusiast-based assistant. I have access to a wide range of information and can provide assistance on various topics. I can help answer questions, provide insights, and engage in detailed discussions. If you have any specific questions or topics you'd like to explore, feel free to ask!

Now, let's dive into the concepts mentioned in this article.

Tree-of-Thoughts Technique

The article mentions the "tree-of-thoughts" technique, which is a fascinating concept. Unfortunately, I couldn't find specific information about this technique in the search results I have access to. It's possible that this technique is specific to the article or the author's own framework. If you have any specific questions or would like to explore related topics, please let me know, and I'll be happy to assist you further.

Language Model Evaluation Harness (LMEH) Framework

The article also mentions the Language Model Evaluation Harness (LMEH) framework for evaluating language models. Unfortunately, I couldn't find specific information about this framework in the search results I have access to. It's possible that this framework is relatively new or specific to the article or author's work. If you have any specific questions or would like to explore related topics, please let me know, and I'll be happy to assist you further.

Robotic Applications and Foundation Models

The article discusses the use of foundation models in robotic applications. Foundation models, such as the Robotic Transformer 2 (RT-2) models developed by Google DeepMind, leverage language and computer vision to translate knowledge into robotic actions. These models have been enhanced with additional techniques, including AutoRT, SARA-RT, and RT-Trajectory, to improve their accuracy and speed in deploying robots in new environments and learning control policies for physical actions.

The integration of foundation models in robotics holds great potential for advancing the field. By combining image, language, and video models, robotic applications can benefit from improved perception and action capabilities. This can lead to more effective coordination of perception and action, enabling robots to navigate environments and make quick decisions.

Other Concepts and Technologies

The article also mentions several other concepts and technologies related to artificial intelligence (AI) and machine learning (ML). These include:

Mobile ALOHA: A robotic application developed by researchers from Stanford University that uses imitation learning to master complex tasks through specific demonstrations.
GPU Split: An optimization technique for GPU utilization published by Microsoft Research.
Composition of Augmented Language Models (CALM): A method introduced by Google DeepMind that augments the capabilities of language models by introducing cross-attention between models to enable the reuse of knowledge representations.
High-Quality Text Embeddings Using Synthetic Data: A method detailed in a paper by Microsoft Research that obtains high-quality text embeddings using synthetic data and language models.
OpenVoice: A voice cloning technique developed by MyShell that only requires a short audio clip as input and enables granular control over voice characteristics.

These concepts and technologies highlight the diverse applications and advancements in AI and ML, ranging from robotics to natural language processing and voice cloning.

If you have any specific questions about these concepts or would like to explore any related topics further, please let me know!