Scaling Autonomy in the Cloud

这是我在 Nuro 的计算 Infra (BATES 组) 工作内容的介绍

At Nuro, developing autonomous driving technology isn’t just about cutting-edge robotics and AI — it’s also about handling an immense amount of data efficiently and cost-effectively. Every day, we run millions of hours of simulations, data processing tasks, and evaluations on Google Cloud. To make this feasible, we’ve built an in-house system that orchestrates complex task dependencies and optimizes our compute cluster for maximum resource utilization.

The Challenge: Scaling Beyond Industry Standards

In the quest for autonomy, leveraging vast datasets is essential. However, existing industry solutions for job processing fall short when scaling to our needs or providing the rich features developers require for in-depth analysis. Broadly, these solutions fall into two categories:

1. Workflow Management Systems

Tools like Airflow, Jenkins, and Buildkite offer user-friendly interfaces for visualizing task execution — monitoring statuses, accessing logs, and more. But they’re typically designed to handle thousands of tasks per pipeline. In contrast, our validation processes often involve millions of tasks. Scaling such systems to our level presents insurmountable challenges. Traditional schedulers can’t handle thousands of tasks per second or manage Directed Acyclic Graphs (DAGs) with millions of nodes. Moreover, due to user interface constraints, identifying and interacting with individual tasks in massive DAGs becomes impractical.

2. Map-Reduce Frameworks

Platforms like Celery, Google Cloud BigQuery, DataFlow, Ray, and Pub/Sub excel at processing large datasets. However, they’re optimized for quick, stateless operations and lack detailed per-invocation tracking. Diagnosing issues in individual simulations is difficult without granular tracking, and without detailed insights, customizing resource utilization strategies is nearly impossible.

Fig 1. BATES manages DAGs of millions of scenes.

Introducing BATES: Our Scalable Solution

To overcome these hurdles, we developed BATES (Batch Task Execution System) — a robust platform capable of managing millions of tasks daily. Here’s how BATES transforms our workflow:

1. Hierarchical Task Management

In BATES, every execution unit is a Task, encapsulated as a protobuf message routed to a worker for execution. Tasks can dynamically generate sub-tasks, forming a hierarchical structure. Instead of predefining an entire job graph, tasks can spawn new tasks as needed, allowing for flexibility and scalability. This means the root task, or Job, is submitted by the user without the overhead of specifying every dependency upfront.

2. Advanced Orchestration and Tracking

Our BATES server leverages a combination of an OLTP database and Redis to monitor task statuses. We’ve implemented a custom message queue that automatically assigns and adjusts task priorities based on job size, start time, and real-time factors like new job submissions or user reprioritizations. Users can attach metadata to tasks, which we export to observability platforms like Prometheus and Google Cloud BigQuery for analysis and alerting.

Fig 2. Job scheduling optimizes job turnaround time based on SLA and size.

Simplifying Execution with Generic Workers