A classical modular autonomous driving system typically consists of perception, prediction, planning, and control. Until around 2023, ai (artificial intelligence) or ML (machine learning) primarily enhanced perception in most mass-production autonomous driving systems, with its influence diminishing in downstream components. In stark contrast to the low integration of ai in the planning stack, end-to-end perception systems (such as the BEV, or birds-eye-view perception pipeline) have been deployed in mass production vehicles.
There are multiple reasons for this. A classical stack based on a human-crafted framework is more explainable and can be iterated faster to fix field test issues (within hours) compared to machine learning-driven features (which could take days or weeks). However, it does not make sense to let readily available human driving data sit idle. Moreover, increasing computing power is more scalable than expanding the engineering team.
Fortunately, there has been a strong trend in both academia and industry to change this situation. First, downstream modules are becoming increasingly data-driven and may also be integrated via different interfaces, such as the one proposed in CVPR 2023’s best paper, UniAD. Moreover, driven by the ever-growing wave of Generative ai, a single unified vision-language-action (VLA) model shows great potential for handling complex robotics tasks (RT-2 in academia, TeslaBot and 1X in industry) and autonomous driving (GAIA-1, DriveVLM in academia, and Wayve ai driver, Tesla FSD in industry). This brings the toolsets of ai and data-driven development from the perception stack to the planning stack.
This blog post aims to introduce the problem settings, existing methodologies, and challenges of the planning stack, in the form of a crash course for perception engineers. As a perception engineer, I finally had some time over the past couple of weeks to systematically learn the classical planning stack, and I would like to share what I learned. I will also share my thoughts on how ai can help from the perspective of an ai practitioner.
The intended audience for this post is ai practitioners who work in the field of autonomous driving, in particular, perception engineers.
The article is a bit long (11100 words), and the table of contents below will most likely help those who want to do quick ctrl+F searches with the keywords.
Table of Contents (ToC)Why learn planning?
What is planning?
The problem formulation
The Glossary of Planning
Behavior Planning
Frenet vs Cartesian systems
Classical tools-the troika of planning
Searching
Sampling
Optimization
Industry practices of planning
Path-speed decoupled planning
Joint spatiotemporal planning
Decision making
What and why?
MDP and POMDP
Value iteration and Policy iteration
AlphaGo and MCTS-when nets meet trees
MPDM (and successors) in autonomous driving
Industry practices of decision making
Trees
No trees
Self-Reflections
Why NN in planning?
What about e2e NN planners?
Can we do without prediction?
Can we do with just nets but no trees?
Can we use LLMs to make decisions?
The trend of evolution
This brings us to an interesting question: why learn planning, especially the classical stack, in the era of ai?
From a problem-solving perspective, understanding your customers’ challenges better will enable you, as a perception engineer, to serve your downstream customers more effectively, even if your main focus remains on perception work.
Machine learning is a tool, not a solution. The most efficient way to solve problems is to combine new tools with domain knowledge, especially those with solid mathematical formulations. Domain knowledge-inspired learning methods are likely to be more data-efficient. As planning transitions from rule-based to ML-based systems, even with early prototypes and products of end-to-end systems hitting the road, there is a need for engineers who can deeply understand both the fundamentals of planning and machine learning. Despite these changes, classical and learning methods will likely continue to coexist for a considerable period, perhaps shifting from an 8:2 to a 2:8 ratio. It is almost essential for engineers working in this field to understand both worlds.
From a value-driven development perspective, understanding the limitations of classical methods is crucial. This insight allows you to effectively utilize new ML tools to design a system that addresses current issues and delivers immediate impact.
Additionally, planning is a critical part of all autonomous agents, not just in autonomous driving. Understanding what planning is and how it works will enable more ML talents to work on this exciting topic and contribute to the development of truly autonomous agents, whether they are cars or other forms of automation.
The problem formulation
As the “brain” of autonomous vehicles, the planning system is crucial for the safe and efficient driving of vehicles. The goal of the planner is to generate trajectories that are safe, comfortable, and efficiently progressing towards the goal. In other words, safety, comfort, and efficiency are the three key objectives for planning.
As input to the planning systems, all perception outputs are required, including static road structures, dynamic road agents, free space generated by occupancy networks, and traffic wait conditions. The planning system must also ensure vehicle comfort by monitoring acceleration and jerk for smooth trajectories, while considering interaction and traffic courtesy.
The planning systems generate trajectories in the format of a sequence of waypoints for the ego vehicle’s low-level controller to track. Specifically, these waypoints represent the future positions of the ego vehicle at a series of fixed time stamps. For example, each point might be 0.4 seconds apart, covering an 8-second planning horizon, resulting in a total of 20 waypoints.
A classical planning stack roughly consists of global route planning, local behavior planning, and local trajectory planning. Global route planning provides a road-level path from the start point to the end point on a global map. Local behavior planning decides on a semantic driving action type (e.g., car following, nudging, side passing, yielding, and overtaking) for the next several seconds. Based on the decided behavior type from the behavior planning module, local trajectory planning generates a short-term trajectory. The global route planning is typically provided by a map service once navigation is set and is beyond the scope of this post. We will focus on behavior planning and trajectory planning from now on.
Behavior planning and trajectory generation can work explicitly in tandem or be combined into a single process. In explicit methods, behavior planning and trajectory generation are distinct processes operating within a hierarchical framework, working at different frequencies, with behavior planning at 1–5 Hz and trajectory planning at 10–20 Hz. Despite being highly efficient most of the time, adapting to different scenarios may require significant modifications and fine-tuning. More advanced planning systems combine the two into a single optimization problem. This approach ensures feasibility and optimality without any compromise.
The Glossary of Planning
You might have noticed that the terminology used in the above section and the image do not completely match. There is no standard terminology that everyone uses. Across both academia and industry, it is not uncommon for engineers to use different names to refer to the same concept and the same name to refer to different concepts. This indicates that planning in autonomous driving is still under active development and has not fully converged.
Here, I list the notation used in this post and briefly explain other notions present in the literature.
- Planning: A top-level concept, parallel to control, that generates trajectory waypoints. Together, planning and control are jointly referred to as PnC (planning and control).
- Control: A top-level concept that takes in trajectory waypoints and generates high-frequency steering, throttle, and brake commands for actuators to execute. Control is relatively well-established compared to other areas and is beyond the scope of this post, despite the common notion of PnC.
- Prediction: A top-level concept that predicts the future trajectories of traffic agents other than the ego vehicle. Prediction can be considered a lightweight planner for other agents and is also called motion prediction.
- Behavior Planning: A module that produces high-level semantic actions (e.g., lane change, overtake) and typically generates a coarse trajectory. It is also known as task planning or decision making, particularly in the context of interactions.
- Motion Planning: A module that takes in semantic actions and produces smooth, feasible trajectory waypoints for the duration of the planning horizon for control to execute. It is also referred to as trajectory planning.
- Trajectory Planning: Another term for motion planning.
- Decision Making: Behavior planning with a focus on interactions. Without ego-agent interaction, it is simply referred to as behavior planning. It is also known as tactical decision making.
- Route Planning: Finds the preferred route over road networks, also known as mission planning.
- Model-Based Approach: In planning, this refers to manually crafted frameworks used in the classical planning stack, as opposed to neural network models. Model-based methods contrast with learning-based methods.
- Multimodality: In the context of planning, this typically refers to multiple intentions. This contrasts with multimodality in the context of multimodal sensor inputs to perception or multimodal large language models (such as VLM or VLA).
- Reference Line: A local (several hundred meters) and coarse path based on global routing information and the current state of the ego vehicle.
- Frenet Coordinates: A coordinate system based on a reference line. Frenet simplifies a curvy path in Cartesian coordinates to a straight tunnel model. See below for a more detailed introduction.
- Trajectory: A 3D spatiotemporal curve, in the form of (x, y, t) in Cartesian coordinates or (s, l, t) in Frenet coordinates. A trajectory is composed of both path and speed.
- Path: A 2D spatial curve, in the form of (x, y) in Cartesian coordinates or (s, l) in Frenet coordinates.
- Semantic Action: A high-level abstraction of action (e.g., car following, nudge, side pass, yield, overtake) with clear human intention. Also referred to as intention, policy, maneuver, or primitive motion.
- Action: A term with no fixed meaning. It can refer to the output of control (high-frequency steering, throttle, and brake commands for actuators to execute) or the output of planning (trajectory waypoints). Semantic action refers to the output of behavior prediction.
Different literature may use various notations and concepts. Here are some examples:
These variations illustrate the diversity in terminology and the evolving nature of the field.
Behavior Planning
As a machine learning engineer, you may notice that the behavior planning module is a heavily manually crafted intermediate module. There is no consensus on the exact form and content of its output. Concretely, the output of behavior planning can be a reference path or object labeling on ego maneuvers (e.g., pass from the left or right-hand side, pass or yield). The term “semantic action” has no strict definition and no fixed methods.
The decoupling of behavior planning and motion planning increases efficiency in solving the extremely high-dimensional action space of autonomous vehicles. The actions of an autonomous vehicle need to be reasoned at typically 10 Hz or more (time resolution in waypoints), and most of these actions are relatively straightforward, like going straight. After decoupling, the behavior planning layer only needs to reason about future scenarios at a relatively coarse resolution, while the motion planning layer operates in the local solution space based on the decision made by behavior planning. Another benefit of behavior planning is converting non-convex optimization to convex optimization, which we will discuss further below.
Frenet vs Cartesian systems
The Frenet coordinate system is a widely adopted system that merits its own introduction section. The Frenet frame simplifies trajectory planning by independently managing lateral and longitudinal movements relative to a reference path. The sss coordinate represents longitudinal displacement (distance along the road), while the lll (or ddd) coordinate represents lateral displacement (side position relative to the reference path).
Frenet simplifies a curvy path in Cartesian coordinates to a straight tunnel model. This transformation converts non-linear road boundary constraints on curvy roads into linear ones, significantly simplifying the subsequent optimization problems. Additionally, humans perceive longitudinal and lateral movements differently, and the Frenet frame allows for separate and more flexible optimization of these movements.
The Frenet coordinate system requires a clean, structured road graph with low curvature lanes. In practice, it is preferred for structured roads with small curvature, such as highways or city expressways. However, the issues with the Frenet coordinate system are amplified with increasing reference line curvature, so it should be used cautiously on structured roads with high curvature, like city intersections with guide lines.
For unstructured roads, such as ports, mining areas, parking lots, or intersections without guidelines, the more flexible Cartesian coordinate system is recommended. The Cartesian system is better suited for these environments because it can handle higher curvature and less structured scenarios more effectively.
Planning in autonomous driving involves computing a trajectory from an initial high-dimensional state (including position, time, velocity, acceleration, and jerk) to a target subspace, ensuring all constraints are satisfied. Searching, sampling, and optimization are the three most widely used tools for planning.
Searching
Classical graph-search methods are popular in planning and are used in route/mission planning on structured roads or directly in motion planning to find the best path in unstructured environments (such as parking or urban intersections, especially mapless scenarios). There is a clear evolution path, from Dijkstra’s algorithm to A* (A-star), and further to hybrid A*.
Dijkstra’s algorithm explores all possible paths to find the shortest one, making it a blind (uninformed) search algorithm. It is a systematic method that guarantees the optimal path, but it is inefficient to deploy. As shown in the chart below, it explores almost all directions. Essentially, Dijkstra’s algorithm is a breadth-first search (BFS) weighted by movement costs. To improve efficiency, we can use information about the location of the target to trim down the search space.
The A* algorithm uses heuristics to prioritize paths that appear to be leading closer to the goal, making it more efficient. It combines the cost so far (Dijkstra) with the cost to go (heuristics, essentially greedy best-first). A* only guarantees the shortest path if the heuristic is admissible and consistent. If the heuristic is poor, A* can perform worse than the Dijkstra baseline and may degenerate into a greedy best-first search.
In the specific application of autonomous driving, the hybrid A* algorithm further improves A* by considering vehicle kinematics. A* may not satisfy kinematic constraints and cannot be tracked accurately (e.g., the steering angle is typically within 40 degrees). While A* operates in grid space for both state and action, hybrid A* separates them, maintaining the state in the grid but allowing continuous action according to kinematics.
Analytical expansion (shot to goal) is another key innovation proposed by hybrid A*. A natural enhancement to A* is to connect the most recently explored nodes to the goal using a non-colliding straight line. If this is possible, we have found the solution. In hybrid A*, this straight line is replaced by Dubins and Reeds-Shepp (RS) curves, which comply with vehicle kinematics. This early stopping method strikes a balance between optimality and feasibility by focusing more on feasibility for the further side.
Hybrid A* is used heavily in parking scenarios and mapless urban intersections. Here is a very nice video showcasing how it works in a parking scenario.
Sampling
Another popular method of planning is sampling. The well-known Monte Carlo method is a random sampling method. In essence, sampling involves selecting many candidates randomly or according to a prior, and then selecting the best one according to a defined cost. For sampling-based methods, the fast evaluation of many options is critical, as it directly impacts the real-time performance of the autonomous driving system.
Large Language Models (LLMs) essentially provide samples, and there needs to be an evaluator with a defined cost that aligns with human preferences. This evaluation process ensures that the selected output meets the desired criteria and quality standards.
Sampling can occur in a parameterized solution space if we already know the analytical solution to a given problem or subproblem. For example, typically we want to minimize the time integral of the square of jerk (the third derivative of position p