End-to-End Motion Planning With Deep Learning

12 min readSep 14, 2020

For past few months I have been very interested into self driving car technology. There are infinitely many possibilities upon how you can apply your machine learning knowledge to develop parts of the self driving car stack.

In this post, I’d be taking you through the process of developing an end to end motion planner for autonomous robots. This project is hugely inspired by the approach that comma.ai uses in open pilot.

Different Approaches to Self Driving

There are several approaches to make a self driving car software. The model can either be trained end-to-end or models can be trained using mid to mid approach. Below are some examples:

End to End Behavioral Cloning: In this approach, A CNN is trained end-to-end. Input is a sequence of images or a single image, and output is directly a steering angle. This approach got popular when Nvidia trained this type of neural network at scale. Respective blog can be found here. I have also worked on implementing many similar kinds of model. The link to the repo can be found here. It contains step by step procedure to train these kinds of model, some of them are: Nvidia’s model, Comma AI’s model, 3D CNNs, LSTMs. This approach is quite easy to implement since model learns almost everything internally and we won’t have to implement out conventional control stack, localizer, planner, etc. But the downside ? You lose full control over the model output, cannot interpret the results understood by humans (total black boxes), difficult to incorporate prior knowledge.

Mid to Mid: In this approach, instead of directly outputting controls, the network is broken down into as many parts as you would like to. For example, training separate networks for object detection, motion prediction, path planning, etc. This approach is used by Lyft, Tesla, etc. Shown below, is how Lyft detects other vehicles, place them on a sementic map of surroundings, and predicts motion of all other things whose results are further passed to controllers which finally produces steering, throttle, etc outputs.

From Lyft’s motion prediction challenge 2020

Disadvantages of this approach are, the errors keep accumulating through multiple steps, putting all the pieces of this together is exponentially harder. But on the flip side, advantages are, the process is fully interpretable by humans and can easily reason the decisions of the self driving car. Other than that, incorporating prior knowledge is now possible.

Comma’s Approach: Comma skips object detection, depth estimation, etc to directly output trajectory to be followed by the vehicle. Read more about this in their blog post here. This path is then tracked by conventional control stack maintained by their team.

Understanding the Super Combo Model

The supercombo model is elegantly crafted, yet lightweight which takes inputs as:

2 Consecutive image frames in YUV format
State of the Recurrent Cells (being a recurrent neural network)
Desire (Discussed later)

And Predicts the following Outputs:

Path (Trajectory output to be followed)
Left Lane
Right Lane
Standard Deviations of Path and Lanes (Discussed Below)
Lead Car information
Longitudinal Accelerations, Velocities and Displacements for n future steps
Pose (Discussed Below)
And some other stuff I am not sure about

The Path, Left Lane and Right Lane are the ‘Predicted Trajectory’ and the lane lines in Top Down / Bird’s Eye View coordinate system relative the the ego vehicle.

See how the model predicts the curve on the road to the right.

The one other thing to see here is that lane lines also fade when no lane lines are visible, or simply when the model is less confident about it’s prediction. This ‘confidence’ score is basically predicted by the model too and are simply the ‘standard deviations’. These standard deviations are predicted for each of the path, left lane and right lane and many other outputs including pose, lead outputs, etc.

The standard deviations can be predicted using a Bayesian Neural Network that carries the unceartinity of the predictions down through the network as they are computed.

The pose output is the predicted translation and rotation between the two input images. This is probably used to perform visual odometery and to localize the car with respect to output trajectories.

Data Creation

Before even thinking of training the neural network, we need to collect and process the data into the right format for training. Comma.ai has released an opensource dataset containing much of this information like lanes, path and other stuff. But I wanted a bit more of the challenge and instead decided to use Udacity’s dataset that just contained:

Steering angles
Speed
Gyroscope Readings
GPS Readings
Image frames from center, left and right mounted camera.

For training a custom model, I decided to just predict future path and speeds.

Desire

Desire in basically a one-hot encoding of high level actions like ‘change lane left’, ‘change lane right’, ‘stay in lane’, ‘turn right’, ‘turn left’, etc.

The Udacity Dataset didn’t contain any “desire”. So needed to manually label 10, 000 frames :(

Labeling 10,000 frames by hand is not that easy. So instead of labeling each frame individually, I created a small tool for this purpose. The Autolabler.

Using this, My friend, Raghav, labelled the whole dataset in under 30 minutes. It’s more like a small video player, you can pause, seek, fast forward, slow down the input frame stream as you like. It keeps tagging the selected label on the fly as the stream is being played. The full description for using it can be found in my repo.

Path

For trajectory ground truthing, I was constrained to use only the information that was present in the udacity dataset.

First of all, I used a simple bicycle model to step it through time using velocity and steering angle inputs to the vehicle model.

At low speeds, a simple bicycle model is used considering center of mass as the reference frame.

Bicycle model with reference as center of gravity of car

I could had used dynamic model for lateral control, but I didn’t really had data for vehicle parameters that was used to collect the dataset.

Now definitely, the the above model is very simple and cannot really model the actual physics of the vehicle very well, so it definitely made sense to use additional data from gyro and GPS and fuse it together with kalman filters. Also, one other rule of developing an algorithm is, never ever throw any data away even if it is very noisy. Find ways to incorporate any information from other data sources. This would always only improve the algorithm.

Fortunately, the GPS model was known, so I was able to construct a kalman filter.

See how the process model (Red) starts drifting away from original trajectory in after some time.

I used 3 methods of tracking and combined them together using kalman filter.

Bicycle Model as the process mode
GPS readings
IMU Based Odometery
Visual Odometery

One difficulty I found was to make sure all the readings are converted to the same coordinate system with same reference (Car). For example, for GPS readings, latitude and longitude had to be converted to a XY grid system in meters.

For small distances, I used below formula as approximaton to find distance between two latitude longitude pairs in meters. Latitude and longitude are basically measurements on a globe (angles) wrt Equator and prime meridian respectively.

def latlon_to_xy_grid(lat1, lon1, lat2, lon2): 
    dx = (lon1-lon2)*40075*math.cos((lat1+lat2)*math.pi/360)/360
    dy = (lat1-lat2)*40075/360
    return dx, dy

Now, to collect extra data that included cases like avoiding obstacles, tight curves, and the problem of ‘car drifting away from the lane’, I made use of the CARLA simulator.

Since I had the parameters of the camera used to collect the udacity dataset, I was able to design a similar camera model in CARLA. It was relatively easy to obtain ground truth data from CARLA.

Also, for avoiding obstacles, I wrote a script that would create a scene, place a vehicle on edge of the lane, and use the CARLA autopilot to generate trajectory to avoid the obstacle. In few minutes after writing the script, I had thousands of scenes of this type of scenarios. This could actually make model worse for real life because this may not really be close to what humans would actually do in real life.

There were certain problems with how CARLA directly saved the data to disk. This caused a huge slowdown of the simulator that made it impossible to manually control the car and also caused slowdown in data collection process.

Due to this, I had to write manual buffers that would keep on accumulating data until a certain size and dump them to disk at once. This hugely improved the efficiency of data collection process.

A bunch of trajectories sampled from dataset

Now Finally the data collection process was complete. Let’s move to model training !

Analyzing Super Combo Model

For feature extraction from frames, super combo uses a lot of skip connections and converts a 256 x 128 x 12 (two YUV format consecutive frames with 3 extra alpha channels. see here) and boils them down to 1024 dimensional encoding.

This encoding learns all the information from the image once you train the path planner.

“We can get insight into what is encoded in the 1024 output vector of the vision model by running it on an image and trying to reconstruct an image from it with a GAN. Above is the original image of a road, below is a reconstruction made by a GAN trained on a few million images and feature vectors. Notice how details relevant to planning such as locations of the lanes and lead car are preserved, while irrelevant details such as the colors of cars and background scenery are lost.” — Harald Schäfer

The model learns to encode all the relevant information required for planning into a compressed form. This vision encoding is later forked into several branches which are processed independently to output lanes, paths, etc.

Super Combo also uses a GRU like layer that is used to encode the temporal information too. The network is stateful and the state output of the recurrent layer is re-fed during the next inference.

Training The Model

for training the model, I will be taking desire and frames as input (RNN state is fed only during inference time) and outputs would be only the path and velocities for future 50 meters/steps and the standard deviations.

Bayesian Neural Networks

Standard Deviations gives us an idea about how confident is the model about it’s prediction. This can be done using a special kind of network called Bayesian Neural Network.

Instead of directly regressing the lanes, paths, velocities, we would be predicting a distribution.

Source — https://www.youtube.com/watch?v=z7xV-HYVAZ8

See this video by AI student for a beautiful explanation on this.

We can use a mixture density networks for outputting multiple paths and their confidence scores, which converts this into a mixture density network.

Outputting multiple gaussian distributions and their scores, alphas.

I used a batch size of 16 with 16 time steps from the past with 2 gaussian mixtures for training the network.

Data Augmentation

I used data augmentation for making model robust to noise. These included adding poisson noise to images, shuffling color channels of images, flipping images horizontally, increasing/decreasing contrast, temperature, etc of the images.

Loss function

Simple Point Output (No unceartanity prediction): In practical applications, we minimize the squared error term ( μ(x, Θ)− y )² of the output of the linear function μ given x, its parameters Θ, and the target value y for all pairs of (x, y) in some dataset 𝔻. The learned function essentially “spits” out the conditional mean of the Gaussian distribution μ(x, Θ) given the data and parameters. It throws away the std. deviation and normalization constant, which do not depend on Θ.

I tried multiple loss functions. In case of simple regression output (as discussed bove)(simple neural network, no standard deviation outputs), I used weighted sum of L2 loss between trajectories and L1 loss between the gradients of trajectories.

The intuition behind gradient loss is, we as humans estimate the ‘intensity’ of rotating the steering wheel according to the curves on the road. So it made sense to incorporate this into the loss function which turned to perform better than ‘simple L2 between trajectories alone’.

MDN Network: The standard deviation in this case is now conditioned on the input, allowing us to account for variable standard deviation. Even when we are just using a single Gaussian distribution, this advantage applies.

For the MDN (predicting standard deviations too), the loss is computed as

cat = tfd.Categorical(logits=out_pi)        
component_splits = [output_dim] * num_mixes        
mus = tf.split(out_mu, num_or_size_splits=component_splits, axis=1)        sigs = tf.split(out_sigma, num_or_size_splits=component_splits, axis=1)    
    
coll = [tfd.MultivariateNormalDiag(loc=loc, scale_diag=scale) for loc, scale in zip(mus, sigs)]mixture = tfd.Mixture(cat=cat, components=coll)        
loss = mixture.log_prob(y_true)        
loss = tf.negative(loss)        
loss = tf.reduce_mean(loss)

I used implementation from this repo.

Testing

See how the network outputs the means (actual path) and deviations.

Initially, the model is confident about the prediction and the deviation is low, but as the model predicts trajectory into the future, deviation increases, which make intuitive sense.

Even the openpilot model shows similar behavior.

Path and lanes from original openpilot model

See how the model is fairly confident at smaller distances but predictions gets more noisier and unstable at larger distance.

I’d be doing more rigorous testing and would be posting a video for model in action soon, on real life, as well as from synthetic data. Now we have the trajectories output, in next post we may discuss about how to actually convert trajectory to controls and control a simulated car. But this is going to be a lot of more work. But let’s try nonetheless :D

Conclusion

Congratulations on making it to the end ! This was a long post.

So, this was my journey to train an end to end path planner hugely inspired by the work of comma.ai. I’d like to that the comma.ai discord community for answering a lot of my doubts. I do not guarantee that all things in this post are correct. Please feel free to point any errors you may find and help in correcting them by commenting below.

I have explored a lot of things during this project. I didn’t really go into much details in this articles as well as I missed to write a lot of challenges faced by me during this project. Next time, I would try to write the article side by side so I don’t forget to add anything.

Please feel free to ask any questions down below in comments :)

Code

The whole project is some thousand lines of ‘not very well structured code’. I’d upload it after cleaning it up a bit. Keep an eye on my GitHub and LinkedIn for further updates :)