In this project, we will work towards constructing an optimized Q-Learning driving agent that will navigate a Smartcab through its environment towards a goal. Since the Smartcab is expected to drive passengers from one location to another, the driving agent will be evaluated on two very important metrics: Safety and Reliability. A driving agent that gets the Smartcab to its destination while running red lights or narrowly avoiding accidents would be considered unsafe. Similarly, a driving agent that frequently fails to reach the destination in time would be considered unreliable. Maximizing the driving agent's safety and reliability would ensure that Smartcabs have a permanent place in the transportation industry.
Safety and Reliability are measured using a letter-grade system as follows:
Grade | Safety | Reliability |
---|---|---|
A+ | Agent commits no traffic violations, and always chooses the correct action. |
Agent reaches the destination in time for 100% of trips. |
A | Agent commits few minor traffic violations, such as failing to move on a green light. |
Agent reaches the destination on time for at least 90% of trips. |
B | Agent commits frequent minor traffic violations, such as failing to move on a green light. |
Agent reaches the destination on time for at least 80% of trips. |
C | Agent commits at least one major traffic violation, such as driving through a red light. |
Agent reaches the destination on time for at least 70% of trips. |
D | Agent causes at least one minor accident, such as turning left on green with oncoming traffic. |
Agent reaches the destination on time for at least 60% of trips. |
F | Agent causes at least one major accident, such as driving through a red light with cross-traffic. |
Agent fails to reach the destination on time for at least 60% of trips. |
To assist evaluating these important metrics, we will use the following visualization code.
# Import the visualization code
! wget https://raw.githubusercontent.com/alex-coch/alex-coch.github.io/main/smartcab/visuals.py
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
Before starting to work on implementing your driving agent, it's necessary to first understand the world (environment) which the Smartcab and driving agent work in. One of the major components to building a self-learning agent is understanding the characteristics about the agent, which includes how the agent operates.
The environment is made up of roads, intersections, the Smartcab, and other agents (cars), that seem to operate within traffic rules. Red and green traffic lights indicate when a car should stop or stay.
Observations you might make upon running the simulation for the first time, without any learning or smartcab code:
In addition to understanding the world, it is also necessary to understand the code itself that governs how the world, simulation, and so on operate. Attempting to create a driving agent would be difficult without having at least explored the "hidden" devices that make everything work.
Let's see what the key mechanics of our program are:
The first step to creating an optimized Q-Learning driving agent is getting the agent to actually take valid actions. In this case, a valid action is one of None
, (do nothing) 'Left'
(turn left), 'Right'
(turn right), or 'Forward'
(go forward). For your first implementation, we will navigate to the 'choose_action()'
agent function and make the driving agent randomly choose one of these actions. We have access to several class variables that will help us write this functionality, such as 'self.learning'
and 'self.valid_actions'
.
To obtain results from the initial simulation, we will need to adjust following flags:
'enforce_deadline'
- Set this to True
to force the driving agent to capture whether it reaches the destination in time.'update_delay'
- Set this to a small value (such as 0.01
) to reduce the time between steps in each trial.'log_metrics'
- Set this to True
to log the simluation results as a .csv
file in /logs/
.'n_test'
- Set this to '10'
to perform 10 testing trials.# Load the 'sim_no-learning' log file from the initial simulation results
! wget https://raw.githubusercontent.com/alex-coch/alex-coch.github.io/main/smartcab/logs/sim_no-learning.csv
vs.plot_trials('/content/sim_no-learning.csv')
The second step to creating an optimized Q-learning driving agent is defining a set of states that the agent can occupy in the environment. Depending on the input, sensory data, and additional variables available to the driving agent, a set of states can be defined for the agent so that it can eventually learn what action it should take when occupying a state. The condition of 'if state then action'
for each state is called a policy, and is ultimately what the driving agent is expected to learn. Without defining states, the driving agent would never understand which action is most optimal -- or even what environmental variables and conditions it cares about!
Inspecting the 'build_state()'
agent function shows that the driving agent is given the following data from the environment:
'waypoint'
, which is the direction the Smartcab should drive leading to the destination, relative to the Smartcab's heading.'inputs'
, which is the sensor data from the Smartcab. It includes 'light'
, the color of the light.'left'
, the intended direction of travel for a vehicle to the Smartcab's left. Returns None
if no vehicle is present.'right'
, the intended direction of travel for a vehicle to the Smartcab's right. Returns None
if no vehicle is present.'oncoming'
, the intended direction of travel for a vehicle across the intersection from the Smartcab. Returns None
if no vehicle is present.'deadline'
, which is the number of actions remaining for the Smartcab to reach the destination before running out of time.Some questions we can ask ourselves at this stage are:
Which features available to the agent are most relevant for learning both safety and efficiency? Why are these features appropriate for modeling the Smartcab in the environment? If you did not choose some features, why are those features not appropriate?
The 'waypoint', 'light','oncoming' and 'left' features are most relevant for learning both safety and efficiency.
The 'waypoint' helps leads the Smartcab to the destination through the most optimal way, improving the efficiency.
The 'oncoming' and 'left' features contain the intended directions of travel of other cars near the Smartcab in the environment. We don't need to care about cabs on the 'right', as they won't forward on a red light, assuming they follow traffic rules, though we do need to check out for cars on the left before taking a right turn at a traffic light.
The 'light' feature makes our Smartcab aware of the traffic lights, so that it follows the rules and doesn't violate traffic laws.
These set of input features are important to ensure the safety of the Smartcab and its passengers, as not being aware of other cars can lead to accidents, while being unaware of the traffic lights can lead to traffic violations and/or accidents.
The 'deadline' feature, is not important for the safety of the smartcab, while it may resemble some importance for the efficiency. Nevertheless, the feature isn't chosen as 'important' because the waypoint feature should ensure that the chosen path is as optimal as it can be.
When defining a set of states that the agent can occupy, it is necessary to consider the size of the state space. That is to say, if you expect the driving agent to learn a policy for each state, you would need to have an optimal action for every state the agent can occupy. If the number of all possible states is very large, it might be the case that the driving agent never learns what to do in some states, which can lead to uninformed decisions. For example, consider a case where the following features are used to define the state of the Smartcab:
('is_raining', 'is_foggy', 'is_red_light', 'turn_left', 'no_traffic', 'previous_turn_left', 'time_of_day')
.
How frequently would the agent occupy a state like (False, True, True, True, False, False, '3AM')
? Without a near-infinite amount of time for training, it's doubtful the agent would ever learn the proper action!
Some questions we can ask ourselves at this stage are:
If a state is defined using the features you've selected from the previous section, what would be the size of the state space? Given what you know about the evironment and how it is simulated, do you think the driving agent could learn a policy for each possible state within a reasonable number of training trials?
Our chosen state space is (waypoint,lights,oncoming,left). The size will also be smaller than if we did consider using 'deadine' for the state space.
There are 3 possible values for 'waypoint', 2 possible values for 'lights' and 4 possible values for each of cars on 'left',and 'oncoming' (None, left, right, forward).
The number of combinations here will, thus, be 3x2x4x4 = 96. So it is a reasonable number of policies for the agent to learn within a reasonable number of training trials. In a few hundred trials, our agent should be able to see each state at-least once.
For our second implementation, we will use the 'build_state()'
agent function to set the 'state'
variable to a tuple of all the features necessary for Q-Learning.
The third step to creating an optimized Q-Learning agent is to begin implementing the functionality of Q-Learning itself. The concept of Q-Learning is fairly straightforward: For every state the agent visits, create an entry in the Q-table for all state-action pairs available. Then, when the agent encounters a state and performs an action, update the Q-value associated with that state-action pair based on the reward received and the interative update rule implemented. Of course, additional benefits come from Q-Learning, such that we can have the agent choose the best action for each state based on the Q-values of each state-action pair possible.
For this project, we will be implementing a decaying, $\epsilon$-greedy Q-learning algorithm with no discount factor.
Note that the agent attribute self.Q
is a dictionary: This is how the Q-table will be formed. Each state will be a key of the self.Q
dictionary, and each value will then be another dictionary that holds the action and Q-value. Here is an example:
{ 'state-1': {
'action-1' : Qvalue-1,
'action-2' : Qvalue-2,
...
},
'state-2': {
'action-1' : Qvalue-1,
...
},
...
}
Furthermore, note that we are expected to use a decaying $\epsilon$ (exploration) factor. Hence, as the number of trials increases, $\epsilon$ should decrease towards 0. This is because the agent is expected to learn from its behavior and begin acting on its learned behavior.
To obtain results from the initial Q-Learning implementation, we will need to adjust the following flags and setup:
'enforce_deadline'
- Set this to True
to force the driving agent to capture whether it reaches the destination in time.'update_delay'
- Set this to a small value (such as 0.01
) to reduce the time between steps in each trial.'log_metrics'
- Set this to True
to log the simluation results as a .csv
file and the Q-table as a .txt
file in /logs/
.'n_test'
- Set this to '10'
to perform 10 testing trials.'learning'
- Set this to 'True'
to tell the driving agent to use your Q-Learning implementation.In addition, we can use the following decay function for $\epsilon$:
$$ \epsilon_{t+1} = \epsilon_{t} - 0.05, \hspace{10px}\textrm{for trial number } t$$# Load the 'sim_default-learning' file from the default Q-Learning simulation
! wget https://raw.githubusercontent.com/alex-coch/alex-coch.github.io/main/smartcab/logs/sim_default-learning.csv
vs.plot_trials('/content/sim_default-learning.csv')
There isn't much similarity between the basic driving agen and the default Q-Learning agent, as we can see that our agent has already started to learn, based on the visualizations.
The third step to creating an optimized Q-Learning agent is to perform the optimization! Now that the Q-Learning algorithm is implemented and the driving agent is successfully learning, it's necessary to tune settings and adjust learning paramaters so the driving agent learns both safety and efficiency. Typically this step requires a lot of trial and error, as some settings will invariably make the learning worse. One thing to keep in mind is the act of learning itself and the time that this takes: In theory, we could allow the agent to learn for an incredibly long amount of time; however, another goal of Q-Learning is to transition from experimenting with unlearned behavior to acting on learned behavior. For example, always allowing the agent to perform a random action during training (if $\epsilon = 1$ and never decays) will certainly make it learn, but never let it act.
To obtain results from the initial Q-Learning implementation, you will need to adjust the following flags and setup:
'enforce_deadline'
- Set this to True
to force the driving agent to capture whether it reaches the destination in time.'update_delay'
- Set this to a small value (such as 0.01
) to reduce the time between steps in each trial.'log_metrics'
- Set this to True
to log the simluation results as a .csv
file and the Q-table as a .txt
file in /logs/
.'learning'
- Set this to 'True'
to tell the driving agent to use your Q-Learning implementation.'optimized'
- Set this to 'True'
to tell the driving agent you are performing an optimized version of the Q-Learning implementation.Additional flags that can be adjusted as part of optimizing the Q-Learning agent:
'n_test'
- Set this to some positive number (previously 10) to perform that many testing trials.'alpha'
- Set this to a real number between 0 - 1 to adjust the learning rate of the Q-Learning algorithm.'epsilon'
- Set this to a real number between 0 - 1 to adjust the starting exploration factor of the Q-Learning algorithm.'tolerance'
- set this to some small value larger than 0 (default was 0.05) to set the epsilon threshold for testing.Furthermore, we should use a decaying function of your choice for $\epsilon$ (the exploration factor). Note that whichever function we use, it must decay to 'tolerance'
at a reasonable rate. The Q-Learning agent will not begin testing until this occurs. Some example decaying functions (for $t$, the number of trials):
You may also use a decaying function for $\alpha$ (the learning rate) if you so choose, however this is typically less common. If you do so, be sure that it adheres to the inequality $0 \leq \alpha \leq 1$.
# Load the 'sim_improved-learning' file from the improved Q-Learning simulation
! wget https://raw.githubusercontent.com/alex-coch/alex-coch.github.io/main/smartcab/logs/sim_improved-learning.csv
vs.plot_trials('/content/sim_improved-learning.csv')
Quick glance at the visualizations show considerable improvements in our driving agent. In more detail:
Sometimes, the answer to the important question "what am I trying to get my agent to learn?" only has a theoretical answer and cannot be concretely described. Here, however, we can concretely define what it is the agent is trying to learn, and that is the U.S. right-of-way traffic laws. Since these laws are known information, you can further define, for each state the Smartcab is occupying, the optimal action for the driving agent based on these laws. In that case, we call the set of optimal state-action pairs an optimal policy. Hence, unlike some theoretical answers, it is clear whether the agent is acting "incorrectly" not only by the reward (penalty) it receives, but also by pure observation. If the agent drives through a red light, we both see it receive a negative reward but also know that it is not the correct behavior. This can be used to our advantage for verifying whether the policy the driving agent has learned is the correct one, or if it is a suboptimal policy.
Let's check out a few examples (using the states we've defined) of what an optimal or sub-optimal policy for this problem would look like.
To recap, our defined state space was: (waypoint, inputs['light'], inputs['oncoming'], inputs['left'])
A few examples of optimal policy for this problem, with respect to our state space can be:
Looking at the entries in the sim_improved_learning.txt file, the policies are mostly correct for the given states.
For example, looking at the Q table for the following state: ('right', 'red', 'forward', None)
The correct action to take is in fact right, as discussed above (state 3 in the examples).
But there are cases where the recorded policy is in-fact, subotimal. For example, looking at the Q table for the following state: ('forward', 'green', 'forward', 'left')
The policy here is to turn right, while the optimal action would have been to follow the waypoint and move forward, as the car non the left looking to turn left will not because of the traffic lights (which will be red in its case).
So there are some examples where our agent has not learned the optimal policies in the given amount of training time.