Tuning policy hyper-parameters in JaxPlan#

In this advanced notebook, we illustrate how to tune hyper-parameters of the policy and planner efficiently using Bayesian optimization, such as the topology of the policy network, the learning rate, and the model relaxations.

Start by installing the required packages. Notice we use the extra argument to install the required bayesian-optimization package:

%pip install --quiet --upgrade pip
%pip install --quiet pandas pyRDDLGym rddlrepository pyRDDLGym-jax[extra]

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

Import the required packages:

import os
import pandas as pd
from IPython.display import Image
from IPython.utils import io
import matplotlib.pyplot as plt
import math

import pyRDDLGym
from pyRDDLGym.core.visualizer.movie import MovieGenerator
from pyRDDLGym_jax.core.planner import JaxBackpropPlanner, JaxOfflineController, load_config_from_string
from pyRDDLGym_jax.core.tuning import JaxParameterTuning, Hyperparameter

c:\Python\envs\rddl2\Lib\site-packages\pyRDDLGym\core\debug\exception.py:28: UserWarning: [31mcv2 is not installed: save_as_mp4 option will be disabled.[0m
  warnings.warn(message)

We will use the MountainCar control problem as our example:

env = pyRDDLGym.make('MountainCar_Continuous_gym', '0', vectorized=True)

In order to instruct the tuner to tune specific parameters, we need to create a config file with placeholders for the tunable parameters. In this case, we let TUNE_LR denote the learning rate, TUNE_L1 and TUNE_L2 denote the layer sizes of the policy, and TUNE_WEIGHT to denote the model relaxation parameter:

config = """
[Compiler]
sigmoid_weight=TUNE_WEIGHT
[Planner]
method='JaxDeepReactivePolicy'
method_kwargs={'topology': [TUNE_L1, TUNE_L2]}
optimizer_kwargs={'learning_rate': TUNE_LR}
batch_size_train=1
batch_size_test=1
[Optimize]
train_seconds=15
print_summary=False
print_progress=False
"""

Next, we need to connect the selected hyperparameters to ranges the Bayesian optimizer searches over, as well as a mapping from the BO parameter to the RDDL equivalent:

hyperparams = [
    Hyperparameter('TUNE_WEIGHT', 0., 15., math.exp2),
    Hyperparameter('TUNE_LR', -15., 0., math.exp2),
    Hyperparameter('TUNE_L1', 2, 64, math.floor),
    Hyperparameter('TUNE_L2', 2, 64, math.floor)
]    

Next, we will instantiate the tuning instance for the policy network (DRP). This will tune the topology of the network, learning rate and model hyper-parameters. We specify number of independent runs to average per trial, number of parallel threads to 4, and number of iterations to 10, leading to a total of 40 individual trials/hyper-parameters to test:

tuning = JaxParameterTuning(env=env, config_template=config, hyperparams=hyperparams, online=False,
                            eval_trials=1, num_workers=4, gp_iters=10)

Finally, we launch the tuning instance, where we specify the RNG key, the file where logs will be saved, and that we wish to save a plot of the hyper-parameters tried in a 2D space. Warning: this will take a while:

with io.capture_output():
    best_params = tuning.tune(key=42, log_file='mountaincar_tuning.csv')
print(best_params)

{'TUNE_WEIGHT': 7254.226237302353, 'TUNE_LR': 0.05789093881223642, 'TUNE_L1': 50, 'TUNE_L2': 39}

The outputs of the bayesian optimization have now been saved in a csv file. Let’s use pandas to read this file:

gp_data = pd.read_csv('mountaincar_tuning.csv')
gp_data.head(5)

	pid	worker	iteration	target	best_target	acq_params	TUNE_WEIGHT	TUNE_LR	TUNE_L1	TUNE_L2
0	11976	0	0	9.027220e+01	90.272200	{'i': 4, 'kappa': 7.943282347242815, 'explorat...	49.115340	0.599036	47	39
1	4380	1	0	-2.347741e-02	-0.004341	{'i': 4, 'kappa': 7.943282347242815, 'explorat...	5.064008	0.000155	5	55
2	17312	2	0	-4.340783e-03	-0.004341	{'i': 4, 'kappa': 7.943282347242815, 'explorat...	517.970160	0.048064	3	62
3	25960	3	0	-1.188037e-07	90.272200	{'i': 4, 'kappa': 7.943282347242815, 'explorat...	5739.222584	0.000278	13	13
4	34268	0	1	-2.415410e-01	91.274805	{'i': 8, 'kappa': 6.3095734448019325, 'explora...	12.330637	0.007267	53	37

Let’s plot the target values across iterations:

%matplotlib inline
plt.plot(gp_data['target'])
plt.show()

../_images/773f3bf12856da0eb13e8b5594992d316d621a4d6a38ec5bd4c01c6705de51b0.png

The config file corresponding to the best hyper-parameters is:

print(tuning.best_config)

[Compiler]
sigmoid_weight=7254.226237302353
[Planner]
method='JaxDeepReactivePolicy'
method_kwargs={'topology': [50, 39]}
optimizer_kwargs={'learning_rate': 0.05789093881223642}
batch_size_train=1
batch_size_test=1
[Optimize]
train_seconds=15
print_summary=False
print_progress=False

Let’s load this config into a planner and evaluate it:

planner_args, _, train_args = load_config_from_string(tuning.best_config)
planner = JaxBackpropPlanner(rddl=env.model, **planner_args)
agent = JaxOfflineController(planner, **train_args)

if not os.path.exists('frames'):
    os.makedirs('frames')
recorder = MovieGenerator("frames", "mountaincar", max_frames=env.horizon)
env.set_visualizer(viz=None, movie_gen=recorder)
agent.evaluate(env, episodes=1, render=True)
env.close()
Image(filename='frames/mountaincar_0.gif') 

[90m[INFO] Bounds of action-fluent <action> set to (array(-1., dtype=float32), array(1., dtype=float32)).[0m

../_images/bcf5c7d4b77384154a219b22ff18d57fecc0926c2abba224621e10c4b99fb8f7.gif