Training a PPO policy with rllib.#

We demonstrate how to train a PPO policy using the rllib package.

First install and import the required packages:

%pip install --quiet --upgrade pip
%pip install --quiet torch
%pip install --quiet -U "ray[rllib]==2.37.0"
%pip install --quiet rddlrepository pyRDDLGym-rl

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

Import the required packages:

import warnings
warnings.filterwarnings("ignore")
import os
from IPython.display import Image

from ray.tune.registry import register_env
from ray.rllib.algorithms.ppo import PPOConfig

import pyRDDLGym
from pyRDDLGym.core.visualizer.movie import MovieGenerator

from pyRDDLGym_rl.core.agent import RLLibAgent
from pyRDDLGym_rl.core.env import SimplifiedActionRDDLEnv

2025-06-04 19:03:07,035	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2025-06-04 19:03:07,620	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.

We will optimize the classical CartPole control problem. In this case, the environment creation has to be wrapped in an outside function as follows, and the observation space needs to be flattened:

def env_creator(env_config):
    return pyRDDLGym.make(env_config['domain'], env_config['instance'], base_class=SimplifiedActionRDDLEnv) 
register_env('RLLibEnv', env_creator)

Let’s set up and train a PPO agent:

# set up the agent
config = PPOConfig()
config = config.env_runners(num_env_runners=1, num_envs_per_env_runner=8)
config = config.environment('RLLibEnv', env_config={'domain': 'Reservoir_ippc2023', 'instance': '1'})
config = config.training(train_batch_size_per_learner=64, lr=0.0003, gamma=0.98, lambda_=0.5)
algo = config.build()

# train the agent
for n in range(100):
    result = algo.train()
    if n % 10 == 0: print(f'iteration {n}, mean return {result["env_runners"]["episode_reward_mean"]}')

C:\Python\envs\rddl\Lib\site-packages\ray\rllib\algorithms\algorithm.py:555: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
`UnifiedLogger` will be removed in Ray 2.7.
  return UnifiedLogger(config, logdir, loggers=None)
C:\Python\envs\rddl\Lib\site-packages\ray\tune\logger\unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
The `JsonLogger interface is deprecated in favor of the `ray.tune.json.JsonLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
C:\Python\envs\rddl\Lib\site-packages\ray\tune\logger\unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
The `CSVLogger interface is deprecated in favor of the `ray.tune.csv.CSVLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
C:\Python\envs\rddl\Lib\site-packages\ray\tune\logger\unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning"
The `TBXLogger interface is deprecated in favor of the `ray.tune.tensorboardx.TBXLoggerCallback` interface and will be removed in Ray 2.7.
  self._loggers.append(cls(self.config, self.logdir, self.trial))
2025-06-04 19:03:19,854	INFO worker.py:1786 -- Started a local Ray instance.
[36m(RolloutWorker pid=19692)[0m C:\Python\envs\rddl\Lib\site-packages\pyRDDLGym\core\debug\exception.py:28: UserWarning: [31mState invariant 3 does not have a structure of <action or state fluent> <op> <rhs>, where <op> is one of {<=, <, >=, >} and <rhs> is a deterministic function of non-fluents only, and will be ignored.
[36m(RolloutWorker pid=19692)[0m >> ( sum_{?r: reservoir} [ CONNECTED_TO_SEA(?r) ] ) == 1[0m
[36m(RolloutWorker pid=19692)[0m   warnings.warn(message)
[36m(RolloutWorker pid=19692)[0m C:\Python\envs\rddl\Lib\site-packages\gymnasium\spaces\box.py:130: UserWarning: [33mWARN: Box bound precision lowered by casting to float32[0m
[36m(RolloutWorker pid=19692)[0m   gym.logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
[36m(RolloutWorker pid=19692)[0m C:\Python\envs\rddl\Lib\site-packages\pyRDDLGym\core\debug\exception.py:28: UserWarning: [31mState invariant 3 does not have a structure of <action or state fluent> <op> <rhs>, where <op> is one of {<=, <, >=, >} and <rhs> is a deterministic function of non-fluents only, and will be ignored.
[36m(RolloutWorker pid=19692)[0m >> ( sum_{?r: reservoir} [ CONNECTED_TO_SEA(?r) ] ) == 1[0m
[36m(RolloutWorker pid=19692)[0m   warnings.warn(message)
[36m(RolloutWorker pid=19692)[0m C:\Python\envs\rddl\Lib\site-packages\gymnasium\spaces\box.py:130: UserWarning: [33mWARN: Box bound precision lowered by casting to float32[0m
[36m(RolloutWorker pid=19692)[0m   gym.logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
2025-06-04 19:03:31,111	INFO trainable.py:161 -- Trainable.setup took 14.923 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
2025-06-04 19:03:31,111	WARNING util.py:61 -- Install gputil for GPU system monitoring.
2025-06-04 19:03:34,810	WARNING deprecation.py:50 -- DeprecationWarning: `ray.rllib.execution.train_ops.multi_gpu_train_one_step` has been deprecated. This will raise an error in the future!

iteration 0, mean return -42246.37854747662
iteration 10, mean return -34049.59773807312
iteration 20, mean return -25318.256660621093
iteration 30, mean return -7515.10995020135
iteration 40, mean return -1861.6146227081479
iteration 50, mean return -1152.288113156062
iteration 60, mean return -1711.4167606127955
iteration 70, mean return -608.5802914772291
iteration 80, mean return -695.5039031179475
iteration 90, mean return -2702.734062076777

To evaluate the trained agent, we wrap it in a RLLibAgent wrapper, which is an instance of pyRDDLGym’s BaseAgent:

agent = RLLibAgent(algo)

Lastly, we evaluate the agent as always:

# for recording movies
if not os.path.exists('frames'):
    os.makedirs('frames')
env = env_creator({'domain': 'Reservoir_ippc2023', 'instance': '1'})
recorder = MovieGenerator("frames", "reservoir_rllib", max_frames=env.horizon)
env.set_visualizer(viz=None, movie_gen=recorder)

print(agent.evaluate(env, episodes=1, verbose=False, render=True))
env.close()
Image(filename='frames/reservoir_rllib_0.gif') 

{'mean': np.float64(-152.39982016871494), 'median': np.float64(-152.39982016871494), 'min': np.float64(-152.39982016871494), 'max': np.float64(-152.39982016871494), 'std': np.float64(0.0)}

../_images/ea98fb06c47f14548aa7faa76825b8454c7d5f90a9ff2dbfab099a1bc032620a.gif