pyRDDLGym-rl: Reinforcement Learning

pyRDDLGym-rl provides wrappers for deep reinforcement learning algorithms (i.e. Stable Baselines 3 and RLlib) to work with pyRDDLGym.

Requirements

This package requires Python 3.8+, pyRDDLGym>=2.0 together with one of

  • stable-baselines3>=2.2.1

  • ray[rllib]>=2.9.2

Installing via pip

You can install pyRDDLGym-rl and all of its requirements via pip:

pip install stable-baselines3  # need one of these two
pip install -U "ray[rllib]"
pip install rddlrepository pyRDDLGym-rl

Installing the Pre-Release Version via git

pip install git+https://github.com/pyrddlgym-project/pyRDDLGym-rl.git

Running the Basic Stable Baselines 3 Example

To run the stable-baselines3 example, navigate to the install directory of pyRDDLGym-rl, and type:

python -m pyRDDLGym_rl.examples.run_stable_baselines <domain> <instance> <method> <steps> <learning_rate>

where:

  • <domain> is the name of the domain in rddlrepository, or a path pointing to a domain.rddl file

  • <instance> is the name of the instance in rddlrepository, or a path pointing to an instance.rddl file

  • <method> is the RL algorithm to use [a2c, ddpg, dqn, ppo, sac, td3]

  • <steps> is the (optional) number of samples to generate from the environment for training

  • <learning_rate> is the (optional) learning rate to specify for the algorithm.

Running the Basic RLlib Example

To run the RLlib example, from the install directory of pyRDDLGym-rl, type:

python -m pyRDDLGym_rl.examples.run_rllib <domain> <instance> <method> <iters>

where:

  • <domain> is the name of the domain in rddlrepository, or a path pointing to a domain.rddl file

  • <instance> is the name of the instance in rddlrepository, or a path pointing to an instance.rddl file

  • <method> is the RL algorithm to use [dqn, ppo, sac]

  • <iters> is the (optional) number of iterations of training.

Running Stable Baselines 3 from the Python API

The following example sets up the Stable Baselines 3 PPO algorithm to work with pyRDDLGym:

from stable_baselines3 import *

import pyRDDLGym
from pyRDDLGym_rl.core.agent import StableBaselinesAgent
from pyRDDLGym_rl.core.env import SimplifiedActionRDDLEnv

# create the environment
env = pyRDDLGym.make("domain", "instance", base_class=SimplifiedActionRDDLEnv)

# train the PPO agent (pass additional arguments, such as learning rate, here)
agent = PPO('MultiInputPolicy', env, verbose=1)
agent.learn(total_timesteps=steps)

# wrap the agent in a RDDL policy and evaluate
ppo_agent = StableBaselinesAgent(agent)
ppo_agent.evaluate(env, episodes=1, verbose=True, render=True)

env.close()

Running RLlib from the Python API

The following example sets up the RLlib PPO algorithm to work with pyRDDLGym:

from ray.tune.registry import register_env
from ray.rllib.algorithms.ppo import PPOConfig

import pyRDDLGym
from pyRDDLGym_rl.core.agent import RLLibAgent
from pyRDDLGym_rl.core.env import SimplifiedActionRDDLEnv

# set up the environment
def env_creator(cfg):
    return pyRDDLGym.make(cfg['domain'], cfg['instance'], base_class=SimplifiedActionRDDLEnv)
register_env('RLLibEnv', env_creator)

    # create agent
config = {'domain': "domain", 'instance': "instance"}
agent = PPOConfig().environment('RLLibEnv', cfg=config).build()

# train agent
for _ in range(iters):
    print(algo.train()['episode_reward_mean'])

# wrap the agent in a RDDL policy and evaluate
ppo_agent = RLLibAgent(agent)
ppo_agent.evaluate(env_creator(config), episodes=1, verbose=True, render=True)

env.close()

The Environment Wrapper

You can use the environment wrapper with your own RL implementations, or a package that is not currently supported by us:

import pyRDDLGym
from pyRDDLGym_rl.core.env import SimplifiedActionRDDLEnv
env = pyRDDLGym.make("domain", "instance", base_class=SimplifiedActionRDDLEnv)

The goal of this wrapper is to simplify the action space as much as possible. To illustrate, the action space of the MarsRover domain is defined as:

Dict(
    'power-x___d1': Box(-0.1, 0.1, (1,), float32),
    'power-x___d2': Box(-0.1, 0.1, (1,), float32),
    'power-y___d1': Box(-0.1, 0.1, (1,), float32),
    'power-y___d2': Box(-0.1, 0.1, (1,), float32),
    'harvest___d1': Discrete(2), 'harvest___d2': Discrete(2)
)

However, the action space of the wrapper simplifies to

Dict(
    'discrete': MultiDiscrete([2 2]),
    'continuous': Box(-0.1, 0.1, (4,), float32)
)

where the discrete and continuous action variable components have been aggregated. Actions provided to the environment must therefore follow this form, i.e. must be a dictionary with the discrete field is assigned a (2,) array of integer type, and the continuous field is assigned a (4,) array of float type.

Note

The vectorized option is required by the wrapper and is automatically set to True.

Warning

The action simplification rules apply max-nondef-actions only to boolean actions, and assume this value is either 1 or greater than or equal to the total number of boolean actions. Any other scenario is currently not supported in pyRDDLGym-rl and will raise an exception.

Limitations

We cite several limitations of pyRDDLGym-rl:

  • The required action space in the stable-baselines/RLlib agent implementation must be compatible with the action space produced by pyRDDLGym (e.g. DQN only handles Discrete spaces)

  • Only special types of constraints on boolean actions are supported (as described above).