pyRDDLGym-rl: Reinforcement Learning

pyRDDLGym-rl provides wrappers for deep reinforcement learning algorithms (i.e. Stable Baselines 3 and RLlib) to work with pyRDDLGym.

Requirements

This package requires Python 3.8+, pyRDDLGym>=2.0 together with one of

stable-baselines3>=2.2.1
ray[rllib]>=2.9.2

Installing via pip

You can install pyRDDLGym-rl and all of its requirements via pip:

pip install stable-baselines3  # need one of these two
pip install -U "ray[rllib]"
pip install rddlrepository pyRDDLGym-rl

Installing the Pre-Release Version via git

pip install git+https://github.com/pyrddlgym-project/pyRDDLGym-rl.git

Running the Basic Stable Baselines 3 Example

To run the stable-baselines3 example, navigate to the install directory of pyRDDLGym-rl, and type:

python -m pyRDDLGym_rl.examples.run_stable_baselines <domain> <instance> <method> <steps> <learning_rate>

where:

<domain> is the name of the domain in rddlrepository, or a path pointing to a domain.rddl file
<instance> is the name of the instance in rddlrepository, or a path pointing to an instance.rddl file
<method> is the RL algorithm to use [a2c, ddpg, dqn, ppo, sac, td3]
<steps> is the (optional) number of samples to generate from the environment for training
<learning_rate> is the (optional) learning rate to specify for the algorithm.

Running the Basic RLlib Example

To run the RLlib example, from the install directory of pyRDDLGym-rl, type:

python -m pyRDDLGym_rl.examples.run_rllib <domain> <instance> <method> <iters>

where:

<domain> is the name of the domain in rddlrepository, or a path pointing to a domain.rddl file
<instance> is the name of the instance in rddlrepository, or a path pointing to an instance.rddl file
<method> is the RL algorithm to use [dqn, ppo, sac]
<iters> is the (optional) number of iterations of training.

Running Stable Baselines 3 from the Python API

The following example sets up the Stable Baselines 3 PPO algorithm to work with pyRDDLGym:

from stable_baselines3 import *

import pyRDDLGym
from pyRDDLGym_rl.core.agent import StableBaselinesAgent
from pyRDDLGym_rl.core.env import SimplifiedActionRDDLEnv

# create the environment
env = pyRDDLGym.make("domain", "instance", base_class=SimplifiedActionRDDLEnv)

# train the PPO agent (pass additional arguments, such as learning rate, here)
agent = PPO('MultiInputPolicy', env, verbose=1)
agent.learn(total_timesteps=steps)

# wrap the agent in a RDDL policy and evaluate
ppo_agent = StableBaselinesAgent(agent)
ppo_agent.evaluate(env, episodes=1, verbose=True, render=True)

env.close()

Running RLlib from the Python API

The following example sets up the RLlib PPO algorithm to work with pyRDDLGym:

from ray.tune.registry import register_env
from ray.rllib.algorithms.ppo import PPOConfig

import pyRDDLGym
from pyRDDLGym_rl.core.agent import RLLibAgent
from pyRDDLGym_rl.core.env import SimplifiedActionRDDLEnv

# set up the environment
def env_creator(cfg):
    return pyRDDLGym.make(cfg['domain'], cfg['instance'], base_class=SimplifiedActionRDDLEnv)
register_env('RLLibEnv', env_creator)

    # create agent
config = {'domain': "domain", 'instance': "instance"}
agent = PPOConfig().environment('RLLibEnv', cfg=config).build()

# train agent
for _ in range(iters):
    print(algo.train()['episode_reward_mean'])

# wrap the agent in a RDDL policy and evaluate
ppo_agent = RLLibAgent(agent)
ppo_agent.evaluate(env_creator(config), episodes=1, verbose=True, render=True)

env.close()

The Environment Wrapper

You can use the environment wrapper with your own RL implementations, or a package that is not currently supported by us:

import pyRDDLGym
from pyRDDLGym_rl.core.env import SimplifiedActionRDDLEnv
env = pyRDDLGym.make("domain", "instance", base_class=SimplifiedActionRDDLEnv)

The goal of this wrapper is to simplify the action space as much as possible. To illustrate, the action space of the MarsRover domain is defined as:

Dict(
    'power-x___d1': Box(-0.1, 0.1, (1,), float32),
    'power-x___d2': Box(-0.1, 0.1, (1,), float32),
    'power-y___d1': Box(-0.1, 0.1, (1,), float32),
    'power-y___d2': Box(-0.1, 0.1, (1,), float32),
    'harvest___d1': Discrete(2), 'harvest___d2': Discrete(2)
)

However, the action space of the wrapper simplifies to

Dict(
    'discrete': MultiDiscrete([2 2]),
    'continuous': Box(-0.1, 0.1, (4,), float32)
)

where the discrete and continuous action variable components have been aggregated. Actions provided to the environment must therefore follow this form, i.e. must be a dictionary with the discrete field is assigned a (2,) array of integer type, and the continuous field is assigned a (4,) array of float type.

Note

The vectorized option is required by the wrapper and is automatically set to True.

Warning

The action simplification rules apply max-nondef-actions only to boolean actions, and assume this value is either 1 or greater than or equal to the total number of boolean actions. Any other scenario is currently not supported in pyRDDLGym-rl and will raise an exception.

Limitations

We cite several limitations of pyRDDLGym-rl:

The required action space in the stable-baselines/RLlib agent implementation must be compatible with the action space produced by pyRDDLGym (e.g. DQN only handles Discrete spaces)
Only special types of constraints on boolean actions are supported (as described above).