In a nutshell, RL is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future.

Key Concepts and Terminoligies

States and Observations

A state $s$ is a complete description of the state of the world. There is no information about the world which is hidden from the state.

An observation $o$ is a partial description of a state, which may omit information.

When the agent is able to observe the complete state of the environment, we say that the environment is fully observed. When the agent can only see a partial observation, we say that the environment is partially observed.

Action Spaces

action spaces $a$s are set of all valid actions in a given environment

Policy

Policy $\pi_\theta(a|s)$ is a rule used by Agent to make decisions. We can think of it as the brain of the agent, which in many cases, can substitute the word agent to policy.

a. deterministic policies: $a_t = \mu(s_t)$

e.g. feed-forward MLP(Multi Layer Perceptron)

a tanh function looks like

pi_net = nn.Sequential(
	nn.Linear(obs_dim, 64),
	nn.Tanh(),
	nn.Linear(64, 64),
	nn.Tanh(),
	nn.Linear(64, act_dim)
)

obs_dim: numpy array with a batch of observations