Keen Software House has improved their General artificial intelligence. It could already Pong, a Breakout Game and an Artificial Brain simulator and now it is able to work with delayed rewards and create a hierarchy of goals.
The AI was able to control movement through a maze like map and learn the rules of a game.
It used reinforcement learning to seek rewards and avoid punishment.
AI is able to follow a complex chain of strategy in order to complete its main goal. It can assign a hierarchical order to its various goals and plan ahead so it reaches an even bigger goal.
How the algorithm works
The brain we have implemented for this milestone is based on combination of a hierarchical Q-learning algorithm and a motivation model which is able to switch between different strategies in order to reach a complex goal. The Q-learning algorithm is more specifically known as HARM, or Hierarchical Action Reinforcement Motivation system.
In a nutshell, the Q-learning algorithm (HARM) is able to spread a reward given in a specific state (e.g. the agent reaching a position on the map) to the surrounding space so the brain can take proper action by climbing steepest gradient of the Q function. However, if the goal state is far away from the current state, it might take a long time to build a strategy that will lead to that goal state. Also, the number of variables in the environment can lead to extremely long routes through the “state space”, rendering the problem almost unsolvable.
There are several ideas that can improve the overall performance of the algorithm. First, we made the agent reward itself for any successful change to the environment. The motivation value can be assigned to each variable change so the agent is constantly motivated to change its surroundings.
Second, the brain can develop a set of abstract actions assigned to any type of change that is possible (e.g. changing the state of a door) and can build an underlying strategy for how this change can be made. With such knowledge, the whole hierarchy of Q functions can be created. Third, in order to lower the complexity of the problem, the brain can analyze its “experience buffer” from the past and eventually drop variables that are not affected by its actions or are not necessary for the current goal (i.e. strategy to fulfill the goal).
A mixture of these improvements creates a hierarchical decision model that is built during the exploration phase of learning (the agent is left to randomly explore the environment). After a sufficient amount of knowledge is gathered, we can “order” the agent to fulfill a goal by manually raising motivation value for a selected variable. The agent then will execute the learned abstract action (strategy) by traversing the strategy tree and unrolling it into a chain of primitive actions that lie at the bottom.