Snake 3D

Theory Behind

Features

Max Score:

Episode Iteration:

RL Method Name:

Monte Carlo

SnakeLength:

Grid Dimensions:

[ 4, 4 , 4]

Number of Obstacles:

Controls:

Enable Autorotation

Enable Orbit Controls

Start AI

Building Snake 3D videogame

As a brief intro, we are going to explain how we recreate the classic Snake to the actual Snake 3D that you might previosly saw. Explaining our implementation of Snake 3D's logic and how we manage to achive this awesome recreation.

Building the Game was a very simple part, because all we needed to do it was follow the logic of snake 2D, and to transform this base logic in a tree dimensional space. Building the graphics was something also nothing complicated because we used the three.js library that allow to us make whatever kind of 3D model that we want, in a simple lines of code. Also as you have seen the graphics are not to much.

We start creating our class Snake that you can see clicking below

Snake.js

The main methods that the snake has are the move(), changeDirection() and checkSnakeState(). With the move() method allows to the snake to move smoothly around the space acording to the direction that was selected. Involves five main atributes that are speed, current direction including the direction of each part of the tail, chosen direction, snake position, the translation error . It also prevents to change the direction if it is not a valid position to be changed, therefore the snake only can move acoording to the grid. Whereas if it is a valid position change the current direction to the chosen one, and also change the user controls, because the direction that the user might chose is in base of his perspective and not in perspective of the entire enviornment.

The Apple Object is a very simple class that you can inspect clicking below

Apple.js

Every time that apple has been captured it is deleted and it is instantiated with a random position avoiding the snake's position and the obstacles's positons.

We also have implemented some obstacles that the functionality that provide is in the case that the snake is in the same position of any obstacle, the snake will loose inmediatly. The implementation you can see it clicking below.

Walls.js

May 25th 2020

We are two students from CETYS University that are currently studying computational science in fourth semester. This platform was a final project of computer graphics course, but we wanted to develop more than an oriented graphic project. On a side note, currently we are reading about temporal difference learning, so on the future would not be a surprise that this method, to solve finite Markov Decision Processes, will be on this platform.

Developer Team:

Gerardo Hernández Meneses

AI and Game Programmer

Pablo Diaz Ochoa

Game Programmer

Building AI

Here we shall explain how we applied one of the most common introductory methods on reinforcement learning to solve a goal-directed learning task using a Monte Carlo method. All the knowledge acquired was recovered from the book at below.

The main point of most of the methods that have been developed on reinforcement learning field, it is that an certain agent must be capable to learn the dynamic of the environment which is confronting in base of the interaction with the environment itself. This is one of tree paradigms of machine learning.

The Snake 3D AI scenario yields a sitatuation that involves take a decision(action) in base of the situtation(state) about the current learning agent is. Being aware about each action the agent could take, could raise on future different states and different signals(rewards). This give us an overview about the Markov Decision Proccess, which is a mathematical framework that describe general equations to solve this problem of having this sequential interaction base on discrete time steps. MDPs is an abstraction of goal-directed learning with only tree main aspects, that are states \(S\), actions \(A\), and rewards \(R\). A sequential interaction can be described as \(s_0,a_0,r_1,s_1,a_1,r_2,\dots s_{T-1},a_{T-1},r_{T}\) where \(s_i \in S\), \(a_i \in A\), \(r_i \in R\) and \(T\) is the time step of the terminal state or absorbing state.

In this case we have a finite Markov Decision Process which involves the quantity of states that an learning agent can confront is less than inifnity, this same caracteristic aplies to the set of actions that an agent could perfom, and to the set of rewards that an agent could obtain. Thus a way to solve this finite MDPs we prefered to implement Monte Carlo Methods. Even Though other alterntives as a temporal diference learning can apply.

Monte Carlo Methods are a way to solve the sequential action selection that we have previously mentioned, when we do not have the state-transition probabilities. Whereas one of the best atritubtes that Monte Carlo methods have, it is the fact they only need experience's samples to learn. It learns at the end of each episodic step, and it is base on the MDP framework. Therefore we are going to use a set of States, Actions and Rewards to make this experience's samples. First we needed to determine the value of an action \(a_i\) in a certain state \(s_i\), given by the expresion \(q(s,a)\). In this case Monte Carlo methods calculates \(q_*(s,a)\) base on the expected return which is the average of all the returns the agent has obtained. The return itself is interpreted with the next formula \[ G_t=\sum_{k=t+1}^{T}\gamma^{k-t-1}\cdot R_k\] Where the \(\gamma \) constant indicates the discount rate factor, which can have a value between 0 and 1. And it is going to determine how much is important the future acumulative reward. Thus the value of an action given a state under an policy is equal to \[q_{\pi}(s,a)=\mathbb{E}[G_t | S_t=s, A_t=a] \] However we did not have any policy, where informally a policy is the way that an agent must behave or act in a certain state. In despite of not been able to determine the value of state \(v(s)\) directly, we discovered our policy associating the value of each action with each state. So the value on an action given a state is denoted by: \( \pi(s)=\arg\max_{a}q_{\pi}(s,a)\). Thus we started with an arbitrary policy \(\pi_0\). We evaluated our policy by selecting greedy each time step and then base on the returns obtained we improved at the end of the episode to get the new evaluated policy until we reached an optimal policy denoted by \(\pi_*\).

At this point we are going to explain and describe how we estructure our states, actions, and rewards, that are implemented in this Snake 3D AI. We thought it could be right to set-up the state structure by the snake position followed by the apple position, and also adding the current direction of the snake. Initially if we analyze the environment it is easy to see that we count with \(4^3\) posible positions for the snake to be, and \(4^3\) posible positions for the apple. Also analyzing the posibles directions that the snake could perform are 5, in perspective from its current direction. The total amount of states the snake could confront is equal to \[|S|=\textrm{Snake Positions}\cdot \textrm{Apple Positions}\cdot \textrm{Posible Directions}\] \[|S|=4^3 \cdot 4^3 \cdot 5=20,480 \] Then we realized that in the moment the snake's grid grows, it is going to be a lot more posible states, and means that it will take more time and computation to achieve the optimal policy. Therefore our goal was to find the way to implement our snake's states maintaining the Markov property, and avoiding this rate of grow. At the end the best way that we found for the states's structure was to only describing the direction the agent must go in order to reach the apple, also telling the agent if the next position it is a loosing position or not by each posible directions to go, and the snake's direction. Where now the amount of posible states is followed by

\[|S|=\textrm{Direction to Apple}\cdot \textrm{Boolean Next Positions}\cdot \textrm{Current Direction}\] \[|S|=3^3 \cdot 2^5 \cdot 6=5,184\]

The set of action the agent can perform is represented as

\[ A=\{(\textrm{'a',right}),(\textrm{'w',up}), (\textrm{'s',down}),(\textrm{'d',left}),(\textrm{'',current direction})\}\]

As an instance about how the state's structure is \[s_i=\textrm{'001bbdbbx'}\] The tree first characters \(\textrm{001}\) are telling the snake is at the same \(x\) and \(y\) coordinates the apple, but its \(z\) position is before than the apple's \(z\). Thus the agent must learn that going to an positive \(z\) direction will reach the apple. Also the rest characters \( \textrm{bbdbb}\) are refering about what is the type of cell the agent can confront, being 'b' as blank cell, or 'd' as a danger cell. If the agent take an \(a_i\) respectivly. And the last one character is defining the snake current direction which in this case is only going positive trough the x axis. The set of rewards are the next ones \[R=\{(\textrm{blank},-0.5),(\textrm{danger},-2),(\textrm{apple},2)\}\] Meaning that if the agent is in a blank cell therefore he is going to obtain a reward of \(-0.5\) and so on.

Start AI