Trying to Implement Deep Q-Learning for a line-follower robot in NXC (using LEGO MINDSTORM NXT)

Disclaimer

This sample code didn't learn correct actions for my robot because of the lack of light sensors. But if you can prepare multiple light sensors (not single), this code would be useful to get started in reinforcement learning.

Besides, I haven't written the neural network part yet :) :) :) :) :) :) :) :) :) :) :)

What is Deep Q-Learning?

Deep Q-Learning is one of the famous reinforcemenet learning algorithms that use a neural network for calculating a relationship between an action and a state of a agent.

The main function

Here is the main function (below). Basically this code does these things:

1. get initial state from the light sensor
2. select an action based on a Q-value based on the initial state
3. do the action
---
4. wait for 0.05 sec (now the robot is moving)
---
5. get next state from the light sensor
6. get a reward for the state where the agent moved to
7. update the Q-value based on the state, the reward and the action (This is where a neural network comes)
8. loops from 1 to 8

and this also summerizes how reinforcement learning works. Although this algorithm is called "Deep" Q-Learning, neural network only involves with the determination of Q-value, which storing the information between the states and the actions of the agent.


task main() {

  for (i = 0; i < numberofState; i++) {
    for (j = 0; j < numberOfAction; j++) {
      Qtable[i][j] = 0;
    }
  }

  state = getState(threshold); // get initial state

  int rewardBefore = 0;
  while (1) {
    action = epsidoublereedy(epsilon, state, numberOfAction);

    doAction(action);

    stateNext = getState(threshold);
    reward = getReward(stateNext);
    maxQ = getMaxQval(stateNext, numberOfAction);
    Qtable[state][action] = (1 - alpha) * Qtable[state][action] + alpha * (reward + gamma * maxQ);

     if (reward < 0)
    {
      Off(OUT_BC);
      Off(OUT_B);
      Off(OUT_C);
      Wait(2000);
    }
    else
    {
      PlaySound(SOUND_CLICK);
      Wait(100);
      Off(OUT_B);
      Off(OUT_C);
      Wait(100);
    }
    rewardBefore = reward;
    state = stateNext;
  }
}

Implementing a Reinforcement Learning part

In this line-following robot exapmle, some states and actions are needed to implement reinforcement learning. This robot has a few states,
- state 1: on the right side of the black line
- state 2: on the left side of the black line
- state 3: on the black line
and a few actions.
- action 1: moving forward
- action 2: turn left
- action 3: turn right
It's an easy problem, isnt'it? I added few more actions and states to move the robot smoothly.

getState()

First thing first. We have to know where my robot is in the environment. To do this, we use a light sensor value (SENSOR_3). By evaluating the value, we can estimate whether the robot is on the black line or not.


int getState(int threshold) {
  int state;
  SetSensorLight(S3);

  if (SENSOR_3 < threshold - 10)
  {
    state = 0;
  }
  else if (SENSOR_3 < threshold - 5)
  {
    state = 1;
  }
  else if (SENSOR_3 < threshold)
  {
    state = 2;
  }
  else if (SENSOR_3 < threshold + 5)
  {
    state = 3;
  }
  else if (SENSOR_3 < threshold + 10)
  {
    state = 4;
  }
  else
  {
    state = 5;
  }
  ClearScreen();
  TextOut(0, LCD_LINE1, "Light:");   // 1行目の左端に表示
  NumOut(80, LCD_LINE1, Sensor(S3)); // 超音波センサの値を表示
  TextOut(0, LCD_LINE2, "State:");   // 1行目の左端に表示
  NumOut(80, LCD_LINE2, state); // 超音波センサの値を表示

  return state;

}

doAction()

To choose an appropriate action from a given state, we use a function that relates the states to the actions. This function is a neural network. The output of the neural network is a Qvalue vector. We get the action as an index of the Qvalue vector using argmax(). The index of the vector that has maximum Qvalue is the action that we want.


void onRL(float speed_l, float speed_r)
{
  OnFwd(OUT_B, speed_r);
  OnFwd(OUT_C, speed_l);
}

void doAction(int action) {


  // actionToSpeedArray = [][];
  if (action == 0)
  {
    onRL(25, 40);
  } 
  else if (action == 1) 
  {
    onRL(40,25);
  }
  else if (action == 2)
  {
    onRL(30, 0);
  }
  else if (action == 3)
  {
    onRL(0, 30);
  }
}

selectAction()

How do we make the robot to learn new relationship between actions and states? We can do that just to add randomness into the selectAction() function. In epslion probability, we choose random action, In (1 - epslion) probability, we choose the action from selectAction() function. This method is called epsilon-greedy.


int selectAction(int state, int num) {
  double maxQval;
  int i = 0;
  int maxIndex;
  int action;

  maxQval = Qtable[state][0];
  maxIndex = 0;

  for (i = 1; i < num; i++) {
    if (Qtable[state][i] > maxQval) {
      maxQval = Qtable[state][i];
      maxIndex = i;
    };
  };

  action = maxIndex;
  return action;
}

epsilonGreedy()

So now we can get an action from a given state, using the neural network function. But we know that a neural network have to update its parameters to quantify the error between the result and the labelled data. How do we get the labelled data?

Here we use updated Qvalue as the labelled data, using Q-Learning alghorithm with rewards. In each turn, we update Q-value, the output of the neural network to create the labelled Q-value. The error is calculated as (Qvalue - updated Qvalue). to calculate labelled Q-value, we set rewards.


int epsidoublereedy(int eps, int state, int num) {
  int action;
  if (eps > rand() % 100) {
    action = rand() % num;
  } else {
    action = selectAction(state, num);
  }
  return action;
}

getReward()

Rewards are used to update Q-values. This setting is really important to make reinforcement learning work. We set reward +100 on the black line and -100 not on the black line! (stupid)


double getReward(int state) {
  double reward;

  reward = 0;
  if (SENSOR_3 < threshold)
  {
    rewardRecord += 1;
    reward = 100 * rewardRecord;
  }
  else
  {
    rewardRecord = 1;
    reward = -1000;
  }

  return reward;
}

The entire code

Here is the entire code written in NXC (C-like language written for LEGO MINDSTORM).


int numberOfAction = 4;
int numberofState = 6;
double Qtable[numberofState][numberOfAction];
double maxQ = 0;
double reward = 0;
double alpha = 0.7;
double gamma = 0.9;
int epsilon = 50;

int action = 0;
int state = 0;
int stateNext = 0;
int i, j;
int seed = 0;
int threshold = 60;
int actionRecord[];
int rewardRecord = 1;

int getState(int threshold) {
  int state;
  SetSensorLight(S3);

  if (SENSOR_3 < threshold - 10)
  {
    state = 0;
  }
  else if (SENSOR_3 < threshold - 5)
  {
    state = 1;
  }
  else if (SENSOR_3 < threshold)
  {
    state = 2;
  }
  else if (SENSOR_3 < threshold + 5)
  {
    state = 3;
  }
  else if (SENSOR_3 < threshold + 10)
  {
    state = 4;
  }
  else
  {
    state = 5;
  }
  ClearScreen();
  TextOut(0, LCD_LINE1, "Light:");   // 1行目の左端に表示
  NumOut(80, LCD_LINE1, Sensor(S3)); // 超音波センサの値を表示
  TextOut(0, LCD_LINE2, "State:");   // 1行目の左端に表示
  NumOut(80, LCD_LINE2, state); // 超音波センサの値を表示

  return state;

}

double getReward(int state) {
  double reward;

  reward = 0;
  if (SENSOR_3 < threshold)
  {
    rewardRecord += 1;
    reward = 100 * rewardRecord;
  }
  else
  {
    rewardRecord = 1;
    reward = -1000;
  }

  return reward;
}

void onRL(float speed_l, float speed_r)
{
  OnFwd(OUT_B, speed_r);
  OnFwd(OUT_C, speed_l);
}

void doAction(int action) {


  // actionToSpeedArray = [][];
  if (action == 0)
  {
    onRL(25, 40);
  } 
  else if (action == 1) 
  {
    onRL(40,25);
  }
  else if (action == 2)
  {
    onRL(30, 0);
  }
  else if (action == 3)
  {
    onRL(0, 30);
  }
}

double getMaxQval(int state, int num) {
  double maxQval;
  int i = 0;

  maxQval = Qtable[state][0];

  for (i = 1; i < num; i++) {
    if (Qtable[state][i] > maxQval) {
      maxQval = Qtable[state][i];
    };
  };

  return maxQval;
}

int selectAction(int state, int num) {
  double maxQval;
  int i = 0;
  int maxIndex;
  int action;

  maxQval = Qtable[state][0];
  maxIndex = 0;

  for (i = 1; i < num; i++) {
    if (Qtable[state][i] > maxQval) {
      maxQval = Qtable[state][i];
      maxIndex = i;
    };
  };

  action = maxIndex;
  return action;
}

int epsidoublereedy(int eps, int state, int num) {
  int action;
  if (eps > rand() % 100) {
    action = rand() % num;
  } else {
    action = selectAction(state, num);
  }
  return action;
}


task main() {

  for (i = 0; i < numberofState; i++) {
    for (j = 0; j < numberOfAction; j++) {
      Qtable[i][j] = 0;
    }
  }

  state = getState(threshold); // get initial state

  int rewardBefore = 0;
  while (1) {
    action = epsidoublereedy(epsilon, state, numberOfAction);

    doAction(action);

    stateNext = getState(threshold);
    reward = getReward(stateNext);
    maxQ = getMaxQval(stateNext, numberOfAction);
    Qtable[state][action] = (1 - alpha) * Qtable[state][action] + alpha * (reward + gamma * maxQ);

     if (reward < 0)
    {
      Off(OUT_BC);
      Off(OUT_B);
      Off(OUT_C);
      Wait(2000);
    }
    else
    {
      PlaySound(SOUND_CLICK);
      Wait(100);
      Off(OUT_B);
      Off(OUT_C);
      Wait(100);
    }
    rewardBefore = reward;
    state = stateNext;
  }
}

Implementing a neural network part

Under construction forever

References

NXC Programmer's Guide
Simple Reinforcement Learning: Q-learning by Andre Violante
実装強化学習: Cによるロボットプログラミング
機械学習スタートアップシリーズ Pythonで学ぶ強化学習入門から実践まで (KS情報科学専門書)