Reinforcement Learning: The Markov Decision Problem for Feature Selection
It has been shown that reinforcement learning (RL) techniques can be very efficient for problems such as game solving. The concept of RL is based on the Markov decision process (MDP). The point here is not to deeply define the MDP but to have a general idea of how it works and how it can be useful for our problem.
The naive idea behind RL is that an agent starts in an unknown environment. This agent has to take actions to complete a task. Depending on the agent's current state and the action it previously selected, the agent will be more inclined to choose some actions. For each new state reached and action performed, the agent receives a reward. These are the main parameters that we need to define for feature selection purposes:
- What is a state?
- What is an action?
- What are the rewards?
- How do we choose an action?
First, state is simply a subset of features that exist in the data set. For example, if the data set has three characteristics (Age, Gender, Height) plus a label, these will be the possible states:
() --> Empty set
(Age), (Gender), (Height) --> 1-feature set
(Age, Gender), (Gender, Height), (Age, Height) --> 2-feature set
(Age, Gender, Height) --> All-feature set
In a state, the order of the features does not matter and it will be explained why a little later in the article. We have to consider it as a set and not a list of features.
As for actions, from one subset we can move to any other subset that has a yet unexplored feature other than the current state. In the feature selection problem, an action selects a feature not yet explored in the current state and adds it to the next state. Below is a sample of possible actions:
(Age) -> (Age, Gender)
(Gender, Height) -> (Age, Gender, Height)
Here is an example of impossible actions:
(Age) -> (Age, Gender, Height)
(Age, Gender) -> (Age)
(Gender) -> (Gender, Gender)
We have defined the states and actions but not the reward. The reward is a real number that is used to evaluate the quality of a state. For example, if a robot is trying to reach the exit of a maze and decides to go towards the exit as its next action, then the reward associated with this action will be “good”. If you choose to fall into a trap as your next action, the reward “will not be good.” The reward is a value that provides information about the previous action performed.
In the feature selection problem, an interesting reward could be a precision value that is added to the model when adding a new feature. Below is an example of how the reward is calculated:
(Age) --> Accuracy = 0.65
(Age, Gender) --> Accuracy = 0.76
Reward(Gender) = 0.76 - 0.65 = 0.11
For each state that we visit for the first time, a classifier will be trained with the set of features. This value is stored in the state and the training of the classifier, which is very expensive, will only be performed once, even if the state is reached again later. The classifier does not consider the order of the feature. That's why we can see this problem as a graph and not as a tree. In this example, the reward for the action of selecting Gender as a new feature for the model is the difference between the accuracy of the current state and the next.
In the graph above, each characteristic has been assigned a number (i.e. “Age” is 1, “Gender” is 2, and “Height” is 3). It is entirely possible to take other metrics to maximize and find the optimal set. In many business applications, recall is more important than precision.
The next important question is how we select the next state from the current state or how we explore our environment. We have to find the most optimal way to do it as it can quickly become a very complex problem. In fact, if we naively explore the entire possible set of features in a problem with 10 features, the number of states would be
10! + 2 = 3 628 802 possible states
The +2 is because we consider an empty state and a state that contains all possible features. In this problem we would have to train the same model in all states to obtain the set of features that maximizes accuracy. In the RL approach we will not have to go to all the states and train a model every time we go to a state already visited.
We had to determine some stopping conditions for this problem and they will be detailed later. For now the selection of epsilon greedy states has been chosen. The idea is that from a current state we select the next action at random with a probability of epsilon (between 0 and 1 and often around 0.2) and otherwise select the action that maximizes a function . For feature selection, the function is the average reward that each feature has contributed to the accuracy of the model.
The epsilon-greedy algorithm involves two steps:
- A random phase: with a probability epsilon, we randomly select the next state among the possible neighbors of the current state (we can imagine a uniform selection or softmax)
- A greedy phase: We select the next state so that the feature added to the current state has the maximum accuracy contribution to the model. To reduce time complexity, we have initialized a list containing these values for each feature. This list is updated each time a feature is chosen. The update is very optimal thanks to the following formula:
- AORf : Average reward provided by function “f”
- k : number of times “f” has been selected
- V(F) : Feature set F state value (not detailed in this article for clarity)
The overall idea is to find which feature has provided the greatest accuracy to the model. That is why we need to explore different states to evaluate in many different environments the most accurate global value of a feature for the model.
Finally I will detail the two stopping conditions. Since the goal is to minimize the number of states the algorithm visits, we must be careful with them. The less never-visited state we visit, the fewer models we will have to train with different feature sets. Training the model to obtain accuracy is the most expensive phase in terms of time and computing power.
- The algorithm stops in any case at the final state which is the set containing all the features. We want to avoid reaching this state since it is the most expensive to train a model.
- Additionally, it stops navigating the graph if a sequence of visited states sees its values degrade successively. A threshold has been set such that after the square root of the total number of entities in the data set, scanning is stopped.
Now that the modeling of the problem has been explained, we will detail the implementation in Python.