Revolutionizing feature selection from large data sets with reinforcement learning | by Baptiste Lefort | May, 2024

Reinforcement Learning: The Markov Decision Problem for Feature Selection

It has been shown that reinforcement learning (RL) techniques can be very efficient for problems such as game solving. The concept of RL is based on the Markov decision process (MDP). The point here is not to deeply define the MDP but to have a general idea of how it works and how it can be useful for our problem.

The naive idea behind RL is that an agent starts in an unknown environment. This agent has to take actions to complete a task. Depending on the agent's current state and the action it previously selected, the agent will be more inclined to choose some actions. For each new state reached and action performed, the agent receives a reward. These are the main parameters that we need to define for feature selection purposes:

What is a state?
What is an action?
What are the rewards?
How do we choose an action?

First, state is simply a subset of features that exist in the data set. For example, if the data set has three characteristics (Age, Gender, Height) plus a label, these will be the possible states:

()                                              --> Empty set                           
(Age), (Gender), (Height)                       --> 1-feature set
(Age, Gender), (Gender, Height), (Age, Height)  --> 2-feature set
(Age, Gender, Height)                           --> All-feature set

In a state, the order of the features does not matter and it will be explained why a little later in the article. We have to consider it as a set and not a list of features.

As for actions, from one subset we can move to any other subset that has a yet unexplored feature other than the current state. In the feature selection problem, an action selects a feature not yet explored in the current state and adds it to the next state. Below is a sample of possible actions:

(Age) -> (Age, Gender)
(Gender, Height) -> (Age, Gender, Height)

Here is an example of impossible actions:

(Age) -> (Age, Gender, Height)
(Age, Gender) -> (Age)
(Gender) -> (Gender, Gender)

We have defined the states and actions but not the reward. The reward is a real number that is used to evaluate the quality of a state. For example, if a robot is trying to reach the exit of a maze and decides to go towards the exit as its next action, then the reward associated with this action will be “good”. If you choose to fall into a trap as your next action, the reward “will not be good.” The reward is a value that provides information about the previous action performed.

In the feature selection problem, an interesting reward could be a precision value that is added to the model when adding a new feature. Below is an example of how the reward is calculated:

(Age) --> Accuracy = 0.65
(Age, Gender) --> Accuracy = 0.76
Reward(Gender) = 0.76 - 0.65 = 0.11

For each state that we visit for the first time, a classifier will be trained with the set of features. This value is stored in the state and the training of the classifier, which is very expensive, will only be performed once, even if the state is reached again later. The classifier does not consider the order of the feature. That's why we can see this problem as a graph and not as a tree. In this example, the reward for the action of selecting Gender as a new feature for the model is the difference between the accuracy of the current state and the next.

Example of possible states and actions with rewards associated with each one — Each state has several possible actions and rewards associated with it (Image by author)

In the graph above, each characteristic has been assigned a number (i.e. “Age” is 1, “Gender” is 2, and “Height” is 3). It is entirely possible to take other metrics to maximize and find the optimal set. In many business applications, recall is more important than precision.

The next important question is how we select the next state from the current state or how we explore our environment. We have to find the most optimal way to do it as it can quickly become a very complex problem. In fact, if we naively explore the entire possible set of features in a problem with 10 features, the number of states would be

10! + 2 = 3 628 802 possible states

The +2 is because we consider an empty state and a state that contains all possible features. In this problem we would have to train the same model in all states to obtain the set of features that maximizes accuracy. In the RL approach we will not have to go to all the states and train a model every time we go to a state already visited.

We had to determine some stopping conditions for this problem and they will be detailed later. For now the selection of epsilon greedy states has been chosen. The idea is that from a current state we select the next action at random with a probability of epsilon (between 0 and 1 and often around 0.2) and otherwise select the action that maximizes a function . For feature selection, the function is the average reward that each feature has contributed to the accuracy of the model.

The epsilon-greedy algorithm involves two steps:

A random phase: with a probability epsilon, we randomly select the next state among the possible neighbors of the current state (we can imagine a uniform selection or softmax)
A greedy phase: We select the next state so that the feature added to the current state has the maximum accuracy contribution to the model. To reduce time complexity, we have initialized a list containing these values for each feature. This list is updated each time a feature is chosen. The update is very optimal thanks to the following formula:

Reward list average update for each feature (Image by author)

AORf : Average reward provided by function “f”
k : number of times “f” has been selected
V(F) : Feature set F state value (not detailed in this article for clarity)

The overall idea is to find which feature has provided the greatest accuracy to the model. That is why we need to explore different states to evaluate in many different environments the most accurate global value of a feature for the model.

Finally I will detail the two stopping conditions. Since the goal is to minimize the number of states the algorithm visits, we must be careful with them. The less never-visited state we visit, the fewer models we will have to train with different feature sets. Training the model to obtain accuracy is the most expensive phase in terms of time and computing power.

The algorithm stops in any case at the final state which is the set containing all the features. We want to avoid reaching this state since it is the most expensive to train a model.
Additionally, it stops navigating the graph if a sequence of visited states sees its values degrade successively. A threshold has been set such that after the square root of the total number of entities in the data set, scanning is stopped.

Now that the modeling of the problem has been explained, we will detail the implementation in Python.

Revolutionizing feature selection from large data sets with reinforcement learning | by Baptiste Lefort | May, 2024

Technical Terrence Team

The new wave of Bitcoin L2 is sidechains

Leave a Reply Cancel reply

Recommended.

Stability AI Open-Sources Stable Audio Open: An audio generation model with variable-length stereo audio (up to 47 s) at 44.1 kHz from text prompts

Bahamian Students Each Win $10K in NFT Competition Sponsored by Snoop Dogg's Son

Google found to have violated order to withhold chat evidence in Epic antitrust case

Bitcoin Leads the Charge as Weekly NFT Sales Soar to $423 Million

CleanSpark Invests $145 Million In Mining Tech: How Will It Boost Bitcoin Price?

Categories

Important Links

Revolutionizing feature selection from large data sets with reinforcement learning | by Baptiste Lefort | May, 2024

Reinforcement Learning: The Markov Decision Problem for Feature Selection

Related

Technical Terrence Team

The new wave of Bitcoin L2 is sidechains

Leave a Reply Cancel reply

Recommended.

Stability AI Open-Sources Stable Audio Open: An audio generation model with variable-length stereo audio (up to 47 s) at 44.1 kHz from text prompts

Bahamian Students Each Win $10K in NFT Competition Sponsored by Snoop Dogg's Son

Google found to have violated order to withhold chat evidence in Epic antitrust case

Bitcoin Leads the Charge as Weekly NFT Sales Soar to $423 Million

CleanSpark Invests $145 Million In Mining Tech: How Will It Boost Bitcoin Price?

Categories

Important Links

Get daily news updates to your inbox!