Discover how a neural network with a hidden layer using ReLU activation can represent any continuous nonlinear function.
Activation functions play an integral role in neural networks. (NN) since they introduce nonlinearity and allow the network to learn more complex features and functions than a simple linear regression. One of the most used activation functions is the Rectified Linear Unit. (ReLU), which has been theoretically It has been shown to allow NNs to approximate a wide range of continuous functions, making them powerful function approximators.
In this publication, we study in particular the Nonlinear Continuous approximation (CNL) functions, the main objective of using an NN over a simple linear regression model. More precisely, we investigate 2 subcategories of CNL functions: Continuous PieceWise Linear (CPWL)and continuous curve (DC) functions. We will show how these two types of functions can be represented using an NN consisting of a hidden layer, provided there are enough neurons with ReLU activation.
For illustrative purposes, we consider only single-feature inputs, but the idea also applies to multiple-feature inputs.
ReLU is a piecewise linear function consisting of two linear parts: one that cuts off negative values where the output is zero and another that provides a continuous linear mapping for non-negative values.
CPWL functions are continuous functions with multiple linear portions. The slope is constant in each slice, but changes abruptly at the transition points when new linear functions are added.
In a NN with a hidden layer using ReLU activation and a linear output layer, the activations are added to form the objective function CPWL. Each hidden layer unit is responsible for one linear piece. In each unit, a new ReLU function corresponding to the slope change is added to produce the new slope. (see figure 2). Since this activation function is always positive, the output layer weights corresponding to the units that increase the slope will be positive and, conversely, the weights corresponding to the units that decrease the slope will be negative. (see figure 3). The new function is added at the transition point but does not contribute to the resulting function before (and sometimes after) that point due to the deactivation range of the ReLU activation function.
Example
To make it more concrete, we consider an example of a CPWL function consisting of 4 linear segments defined below.
To represent this objective function, we will use an NN with 1 hidden layer of 4 units and a linear layer that generates the weighted sum of the activation outputs of the previous layer. Let's determine the network parameters so that each unit in the hidden layer represents a segment of the target. For the sake of this example, the output layer bias (b2_0) is set to 0.
The next type of continuous nonlinear function that we will study is the CC function. There is no adequate definition for this subcategory, but An informal way to define CC functions are continuous nonlinear functions that are piecewise nonlinear. Various examples of CC functions are: quadratic function, exponential function, sinus function, etc.
A CC function can be approximated by a series of infinitesimal linear pieces, which is called piecewise linear approximation of the function. The greater the number of linear pieces and the smaller the size of each segment, the better the approximation to the objective function. Therefore, the same network architecture above with a sufficiently large number of hidden units can produce a good approximation for a curve function.
However, in reality, the network is trained to fit a given data set where the input-output mapping function is unknown. An architecture with too many neurons is prone to overfitting, high variation, and requires more time to train. Therefore, an appropriate number of hidden units should be large enough to adequately fit the data and, at the same time, small enough to avoid overfitting. Furthermore, with a limited number of neurons, a good low-loss approximation has more transition points in a restricted domain, rather than equidistant transition points in a uniformly sampling manner (as shown in Fig.10).
In this post, we have studied how the ReLU activation function allows multiple units to contribute to the resulting function without interfering, allowing for continuous approximation of nonlinear functions. Furthermore, we have discussed about the choice of network architecture and the number of hidden units to obtain a good approximation result.
I hope this post is helpful for your Machine Learning learning process!
More questions to think about:
- How does the approximation ability change if the number of hidden layers with ReLU activation increases?
- How are ReLU activations used for a classification problem?
*Unless otherwise noted, all images are the author's.