In recent years, neural networks have demonstrated unprecedented success across various domains, fueled by their remarkable ability to generalize well and achieve impressive results. One notable trend in the development of neural networks is the constant increase in their size i.e. the number of parameters. Surprisingly, these networks often have more parameters than the actual number of training samples available. This raises the question: how many parameters are truly necessary for a given task?

Determining the optimal number of parameters is a crucial aspect in designing efficient neural network architectures. While the increasing parameter count has shown promising outcomes, researchers still lack a proper methodology to ascertain the ideal parameterization for specific tasks. A potential approach to estimate task difficulty is by considering the number of parameters required to solve a problem, assuming it can serve as a rough measure of its complexity. However, the accuracy and reliability of such an estimation remain uncertain.

In light of this challenge, [Li18M] put forth a new method for estimating the task difficulty based on optimizing parameters within a random subspace of the full parameter space.

In typical neural network training, the network’s parameters are
randomly initialized and then optimized to minimize some loss.
The authors of the paper call this *direct optimization*.
Instead, the new method optimizes the objective in a random subspace
generated from the initial parameters, see Figure 1.
The smallest dimension of this random subspace such that
the optimum exceeds some performance threshold
will then be a measure of the difficulty of the task.

Given a neural network with parameters vector $\theta^{(D)} \in \mathbb{R}^{D}$ and initial parameters $\theta_0^{(D)}$, we define the parameters in the following way:

$$ \theta^{(D)} = \theta_0^{(D)} + P\theta^{(d)} $$

where:

- $P \in \mathbb{R}^{D \times d}$ is a randomly generated projection matrix. 1 1 The choice of the term “projection” by the authors to describe the transformation from a lower-dimensional space to a higher-dimensional space is perplexing, and it would have been more appropriate to use the term “lift” instead.
- $\theta^{(d)} \in \mathcal{R}^d$ is a parameter vector in a generally smaller subspace.
- $P$ and $\theta_0^{(D)}$ are randomly generated and frozen (not trained), so the system is restricted to only $d$ degrees of freedom.

The key idea is to start with a low value of $d$ and then
to gradually increase it until we reach a threshold, see Figure 2.
We call this value the **intrinsic dimension** $d_\operatorname{int}$.

From the equation we can observe that this method doesn’t consider the network’s structure i.e. it mixes the values of all parameters regardless of the layer to which they belong.

What remains is to select a threshold. The authors base it on the performance of a baseline model, taken as the best “directly trained” model for the given problem. They specifically defined $d_{int90}$ as the intrinsic dimension of the “90%” solution: solutions with performance at least 90% of the baseline. They excluded the “100%” solution since it was observed to strongly vary and thus be unreliable.

Table 1 shows the measured $d_\operatorname{int90}$ on various supervised and reinforcement learning problems on which they experimented.

The intrinsic dimension of the fully connected (FC) network is the same for the standard MNIST dataset and the MNIST dataset with shuffled pixels (Shuf Pixels), due to the permutation invariance of this type of network, whereas it increases a lot for the MNIST dataset with shuffled labels (Shuf Labels).

The intrinsic dimension of the convolutional network (LeNet) is lower than that of the FC network for the standard MNIST dataset whereas it is higher for the MNIST dataset with shuffled pixels. This aligns with the fact convolutional networks depend on the structure of the input data.

For Imagenet there is no actual value for the intrinsic dimension because memory issues arise when using this approach for large neural networks, such as Large Language Models (LLMs) or in this specific case the SqueezeNet model. In these cases we can’t use a dense projection matrix because it wouldn’t fit in memory. For example, given the RoBERTa large model with $D = 354 \cdot 10^6$ and choosing a subspace dimension $d = 10^3$ and storing the values in 16-bit precision we would need approximately 5 Terabytes of RAM just for the projection matrix. The authors instead used a sparse projection matrix and the Fastfood transform 2 2 The project matrix $P$ is redefined as $P = HG\Pi HB$ where $H$ is a Hadamard matrix, $G$, is a random diagonal matrix with independent standard normal entries, $B$ is a random diagonal matrix with equal probability $\pm 1$ entries, and $\Pi$ is a random permutation matrix. Furthermore, the matrix multiplication with a Hadamard matrix can be computed in $O(D \log(d))$ via the Fast Walsh-Hadamard Transform [Le13F] to address the issue but still didn’t have enough time to finish this specific experiment.

Overall, this paper advances our understanding of neural network training and of the effectiveness of parameter-efficient fine-tuning and may assist in developing more efficient and effective neural network architectures. However, several intriguing questions persist, including the impact of pre-training on the intrinsic dimension and the possibility of discovering a faster approach to determining the intrinsic dimension. In a future paper pill, we will delve a follow-up work that introduces a structure-aware intrinsic dimension measure specifically designed for LLMs.

If you’re interested in trying this method or learning more about it, we highly recommend accessing the open-source code provided by the authors as well as the corresponding blog post.