Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Intermediate layers of neural networks have significant representational power that lies at the basis of all kinds of cool applications. One can use them to change the style of an image with neural style transfer, to create artificial images maximizing their activation via the deepdream algorithm, to perform semantic arithmetics with word and sentence embeddings and much more.

One of the invited talks of ICLR 2022 revolved around the recent developments of this fascinating research area. For example in [Kim18I] this representational power is employed for interpreting a neural network’s decision making through high-level human designed concepts. The main idea of the paper (dubbed TCAV for Testing with Concept Activation Vectors) can be formulated like this: find a direction (vector) in the same space as the activations of some intermediate layer $l$ that corresponds to a humanly understandable concept, e.g. “striped” or “male”. Then the corresponding directional derivative of the network’s output for a sample $x$ is a measure of sensitivity of the prediction to the selected concept. This directional derivative is found by computing the activations of the $l$-th layer for $x, f_l(x),$ and by taking the derivative of the remaining network’s output ($h_l, k$ in the image below) w.r.t. the concept’s direction at $f_l(x).$

The paper proposes a simple and intuitive scheme for finding such directions which they call concept activation vectors (CAV). A CAV is determined by a user through providing several inputs corresponding to their concept of choice together with “random” inputs not corresponding to this concept. Note that these inputs don’t have to be part of the training set. First, activation vectors of an intermediate layer $l$ are computed (the functions $f_l$ in the image below) for all inputs. The direction in activation space orthogonal to the plane that optimally separates concept activations from random input activations is the desired concept activation vector. In practice, this direction can easily be computed by fitting a linear binary classifier on the dataset given by activation vectors and the corresponding is_concept = True/False labels. The image below illustrates this procedure for finding a CAV for the concept “striped”.

Once the CAV is determined, it can be used for a variety of tasks. In particular, the authors propose testing the sensitivity of predicting a class label $k$ w.r.t. to the selected concept by computing the fraction of datapoints belonging to this class in the training set where the directional derivative of the $k-th$ logit is positive, i.e. where a movement towards the concept corresponds to a higher probability of predicting $k$. One can thus check whether the concept red is important for predicting a fire truck or whether male is important for predicting doctor. This way of testing is dubbed TCAV. Apart from the author’s implementation, TCAV has also been integrated in pytorch’s model interpretation toolbox

CAVs can be used in a variety of ways, we recommend having a look at the follow-up papers by Been Kim for more details. In my personal opinion, this is a nice extension to the interpretability toolbox but it should be used with caution as the CAVs are computed from human inputs and heavily depend on the negative “random” counterexamples. E.g. the concept of male may easily be mistaken by the concept of human with suit if the user is not careful enough in their choice of counterexamples.


  • Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV), Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, Rory Sayres. arXiv:1711.11279 [stat] (2018)