For artificial intelligence (AI) transparency and to better shape upcoming policies, we need to better understand the AI’s output. In particular, one may want to understand the role attributed to each input. This is hard, because in neural networks input variables don’t have a single weight that could serve as a proxy for determining their importance with regard to the output. Therefore, one have to consider all the neural network’s weights, which may be all interconnected. Here is how Integrated Gradients does this.

Approaches such as LIME, which is covered in a previous post try to simplify the problem by locally approximating neural network models , but the quality of the attributions (the importance of each feature relative to the neural network model output) is hard to asses, because one can’t tell whether incorrect attributions comes from problems in the model or from flaws or approximations in attribution method. Integrated Gradients (IG) seeks to satisfy two desirable axioms for an attribution mechanism:

1. Sensitivity. If one feature change makes the classification output to change, then that feature should have a non-zero attribution. That makes sense, because if a feature makes the output to change, then it must have played a role. For example, if only changing the feature “Age” makes the predicted decision to change, then “Age” should have a played a role in it and therefore the attribution should be non-zero.
2. Implementation Invariance. The attribution method result should not depend on the specificities of the neural network. If two neural networks are equivalent (i.e. they give the same results for the same input), the attribution should be the same.

Because computing the gradients of the input with regard to the output is implementation invariant (as $\frac{\partial f}{\partial g} = \frac{\partial f}{ \partial h} \times \frac{\partial h}{ \partial g}$) but does not satisfy Sensitivity (a feature change does not necessarily yield a non-zero gradient for that feature), they can’t be used directly for attributions. To provide explanations, IG makes use of a baseline, a reference input for which the predictions are neutral (e.g. the probabilities are close to $1/k$ for classification with $k$ classes), and then computes the gradient from the reference to the input. IG needs a neutral baseline so that it is easy to compare it to the input and to make the model outputs as close to zeros as possible, which is necessary to consider the attributions as depending only on the inputs.

IG are defined as:

$$IntegratedGrads_i(x) \mathrel{\coloncolonequals} (x_i - x^\prime_i ) \times \int_{\alpha=0}^1 \frac{\partial F(x^\prime + \alpha \times (x - x^\prime))}{\partial x_i}d\alpha$$

where:

• $x$ is the input for which we want attributions
• $i$ is a dimension in $x$
• $x^\prime$ is the baseline
• $\alpha$ is a coefficient that creates small interpolation steps from $x^\prime$ to $x$
• $F$ is the neural network

To compute the integral, the Riemann approximation is used in practice, which sums up rectangular portions of the integral:

$$IntegratedGrads_i(x) \mathrel{\coloncolonequals} (x_i - x^\prime_i ) \times \sum_{k=1}^m \frac{\partial F(x^\prime + \frac{k}{m} \times (x - x^\prime))}{\partial x_i} \times \frac{1}{m}$$

where $m$ is the number of steps in the Riemann approximation. The greater the more accurate the approximation is. The paper states that 20 to 300 steps are enough, that the number should be proportional to the complexity of the network.

### Integrated Gradients in practice #

In PyTorch, this is equivalent to:

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52  import torch # Example deep learning model class Model(torch.nn.Module): def __init__(self): super(Model, self).__init__() self.lin1 = torch.nn.Linear(20, 10) self.relu = torch.nn.ReLU() self.lin2 = torch.nn.Linear(10, 3) def forward(self, input): return torch.nn.functional.log_softmax( self.lin2(self.relu(self.lin1(input))), dim=1 ) model = Model() # Generate 50 inputs and baselines with 20 dimensions each inputs = torch.rand(50, 20, requires_grad=True) baseline = torch.zeros_like(inputs, requires_grad=True) # Number of steps m = 20 # Hold the gradients for each step grads = [] for k in range(1, m + 1): model.zero_grad() # Interpolation from the baseline to the input baseline_input = baseline + ((k / m) * (inputs - baseline)) # Put the interpolated baseline through the model out = model(baseline_input) # Get the predicted classes and use them as indexes for which we want # attributions idx = out.argmax(dim=1).unsqueeze(1) # Select the output for each predicted class out = out.gather(dim=1, index=idx) # Perform backpropagation to generate gradients for the input out.backward(torch.ones_like(idx)) # Append the gradient for each step grads.append(inputs.grad.detach()) # Stack the list of gradients, compute the mean over the m steps grads = torch.stack(grads, 0).mean(dim=0) # Compute attributions attr = (inputs - baseline).detach() * grads

The captum library (released under the BSD 3-Clause license) provides an easy-to-use implementation of the integrated gradients.