This program was contributed by Jean-Paul Pelteret and Wolfgang Bangerth.
Wolfgang Bangerth's work is partially supported by National Science Foundation grants OCI-1148116, OAC-1835673, DMS-1821210, and EAR-1925595; and by the Computational Infrastructure in Geodynamics initiative (CIG), through the National Science Foundation under Award No. EAR-1550901 and The University of California-Davis.
Among the issues we had identified there (see the Possibilities for extensions section) was that when wanting to use a Newton iteration, we needed to compute the derivative of the residual of the equation with regard to the solution (here, because the right hand side is zero, the residual is simply the left hand side). For the equation we have here, this is cumbersome but not impossible – but one can easily imagine much more complicated equations where just implementing the residual itself correctly is a challenge, let alone doing so for the derivative necessary to compute the Jacobian matrix. We will address this issue in this program: Using the automatic differentiation techniques discussed in great detail in step-71, we will come up with a way how we only have to implement the residual and get the Jacobian for free.
In fact, we can even go one step further. While in step-15 we have just taken the equation as a given, the minimal surface equation is actually the product of minimizing an energy. Specifically, the minimal surface equations are the Euler-Lagrange equations that correspond to minimizing the energy
where the energy density is given by
This is the same as saying that we seek to find the stationary point of the variation of the energy functional
as this is where the equilibrium solution to the boundary value problem lies.
The key point then is that, maybe, we don't even need to implement the residual, but that implementing the simpler energy density might actually be enough.
Our goal then is this: When using a Newton iteration, we need to repeatedly solve the linear partial differential equation
so that we can compute the update
with the solution of the Newton step. As discussed in step-15, we can compute the derivative by hand and obtain
So here then is what this program is about: It is about techniques that can help us with computing without having to implement it explicitly, either by providing an implementation of or an implementation of . More precisely, we will implement three different approaches and compare them in terms of run-time but also – maybe more importantly – how much human effort it takes to implement them:
The method used in step-15 to form the Jacobian matrix.
Computing the Jacobian matrix from an implementation of the residual , using automatic differentiation.
Computing both the residual and Jacobian matrix from an implementation of the energy functional , also using automatic differentiation.
For the first of these methods, there are no conceptual changes compared to step-15.
Computing the Jacobian from the residual
For the second method, let us outline how we will approach the issue using automatic differentiation to compute the linearization of the residual vector. To this end, let us change notation for a moment and denote by not the residual of the differential equation, but in fact the residual vector – i.e., the discrete residual. We do so because that is what we actually* do when we discretize the problem on a given mesh: We solve the problem where is the vector of unknowns.
More precisely, the th component of the residual is given by
where . Given this, the contribution for cell is
Its first order Taylor expansion is given as
and consequently we can compute the contribution of cell to the Jacobian matrix as . The important point here is that on cell , we can express
For clarity, we have used and as counting indices to make clear that they are distinct from each other and from above. Because in this formula, only depends on the coefficients , we can compute the derivative as a matrix via automatic differentiation of . By the same argument as we always use, it is clear that does not actually depend on all* unknowns , but only on those unknowns for which is a shape function that lives on cell , and so in practice, we restrict and to that part of the vector and matrix that corresponds to the local DoF indices, and then distribute from the local cell to the global objects.
Using all of these realizations, the approach will then be to implement in the program and let the automatic differentiation machinery compute the derivatives from that.
Computing the Jacobian and the residual from the energy functional
For the final implementation of the assembly process, we will move a level higher than the residual: our entire linear system will be determined directly from the energy functional that governs the physics of this boundary value problem. We can take advantage of the fact that we can calculate the total energy in the domain directly from the local contributions, i.e.,
In the discrete setting, this means that on each finite element we have
If we implement the cell energy, which depends on the field solution, we can compute its first (discrete) variation
and, thereafter, its second (discrete) variation
So, from the cell contribution to the total energy function, we may expect to have the approximate residual and tangent contributions generated for us as long as we can provide an implementation of the local energy . Again, due to the design of the automatic differentiation variables used in this tutorial, in practice these approximations for the contributions to the residual vector and tangent matrix are actually accurate to machine precision.
The commented program
The majority of this tutorial is an exact replica of step-15. So, in the interest of brevity and maintaining a focus on the changes implemented here, we will only document what's new and simply indicate which sections of code are a repetition of what has come before.
Include files
There are a few new header files that have been included in this tutorial. The first is the one that provides the declaration of the ParameterAcceptor class.
#include <deal.II/base/quadrature_lib.h>
#include <deal.II/base/function.h>
#include <deal.II/base/parameter_acceptor.h>
#include <deal.II/base/timer.h>
#include <deal.II/base/utilities.h>
This is the second, which is an all-inclusive header that will allow us to incorporate the automatic differentiation (AD) functionality within this code.
In this tutorial we will implement three different approaches for assembling the linear system. One mirrors the hand implementation originally provided in step-15, while the other two use the Sacado automatic differentiation library that is provided as a part of the Trilinos framework.
To facilitate switching between the three implementations, we have this really basic parameters class that has only two options that are configurable.
Selection for the formulation and corresponding AD framework to be used:
formulation = 0 : Unassisted implementation (full hand linearization).
formulation = 1 : Automated linearization of the finite element residual.
formulation = 2 : Automated computation of finite element residual and linearization using a variational formulation.
unsignedint formulation = 0;
The maximum acceptable tolerance for the linear system residual. We will see that the assembly time becomes appreciable once we use the AD framework, so we have increased the tolerance selected in step-15 by one order of magnitude. This way, the computations do not take too long to complete.
The class template is essentially the same as in step-15. The only functional changes to the class are that:
the run() function now takes in two arguments: one to choose which assembly approach is to be adopted, and one for the tolerance for the permissible final residual is, and
there are now three different assembly functions that implement the three methods of assembling the linear system. We'll provide details on these later on.
There have been no changes made to the function that sets up the class data structures, namely the DoFHandler, the hanging node constraints applied to the problem, and the linear system.
The assembly functions are the interesting contributions to this tutorial. The assemble_system_unassisted() method implements exactly the same assembly function as is detailed in step-15, but in this instance we use the MeshWorker::mesh_loop() function to multithread the assembly process. The reason for doing this is quite simple: When using automatic differentiation, we know that there is to be some additional computational overhead incurred. In order to mitigate this performance loss, we'd like to take advantage of as many (easily available) computational resources as possible. The MeshWorker::mesh_loop() concept makes this a relatively straightforward task. At the same time, for the purposes of fair comparison, we need to do the same to the implementation that uses no assistance when computing the residual or its linearization. (The MeshWorker::mesh_loop() function is first discussed in step-12 and step-16, if you'd like to read up on it.)
The steps required to implement the multithreading are the same between the three functions, so we'll use the assemble_system_unassisted() function as an opportunity to focus on the multithreading itself.
The MeshWorker::mesh_loop() expects that we provide two exemplar data structures. The first, ScratchData, is to store all large data that is to be reused between threads. The CopyData will hold the contributions to the linear system that come from each cell. These independent matrix-vector pairs must be accumulated into the global linear system sequentially. Since we don't need anything on top of what the MeshWorker::ScratchData and MeshWorker::CopyData classes already provide, we use these exact class definitions for our problem. Note that we only require a single instance of a local matrix, local right-hand side vector, and cell degree of freedom index vector – the MeshWorker::CopyData therefore has 1 for all three of its template arguments.
We also need to know what type of iterator we'll be working with during assembly. For simplicity, we just ask the compiler to work this out for us using the decltype() specifier, knowing that we'll be iterating over active cells owned by the dof_handler.
using CellIteratorType = decltype(dof_handler.begin_active());
Here we initialize the exemplar data structures. Since we know that we need to compute the shape function gradients, weighted Jacobian, and the position of the quadrate points in real space, we pass these flags into the class constructor.
Now we define a lambda function that will perform the assembly on a single cell. The three arguments are those that will be expected by MeshWorker::mesh_loop(), due to the arguments that we'll pass to that final call. We also capture the this pointer, which means that we'll have access to "this" (i.e., the current MinimalSurfaceProblem<dim>) class instance, and its private member data (since the lambda function is defined within a MinimalSurfaceProblem<dim> method).
At the top of the function, we initialize the data structures that are dependent on the cell for which the work is being performed. Observe that the reinitialization call actually returns an instance to an FEValues object that is initialized and stored within (and, therefore, reused by) the scratch_data object.
Similarly, we get aliases to the local matrix, local RHS vector, and local cell DoF indices from the copy_data instance that MeshWorker::mesh_loop() provides. We then initialize the cell DoF indices, knowing that the local matrix and vector are already correctly sized.
For Newton's method, we require the gradient of the solution at the point about which the problem is being linearized.
Once we have that, we can perform assembly for this cell in the usual way. One minor difference to step-15 is that we've used the (rather convenient) range-based loops to iterate over all quadrature points and degrees-of-freedom.
The second lambda function that MeshWorker::mesh_loop() requires is one that performs the task of accumulating the local contributions in the global linear system. That is precisely what this one does, and the details of the implementation have been seen before. The primary point to recognize is that the local contributions are stored in the copy_data instance that is passed into this function. This copy_data has been filled with data during some call to the cell_worker.
We have all of the required functions definitions in place, so now we call the MeshWorker::mesh_loop() to perform the actual assembly. We pass a flag as the last parameter which states that we only want to perform the assembly on the cells. Internally, MeshWorker::mesh_loop() then distributes the available work to different threads, making efficient use of the multiple cores almost all of today's processors have to offer.
And finally, as is done in step-15, we remove hanging nodes from the system and apply zero boundary values to the linear system that defines the Newton updates .
Assembly via differentiation of the residual vector
As outlined in the introduction, what we need to do for this second approach is implement the local contributions from cell to the residual vector, and then let the AD machinery deal with how to compute the derivatives from it.
We'll define up front the AD data structures that we'll be using, utilizing the techniques shown in step-71. In this case, we choose the helper class that will automatically compute the linearization of the finite element residual using Sacado forward automatic differentiation types. These number types can be used to compute first derivatives only. This is exactly what we want, because we know that we'll only be linearizing the residual, which means that we only need to compute first-order derivatives. The return values from the calculations are to be of type double.
We also need an extractor to retrieve some data related to the field solution to the problem.
We'll now create and initialize an instance of the AD helper class. To do this, we need to specify how many independent variables and dependent variables there are. The independent variables will be the number of local degrees of freedom that our solution vector has, i.e., the number in the per-element representation of the discretized solution vector that indicates how many solution coefficients are associated with each finite element. In deal.II, this equals FiniteElement::dofs_per_cell. The number of dependent variables will be the number of entries in the local residual vector that we will be forming. In this particular problem (like many others that employ the standard Galerkin method) the number of local solution coefficients matches the number of local residual equations.
Next we inform the helper of the values of the solution, i.e., the actual values for about which we wish to linearize. As this is done on each element individually, we have to extract the solution coefficients from the global solution vector. In other words, we define all of those coefficients where is a local degree of freedom as the independent variables that enter the computation of the vector (the dependent function).
Then we get the complete set of degree of freedom values as represented by auto-differentiable numbers. The operations performed with these variables are tracked by the AD library from this point until the object goes out of scope. So it is precisely these variables with respect to which we will compute derivatives of the residual entries.
Then we do some problem specific tasks, the first being to compute all values, (spatial) gradients, and the like based on "sensitive" AD degree of freedom values. In this instance we are retrieving the solution gradients at each quadrature point. Observe that the solution gradients are now sensitive to the values of the degrees of freedom as they use the ADNumberType as the scalar type and the dof_values_ad vector provides the local DoF values.
The next variable that we declare will store the cell residual vector contributions. This is rather self-explanatory, save for one very important detail: Note that each entry in the vector is hand-initialized with a value of zero. This is a highly recommended practice, as some AD libraries appear not to safely initialize the internal data structures of these number types. Not doing so could lead to some very hard to understand or detect bugs (appreciate that the author of this program mentions this out of, generally bad, experience). So out of an abundance of caution it's worthwhile zeroing the initial value explicitly. After that, apart from a sign change the residual assembly looks much the same as we saw for the cell RHS vector before: We loop over all quadrature points, ensure that the coefficient now encodes its dependence on the (sensitive) finite element DoF values by using the correct ADNumberType, and finally we assemble the components of the residual vector. For complete clarity, the finite element shape functions (and their gradients, etc.) as well as the "JxW" values remain scalar valued, but the coeff and the old_solution_gradients at each quadrature point are computed in terms of the independent variables.
Once we have the full cell residual vector computed, we can register it with the helper class.
Thereafter, we compute the residual values (basically, extracting the real values from what we already computed) and their Jacobian (the linearization of each residual component with respect to all cell DoFs) at the evaluation point. For the purposes of assembly into the global linear system, we have to respect the sign difference between the residual and the RHS contribution: For Newton's method, the right hand side vector needs to be equal to the negative residual vector.
ad_helper.register_residual_vector(residual_ad);
ad_helper.compute_residual(cell_rhs);
cell_rhs *= -1.0;
ad_helper.compute_linearization(cell_matrix);
};
The remainder of the function equals what we had previously:
In this implementation of the assembly process, we choose the helper class that will automatically compute both the residual and its linearization from the cell contribution to an energy functional using nested Sacado forward automatic differentiation types. The selected number types can be used to compute both first and second derivatives. We require this, as the residual defined as the sensitivity of the potential energy with respect to the DoF values (i.e. its gradient). We'll then need to linearize the residual, implying that second derivatives of the potential energy must be computed. You might want to compare this with the definition of ADHelper used int previous function, where we used Differentiation::AD::ResidualLinearization<Differentiation::AD::NumberTypes::sacado_dfad,double>.
Let us then again define the lambda function that does the integration on a cell.
To initialize an instance of the helper class, we now only require that the number of independent variables (that is, the number of degrees of freedom associated with the element solution vector) are known up front. This is because the second-derivative matrix that results from an energy functional is necessarily square (and also, incidentally, symmetric).
Once more, we register all cell DoFs values with the helper, followed by extracting the "sensitive" variant of these values that are to be used in subsequent operations that must be differentiated – one of those being the calculation of the solution gradients.
We next create a variable that stores the cell total energy. Once more we emphasize that we explicitly zero-initialize this value, thereby ensuring the integrity of the data for this starting value.
The aim for our approach is then to compute the cell total energy, which is the sum of the internal (due to right hand side functions, typically linear in ) and external energies. In this particular case, we have no external energies (e.g., from source terms or Neumann boundary conditions), so we'll focus on the internal energy part.
In fact, computing is almost trivial, requiring only the following lines:
ADNumberType energy_ad = ADNumberType(0.0);
for (constunsignedint q : fe_values.quadrature_point_indices())
After we've computed the total energy on this cell, we'll register it with the helper. Based on that, we may now compute the desired quantities, namely the residual values and their Jacobian at the evaluation point. As before, the Newton right hand side needs to be the negative of the residual:
ad_helper.register_energy_functional(energy_ad);
ad_helper.compute_residual(cell_rhs);
cell_rhs *= -1.0;
ad_helper.compute_linearization(cell_matrix);
};
As in the previous two functions, the remainder of the function is as before:
... as does the function used to compute the residual during the solution iteration procedure. One could replace this by differentiation of the energy functional if one really wanted, but for simplicity we here simply copy what we already had in step-15.
The choice of step length (or, under-relaxation factor) for the nonlinear iterations procedure remains fixed at the value chosen and discussed in step-15.
This last function to be called from run() outputs the current solution (and the Newton update) in graphical form as a VTU file. It is entirely the same as what has been used in previous tutorials.
In the run function, most remains the same as was first implemented in step-15. The only observable changes are that we can now choose (via the parameter file) what the final acceptable tolerance for the system residual is, and that we can choose which method of assembly we wish to utilize. To make the second choice clear, we output to the console some message which indicates the selection. Since we're interested in comparing the time taken to assemble for each of the three methods, we've also added a timer that keeps a track of how much time is spent during assembly. We also track the time taken to solve the linear system, so that we can contrast those numbers to the part of the code which would normally take the longest time to execute.
void hyper_ball(Triangulation< dim > &tria, const Point< dim > ¢er=Point< dim >(), const double radius=1., const bool attach_spherical_manifold_on_boundary_cells=false)
The main function
Finally the main function. This follows the scheme of most other main functions, with two obvious exceptions:
We call Utilities::MPI::MPI_InitFinalize in order to set up (via a hidden default parameter) the number of threads using the execution of multithreaded tasks.
We also have a few lines dedicates to reading in or initializing the user-defined parameters that will be considered during the execution of the program.
Since there was no change to the physics of the problem that has first been analyzed in step-15, there is nothing to report about that. The only outwardly noticeable difference between them is that, by default, this program will only run 9 mesh refinement steps (as opposed to step-15, which executes 11 refinements). This will be observable in the simulation status that appears between the header text that prints which assembly method is being used, and the final timings. (All timings reported below were obtained in release mode.)
Mesh refinement step 0
Initial residual: 1.53143
Residual: 1.08746
Residual: 0.966748
Residual: 0.859602
Residual: 0.766462
Residual: 0.685475
...
Mesh refinement step 9
Initial residual: 0.00924594
Residual: 0.00831928
Residual: 0.0074859
Residual: 0.0067363
Residual: 0.00606197
Residual: 0.00545529
So what is interesting for us to compare is how long the assembly process takes for the three different implementations, and to put that into some greater context. Below is the output for the hand linearization (as computed on a circa 2012 four core, eight thread laptop – but we're only really interested in the relative time between the different implementations):
Assembly approach ********
Unassisted implementation (full hand linearization).
And, lastly, for the implementation that computes both the residual and its linearization directly from an energy functional (using nested Sacado dynamic forward AD numbers):
Assembly approach ********
Automated computation of finite element residual and linearization using a variational formulation.
It's evident that the more work that is passed off to the automatic differentiation framework to perform, the more time is spent during the assembly process. Accumulated over all refinement steps, using one level of automatic differentiation resulted in more computational time being spent in the assembly stage when compared to unassisted assembly, while assembling the discrete linear system took longer when deriving directly from the energy functional. Unsurprisingly, the overall time spent solving the linear system remained unchanged. This means that the proportion of time spent in the solve phase to the assembly phase shifted significantly as the number of times automated differentiation was performed at the finite element level. For many, this might mean that leveraging higher-order differentiation (at the finite element level) in production code leads to an unacceptable overhead, but it may still be useful during the prototyping phase. A good compromise between the two may, therefore, be the automated linearization of the finite element residual, which offers a lot of convenience at a measurable, but perhaps not unacceptable, cost. Alternatively, one could consider not re-building the Newton matrix in every step – a topic that is explored in substantial depth in step-77.
Of course, in practice the actual overhead is very much dependent on the problem being evaluated (e.g., how many components there are in the solution, what the nature of the function being differentiated is, etc.). So the exact results presented here should be interpreted within the context of this scalar problem alone, and when it comes to other problems, some preliminary investigation by the user is certainly warranted.
Possibilities for extensions
Like step-71, there are a few items related to automatic differentiation that could be evaluated further:
The use of other AD frameworks should be investigated, with the outlook that alternative implementations may provide performance benefits.
It is also worth evaluating AD number types other than those that have been hard-coded into this tutorial. With regard to twice differentiable types employed at the finite-element level, mixed differentiation modes ("RAD-FAD") should in principle be more computationally efficient than the single mode ("FAD-FAD") types employed here. The reason that the RAD-FAD type was not selected by default is that, at the time of writing, there remain some bugs in its implementation within the Sacado library that lead to memory leaks. This is documented in the Automatic and symbolic differentiation module.
It might be the case that using reduced precision types (i.e., float) as the scalar types for the AD numbers could render a reduction in computational expense during assembly. Using float as the data type for the matrix and the residual is not unreasonable, given that the Newton update is only meant to get us closer to the solution, but not actually to the solution; as a consequence, it makes sense to consider using reduced-precision data types for computing these updates, and then accumulating these updates in a solution vector that uses the full double precision accuracy.
One further method of possibly reducing resources during assembly is to frame the AD implementations as a constitutive model. This would be similar to the approach adopted in step-71, and pushes the starting point for the automatic differentiation one level higher up the chain of computations. This, in turn, means that less operations are tracked by the AD library, thereby reducing the cost of differentiating (even though one would perform the differentiation at each cell quadrature point).
step-77 is yet another variation of step-15 that addresses a very different part of the problem: Line search and whether it is necessary to re-build the Newton matrix in every nonlinear iteration. Given that the results above show that using automatic differentiation comes at a cost, the techniques in step-77 have the potential to offset these costs to some degree. It would therefore be quite interesting to combine the current program with the techniques in step-77. For production codes, this would certainly be the way to go.