![]() |
Reference documentation for deal.II version 9.5.1
|
This tutorial depends on step-37.
Table of contents | |
---|---|
This program was contributed by Katharina Kormann and Martin Kronbichler.
This work was partly supported by the German Research Foundation (DFG) through the project "High-order discontinuous Galerkin for the exa-scale" (ExaDG) within the priority program "Software for Exascale Computing" (SPPEXA).
Matrix-free operator evaluation enables very efficient implementations of discretization with high-order polynomial bases due to a method called sum factorization. This concept has been introduced in the step-37 and step-48 tutorial programs. In this tutorial program, we extend those concepts to discontinuous Galerkin (DG) schemes that include face integrals, a class of methods where high orders are particularly widespread.
The underlying idea of the matrix-free evaluation is the same as for continuous elements: The matrix-vector product that appears in an iterative solver or multigrid smoother is not implemented by a classical sparse matrix kernel, but instead applied implicitly by the evaluation of the underlying integrals on the fly. For tensor product shape functions that are integrated with a tensor product quadrature rule, this evaluation is particularly efficient by using the sum-factorization technique, which decomposes the initially
More information on the algorithms are available in the preprint
Fast matrix-free evaluation of discontinuous Galerkin finite element operators by Martin Kronbichler and Katharina Kormann, arXiv:1711.03590.
For this tutorial program, we exemplify the matrix-free DG framework for the interior penalty discretization of the Laplacian, i.e., the same scheme as the one used for the step-39 tutorial program. The discretization of the Laplacian is given by the following weak form
where
The terms in the equation represent the cell integral after integration by parts, the primal consistency term that arises at the element interfaces due to integration by parts and insertion of an average flux, the adjoint consistency term that is added for restoring symmetry of the underlying matrix, and a penalty term with factor
In the implementation below, we implement the weak form above by moving the normal vector
For boundary conditions, we use the so-called mirror principle that defines artificial exterior values
The matrix-free framework of deal.II provides the necessary infrastructure to implement the action of the discretized equation above. As opposed to the MatrixFree::cell_loop() that we used in step-37 and step-48, we now build a code in terms of MatrixFree::loop() that takes three function pointers, one for the cell integrals, one for the inner face integrals, and one for the boundary face integrals (in analogy to the design of MeshWorker used in the step-39 tutorial program). In each of these three functions, we then implement the respective terms on the quadrature points. For interpolation between the vector entries and the values and gradients on quadrature points, we use the class FEEvaluation for cell contributions and FEFaceEvaluation for face contributions. The basic usage of these functions has been discussed extensively in the step-37 tutorial program.
In MatrixFree::loop(), all interior faces are visited exactly once, so one must make sure to compute the contributions from both the test functions phi_inner
and phi_outer
for testing with the normal derivative of the test function, and values with opposite sign for testing with the values of the test function, because the latter involves opposite signs due to the jump term. For faces between cells of different refinement level, the integration is done from the refined side, and FEFaceEvaluation automatically performs interpolation to a subface on the coarse side. Thus, a hanging node never appears explicitly in a user implementation of a weak form.
The fact that each face is visited exactly once also applies to those faces at subdomain boundaries between different processors when parallelized with MPI, where one cell belongs to one processor and one to the other. The setup in MatrixFree::reinit() splits the faces between the two sides, and eventually only reports the faces actually handled locally in MatrixFree::n_inner_face_batches() and MatrixFree::n_boundary_face_batches(), respectively. Note that, in analogy to the cell integrals discussed in step-37, deal.II applies vectorization over several faces to use SIMD, working on something we call a batch of faces with a single instruction. The face batches are independent from the cell batches, even though the time at which face integrals are processed is kept close to the time when the cell integrals of the respective cells are processed, in order to increase the data locality.
Another thing that is new in this program is the fact that we no longer split the vector access like FEEvaluation::read_dof_values() or FEEvaluation::distribute_local_to_global() from the evaluation and integration steps, but call combined functions FEEvaluation::gather_evaluate() and FEEvaluation::integrate_scatter(), respectively. This is useful for face integrals because, depending on what gets evaluated on the faces, not all vector entries of a cell must be touched in the first place. Think for example of the case of the nodal element FE_DGQ with node points on the element surface: If we are interested in the shape function values on a face, only
Now of course we are not interested in only the function values, but also the derivatives on the cell. Fortunately, there is an element in deal.II that extends this property of reduced access also for derivatives on faces, the FE_DGQHermite element.
The element FE_DGQHermite belongs to the family of FE_DGQ elements, i.e., its shape functions are a tensor product of 1D polynomials and the element is fully discontinuous. As opposed to the nodal character in the usual FE_DGQ element, the FE_DGQHermite element is a mixture of nodal contributions and derivative contributions based on a Hermite-like concept. The underlying polynomial class is Polynomials::HermiteLikeInterpolation and can be summarized as follows: For cubic polynomials, we use two polynomials to represent the function value and first derivative at the left end of the unit interval,
Using this element, we only need to access read_dof_values()
.
This optimization is not only useful for computing the face integrals, but also for the MPI ghost layer exchange: In a naive exchange, we would need to send all degrees of freedom of a cell to another processor if the other processor is responsible for computing the face's contribution. Since we know that only some of the degrees of freedom in the evaluation with FEFaceEvaluation are touched, it is natural to only exchange the relevant ones. The MatrixFree::loop() function has support for a selected data exchange when combined with LinearAlgebra::distributed::Vector. To make this happen, we need to tell the loop what kind of evaluation on faces we are going to do, using an argument of type MatrixFree::DataAccessOnFaces, as can be seen in the implementation of LaplaceOperator::vmult()
below. The way data is exchanged in that case is as follows: The ghost layer data in the vector still pretends to represent all degrees of freedom, such that FEFaceEvaluation can continue to read the values as if the cell were a locally owned one. The data exchange routines take care of the task for packing and unpacking the data into this format. While this sounds pretty complicated, we will show in the results section below that this really pays off by comparing the performance to a baseline code that does not specify the data access on faces.
In the tradition of the step-37 program, we again solve a Poisson problem with a geometric multigrid preconditioner inside a conjugate gradient solver. Instead of computing the diagonal and use the basic PreconditionChebyshev as a smoother, we choose a different strategy in this tutorial program. We implement a block-Jacobi preconditioner, where a block refers to all degrees of freedom on a cell. Rather than building the full cell matrix and applying its LU factorization (or inverse) in the preconditioner — an operation that would be heavily memory bandwidth bound and thus pretty slow — we approximate the inverse of the block by a special technique called fast diagonalization method.
The idea of the method is to take use of the structure of the cell matrix. In case of the Laplacian with constant coefficients discretized on a Cartesian mesh, the cell matrix
in 2D and
in 3D. The matrices
Interestingly, the exact inverse of the matrix
where
and
The deal.II library implements a class using this concept, called TensorProductMatrixSymmetricSum.
For the sake of this program, we stick with constant coefficients and Cartesian meshes, even though an approximate version based on tensor products would still be possible for a more general mesh, and the operator evaluation itself is of course generic. Also, we do not bother with adaptive meshes where the multigrid algorithm would need to get access to flux matrices over the edges of different refinement, as explained in step-39. One thing we do, however, is to still wrap our block-Jacobi preconditioner inside PreconditionChebyshev. That class relieves us from finding an appropriate relaxation parameter (which would be around 0.7 in 2D and 0.5 in 3D for the block-Jacobi smoother), and often increases smoothing efficiency somewhat over plain Jacobi smoothing, especially when using several iterations.
Note that the block-Jacobi smoother has an additional benefit: The fast diagonalization method can also be interpreted as a change from the Hermite-like polynomials underlying FE_DGQHermite to a basis where the cell Laplacian is diagonal. Thus, it cancels the effect of the basis, and we get the same iteration counts irrespective of whether we use FE_DGQHermite or FE_DGQ. This is in contrast to using the PreconditionChebyshev class with only the diagonal (a point-Jacobi scheme), where FE_DGQ and FE_DGQHermite do indeed behave differently and FE_DGQ needs fewer iterations than FE_DGQHermite, despite the modification made to the Hermite-like shape functions to ensure a good conditioning.
The include files are essentially the same as in step-37, with the exception of the finite element class FE_DGQHermite instead of FE_Q. All functionality for matrix-free computations on face integrals is already contained in fe_evaluation.h
.
As in step-37, we collect the dimension and polynomial degree as constants here at the top of the program for simplicity. As opposed to step-37, we choose a really high order method this time with degree 8 where any implementation not using sum factorization would become prohibitively slow compared to the implementation with MatrixFree which provides an efficiency that is essentially the same as at degrees two or three. Furthermore, all classes in this tutorial program are templated, so it would be easy to select the degree at run time from an input file or a command-line argument by adding instantiations of the appropriate degrees in the main()
function.
In analogy to step-7, we define an analytic solution that we try to reproduce with our discretization. Since the aim of this tutorial is to show matrix-free methods, we choose one of the simplest possibilities, namely a cosine function whose derivatives are simple enough for us to compute analytically. Further down, the wave number 2.4 we select here will be matched with the domain extent in
The LaplaceOperator
class is similar to the respective class in step-37. A significant difference is that we do not derive the class from MatrixFreeOperators::Base because we want to present some additional features of MatrixFree::loop() that are not available in the general-purpose class MatrixFreeOperators::Base. We derive the class from the Subscriptor class to be able to use the operator within the Chebyshev preconditioner because that preconditioner stores the underlying matrix via a SmartPointer.
Given that we implement a complete matrix interface by hand, we need to add an initialize()
function, an m()
function, a vmult()
function, and a Tvmult()
function that were previously provided by MatrixFreeOperators::Base. Our LaplaceOperator also contains a member function get_penalty_factor()
that centralizes the selection of the penalty parameter in the symmetric interior penalty method according to step-39.
The PreconditionBlockJacobi
class defines our custom preconditioner for this problem. As opposed to step-37 which was based on the matrix diagonal, we here compute an approximate inversion of the diagonal blocks in the discontinuous Galerkin method by using the so-called fast diagonalization method discussed in the introduction.
This free-standing function is used in both the LaplaceOperator
and PreconditionBlockJacobi
classes to adjust the ghost range. This function is necessary because some of the vectors that the vmult()
functions are supplied with are not initialized properly with LaplaceOperator::initialize_dof_vector
that includes the correct layout of ghost entries, but instead comes from the MGTransferMatrixFree class that has no notion on the ghost selection of the matrix-free classes. To avoid index confusion, we must adjust the ghost range before actually doing something with these vectors. Since the vectors are kept around in the multigrid smoother and transfer classes, a vector whose ghost range has once been adjusted will remain in this state throughout the lifetime of the object, so we can use a shortcut at the start of the function to see whether the partitioner object of the distributed vector, which is stored as a shared pointer, is the same as the layout expected by MatrixFree, which is stored in a data structure accessed by MatrixFree::get_dof_info(0), where the 0 indicates the DoFHandler number from which this was extracted; we only use a single DoFHandler in MatrixFree, so the only valid number is 0 here.
The next five functions to clear and initialize the LaplaceOperator
class, to return the shared pointer holding the MatrixFree data container, as well as the correct initialization of the vector and operator sizes are the same as in step-37 or rather MatrixFreeOperators::Base.
This function implements the action of the LaplaceOperator on a vector src
and stores the result in the vector dst
. When compared to step-37, there are four new features present in this call.
The first new feature is the adjust_ghost_range_if_necessary
function mentioned above that is needed to fit the vectors to the layout expected by FEEvaluation and FEFaceEvaluation in the cell and face functions.
The second new feature is the fact that we do not implement a vmult_add()
function as we did in step-37 (through the virtual function MatrixFreeOperators::Base::vmult_add()), but directly implement a vmult()
functionality. Since both cell and face integrals will sum into the destination vector, we must of course zero the vector somewhere. For DG elements, we are given two options – one is to use FEEvaluation::set_dof_values() instead of FEEvaluation::distribute_local_to_global() in the apply_cell
function below. This works because the loop layout in MatrixFree is such that cell integrals always touch a given vector entry before the face integrals. However, this really only works for fully discontinuous bases where every cell has its own degrees of freedom, without any sharing with neighboring results. An alternative setup, the one chosen here, is to let the MatrixFree::loop() take care of zeroing the vector. This can be thought of as simply calling dst = 0;
somewhere in the code. The implementation is more involved for supported vectors such as LinearAlgebra::distributed::Vector
, because we aim to not zero the whole vector at once. Doing the zero operation on a small enough pieces of a few thousands of vector entries has the advantage that the vector entries that get zeroed remain in caches before they are accessed again in FEEvaluation::distribute_local_to_global() and FEFaceEvaluation::distribute_local_to_global(). Since matrix-free operator evaluation is really fast, just zeroing a large vector can amount to up to a 25% of the operator evaluation time, and we obviously want to avoid this cost. This option of zeroing the vector is also available for MatrixFree::cell_loop and for continuous bases, even though it was not used in the step-37 or step-48 tutorial programs.
The third new feature is the way we provide the functions to compute on cells, inner faces, and boundary faces: The class MatrixFree has a function called loop
that takes three function pointers to the three cases, allowing to separate the implementations of different things. As explained in step-37, these function pointers can be std::function
objects or member functions of a class. In this case, we use pointers to member functions.
The final new feature are the last two arguments of type MatrixFree::DataAccessOnFaces that can be given to MatrixFree::loop(). This class passes the type of data access for face integrals to the MPI data exchange routines LinearAlgebra::distributed::Vector::update_ghost_values() and LinearAlgebra::distributed::Vector::compress() of the parallel vectors. The purpose is to not send all degrees of freedom of a neighboring element, but to reduce the amount of data to what is really needed for the computations at hand. The data exchange is a real bottleneck in particular for high-degree DG methods, therefore a more restrictive way of exchange is often beneficial. The enum field MatrixFree::DataAccessOnFaces can take the value none
, which means that no face integrals at all are done, which would be analogous to MatrixFree::cell_loop(), the value values
meaning that only shape function values (but no derivatives) are used on faces, and the value gradients
when also first derivatives on faces are accessed besides the values. A value unspecified
means that all degrees of freedom will be exchanged for the faces that are located at the processor boundaries and designated to be worked on at the local processor.
To see how the data can be reduced, think of the case of the nodal element FE_DGQ with node points on the element surface, where only
Since the Laplacian is symmetric, the Tvmult()
(needed by the multigrid smoother interfaces) operation is simply forwarded to the vmult()
case.
The cell operation is very similar to step-37. We do not use a coefficient here, though. The second difference is that we replaced the two steps of FEEvaluation::read_dof_values() followed by FEEvaluation::evaluate() by a single function call FEEvaluation::gather_evaluate() which internally calls the sequence of the two individual methods. Likewise, FEEvaluation::integrate_scatter() implements the sequence of FEEvaluation::integrate() followed by FEEvaluation::distribute_local_to_global(). In this case, these new functions merely save two lines of code. However, we use them for the analogy with FEFaceEvaluation where they are more important as explained below.
The face operation implements the terms of the interior penalty method in analogy to step-39, as explained in the introduction. We need two evaluator objects for this task, one for handling the solution that comes from the cell on one of the two sides of an interior face, and one for handling the solution from the other side. The evaluators for face integrals are called FEFaceEvaluation and take a boolean argument in the second slot of the constructor to indicate which of the two sides the evaluator should belong two. In FEFaceEvaluation and MatrixFree, we call one of the two sides the interior
one and the other the exterior
one. The name exterior
refers to the fact that the evaluator from both sides will return the same normal vector. For the interior
side, the normal vector points outwards, whereas it points inwards on the other side, and is opposed to the outer normal vector of that cell. Apart from the new class name, we again get a range of items to work with in analogy to what was discussed in step-37, but for the interior faces in this case. Note that the data structure of MatrixFree forms batches of faces that are analogous to the batches of cells for the cell integrals. All faces within a batch involve different cell numbers but have the face number within the reference cell, have the same refinement configuration (no refinement or the same subface), and the same orientation, to keep SIMD operations simple and efficient.
Note that there is no implied meaning in interior versus exterior except the logic decision of the orientation of the normal, which is pretty random internally. One can in no way rely on a certain pattern of assigning interior versus exterior flags, as the decision is made for the sake of access regularity and uniformity in the MatrixFree setup routines. Since most sane DG methods are conservative, i.e., fluxes look the same from both sides of an interface, the mathematics are unaltered if the interior/exterior flags are switched and normal vectors get the opposite sign.
On a given batch of faces, we first update the pointers to the current face and then access the vector. As mentioned above, we combine the vector access with the evaluation. In the case of face integrals, the data access into the vector can be reduced for the special case of an FE_DGQHermite basis as explained for the data exchange above: Since only
The next two statements compute the penalty parameter for the interior penalty method. As explained in the introduction, we would like to have a scaling like dim
components, we must finally pick the component that is oriented normal to the reference cell. In the geometry data stored in MatrixFree, a permutation of the components in the Jacobian is applied such that this latter direction is always the last component dim-1
(this is beneficial because reference-cell derivative sorting can be made agnostic of the direction of the face). This means that we can simply access the last entry dim-1
and must not look up the local face number in data.get_face_info(face).interior_face_no
and data.get_face_info(face).exterior_face_no
. Finally, we must also take the absolute value of these factors as the normal could point into either positive or negative direction.
In the loop over the quadrature points, we eventually compute all contributions to the interior penalty scheme. According to the formulas in the introduction, the value of the test function gets multiplied by the difference of the jump in the solution times the penalty parameter and the average of the normal derivative in real space. Since the two evaluators for interior and exterior sides get different signs due to the jump, we pass the result with a different sign here. The normal derivative of the test function gets multiplied by the negative jump in the solution between the interior and exterior side. This term, coined adjoint consistency term, must also include the factor of
Once we are done with the loop over quadrature points, we can do the sum factorization operations for the integration loops on faces and sum the results into the result vector, using the integrate_scatter
function. The name scatter
reflects the distribution of the vector data into scattered positions in the vector using the same pattern as in gather_evaluate
. Like before, the combined integrate + write operation allows us to reduce the data access.
The boundary face function follows by and large the interior face function. The only difference is the fact that we do not have a separate FEFaceEvaluation object that provides us with exterior values LaplaceProblem::compute_rhs()
. Note that due to extension of the solution
There is one catch at this point: The implementation below uses a boolean variable is_dirichlet
to switch between the Dirichlet and the Neumann cases. However, we solve a problem where we also want to impose periodic boundary conditions on some boundaries, namely along those in the apply_face()
function and not in this one.
Next we turn to the preconditioner initialization. As explained in the introduction, we want to construct an (approximate) inverse of the cell matrices from a product of 1d mass and Laplace matrices. Our first task is to compute the 1d matrices, which we do by first creating a 1d finite element. Instead of anticipating FE_DGQHermite<1> here, we get the finite element's name from DoFHandler, replace the dim
argument (2 or 3) by 1 to create a 1d name, and construct the 1d element by using FETools.
As for computing the 1d matrices on the unit element, we simply write down what a typical assembly procedure over rows and columns of the matrix as well as the quadrature points would do. We select the same Laplace matrices once and for all using the coefficients 0.5 for interior faces (but possibly scaled differently in different directions as a result of the mesh). Thus, we make a slight mistake at the Dirichlet boundary (where the correct factor would be 1 for the derivative terms and 2 for the penalty term, see step-39) or at the Neumann boundary where the factor should be zero. Since we only use this class as a smoother inside a multigrid scheme, this error is not going to have any significant effect and merely affects smoothing quality.
The left and right boundary terms assembled by the next two statements appear to have somewhat arbitrary signs, but those are correct as can be verified by looking at step-39 and inserting the value -1 and 1 for the normal vector in the 1d case.
Next, we go through the cells and pass the scaled matrices to TensorProductMatrixSymmetricSum to actually compute the generalized eigenvalue problem for representing the inverse: Since the matrix approximation is constructed as reinit()
has been called.
Once we have accessed the inverse Jacobian through the FEEvaluation access function (we take the one for the zeroth quadrature point as they should be the same on all quadrature points for a Cartesian cell), we check that it is diagonal and then extract the determinant of the original Jacobian, i.e., the inverse of the determinant of the inverse Jacobian, and set the weight as
Once we know the factor by which we should scale the Laplace matrix, we apply this weight to the unscaled DG Laplace matrix and send the array to the class TensorProductMatrixSymmetricSum for computing the generalized eigenvalue problem mentioned in the introduction.
The vmult function for the approximate block-Jacobi preconditioner is very simple in the DG context: We simply need to read the values of the current cell batch, apply the inverse for the given entry in the array of tensor product matrix, and write the result back. In this loop, we overwrite the content in dst
rather than first setting the entries to zero. This is legitimate for a DG method because every cell has independent degrees of freedom. Furthermore, we manually write out the loop over all cell batches, rather than going through MatrixFree::cell_loop(). We do this because we know that we are not going to need data exchange over the MPI network here as all computations are done on the cells held locally on each processor.
The definition of the LaplaceProblem class is very similar to step-37. One difference is the fact that we add the element degree as a template argument to the class, which would allow us to more easily include more than one degree in the same program by creating different instances in the main()
function. The second difference is the selection of the element, FE_DGQHermite, which is specialized for this kind of equations.
The setup function differs in two aspects from step-37. The first is that we do not need to interpolate any constraints for the discontinuous ansatz space, and simply pass a dummy AffineConstraints object into Matrixfree::reinit(). The second change arises because we need to tell MatrixFree to also initialize the data structures for faces. We do this by setting update flags for the inner and boundary faces, respectively. On the boundary faces, we need both the function values, their gradients, JxW values (for integration), the normal vectors, and quadrature points (for the evaluation of the boundary conditions), whereas we only need shape function values, gradients, JxW values, and normal vectors for interior faces. The face data structures in MatrixFree are always built as soon as one of mapping_update_flags_inner_faces
or mapping_update_flags_boundary_faces
are different from the default value update_default
of UpdateFlags.
The computation of the right hand side is a bit more complicated than in step-37. The cell term now consists of the negative Laplacian of the analytical solution, RightHandSide
, for which we need to first split up the Point of VectorizedArray fields, i.e., a batch of points, into a single point by evaluating all lanes in the VectorizedArray separately. Remember that the number of lanes depends on the hardware; it could be 1 for systems that do not offer vectorization (or where deal.II does not have intrinsics), but it could also be 8 or 16 on AVX-512 of recent Intel architectures.
Secondly, we also need to apply the Dirichlet and Neumann boundary conditions. This function is the missing part of to the function LaplaceOperator::apply_boundary()
function once the exterior solution values
We could have issued both the cell and the boundary part through a MatrixFree::loop part, but we choose to manually write the full loop over all faces to learn how the index layout of face indices is set up in MatrixFree: Both the inner faces and the boundary faces share the index range, and all batches of inner faces have lower numbers than the batches of boundary cells. A single index for both variants allows us to easily use the same data structure FEFaceEvaluation for both cases that attaches to the same data field, just at different positions. The number of inner face batches (where a batch is due to the combination of several faces into one for vectorization) is given by MatrixFree::n_inner_face_batches(), whereas the number of boundary face batches is given by MatrixFree::n_boundary_face_batches().
The MatrixFree class lets us query the boundary_id of the current face batch. Remember that MatrixFree sets up the batches for vectorization such that all faces within a batch have the same properties, which includes their boundary_id
. Thus, we can query that id here for the current face index face
and either impose the Dirichlet case (where we add something to the function value) or the Neumann case (where we add something to the normal derivative).
Since we have manually run the loop over cells rather than using MatrixFree::loop(), we must not forget to perform the data exchange with MPI - or actually, we would not need that for DG elements here because each cell carries its own degrees of freedom and cell and boundary integrals only evaluate quantities on the locally owned cells. The coupling to neighboring subdomain only comes in by the inner face integrals, which we have not done here. That said, it does not hurt to call this function here, so we do it as a reminder of what happens inside MatrixFree::loop().
The solve()
function is copied almost verbatim from step-37. We set up the same multigrid ingredients, namely the level transfer, a smoother, and a coarse grid solver. The only difference is the fact that we do not use the diagonal of the Laplacian for the preconditioner of the Chebyshev iteration used for smoothing, but instead our newly resolved class PreconditionBlockJacobi
. The mechanisms are the same, though.
Since we have solved a problem with analytic solution, we want to verify the correctness of our implementation by computing the L2 error of the numerical result against the analytic solution.
The run()
function sets up the initial grid and then runs the multigrid program in the usual way. As a domain, we choose a rectangle with periodic boundary conditions in the Solution
) as compared to the
There is nothing unexpected in the main()
function. We call MPI_Init()
through the MPI_InitFinalize
class, pass on the two parameters on the dimension and the degree set at the top of the file, and run the Laplace problem.
Like in step-37, we evaluate the multigrid solver in terms of run time. In two space dimensions with elements of degree 8, a possible output could look as follows:
Like in step-37, the number of CG iterations remains constant with increasing problem size. The iteration counts are a bit higher, which is because we use a lower degree of the Chebyshev polynomial (2 vs 5 in step-37) and because the interior penalty discretization has a somewhat larger spread in eigenvalues. Nonetheless, 13 iterations to reduce the residual by 12 orders of magnitude, or almost a factor of 9 per iteration, indicates an overall very efficient method. In particular, we can solve a system with 21 million degrees of freedom in 5 seconds when using 12 cores, which is a very good efficiency. Of course, in 2D we are well inside the regime of roundoff for a polynomial degree of 8 – as a matter of fact, around 83k DoFs or 0.025s would have been enough to fully converge this (simple) analytic solution here.
Not much changes if we run the program in three spatial dimensions, except for the fact that we now use do something more useful with the higher polynomial degree and increasing mesh sizes, as the roundoff errors are only obtained at the finest mesh. Still, it is remarkable that we can solve a 3D Laplace problem with a wave of three periods to roundoff accuracy on a twelve-core machine pretty easily - using about 3.5 GB of memory in total for the second to largest case with 24m DoFs, taking not more than eight seconds. The largest case uses 30GB of memory with 191m DoFs.
In the introduction and in-code comments, it was mentioned several times that high orders are treated very efficiently with the FEEvaluation and FEFaceEvaluation evaluators. Now, we want to substantiate these claims by looking at the throughput of the 3D multigrid solver for various polynomial degrees. We collect the times as follows: We first run a solver at problem size close to ten million, indicated in the first four table rows, and record the timings. Then, we normalize the throughput by recording the number of million degrees of freedom solved per second (MDoFs/s) to be able to compare the efficiency of the different degrees, which is computed by dividing the number of degrees of freedom by the solver time.
degree | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Number of DoFs | 2097152 | 7077888 | 16777216 | 32768000 | 7077888 | 11239424 | 16777216 | 23887872 | 32768000 | 43614208 | 7077888 | 8998912 |
Number of iterations | 13 | 12 | 12 | 12 | 13 | 13 | 15 | 15 | 17 | 19 | 18 | 18 |
Solver time [s] | 0.713 | 2.150 | 4.638 | 8.803 | 2.041 | 3.295 | 5.723 | 8.306 | 12.75 | 19.25 | 3.530 | 4.814 |
MDoFs/s | 2.94 | 3.29 | 3.62 | 3.72 | 3.47 | 3.41 | 2.93 | 2.88 | 2.57 | 2.27 | 2.01 | 1.87 |
We clearly see how the efficiency per DoF initially improves until it reaches a maximum for the polynomial degree
Note that the above numbers are a bit pessimistic because they include the time it takes the Chebyshev smoother to compute an eigenvalue estimate, which is around 10 percent of the solver time. If the system is solved several times (as e.g. common in fluid dynamics), this eigenvalue cost is only paid once and faster times become available.
Finally, we take a look at some of the special ingredients presented in this tutorial program, namely the FE_DGQHermite basis in particular and the specification of MatrixFree::DataAccessOnFaces. In the following table, the third row shows the optimized solver above, the fourth row shows the timings with only the MatrixFree::DataAccessOnFaces set to unspecified
rather than the optimal gradients
, and the last one with replacing FE_DGQHermite by the basic FE_DGQ elements where both the MPI exchange are more expensive and the operations done by FEFaceEvaluation::gather_evaluate() and FEFaceEvaluation::integrate_scatter().
degree | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Number of DoFs | 2097152 | 7077888 | 16777216 | 32768000 | 7077888 | 11239424 | 16777216 | 23887872 | 32768000 | 43614208 | 7077888 | 8998912 |
Solver time optimized as in tutorial [s] | 0.713 | 2.150 | 4.638 | 8.803 | 2.041 | 3.295 | 5.723 | 8.306 | 12.75 | 19.25 | 3.530 | 4.814 |
Solver time MatrixFree::DataAccessOnFaces::unspecified [s] | 0.711 | 2.151 | 4.675 | 8.968 | 2.243 | 3.655 | 6.277 | 9.082 | 13.50 | 20.05 | 3.817 | 5.178 |
Solver time FE_DGQ [s] | 0.712 | 2.041 | 5.066 | 9.335 | 2.379 | 3.802 | 6.564 | 9.714 | 14.54 | 22.76 | 4.148 | 5.857 |
The data in the table shows that not using MatrixFree::DataAccessOnFaces increases costs by around 10% for higher polynomial degrees. For lower degrees, the difference is obviously less pronounced because the volume-to-surface ratio is more beneficial and less data needs to be exchanged. The difference is larger when looking at the matrix-vector product only, rather than the full multigrid solver shown here, with around 20% worse timings just because of the MPI communication.
For
As mentioned in the introduction, the fast diagonalization method as realized here is tied to a Cartesian mesh with constant coefficients. When dealing with meshes that contain deformed cells or with variable coefficients, it is common to determine a nearby Cartesian mesh cell as an approximation. This can be done with the class TensorProductMatrixSymmetricSumCollection. Here, one can insert cell matrices similarly to the PreconditionBlockJacobi::initialize() function of this tutorial program. The benefit of the collection class is that cells on which the coefficient of the PDE has the same value can re-use the same Laplacian matrix, which reduces the memory consumption for the inverse matrices. As compared to the algorithm implemented in this tutorial program, one would define the length scales as the distances between opposing faces. For continuous elements, the code project <a href=https://github.com/peterrum/dealii-dd-and-schwarz">Cache-optimized and low-overhead implementations of multigrid smoothers for high-order FEM computations</a> presents the computation for continuous elements. There is currently no infrastructure in deal.II to automatically generate the 1D matrices for discontinuous elements with SIP-DG discretization, as opposed to continuous elements, where we provide TensorProductMatrixCreator::create_laplace_tensor_product_matrix(). Another way of extending the program would be to include support for adaptive meshes. While the classical approach of defining interface operations at edges of different refinement level, as discussed in @ref step_39 "step-39", is one possibility, for Poisson-type problems another option is typically more beneficial. Using the class MGTransferGlobalCoarsening, which is explained in the @ref step_75 "step-75" tutorial program, one can deal with meshes of hanging nodes on all levels. An algorithmic improvement can be obtained by combining the discontinuous function space with the auxiliary continuous finite element space of the same polynomial degree. This idea, introduced by Antonietti et al. @cite antonietti2016uniform in 2016, allows making the multigrid convergence independent of the penalty parameter. As demonstrated by Fehn et al. @cite fehn2020hybrid, this also gives considerably lower iteration counts than a multigrid solver directly working on levels with discontinuous function spaces. The latter work also proposes p-multigrid techniques and combination with algebraic multigrid coarse spaces as a means to efficiently solve Poisson problems with high-order discontinuous Galerkin discretizations on complicated geometries, representing the current state-of-the-art for simple Poisson-type problems. The class MGTransferGlobalCoarsening provides features for each of these three coarsening variants, the discontinuous-continuous auxiliary function concept, p-multigrid, and traditional h-multigrid. The main ingredient is to define an appropriate MGTwoLevelTransfer object and call MGTwoLevelTransfer::reinit_geometric_transfer() or MGTwoLevelTranfer::reinit_polynomial_transfer(), respectively. @anchor PlainProg <a></a> <h1> The plain program</h1> @include "step-59.cc"