A study of vectorization for matrix-free finite element methods

Vectorization is increasingly important to achieve high performance on modern hardware with SIMD instructions. Assembly of matrices and vectors in the finite element method, which is characterized by iterating a local assembly kernel over unstructured meshes, poses challenges to effective vectorization. Maintaining a user-friendly high-level interface with a suitable degree of abstraction while generating efficient, vectorized code for the finite element method is a challenge for numerical software systems and libraries. In this work, we study cross-element vectorization in the finite element framework Firedrake via code transformation and demonstrate the efficacy of such an approach by evaluating a wide range of matrix-free operators spanning different polynomial degrees and discretizations on two recent CPUs using three mainstream compilers. Our experiments show that our approaches for cross-element vectorization achieve 30% of theoretical peak performance for many examples of practical significance, and exceed 50% for cases with high arithmetic intensities, with consistent speed-up over (intra-element) vectorization restricted to the local assembly kernels.


Introduction
The realization of efficient solution procedures for partial differential equations (PDEs) using finite element methods on modern computer systems requires the combination of diverse skills across mathematics, programming languages and high-performance computing. Automated code generation is one of the promising approaches to manage this complexity. It has been increasingly adopted in software systems and libraries. Recent successful examples include FEniCS (Logg et al. 2012), Firedrake (Rathgeber et al. 2016) and FreeFem++ (Hecht 2012). These software packages provide users with high-level interfaces for high productivity while relying on optimizations and transformations in the code generation pipeline to generate efficient low-level code. The challenge, as in all compilers, is to use appropriate abstraction layers that enable optimizations to be applied that achieve high performance on a broad set of programs and machines.
One particular challenge for generating high-performance code on modern hardware is vectorization. Modern CPUs increasingly rely on SIMD instructions to achieve higher throughput and better energy efficiency. Finite element computation requires the assembly of vectors and matrices which represent differential forms on discretized function spaces. This process consists of applying a local function, often called an element kernel, to each mesh entity, and incrementing the global data structure with the local contribution. Typical local assembly kernels suffer from issues that can preclude effective vectorization. These issues include complicated loop structures, poor data access patterns, and short loop trip counts that are not multiples of the vector width. As we show in this paper, general purpose compilers perform poorly in generating efficient, vectorized code for such kernels. Padding and data layout transformations are required to enable the vectorization of the element kernels (Luporini et al. 2015), but the effectiveness of such approaches is not consistent across different examples. Since padding may also result in larger overheads for wider vector architectures, new strategies are needed as vector width increases for the new generation of hardware.
Matrix-free methods avoid building large sparse matrices in applications of the finite element method and thus trade computation for storage. They have become popular for use on modern hardware due to their higher arithmetic intensity (defined as the number of floating-point operations per byte of data transfer). Vectorization is particularly important for computationally intensive high order methods, for which matrix-free methods are often applied. Previous works on improving vectorization of matrix-free operator application, or equivalently, residual evaluation, mostly focus on exposing library interfaces to the users. Kronbichler and Kormann (2017) first perform a change of basis from nodal points to quadrature points, and provide overloaded SIMD types for users to write a quadrature-point-wise expression for residual evaluation. However, since the transformation is done manually, new operators require manual reimplementation. Knepley and Terrel (2013) also transpose to quadrature-point basis but target GPUs instead. Both works vectorize by grouping elements into batches, either to match the SIMD vector length in CPUs or the shared memory capacity on GPUs. In contrast, Müthing et al. (2017) apply an intra-kernel vectorization strategy and exploit the fact that in 3D, evaluating both a scalar field and its three derivatives fills the four lanes of an AVX2 vector register. More recently, Kempf et al. (2018) target high order Discontinuous Galerkin (DG) methods on hexahedral meshes using automated code generation to search for vectorization strategies, while taking advantage of the specific memory layout of the data.
In this work, we present a generic and portable solution based on cross-element vectorization. Our vectorization strategy, implemented in Firedrake, is similar to that of Kronbichler and Kormann (2017) but is fully automated through code generation like that of Kempf et al. (2018). We extend the scope of code generation in Firedrake to incorporate the outer iteration over mesh entities and leverage Loopy (Klöckner 2014), a loop code generator based loosely on the polyhedral model, to systematically apply a sequence of transformations which promote vectorization by grouping mesh entities into batches so that each SIMD lane operates on one entity independently. This automated code generation mechanism enables us to explore the effectiveness of our techniques on operators spanning a wide range of complexity and systematically evaluate our methodology. Compared with an intra-kernel vectorization strategy, this approach is conceptually well-defined, more portable, and produces more predictable performance. Our experimental evaluation demonstrates that the approach consistently achieves a high fraction of hardware peak performance while being fully transparent to end users.
The contributions of this work are as follows: • We present the design of a code transformation pipeline that permits the generation of highperformance, vectorized code on a broad class of FEM models. • We provide a thorough evaluation of our code generation strategy and demonstrate that it achieves a substantial fraction of theoretical peak performance across a broad range of test cases.
The rest of this paper is arranged as follows. After reviewing the preliminaries of code generation for the finite element method in Section 2, we describe our implementation of cross-element vectorization in Firedrake in Section 3. In Section 4, we demonstrate the effectiveness of our approach with experimental results. Finally, we review our contributions and identify future research priorities in Section 5.

Preliminaries
The computation of multilinear forms using the basis functions spanning the discretized function spaces is called finite element assembly. When applying the matrix-free methods, one only needs to assemble linear forms, or residual forms, because matrix-vector products are essentially the assembly of linear forms which represent the actions of bilinear forms. Optimizing linear form assembly is therefore crucial for improving the performance of matrixfree methods. In Firedrake, one can invoke the matrixfree approach without changing the high-level problem formulation by setting solver options as detailed by Kirby and Mitchell (2018).
The general structure of a linear form L is where c i ∈V i , i = 1 . . . k, are arbitrary coefficient functions, and v ∈ V is the test function. L is linear with respect to v, but possibly nonlinear with respect to the coefficient functions.
be the set of basis functions spanning V . Define v i = L(c 1 , . . . , c k ; φ i ) ∈ R, then the assembly of L constitutes the computation of the vector v = (v i , . . . , v n ). In Firedrake, this is treated as a two-step process: local assembly and global assembly.

Local assembly
Local assembly of linear forms is the evaluation of the integrals as defined by the weak form of the differential equation on each entity (cell or facet) of the mesh. In Firedrake, the users define the problem in Unified Form Language (UFL) (Alnaes et al. 2014) which captures the weak form and the function space discretization. Then the Two-Stage Form Compiler (TSFC) (Homolya et al. 2018) takes this high-level, mathematical description and generates efficient C code. As an example, consider the linear form of the weak form of the positive-definite Helmholtz operator: Listing 1. Assembling the linear form of the Helmholtz operator in UFL.
Listing 1 shows the UFL syntax to assemble the linear form L as the vector result, on a 10 × 10 triangulation of a unit square. We choose to use the first-order Lagrange element as our approximation space. Listing 2 shows a C representation of this kernel generated by TSFC. We note the following key features of this element kernel: • The kernel takes three array arguments in this case: coords holds the coordinates of the current triangle, w 0 holds u i , the coefficients of u, and A stores the result. • The first part of the kernel (line 7 to line 15) computes the inverse and the determinant of the Jacobian for the coordinate transformation from the reference element Listing 2. Local assembly kernel the linear form of the Helmholtz operator in C. to the current element. This is required for pulling back the differential forms to the reference element. The Jacobian is constant for each triangle because the coordinate transformation is affine in this case. Otherwise, the Jacobian needs to be computed at each quadrature point.
• The constant arrays t0, t9 are the same for all elements. t0 represents the tabulation of the evaluation of basis functions at quadrature points, t9 represents the quadrature weights. • The ip loop iterates over the quadrature points, evaluating the integrand in (2) and summing to approximate the integral. The j loops iterate over the degrees of freedom, once inside the quadrature loop, and once upon output to the assembled array A. The extents of these loops depend on the integrals performed and the choice of function spaces respectively. • TSFC performs optimization passes on the loop nests.
In particular, it applies loop-invariant code motion which pulls invariant expression out of the loop nests into temporary arrays. This reduces the number of operations required while changing the structure of otherwise perfectly nested loops.

Global assembly
During global assembly, the local contribution from each mesh entity, computed by the element kernel, is accumulated into the global data structure. In Firedrake, PyOP2 (Rathgeber et al. 2012) is responsible for representing and realizing the iteration over mesh entities, marshalling data in and out of the element kernels. The computation is organized as PyOP2 parallel loops, or parloops. A parloop specifies a computational kernel, a set of mesh entities to which the kernel is applied, and all data required for the kernel. The data objects could be directly defined on the mesh entities, or indirectly access through maps from the mesh entities. For instance, the signature for the global assembly of the Helmholtz operator is: parloop(helmholtz, cells, r(cell2vert, RW), coords(cell2vert, R), x(cell2vert, R)).
Here helmholtz is the element kernel as shown in Listing 2, generated by TSFC; cells is the set of all triangles in the mesh; r, coords and x are the global data objects that are needed to create the arguments for the element kernel, where r holds the result vector, coords holds the coordinates of the vertices of the triangles which are needed for computing the Jacobian, and x holds the vector Listing 3. Global assembly code for action of the Helmholtz operator in C. representation of function u (as weights of basis functions). These global data objects correspond to the kernel arguments A, coords and w 0 respectively. The map cell2vert provides indirection from mesh entities to the global data objects, and each data argument is annotated with an access descriptor (R for read-only, RW for read-write access). In this example, all three arguments share the same map because first-order Lagrange element on triangles only have degreesof-freedom defined on the vertices, while the coordinate fields are also defined on the vertices.
Listing 3 shows the C code generated by PyOP2 for the above example. The code is then JIT-compiled when the result is needed in Firedrake. In the context of vectorization, this approach, with the inlined element kernel, forms the baseline in our experimental evaluation. We note the following key features of the global assembly kernel: • The outer loop is over mesh entities.
• For each entity, the computation can be divided into three parts: gathering the input data from global data structures (t3 and t4 in this case, which correspond to kernel arguments coords and w 0), calling the local assembly kernel, scattering the output data (t2) to the global data structure. • The gathering and scattering of data make use of indirect addressing via base pointers (dats) and indices (maps). • Different mesh entities might share the same degrees of freedom. • Global assembly interacts with local assembly via a function call (Line 23). This call can be inlined by the compiler, but it creates an artificial boundary for loop transformations at the source code level. This is the software engineering challenge that limits vectorization to a single local assembly kernel previously.

Vectorization
As one would expect, the loop nests and loop trip counts vary considerably for different integrals, meshes and function spaces that users might choose. This complexity is one of the challenges that our system specifically, and Firedrake more generally, must face in order to deliver predictable performance on modern CPUs, which have increasingly rich SIMD instruction sets.
In the prior approach to vectorization in our framework, the local assembly kernels generated by TSFC are further transformed to facilitate vectorization, as described by Luporini et al. (2015). The arrays are padded so that the trip counts of the innermost loops match multiples of the length of SIMD units. However, padding becomes less effective for low polynomial degrees on wide SIMD units. For instance, AVX512 instructions act on 8 doubleprecision floats, but the loops for degree 1 polynomials on triangles only have trip counts of 3, as shown in Listing 2. Moreover, loop-invariant code motion is very effective in reducing the number of floating-point operations, but hoisted instructions are not easily vectorized as they are no longer in the innermost loops. This effect is more pronounced on tensor-product elements where TSFC is able to apply sum factorization (Homolya et al. 2017) to achieve better algorithmic complexity.

Cross-element vectorization and Loopy
Another strategy is to vectorize across several elements in the outer loop over the mesh entities, as proposed previously by Kronbichler and Kormann (2017). This approach computes the contributions from several mesh entities using SIMD instructions, where each SIMD lane handles one entity. This is always possible regardless of the complexity of the local element kernel because the computation on each entity is independent and identical. One potential downside is the increase in memory pressure as the working set is larger.
For a compiler, the difficulty in performing cross-element vectorization (or, more generally, outer-loop vectorization) is to automate a sequence of loop transformations and necessary data layout transformations robustly. This is further complicated by the indirect memory access in data gathering and scattering, and the need to unroll and interchange loops across the indirections, which requires significantly more semantic knowledge than what is available to the C compiler.
Loopy (Klöckner 2014) is a loop generator embedded in Python which targets both CPUs and GPUs. Loopy provides abstractions based on integer sets for loop-based computations and enables powerful transformations based on the polyhedral model (Verdoolaege 2010). Loop-based computations in Loopy are represented as Loopy kernels. A Loopy kernel is a subprogram consisting of a loop domain and a partially-ordered list of scalar assignments acting on multi-dimensional arrays. The loop domain is specified as the set of integral points in the convex intersection of quasi-affine constraints, as described by the Integer Set Library (Verdoolaege 2010).
To integrate with Loopy, the code generation mechanisms in Firedrake were modified as illustrated in Figure 1. Instead of generating source code directly, TSFC and PyOP2 are modified to generate Loopy kernels. We have augmented the Loopy internal representation with the ability to support a generalized notion of kernel fusion through the nested composition of kernels, specifically through subprograms and inlining. This allows PyOP2 to inline the element kernel such that the global assembly Loopy kernel encapsulates the complete computation of global assembly. This holistic view of the overall computation enables robust loop transformations for vectorization across the boundary between global and local assembly.
Listing 4 shows an abridged version of the global assembly Loopy kernel for the Helmholtz operator, with the element kernel fused. We highlight the following key features of Loopy kernels: • Loop indices, such as n, i1, are called inames in Loopy, which define the iteration space. The bounds of the loops are specified by the affine constraints in domains. • Loop transformations operate on kernels by rewriting the loop domain and the statements making up the kernel. In addition, each iname carries a set of tags governing its realization in generated code, perhaps as a sequential loop, as a vector lane index, or through unrolling. • Multi-dimensional arrays occur as arguments and temporaries. The memory layout of the data can be specified by assigning tags to the array dimensions. • Dependencies between statements specify their partial order. Statement scheduling can also be controlled by assigning priorities to statements and inames.
For example, to achieve cross-element vectorization (by batching 4 elements into one SIMD vector in this example) we invoke the following sequence of Loopy transformations on the global assembly Loopy kernel, exploiting the domain knowledge of finite element assembly: • Split the outer loop n over mesh entities into n outer and n simd, with n simd having trip count of 4. The objective is to generate SIMD instructions for the n simd loops, such that each vector lane computes one iteration of the n simd loops. • Assign the tag SIMD to the new iname n simd.
This tag informs Loopy to force the n simd loop to be innermost, privatizing data by vector-expansion if necessary.
We highlight the change to the Loopy kernel after these transformations in Listing 5. Loopy supports code generation for different environments from the same kernel by choosing different targets. We introduced an OpenMP Target to Loopy which extends its existing C-language Target to support OpenMP pragmas, facilitating SIMD instruction generation.
Listing 6 shows the generated C code for the Helmholtz operator vectorized by grouping together 4 elements. Apart from the previously mentioned changes, we note the following details: • The n simd loops are pushed to the innermost level. Moreover, this transformation vector-expands temporary arrays such as t2, t3, t4 by 4, with the expanded dimension labeled as varying the fastest when viewed from (linear) system memory. This ensures their accesses in the n simd loops always have unit stride. • Loopy provides a mechanism to declare arrays to be aligned to specified memory boundaries (64 bytes in this example). • The n simd loops are decorated by pragma omp simd to inform C compilers to generate SIMD instructions. The exception is the writing back to the global array (Line 36), which is sequentialized due to potential race conditions, as different mesh entities could share the same degrees of freedom. • The remainder loop which handles the cases where the number of elements is non-divisible by 4 is omitted here for simplicity. After cross-element vectorization, all local assembly instructions (Lines 24-36) are inside the n simd loops, which always have trip counts of 4 and are stride 1. All loop-varying array accesses are stride 1 in the fastest moving dimension. There are no loop-carried dependencies in n simd loops. As a result, the n simd loops, and therefore all local assembly instructions, are vectorizable without further consideration of dependencies. This is verified by checking the x86 assembly code and running the program with the Intel Software Development Emulator.

Vector extensions
A more direct way to inform the compiler to emit SIMD instructions without depending on OpenMP implementation is to use vector extensions 2 , which support vector data types. These were first introduced in the GNU compiler (GCC), but  are also supported in recent versions of the Intel C compiler (ICC) and Clang. Analogous mechanisms exist in various vector-type libraries, e.g. VCL (Fog 2017). To evaluate and compare with the directive-based approach from Section 3.1, we created a new code generation target in Loopy to support vector data types. When inames and corresponding array axes are jointly tagged as vector loops, Loopy generates code to compute on data in vector registers directly, instead of scalar loops over the vector lanes. It is worth noting that the initial intermediate representation of the loop was identical in each case, and that the different specializations were achieved through code transformation.  Listing 7 shows the C code generated for the Helmholtz operator vectorized by batching 4 elements using the vector extension target. Here all vectorized (innermost) loops for local assembly are replaced by operations on vector variables. For instructions which do not fit the vector computation model, most noticeably the indirect data gathering (Line 18), or instructions containing builtin mathematics functions which are not supported on vector data types (Line 32), Loopy defaults to generating scalar loops over vector lanes, decorated with pragma Listing 6. Global assembly code for action of the Helmholtz operator in C vectorized by batching 4 elements.

Performance Evaluation
We follow the performance evaluation methodology of Luporini et al. (2017) by measuring the assembly time of a range of operators of increasing complexity and polynomial degrees. Due to the large number of combinations of experimental parameters (operators, meshes, polynomial degrees, vectorization strategies, compilers, hyperthreading), we only report an illustrative portion of the results here, with the entire suite of experiments made available on the interactive online repository CodeOcean (Sun 2019a).

Experimental setup
We performed experiments on a single node of two Intel systems, based on the Haswell and Skylake microarchitectures, as detailed in Table 1. Because we observe that hyperthreading usually improves the performance by 5% to 10% for our applications, we set the number of MPI processes to the number of logical cores of the CPU to utilize all available computation resources. Experimental results with hyperthreading turned off are available on CodeOcean.
The batch size, i.e., the number of elements grouped together for vectorization, is chosen to be consistent with the SIMD length. We use three C compilers: GCC 7.3, ICC 18.0 and Clang 5.0. The two vectorization strategies described in Section 3 are tested on all platforms. We use the listed Base Frequency to calculate the peak performance in Table 1. In reality, modern Intel CPUs dynamically reduce frequencies on heavy workloads with AVX2 and AVX512 instructions, which results in lower achievable performance. Running the optimized LINPACK benchmark binary provided by Intel gives a reasonable indication of peak performance for real applications.
For the benefit of reproducibility, we have archived the specific versions of Firedrake components used for experimental evaluation on Zenodo (Zenodo/Firedrake 2019). An installation of Firedrake with components matching the ones used for evaluation in this paper can be obtained following the instruction at https://www.firedrakeproject.org/download.html, with the following command: The evaluation framework is archived at (Sun 2019b). We measure the execution time of assembling the residual of five operators: the mass matrix ("mass"), the Helmholtz equation ("helmholtz"), the vector Laplacian ("laplacian"), an elastic model ("elasticity"), Listing 7. Global assembly code for action of the Helmholtz operator in C vectorized by 4 elements (using vector extensions).
We performed experiments on both 2D and 3D domains, with two types of mesh used for each case: triangles ("tri") and quadrilaterals ("quad") for 2D problems, tetrahedra ("tet") and hexahedra ("hex") for 3D problems. The arithmetic intensity of the operators are listed in Table 2. The memory footprint is calculated assuming perfect cachingit is thus a lower bound which results in an upper bound estimation for the arithmetic intensity. The triangular and tetrahedral meshes use an affine coordinate transformation (requiring only one Jacobian evaluation per element). The quadrilateral and hexahedral meshes use a bilinear (trilinear) coordinate transformation (requiring Jacobian evaluation at every quadrature point), which usually results in higher arithmetic intensities at low orders. In Firedrake, tensor-product elements (McRae et al. 2016) benefit from Table 2. Operator characteristics and speed-up summary, using GCC with vector extensions. AI: arithmetic intensity (FLOP/byte). D: trip count of loops over degrees of freedom. Q: trip count of loops over quadrature points. H: speed-up over baseline on Haswell, 16 processes, with vector extensions. S: speed-up over baseline on Skylake, 40 processes, with vector extensions. optimizations such as sum factorization to achieve lower asymptotic algorithmic complexity. They are therefore more competitive for higher order methods (Homolya et al. 2017).
We record the maximum execution time of the generated global assembly kernels on all MPI processes. This time does not includes the time in synchronization and MPI data exchange for halo updates. Each experiment is run five times, and the average execution time is reported. Exclusive access to the compute nodes is ensured and threads are pinned to individual logical cores. Startup costs such as code generation time and compilation time are excluded. We use automatic vectorization by GCC without batching, compiled with the same optimization flags listed earlier, as the baseline for comparison. Comparing with the crosselement strategy, the baseline represents the out-of-the-box performance of compiler auto-vectorization for the local element kernel. We note that cross-element vectorization does not alter the algorithm of local assembly except for the vector expansion, as illustrated by Listing 2 and Listing 6. Consequently, the total number of floating-point operations remains the same. The performance benefit from crosselement vectorization is therefore composable with the operation-reduction optimizations performed by the form compiler to the local assembly kernels.

Experimental results and discussion
Figures 2 to 5 show the performance of the helmholtz and elasticity operators on Haswell and Skylake, vectorized with OpenMP pragma as described in Section 3.1, and with vector extensions as described in Section 3.2. We indicate the fraction of peak performance achieved on the left axis, and the fraction of the LINPACK benchmark performance on the right axis. Figure 6 and 7 compare the roofline models (Williams et al. 2009) of the baseline and our cross-element vectorization implementation using GCC and vector extensions on Haswell and Skylake. The speed-up achieved is also summarized in Table 2. We analyze the data in the following aspects:

Compiler comparison and vector extensions
When vectorizing with OpenMP pragma, ICC gives the best performance for almost all test cases, followed by Clang, while GCC is significantly less competitive. The performance disparity is more pronounced on Skylake than on Haswell. However, when using vector extensions, Clang and GCC improve significantly and are able to match the performance of ICC on both Haswell and Skylake, whereas ICC performs similarly with OpenMP pragma and with vector extensions.
We use the Intel Software Development Emulator 7 to count the number of instructions executed at runtime for code generated by different compilers. The data indicate that although floating-point operations are fully vectorized by all compilers, GCC and Clang generate more load and store instructions between vector registers and memory when using OpenMP pragma for vectorization. One possible reason is that GCC and Clang choose to allocate short arrays to the stack rather than the vector registers directly, causing more load on the memory subsystem.
In light of these results, we conclude that vectorization with vector extensions allows greater performance portability on different compilers and CPUs for our application. It is, therefore, our preferred strategy for implementing crosselement vectorization, and is the default option for the rest of our analysis. set of data. On simple operators such as mass on tri and tetra, the kernels have simple loop structures and the compilers can sometimes successfully apply other optimizations such as unrolling and loop interchange to achieve vectorization without batching elements in the outer loop. The pattern of speed-up is consistent across Haswell and Skylake. Higher speed-up is generally achieved on more complicated operators (e.g. hyperelasticity), and on tensor-product elements (quad and hex), which generally correspond to more complicated loop structure and higher arithmetic intensity due to the Jacobian recomputation at each quadrature point.

Achieved fraction of peak performance
We observe that the fraction of peak performance varies smoothly with polynomial degrees for cross-element vectorization in all test cases. This fulfils an important design requirement for Firedrake: small changes in problem setup by the users should not create unexpected performance degradation. This is also shown in Figures 6 and 7 where the results Haswell cross-element vectorization mass -tri mass -quad mass -tet mass -hex helmholtz -tri helmholtz -quad helmholtz -tet helmholtz -hex laplacian -tri laplacian -quad laplacian -tet laplacian -hex elasticity -tri elasticity -quad elasticity -tet elasticity -hex hyperelasticity -tri hyperelasticity -quad hyperelasticity -tet hyperelasticity -hex  Skylake cross-element vectorization mass -tri mass -quad mass -tet mass -hex helmholtz -tri helmholtz -quad helmholtz -tet helmholtz -hex laplacian -tri laplacian -quad laplacian -tet laplacian -hex elasticity -tri elasticity -quad elasticity -tet elasticity -hex hyperelasticity -tri hyperelasticity -quad hyperelasticity -tet hyperelasticity -hex are more clustered on the roofline plots after crosselement vectorization. The baseline shows performance inconsistency, especially on low polynomial degrees. For instance, for the helmholtz operator with degree 3 on quad, the quadrature loops and the basis function loops all have trip counts of 4, which fits the vector length on Haswell and results in better performance.
On simplicial meshes (tri and tetra), higher order discretization leads to kernels with very high arithmetic intensity because of the quadratic and cubic increases in the number of basis functions, and thus the loop trip counts. This is due to the current limitation that simplicial elements in Firedrake are not sum factorized. In these test cases, we observe that the baseline approaches cross-element vectorization for sufficiently high polynomial degrees. This is not a serious concern for our optimization approach because the break-even degrees are very high except for simple operators such as mass, and ultimately tensorproduct elements are more competitive for higher order methods in terms of algorithmic complexity.
We also observe that there exists a small number of test cases where the achieved peak performance is marginally higher than the LINPACK benchmark on Skylake, as shown in Figure 7. One possible reason for this observation is thermal throttling since our test cases typically run for a shorter period of time than LINPACK. We also note that these test cases correspond to high order hyperelasticity operator on tet which are not practically important use cases, since using tensor-product elements requires much less floating-point operations at the same polynomial order.

Tensor-product elements
We observe higher and more consistent speed-up for tensor-product elements (quad and hex) on both Haswell and Skylake. This is because, on these meshes, more computation can be moved outside the innermost loop due to sum factorization, which results in more challenging loop nests for the baseline strategy which attempts to vectorize within the element kernel. The same applies to the evaluation of the Jacobian of coordinate transformation, which is a nested loop over quadrature points after sum factorization for tensor-product elements.
The base elements of quad and hex are interval elements in 1D, thus the extents of loops over degrees of freedom increase only linearly with respect to polynomial degrees, as shown in Table 2. As a result, the baseline performance does not improve as quickly for higher polynomial degrees on quad and hex compared with tri and tet, resulting in stable speed-up for cross-element vectorization observed on tensor-product elements.

Conclusion and future work
We have presented a portable, general-purpose solution for delivering stable vectorization performance on modern CPUs for matrix-free finite element assembly for a very broad class of finite element operators on a large range of elements and polynomial degrees. We described the implementation of cross-element vectorization in Firedrake which is transparent to the end users. Although the technique of cross-element vectorization is conceptually simple and has been applied in hand-written kernels before, our implementation based on code generation is automatic, robust and composable with other optimization passes.
The write-back to global data structure is not vectorized in our approach due to possible race conditions. The newly introduced Conflict Detection instructions in the Intel AVX512 instruction set could potentially mitigate this limitation (Zhang 2016, Section 2.3). This could be achieved by informing Loopy to use the relevant intrinsics when generating code for loops with specific tags.
We have focused on the matrix-free finite element method because it is compute-intensive and more likely to benefit from vectorization. However, our methods and implementation also support matrix assembly. Firedrake relies on PETSc (Balay et al. 2017) to handle distributed sparse matrices, and PETSc requires certain data layouts for the input array when updating the global matrices. When several elements are batched together for crosselement vectorization, we need to generate code to explicitly unpack/transpose the local assembly results into individual arrays before calling PETSc functions to update the global sparse matrices for each element. Future improvement could include eliminating this overhead, possibly by extending the PETSc API.
The newly introduced abstraction layer, together with Loopy integration in the code generation and optimization pipeline, opens up multiple possibilities for future research in Firedrake. These include code generation with intrinsics instructions, loop tiling, and GPU acceleration, all of which are already supported in Loopy.
of Naval Research under grant number N00014-14-1-0117 and the US National Science Foundation under grant number CCF-1524433. AK gratefully acknowledges a hardware gift from Nvidia Corporation.

Supplemental material
Here we describe the operators used as the test cases for experimental evaluation. They are defined as bilinear forms, and we take their action in UFL to obtain the corresponding linear forms.
mass Here u and v are scalar-valued trial and test functions.
where I is the identity matrix, λ and µ are the Lamé parameters of the material, F is the deformation gradient, C is the right Cauchy-Green tensor, E is the Euler-Lagrange strain tensor. We define the Piola-Kirchhoff stress tensors as: Finally, we arrive at the residual form of this nonlinear problem: where b is the external forcing. To solve this nonlinear problem, we need to linearize the residual form at an approximate solution u, this gives us the bilinear form a: a = lim →0 r(u + δu) − r(u) , where the trial function is δu, the test function is v, and u is a coefficient of the operator. We use the automatic differentiation of UFL to compute the operator symbolically.