A sparse regression approach to modeling the relation between galaxy stellar masses and their host halos

Sparse regression algorithms have been proposed as the appropriate framework to model the governing equations of a system from data, without needing prior knowledge of the underlying physics. In this work, we use sparse regression to build an accurate and explainable model of the stellar mass of central galaxies given properties of their host dark matter (DM) halo. Our data set comprises 9,521 central galaxies from the EAGLE hydrodynamic simulation. By matching the host halos to a DM-only simulation, we collect the halo mass and specific angular momentum at present time and for their main progenitors in 10 redshift bins from $z=0$ to $z=4$. The principal component of our governing equation is a third-order polynomial of the host halo mass, which models the stellar-mass halo-mass relation. The scatter about this relation is driven by the halo mass evolution and is captured by second and third-order correlations of the halo mass evolution with the present halo mass. An advantage of sparse regression approaches is that unnecessary terms are removed. Although we include information on halo specific angular momentum, these parameters are discarded by our methodology. This suggests that halo angular momentum has little connection to galaxy formation efficiency. Our model has a root mean square error (RMSE) of $0.167 \log_{10}(M^*/M_\odot)$, and accurately reproduces both the stellar mass function and central galaxy correlation function of EAGLE. The methodology appears to be an encouraging approach for populating the halos of DM-only simulations with galaxies, and we discuss the next steps that are required.


INTRODUCTION
Gravitational collapse in the expanding Universe leads to the formation of complex, highly non-linear structures. The force of gravity can be accurately modeled through N-body cosmological simulations. However, observational probes of the Universe's structure usually rely on galaxies, bringing into play a much broader range of baryon physics. Unlike dark matter only (DMO) simulations, which only allow interactions through gravity, baryon simulations need to deal with complicated feedback processes and are strongly influenced by events happening at scales much smaller than the size of the simulation grid (Schaye et al. 2010). While this can be mitigated by including sub-grid models of these processes, in the form of sources or sinks of energy and matter, the resulting computational ★ E-mail: miguel.a.de-icaza-lizaola@durham.ac.uk cost of accurate baryonic simulations remains far greater than that of DMO simulations. As a consequence the volume of the universe that can be modelled in this way is limited. A hybrid approach is therefore necessary, in which a large volume DMO simulation is populated with galaxies based on the halo-galaxy relationship found in smaller baryonic simulations. This requires a methodology that can extract robust halo-galaxy relationships making optimal use of the available volume of baryonic simulations. In this paper, we explore whether sparse-regression models, which are a type of machine learning algorithm, provide an attractive approach.
We leave the extension of our methodology to include satellite galaxies for a future work.
It is already well established that there is a strong correlation between the stellar mass ( * ) of a central galaxy and the mass of its host halo (White & Rees 1978). This relation is known as the Stellar Mass -Halo Mass (SMHM) relation. However, there is a significant scatter in the SMHM relation (e.g. More et al. 2010;Zu & Mandelbaum 2015) which indicates that the stellar mass of a galaxy may also depend on other factors. Here, we investigate whether the formation history and the angular momentum of the host dark matter halo also play a role. Dependence on formation history is often referred to as assembly bias (Sheth & Tormen 2004;Gao et al. 2005;Gao & White 2007;Ramakrishnan et al. 2019). The effect of assembly bias in the EAGLE simulation has been studied in Chaves-Montero et al. (2016), where it was concluded that it might alter the clustering signal amplitude of the sample by up to 20%. It is worth noting, however, that while assembly bias has been detected in several simulations, the efforts made to detect it on observations have been inconclusive to date (e.g. Lin et al. 2016;Tojeiro et al. 2017;Salcedo et al. 2020).
To explore the effect of assembly bias, Matthee et al. (2016) studied the correlation between the residuals in the SMHM relation and different DM properties on EAGLE. They found that the parameters that are most correlated with this residual are those that are determined by the evolution of the halo mass, in particular concentration and halo formation time. They found no other parameter strongly correlated with the residual of the SMHM relation once it was corrected for the halo concentration correlation. Our aim in this paper is to investigate the optimal measure of halo formation trajectory and to determine whether the prediction of stellar mass can be improved by including the additional halo specific angular momentum.
In observations, the angular momentum of a galaxy appears correlated with its stellar mass (Fall & Romanowsky 2013). However, while there is a correlation between the history of the specific angular momentum of a galaxy and its host halo (Zavala et al. 2016), Danovich et al. (2015) uses cosmological simulations to suggest that the specific angular momentum of gas and dark matter undergo decoupled formation histories and that it is only the final distribution of spin parameters that is similar between baryons and dark matter. Nevertheless, it remains physically plausible that halo specific angular momentum and galaxy formation efficiency maybe interconnected in some more complex way.
The aim of this work is to develop a sparse regression approach to find a polynomial equation that relates the stellar mass of a galaxy with the properties of its DM halo. This is a form of Machine Learning (ML). More conventional ML algorithms such as neural networks (e.g. Lecun et al. 2015) and random forests (e.g. Breiman 2001) are some of the most powerful tools for parameterising a data set. However, algorithms like neural networks or ensemble models (Roberts & Everson 2001) generate models with virtually no explainability, so that extracting the physics behind the model would be difficult. While the network could predict galaxy properties, it would be hard to gain confidence that the output is physically reasonable. Random forest algorithms work by building a collection of decision trees that are designed to be as uncorrelated as possible. These models are easier to interpret, as one can measure how often a variable was used and how drastically the entropy decreases in each step. A potential issue with some machine learning algorithms is the slow evaluation speed of a final model.
Sparse regression methods (SRM; Hastie et al. 2015) are a set of minimization algorithms that are efficient at discarding unneces-sary free parameters. This makes them very useful at minimizing functions for which one suspects that most free parameters are irrelevant except for a small subset that one is trying to identify. SRM provide a trade-off between including very many free parameters (which would result in over-fitting to random artefacts in the data) and eliminating too many parameters (which would result in a poor description of the data). SRM have been proposed as the appropriate framework to extract the governing equations of a physical system from the data alone with relatively little prior knowledge required of the system's physics (Brunton et al. 2016). A key advantage of the SRM approach is that the small number of retained coefficients are likely to have a clearer physical interpretation. Further more, given that the models produced with SRM can be simple polynomial equations, their evaluation comes with virtually no computational cost.
We apply the sparse regression methodology to model the stellar mass of galaxies in the EAGLE 100 Mpc hydrodynamical simulation (Schaye et al. 2015;Crain et al. 2015). The EAGLE simulation provides a reasonable description of the observed Universe (Crain et al. 2015;Artale et al. 2017;Furlong et al. 2015;Trayford et al. 2016), and is ideal for our proposes as parameter values of the DM halos are stored at several redshift slices (McAlpine et al. 2016), which permit us to model assembly bias. While all models presented here were calibrated on EAGLE data, we expect that the methodology can be applied to other hydrodynamical simulations with similar success. One common issue with this type of analysis is the danger of including a selection bias in the independent variables due to dark matter halos in hydrodynamical simulations being affected by baryonic processes that might alter properties like their density profile (e.g. Schaller et al. 2015c;Martizzi et al. 2012;Navarro et al. 1996). With this in mind, we use a one-to-one matching (Schaller et al. 2015a) between our hydrodynamical simulation and a dark matter-only simulation built using the same properties and initial conditions.
Our work builds on other ML methods that have shown promising developments in the creation of mock catalogs using DM halos. Moster et al. (2021) uses neural networks to populate DM halos from N-body simulations with galaxies. While their goal is similar to ours, the philosophy behind both models is different. Their approach avoids using hydrodynamical simulations and focuses on placing the galaxies inside halos in such a way that it reproduces observed properties of the galaxy populations. While that approach leads to accurate models, it would by construction be hard to extract any physical interpretation out of it. Lucie-Smith et al. (2018) used a random forest algorithm to predict which DM particle in a simulation would end up inside a DM halo of a given mass, while Berger & Stein (2019) used a neural network to build DM halo mocks.
In this paper, we focus on the properties of central galaxies. For mock simulations to be compared to observations of large-scale structure surveys they need to be populated with galaxies in such a way that they reproduce the stellar mass function (SMF) and the clustering patterns of galaxies. This would require us to assign both a central and a population of satellite galaxies to each dark matter halo. We discuss the additional challenges of modelling the stellar mass of satellite galaxies at the end of the paper.
This paper is organized as follows. Section 2 introduces the sparse regression methodology used to build our model and includes an example model that illustrates the behavior of the algorithm. Section 3 introduces the hydrodynamical simulation from which the input data were extracted and discusses how the data was processed to be used by our algorithm. The details on running the algorithm using the data set are presented in Section 4. As sections 2 and 4 introduce and test our methodology, readers primarily interested in the astrophysical results can go directly to section 5. Section 5 shows the results of the different configurations in which we run our code, and we discuss the physical interpretation of different terms of our governing equations and compare the stellar mass distribution and clustering statistics to those from EAGLE. Our conclusions and final thoughts, along with a brief discussion on the next steps that we aim to take, are presented in Section 6.

THE SPARSE REGRESSION METHODOLOGY
This section starts by setting the general problem in §2.1. This is followed, in §2.2, by an introduction to the sparse regression method considered, i.e. the Least Absolute Shrinkage and Selection Operator (LASSO). We explain our minimization implementation in §2.3 and the penalty hyperparameter definition in §10. We end in §2.5 with a simple example to more clearly illustrate our methodology.

Problem statement
We are interested in finding a function that models a physical property, , that might be determined by a set of M variables ì = [ 1 , ...., ] (we reserve the symbol for normalised variables -see below). In this work is the stellar mass of a galaxy and ì a set of present and past properties of its DM halo. We can build a data set of values of and their associated ì by looking at large catalogues where the value of both has been measured. In this paper we use the output of the EAGLE hydrodynamical simulations (see Section 3).
We collect a sample of N galaxies to build a vector ì , where and an associated matrix X , with each row ì (1 ≤ ≤ ) representing the list of dependent variables associated with the DM halo of the corresponding galaxy : . .
The different columns of matrix X correspond to different properties of the DM halo, where each property can have different units and distributions. It is, therefore, necessary to standardize our data. We choose to do this using the mean and standard deviation of the distribution, and define the normalised variable as: where ì is now a column of X and 1 ≤ ≤ and and are the mean and standard deviation operators. The same normalization scheme is also applied to our dependent variable: = ( − ( ))/ ( ). Note that the primed variables refer to natural quantities and non-primed variables to standardised ones that have zero mean and unit variance.
The observed values of the M variables of ì will be used as inputs for a series of functions whose output one hopes to use to predict . These functions can in principle have any desired form, and so we will use a gradient descent algorithm to fit a linear combination of them to ì (Section 2.3). Although other approaches like singular value decomposition Golub (1970) could be used in the hyperbolic case, we wish to ensure that the method is generic.
We consider a library of functions, and their evaluated values for the observed parameters of the ℎ galaxy ì ( ì ) = [ 1 (ì ), ....., (ì )]. The library of functions that we use in this work consists of: The total number of functions considered is: The number M of DM halo properties that we use depends on the specific parametrisation of the present and past properties of the halo that we select. We consider four different models each with different values of M (Section 3.5). This methodology is able to deal with far more complicated functional forms than the polynomial forms used here. For example, we experimented with exponential decays and step functions. However including these more complicated functions in our initial testing did not improve our models, but increased the computational time so we excluded them from our final fits in this paper.
Our goal is to find optimized values of the coefficients ì = [ 1 , ......., ] that will make the linear combinations of our functions a sufficiently accurate model of ì. Specifically, we aim to find the optimized values of ì such that ì ( ì , X) ≈ ì, where ì ( ì , X) is defined as: We discuss the precise meaning of the approximate equality in the following section. Our aim is to achieve a balance between the accuracy of fitting the data while keeping the model as simple as possible. Clearly there is an underlying assumption that the functions included can be linearly combined into a sufficiently accurate model. In the absence of a detailed understanding of the physical system our approach is to include a large number of functions in our library, spanning the possible range of physical interactions.

Sparse regression
Sparse regression methods aim to minimise the error term | ì ( ì , X) − ì| while discarding any unnecessary functions by setting their associated coefficients to a negligibly small value. This makes them the appropriate framework for our problem as it allows us to include a large number of functions while knowing that all of the unnecessary ones will be discarded by the methodology. The fewer surviving coefficients the easier it is to interpret the solution (i.e. the more explainable it is). The solution will also be less susceptible to over-fitting to random fluctuations in the training data.
(6) ( ì ) is known as the penalty function and its value should increase with the absolute value and number of coefficients that are not set to zero. The coefficient is a hyperparameter of the model and determines the relative magnitude of the penalty term. The value of is determined using a k-fold methodology (Hastie et al. 2015), as described in Section 2.4. 2 is the normal chi-squared function defined as where is an estimate of the uncertainty in measurement and ( ì , X) is the ℎ element of ì ( ì , X). In the absence of the penalty term, would be the negative of the logarithmic likelihood function (ie., = −2 ln L).
In the standard LASSO approach ( ì ) is defined as We introduce a regularisation term to smooth out the gradient discontinuities that occur when parameters are close to zero, where is a small constant. Note that exp(− ( / ) 2 ) is very close to zero when | | and close to one when | | . Therefore determines how close to zero a coefficient needs to go before its contribution to the penalty is negligible. We adopt a value of = 10 −3 , which we show in Section 4 makes unnecessary coefficients go close enough to zero to be clearly distinguished from the ones that are useful, while keeping a reasonable computational cost (the closer to zero unnecessary coefficients are required to get the longer the minimizer needs to run). We define a cutoff value as the threshold between used and discarded parameters: every coefficient larger than will be used in our model and all smaller coefficients are discarded. The exact value of is presented in Section 4.
In equation 9 the contribution of each coefficient is independent of the contribution of all other coefficients. This means that there is not a strong penalty for having many small, but larger than , values of . We found that a more efficient approach at eliminating non-essential coefficients is to consider the contribution of a coefficient, in comparison to all of the other surviving coefficients. This is achieved by the following penalty function Combining both modifications our penalty function has the following form This is the form of the penalty function adopted in our algorithm. The 2 is a measure of the goodness of fit, which decreases as the model becomes more accurate. Balancing of the goodness of fit statistic and penalty term makes sparse models robust against overfitting: an over-fitted model would use many parameters to make a unrealistically good fit, which would make the 2 very small but it would also make the penalty term large (as it grows with the number of parameters). The minimum should belong to a model that is as simple as possible, while still being a sufficiently good fit. This is why when using a large library of functions all but a small subset of the coefficients end up being set to zero.
By making some general assumptions, we can estimate that in the optimised solution ( ì ) = O (1). First we note that ( ì ) ≈ =1 ≠ | || | ≈ ( =1 | |) 2 , and that the optimised solution should satisfy ì ( ì , X) = F(X) ì ≈ ì Secondly let us note that, in our case, (X) correspond to third order combinations of elements of ì , with each element standardised to be of the order of magnitude of the elements of ì and therefore (X) ≈ O ( ). From here it should be that =1 | |≈ O (1), and consequently that ( ì ) ≈ O (1). The properties of the simulated galaxies do not have formal measurement errors, but we still expect a random scatter due to the stochastic nature of the formation process. We therefore estimate a constant 2 = 2 (for 1 ≤ ≤ ) using evaluated at ì that minimizes equation 7 when 2 = . A consequence of using this definition of 2 is that if we then minimise Eq. 6 with no penalty ( = 0) we find The optimised value of should be such that 2 and ( ì ) are of comparable size. Given that by ( ì ) ≈ O (1), and that we constructed 2 such that 2 ≈ O ( ) then ≈ O ( ). This allows us to estimate the sizes of penalty that we should explore.

Minimization
We use a gradient decent algorithm to minimise Eq. 6. The process starts at an initial point in parameter space and iteratively walks in the direction opposite to the gradient of with respect to . We use a variation of Arfken (1985), the standard practice for most machine learning methodologies. The size of each step is determined by a parameter . At every step one computes , if it is larger at the new position then is reduced (as it would likely mean that it overshot the minimum). In the opposite case size is increased if is smaller at the new position as it is likely that we are still far from the minimum.
The gradient of from Eq. 6 is computed with respect to the vector of coefficients . In the standard methodology one makes a step in the direction of the gradient at the current position. However, we found that this did not perform well in the steep-sided valleys that characterise . In such valleys, a step will overshoot the minimum, and as a consequence the next step would be in the opposite direction than the previous one but with a slightly smaller step size. Progress along the valley toward the global minimum is then slow. This makes convergence inefficient in high dimensional spaces, as the minimizer tends to jump from one wall of a potential well to the opposite wall at each step instead of following a more direct downwards path.
We achieved performance gains by using the following adaptation of the algorithm for determining the next step of the minimization. Defining the position of the ℎ step as = 1 , ..., , the gradient vector points downhill towards the nearest local minimum. Since we are only interested in the direction of the gradient and not its magnitude we can normalize the vector as We make a first trial step on the downhill direction that takes us to the following position in parameter space The direction of the next step +1 is given by the mean of the gradients at and +1/2 , This swings the direction of travel to align with the valley. In order to determine if our code has converged we look at the size of steps taken by the minimizer. A very small step size indicates that we have not moved far for several steps. Our code will run until the step size becomes smaller than some tolerance value. A smaller value of the tolerance means we get closer to the minimum, however, the computational cost of our minimization is strongly dependent on this tolerance value. We found that a tolerance of the step size of < 10 −6 produces stable results and manageable low computational cost.

Penalty Hyperparameter
We will use a k-fold methodology to fit the hyperparameter . Kfold is a well-known method that is standard practice for fitting hyperparameters in sparse regression (e.g. Hastie et al. 2015). The method works by randomly dividing data into independent subsets of roughly the same size. Then the hyperparameter is sampled in independent runs, each time one subset is left out of the minimization and is used to test the model on data it has not seen before. The set left out is called the test set. The rest of the data points are used for running the minimisation algorithm and are referred as the training set. In this work we will use a value of = 10, which is standard practice. Each run explores with thirty uniformly spread points in log 10 between = 1 and = , to which the case of = 0 is added. The higher the granularity of that we explore the more computationally expensive our code becomes. We found by testing that 30 uniformly spread out points in log 10 was enough to find sufficiently smooth curves without a high computational cost. In principle, one could explore larger values of . However in our case models with = provided already significantly worse fits than models with smaller values, which indicates that the penalty was already too large at = . This is true for all models presented in this paper except for the example of Section 2.5 where we needed to run between = 0 and = 800.
In order to quantify the quality of fit for a given ì , we will use the root mean square error (RMSE) defined as: When is close to zero, the error in the model of the training set would be small as there is no significant penalty and the model is overfitted. Such a model is poor at predicting results in data that it has never seen before and this translates into a large error on the test set (see Figs. 2 and 5). For very large values of the penalty, the model becomes too simple as coefficients are heavily penalised: a model that is too simple will show large error on both the test and the training sets. When is large enough to avoid overfitting but not too large that models become too simple, the RMSE of the test set will reach its minimum (as illustrated in Fig. 2). If is the value of where the minimum is for a given k-fold, then = ( ) is an estimate of the optimised value of , where is the mean operator.
It is common practice in sparse regression to choose a value that is larger than by one standard deviation, this is the onestandard-error rule from Hastie et al. (2015). This is done to avoid over fitting due to inaccuracies in the methodology. In this work we implement a modified version of the one-standard-error rule. Let us define RMSE ( ) as the RMSE of the ℎ k-fold as function of . By construction, RMSE ( ) is minimised for = . If (RMSE ( )) is the standard deviation of the collection of RMSE ( ), then for each k-fold we define the optimised value of , min , as: In order to find our surviving coefficients, we run the minimization algorithm again on the complete data set, setting to min . With , the number of coefficients larger than the cutoff value , we define our library of surviving functions (X) = [ 1 (X), .., (X)], for which > with 1 ≤ ≤ . The penalty is useful for selecting which coefficients to discard and keep, but once this is done the presence of a penalty term biases all coefficients to smaller values. A penalty rewards smaller coefficients over larger ones, as the size of the penalty increases with the size of the coefficients. Having this in mind, our final model is found by re-running our minimization algorithm using only the functions in (X) and setting to zero, i.e. without penalty.

Example
In this section we introduce a simple example to more clearly illustrate our methodology. We build a matrix X as in Eq. 2, where each column ì has thirty points (N=30) and each point is a random number between zero and one. We will use three independent variables, ì 1 , ì 2 and ì 3 , i.e. M=3. We will also build a dependent variable ì as: where the noise comes from a Gaussian distribution centered on zero and with a width of 10% of the standard deviation of 1.3 + 2ì 1 . All of our variables in this example have the same order of magnitude, so there is no explicit need to standardize their units. Hence we use and ì notation within this section. Before running the model, we do not know the shape of Eq. 19. However, let us suppose that we suspect that ì should depend on the parameters ì 1 , ì 2 and ì 3 . As we are uncertain on how to model the dependence between the parameters, we include a large set of functions. In this example, our library of functions includes a constant, linear and quadratic terms only (leaving out cubic terms for simplicity). In total we end up with ten functions (D=10).
For the purpose of illustration, we focus on 1 and 3 , the parameters associated with the linear functions of ì 1 and ì 3 . Fig. 1 follows the trajectory of the minimizer for our example model and for these two coefficients. From Eq. 19 we know that 3 = 0 and 1 = 2. The minimizer starts in an arbitrary position (in the case of this example in 1 = 3 = 1) and follows the trajectory shown by the blue line. The dashed lines represent the contours of both the 2 (elliptic dotted contours) and ( ì ) (dashed contours) of Eq. 10. A gradient descent algorithm will try to move perpendicularly to these contours, but the modifications in our algorithm allow the path to quickly align to the valley around 3 = 0. Apparent deviations from this motion come from the fact that we are looking at the 2 dimensional projection of a ten dimensional trajectory. Fig. 2 shows the evolution of the RMSE with respect to for Starting point True Value Path Figure 1. Isocontours of the penalty function defined by Eq. 10 for the two different coefficients 1 and 3 associated with the functions 1 1 1 ( ì ) = 1 1 and 3 1 3 ( ì ) = 3 3 . The dashed hyperbolic and dotted elliptical lines are the isocontours of our penalty function and of the 2 statistic respectively. Given that the gradient is perpendicular to the contour lines, the minimisation routine can efficiently move toward the origin of the plot, and also to one of the axes. Hence the code will quickly reach the minimum if either or both coefficients are zero. our example model using the k-fold methodology of Section 2.4. For this example, we divide the data in to = 5 folds. The blue dashed line correspond to the training set and the green lines to the test set. The solid lines are the median of each set. We explore the hyperparameter between = 0 and = 800. This is different to our nominal range which would be between 0 and N otherwise. This is because the ratio / = 10/30 is much larger in this example than in our nominal set up using the full simulated data set, for which we have hundreds of functions to fit almost 10,000 galaxies (i.e. / ∼ 0.01). This significant change in this ratio of / requires a larger penalty to be considered to avoid any overfitting.
When is close to zero in Fig. 2, the RMSE in the training set (blue line) is small, the model is overfitted and therefore bad at predicting the result in data that it has never seen before. This results in the comparably larger error on the test set (green line). For the largest values of the penalty, the model becomes too simple and the error on both the test and the training set begins to increase.
Around ∼ 10 in Fig. 2, the fit of the test set improves and the RMSE reaches its minimum. This is where the model is the least susceptible to overfitting while still capturing the important features of the data set. The black dots indicate the minimum RMSE for the test set of each individual k-fold and the black dashed line shows the mean value of these points, (RMSE ). Our optimal value of is given by min , defined by Eq. 18 and shown as a vertical red line. Fig. 3 shows how the best-fit coefficients of our example model evolve for different values of . Each curve is the mean curve from our 5 different folds. As stated in Section 2.2, the code will not set parameters exactly to zero but to a very small value which is determined by the parameter of Eq. 10. Fig. 3 Figure 2. Evolution of the RMSE from the best fits of our example model as function of the hyperparameter . The blue and green dashed lines represent the RMSE of the training and test sets respectively. The solid lines represent the median of these curves. When is close to zero, the training set has a very small error and the test set a comparably larger one: this is due to overfitting of the minimizer and it improves as grows. The black dots indicate the minimum RMSE for the test set of each individual k-fold: this is where overfitting was smallest. The black dashed line shows the mean value of the s of the black dots. The red dots are plotted at the values of given by our modified one-standard-error rule. The red line indicates the mean of s of these red dots and is our estimate of min from Eq. 18 . example the value is ∼ 6 × 10 −4 (this is true for both the example model and the galaxy data set as it only depends on ). Therefore we select a cutoff value of = 1 × 10 −3 . This is is shown as the grey shaded region in Fig. 3. The coloured lines correspond to the coefficients that were above the cutoff value at min , and therefore included in the final model. At = 0, all coefficients are above the cutoff threshold due to over-fitting. As grows, coefficients drops below the threshold value and at min all coefficients other than 0 and 1 have been discarded. This is expected as 0 and 1 are the non-zero coefficients used in building ì according to Eq. 19. The grey dashed lines are the coefficients that were below at min and therefore discarded. The vertical dashed line corresponds to the optimized value min .
The final model selected by the algorithm is: Considering that this a fit to data generated using Eq. 19 with 10% Gaussian distributed noise, we can conclude that our algorithm generated a sparse and accurate representation of the data.

DATA SET
This section introduces the data set that is used in our analysis. In §3.1, we introduce the hydrodynamical simulation, which data The coloured lines show the value of the mean fit of our independent k-fold runs for our surviving coefficients, i.e. those coefficients that are larger than the cutoff value at the optimised value min of the hyperparameter . The dashed lines show the true coefficients used to create the data from Eq. 19. The grey dashed lines show the evolution of the values of the coefficients that were discarded in the final model. The grey shaded area represents our cutoff value , below which parameters will be taken out of the fit. The dotted black line represents min . We note that at min all coefficients are set to zero except 0 and 1 . is used to train our model on. In §3.2, we describe the selection of the galaxies considered, as well as the method used to address inconsistencies in the classification of galaxies between different snapshots of the simulation. §3.3 and §3.4 introduce the set of variables that form ì of Eq. 2: the former introduces all variables associated with the mass of the host halo and the latter with its angular momentum. Finally §3.5 lists the four models considered, which differs from one another by the set of variables used to define ì .

The EAGLE simulation
Hydrodynamical simulations provide powerful insight into the galaxy formation process. The resulting database catalogues DM halos and their connection to baryonic properties such as stellar mass. In this paper, we use the Evolution and Assembly of Galaxies and their Environments ( EAGLE, Schaye et al. 2015;Crain et al. 2015) simulations, a suite of hydrodynamical simulations built inside cubic periodic volumes. We use the largest of these volumes, corresponding to a box of 100 comoving Mpc of length.
The simulation runs using a modification of the GADGET 3 code described in . The code uses Smooth Particle Hydrodynamics methods to model the mechanics of the baryon fluid. In order to compute the gravitational potential, the code uses a combination of a Particle Mesh (at large scales) and a hierarchical Tree algorithm (at grid and subgrid scales). The details on the modifications can be found in Schaller et al. (2015d). The simulations are built using the Planck cosmology (Planck Collaboration et al. 2014).
Baryonic physical processes that cannot be solved directly are implemented into the simulation as sources and sink terms, where energy and matter are either absorbed or injected locally into the simulation. These subgrid models should depend only on the local property of the gas. The subgrid models implemented account for radiative cooling (Wiersma et al. 2009a), star formation (Schaye & Dalla Vecchia 2008), star formation feedback (Dalla Vecchia & Schaye 2012), black hole growth (Rosas-Guevara et al. 2015;, Active Galactic Nuclei feedback (Booth & Schaye 2009) and chemical enrichment (Wiersma et al. 2009b). The uncertain parameters of the subgrid models need to be calibrated, which is done by choosing the values that would reproduce the galaxy mass function at =0.1, the galaxy size-stellar mass relation and the black hole mass-stellar mass regression. Discussion of the calibration process can be found in Crain et al. (2015).
Haloes are defined using a Friends-of-Friends algorithm (FoF; e.g. Einasto et al. 1984) with a linking length of =0.2, i.e. all particles that can be linked together with an inter-particle distance smaller than 0.2 times the mean inter-particle distance form a halo. Once the halos have been identified, the SUBFIND algorithm (Springel et al. 2001) identifies the self-bound local overdensities of each FoF group as subhalos. The subhalo that contains the particle with the lowest value of the potential energy will be defined as the central sub-halo.
The simulation information is saved at 29 redshifts from =20 to =0 (i.e. 29 snapshots), and is used to build merger trees, which connect a halo to its progenitors at earlier redshifts (Qu et al. 2017). The main progenitor of a halo is defined as the progenitor with the largest mass at all earlier outputs. We use these main progenitors to track the mass evolution of a DM halo (Section 3.3). Note that when two halos pass close to each other without merging they could momentarily belong to the same FoF group. As a consequence, the mass and the subhalo chosen as the central may be inconsistent at this snapshot when compared to the one immediately before or after the interaction (Behroozi et al. 2015). We introduce a scheme to clean such issues from the input data in section 3.2.

Data selection
Our data set consists of central galaxies inside halos with a mass larger than 200, > 10 11.1 M . 200, corresponds to the mass inside the radius 200, of a halo, which is the radius within which the density is 200 times the critical density of the universe. The stellar mass of a galaxy is measured as the baryonic mass contained inside a sphere of 30 proper kpc around the centre of potential of the halo.
Baryonic processes inside halos can affect their measured DM properties (Bryan et al. 2013;Schaller et al. 2015c). If we run our code using the properties of the DM found in a hydrodynamical simulation, we risk including biases by fitting the stellar mass using a property that has been modified by the presence of baryons (this modification would be correlated with the stellar mass as the halos with more baryons would be more modified). To avoid this bias, it is common practice to extract all DM input properties from a DM only simulation generated with the same initial conditions, box size and resolution as the full hydrodynamic simulation.
The matching between the hydrodynamic and DM-only simulations is described in Schaller et al. (2015b). The 50 most bounded DM particles of each halo in the hydrodynamic simulation are found. If a halo in the DM-only simulation has at least half of those most bound particles it is considered its analogue. Using this method, 99% of the halos with 200, > 1 × 10 11.1 M are matched. We collect information about the host DM halo at different redshifts (Section 3.3)and require that our halos are present in all snapshots. With this in mind, we only use galaxies with a progenitor defined at =4 . Our full sample consists of 9521 galaxies.
Inconsistencies between snapshots are a well-known characteristic of the merger trees (Behroozi et al. 2015) created by running the halo finder separately on each snapshot. When two halos interact some of the particles of one can be assigned to the other regardless of where they belonged in past snapshots. One consequence is that small central halos can be considered satellites of a larger halo if they are close to each other at a given snapshot. In EAGLE, 200, is only computed for central halos, which means that they will not have a value of 200, at these snapshots.
When this happens we interpolate the value of 200, in the missing slices using the following methodology: we look for the 200, value of both the nearest earlier and later redshifts where the halo was still central. We use these values to do a linear interpolation of 200, in the missing slice. The nearest earlier redshift is always well defined (as at z=0 all of our selected subhalos are central); however, a small subset of galaxies have a non-central progenitor at their largest redshifts, and therefore their nearest latter subhalo is not necessarily well defined. In these cases we select the third to last and second to last halos and perform our interpolation with those. We follow a similar procedure to correct the angular momentum of halos that are not considered central in a given slice. We found that the value of the angular momentum can have drastic variations when compared to its value at the surrounding redshift slices, which is due to the number of particles assigned to the halo changing significantly when it is misclassified as a subhalo.
The black lines of Fig. 4 shows the halo mass history relative to the halo mass at redshift zero of four halos from = 4 and to = 0. The figure shows that different halos have very different formation histories. We will explore whether galaxies that have followed different halo formation paths will end up having different residuals in the SMHM relation.

DM Mass
Once we have selected the galaxies in our data set, we define the parameters of the DM halo that are used to build the matrix of Eq. 2. The first variable accounted for is the halo mass at redshift zero (or any variable highly correlated with it), as the SMHM relation explains most of the scatter in the stellar mass. We will denote the Halo Mass input variable of a galaxy as 0 and define it as We use Eq. 3 to standardize the units and denote the halo mass in standardized units as 0 .
There is significant scatter around the SMHM relation due to their varied formation history, therefore we should also add parameters that are good estimators of the mass evolution of the DM host halo. This can be done by adding the halo mass of the main progenitor of a host halo at different redshift slices into our matrix. The EAGLE simulation has 19 snapshots between z=0 and z=4 (McAlpine et al. 2016). However, information between redshift slices that are close to each other is strongly correlated as halos have not evolved significantly. Keeping this in mind and given that the computational cost of running the minimizer increases exponentially with the number of parameters, we only use a subset of  The ( / 0 ) parameters are given by the intersection of the red and black lines. The blue horizontal lines correspond to a constant mass ratio of 90%, 70%, 50%, 30% and 20% (from top to bottom). The formation criterion parameters FC can be visualized as the intersection between the blue and the black lines. the available redshifts. The ten redshifts slices that we use as inputs are in =[0.0, 0.18, 0.37, 0.62, 1.0, 1.26, 1.74, 2.48, 3.02, 3.98].
Sparse regression methods work best if variables are independent, therefore we will use the ratio between the mass at a given redshift and the mass at redshift zero (so that the significant correlation of the mass at a given redshift and its mass at = 0 is removed). We will denote these variables as ( / 0 ) and they are defined as: We then use Eq. 3 to standardise the units and form / 0 . An alternative approach to characterise halo evolution is the formation time (Lacey & Cole 1994), defined as the time at which a halo has assembled half of its present day mass. We generalize this idea to define five formation criteria (FC ) by finding the redshifts (instead of times) at which the DM halo has assembled 20%, 30%, 50%, 70% and 90% of its mass respectively. The set of all five formation criteria for our sample will be referred to as FC , where denotes the percentage used. Fig. 4 shows a set of horizontal blue lines corresponding to halo mass ratios (Eq. 22) of 90%, 70%, 50%, 30% and 20%. The redshifts at which each formation history curve (black solid line) intersects this blue horizontal lines is a visual representation of FC . The redshifts that correspond to a given formation criteria are found by performing a linear interpolation of the halo mass ratios. As with all parameters, a final step is to standardise the units using Eq. 3 and to form FC .

Specific angular momentum
There is a well known observational scaling relation between the angular momentum of a galaxy and its stellar mass (Fall & Romanowsky 2013). It is a matter of discussion, however, how much of a role the angular momentum history of a dark matter halo plays in determining the specific angular momentum of its host galaxy. Zavala et al. (2016) finds strong correlations between both parameters using the EAGLE simulation. However, Danovich et al. (2015) suggest that the specific angular momentum of gas and dark matter undergo different formation histories, which would suggest that any correlation between them is a by-product of a third correlation with other parameters like the mass formation history. Having this in mind, we will generate candidate models that also include specific angular momentum input parameters on top of the mass evolution parameters.
Angular momentum evolution is included in our methodology by computing the halo specific angular momentum vector, ì j, defined within a radius and for each redshift slice as: where ì and ì are the position and velocity vectors of each particle within a radius of the centre of mass, is the mass of the particle, and ì , ì are the position and velocity of the centre of mass of the halo. We will use different values of in order to capture the angular momentum evolution of the full halo and of its centre separately. The specific angular momentum defined in Eq. 23 correlates strongly with the mass of the halo: this is driven by the scaling relations |ì| ∝ 1/3 and |ì| ∼ √︃ ∝ 1/3 . To avoid strongly correlated variables in our parameter set, we define the following specific angular momentum parameter: where | ì j( , )| is the norm of ì j( , ).
Given that the angular momentum is a vector, we need two types of variables to describe it: one capturing its magnitude and the other one its direction. Therefore, we will also include the change in the angle, Θ', defined as the scalar product between the halo specific angular momentum at redshift w.r.t. the one at the present time, i.e.: Note that by definition (Θ( = 0)) = 1 for all galaxies and hence we only include (Θ( > 0)) in our list of variables. As with all other variables we use Eq. 3 to standardize the units and form the scalars ( , ) and Θ( ). We form the following library of ì j parameters for each halo at each redshift: ( 200 , ), ( 200 2 , ), ( 200 5 , ) and Θ ( ).
The evolution of our specific angular momentum parameters has significant statistical noise and so it is smoothed across different redshifts using a Gaussian Kernel.

Models
The four models considered in this work are: (i) Mass ratio: This model includes values of the halo mass at redshift zero, 0 (Eq. 21), and the halo mass ratios, / 0 (Eq. 22), that parameterise the DM halo mass evolution. With 10 different redshift slices, this gives a total of = 10 input parameters, resulting in a total of D=286 functions to minimize over (Eq. 4).
(ii) Formation criterion: In this model, the DM ratios are replaced by the formation criterion FC , defined in Section 3.3. This model uses, as parameters, 5 values of FC (with p=[90, 70, 50, 30, 20]) and the halo mass at redshift zero, 0 , resulting in = 6 and = 84 functions to minimize over.
(iii) Mass ratio and ì j: In this model we add the specific angular momentum parameters ì j (and more specifically ( 200 , ), ( 200 /2, ), ( 200 /5, ) and Θ( ) at each of the ten snapshot considered), to the library of free parameters of the mass ratio model. The library of functions contains the linear, quadratic and cubic terms of the Halo mass evolution parameters / 0 . Only the linear terms of the specific angular momentum parameters are included. To include all the quadratic and cubic terms would result in = 23426 functions to minimize over, which at the moment is too computationally expensive for our algorithm. Hence we will only include linear terms for the specific angular momentum parameters, ending up with a total of = 326 functions to minimize over.
(iv) Formation criterion and ì j: This model is similar in spirit to model (iii), but we add the terms of the specific angular momentum parameters, ì j, to the library of free parameters of the formation criterion model instead. As with model (iii), we consider only the linear terms of the specific angular momentum parameters, ending up with = 123 functions to minimize over.

RUNNING THE ALGORITHM
In this section we present some specific aspects of applying the methodology presented in Section 2 to the data described in Section 3. In particular tests of the consistency of the algorithm are considered: we evaluate the impact of the chosen parameter in Section 4.2 and discuss the uncertainty of the parameter models in Section 4.3. The model results are presented and discussed in Section 5.

Training, holdout and test sets
The data is randomly divided into two, the training set and the holdout set. The training set contains 85% of the data and is used by the algorithm to build the model. The remaining 15% constitute the holdout set and is not used until the model is completed. The final model is applied to the holdout set to test its accuracy by considering data not used in the building of the model and therefore is unbiased to over-fitting.
Note that the holdout data set is different from the test sets used for estimating the optimal value of the hyperparameter in the k-fold methodology of Section 2.4. The latter constitutes sets drawn from the training set that are systematically kept out of the minimizations done while exploring the parameter space and are used to determine . They are part of the methodology for building our model. The holdout set, on the other hand, is kept out of this methodology completely and is used to evaluate the final model once it is built.

Penalty Hyperparameter
This section applies the methodology used for optimizing the hyperparameter , as introduced in Section 2.4. It discusses the impact of the assumed value for the parameter , used in the penalty function (Eq. 10).
From Section 2.4, the optimal value of , , is determined using a k-fold method with = 10 folds. Each fold runs independently and in parallel on different computer nodes. Fig. 5 shows the evolution of the RMSE of the mass ratio model ( §3.5) as function of the hyperparameter . The green and blue dashed lines correspond to the test and training sets respectively. Each test set (green dashed lines) has around 800 points, which is around 10% of our training data set, and the minimization runs with D=286 free parameters. The green dashed lines show some spread in their amplitudes, which are correlated with their value at = 0. This spread is a consequence of dividing the subsets randomly. Some subsets will contain a larger amount of points that are well predicted by the model and will, therefore, have smaller errors.
As we saw with Fig. 2, the RMSE of the training set is smaller when ∼ 0 as overfitting makes the model agree unreasonably well with the data it uses for the fitting. In contrast when the model is tested on data it has not seen before, the RMSE is larger, as shown by the comparatively larger error on the test set. As increases, the error on each test set decreases and eventually, reaches a minimum (RMSE ) around ∼ 100, as shown by the black dots in Fig. 5. This is where the model is least susceptible to overfitting, while still capturing the important features of the data set.
The red dots in Fig. 5 show the correction obtained with the one-standard-error rule from Eq. 18. The plot shows that these points are to the right of the minimum value of the green dashed lines, however the differences in RMSE between the actual minima and the red dots are small. This means that the resulting models are simpler (and therefore more explainable) and with comparable accuracy. The red solid line is the optimized value of the hyperparameter, min , as estimated using Eq. 18. Fig. 6 shows the evolution of the coefficients of Eq. 5 of the mass ratio model ( §3.5 as a function of the hyperparameter . The vertical black dotted line shows the value of found by our algorithm. Each coloured line correspond to a coefficient that is above the cutoff value at min , with represented as the boundary between the white and grey regions of the plot. The grey dashed lines correspond to the coefficients that are below at min and therefore discarded. The figure shows that coefficients that have been discarded have a value of around 0.0005 or lower (shown by the average value of the black dashed lines at large values of ), given that our cutoff value is 0.001 there is a distinct separation between the chosen coefficients and those discarded. Different coefficients are fitted by the minimizer with different orders of magnitude 1 . Therefore we need to make sure that the value of the parameter of the penalty term (Eq. 10), which determines how close to zero unnecessary parameters get in the minimization, is such that discarded coefficients are well below the cutoff value , and are close enough to zero that they can be separated from useful coefficients.
Nominally we use a value of = 10 −3 , which, as shown in 1 As our input variables are not Gaussian, several parameter values are above one standard deviation. In a standardized space this will mean that they will be larger than one. As a consequence linear, quadratic and cubic coefficients will require different scales to make similar contributions. . This parameter determines how close to zero coefficients get before their contribution to the penalty is negligible. The cutoff value is set as = for each run. The black dotted line shows the value of used in our standard configuration. When is large, all coefficients are above the cutoff value . For = 5 × 10 −4 , all kept coefficients are significantly larger than 10 −3 , indicating the adequacy of our nominal choice for . Fig. 6, corresponds to the minimizer setting unused parameters to a value as small as ≈ 6 × 10 −4 . This is comparable to the findings of the example presented in Section 2.5. A very small value of increases the computational time significantly given that parameters need to be driven further toward zero. Our choice represents a value of that is small enough to get parameters close enough to zero while not being so small that the code becomes too expensive to run.
To test what impact the value chosen for has, we consider the formation criterion model ( §3.5). This model has less free parameters than the mass ratio model ( = 84 versus = 286) and hence requires significantly less computational time, enabling an adequate parameter space to be explored. Fig. 7 shows the resulting coefficients after running our full algorithm using five different values of using the formation criterion model. The coloured lines show the parameters that are above the cutoff value in the model built with = 10 −3 (our standard value) and represent the variables that where chosen by the algorithm. The grey dashed lines correspond to the values rejected at = 10 −3 . The cutoff value depends on how close parameters get to zero and therefore it is a function of . For the propose of illustration, we set = .
For larger values of , there is no clear cut between discarded coefficients and most of the cubic and quadratic terms end up in our model. On the opposite end at = 5×10 −4 , all accepted coefficients are significantly greater than the cutoff value.
While the difference between useful and useless coefficients is clearer at = 5 × 10 −4 , our standard configuration with = 10 −3 seems to work just as well while being significantly faster to run.

Uncertainty on the models
Several of our coefficients are associated with functions of the same form but with inputs from different redshifts (see Section 3.3). If the halo mass does not vary significantly between adjacent redshift slices, then the corresponding polynomial functions ( ) are likely to show some correlation between them.
In general different order combinations of correlated terms will also be correlated. Considering the above statements, it is possible that the parameter space of coefficients has several local minima. This could be an issue for gradient decent algorithms, as by construction they will converge only toward the closest minimum. In practice we are satisfied with any reasonable minimum: for example, we do not have a preference between a feature being explained by the halo mass ratio at one specific redshift versus that of an adjacent redshift slice.
This, however, means that there might be slight variations in the surviving parameters of different models depending on the starting point of the minimization and depending on the specific selection of the training set. We test for both aspects in turn.
To test how strong an effect the initial starting point is, we perform five different minimizations of the formation criterion model using 5 distinct starting points in the minimsation algorithm. We set min = 932, which is the optimised value found by running our methodology with our standard configuration. The initial point in the parameter space is varied to random values between the five runs and is the only feature that is different between runs. Fig. 8 shows the best fit coefficients obtained using 5 different sets of initial positions. All models have an equivalent accuracy with a RMSE within the range 0.249 ± 0.001. Three out of the five models use 19 parameters and the remaining two use 18. All resulting models have equivalent accuracy and simplicity and we can not select one as being significantly better than the rest.
We can tell that the most significant coefficients (i.e. those with a larger ) are kept constant amongst all runs, similarly there is a large subset of coefficients that are not necessary in any of the models. However, there is a subset of parameters that are interchangeable between different models. An example of this is shown by the green and blue circles, which correspond to the coefficients associated with 0 ×FC 30 and 0 ×FC 20 functions in runs 3 and 4 respectively. Both runs are very similar in almost every parameter, except that run 3 gives a very important role to the 0 × FC 30 function and almost discards the 0 × FC 20 function, while run 4 does the opposite. This indicates that both parameters are correlated with each other and that our methodology can choose one or the other and still come up with equivalent solutions.
To test the variance of our methodology, we make six independent runs of the formation criterion model, varying only the holdout set, the data that is kept outside of the model fitting process. One of the holdout sets is our standard holdout, used throughout the paper. The other five correspond to five independent subsets of the training set with the same amount of points that the standard holdout set: the six independent holdouts considered have each 15% of the whole data set. The RMSE of the 6 resulting models are [0.167, 0.170, 0.169, 0.162, 0.162, 0.167] and they have [16,14,15,15,18,17] surviving coefficients each respectively. Therefore all six models have similar accuracy and comparable simplicity. Fig. 9 shows the variations in the resulting coefficients that survived in at least one of the six models. Solid line are used for the eleven coefficients that survived in all of six runs. This means that on average two thirds of all coefficients are the same irrespective of the specific holdout data set used. We note that the  numerical values of those eleven coefficients are often of similar amplitude in all runs. Of the remaining coefficients, two are present in five of six models and a further two in four of six. Hence there are 15 coefficients present in nearly all six models, indicating how robust our algorithm is to changes in the holdout set used. We note that some of the other coefficients found in some runs are likely correlated with those ones and are sometimes present but discarded in at least half of the runs.

RESULTS
We now present the results of our four models defined in Section 3.5, i.e.: (i) Mass ratio, (ii) Formation criterion, (iii) Mass ratio and ì j and (iv) Formation criterion and ì j. The specific surviving coefficients selected by each of the models are presented in Table 1, where coefficients are reported in standardized space. They can not be used directly to model the actual data, which needs to be transformed using Eq. 3. The standardised space is defined by the mean and the standard deviation of the logarithm of the stellar mass of galaxies and of the dependent variables ì , which are shown in 2.
A striking feature of models (iii) and (iv), the two models with ì j, is that the algorithm does not select any specific angular momentum parameters in either of them. In fact the selected parameters of model (ii) and (iv) are almost identical. While there are some small differences between the coefficients chosen in models (i) and (iii), namely model (iii) selects two extra parameters, and the values of some of the common parameters are slightly different. These difference are consistence with the variance of the methodology reported  for the formation criterion model using six different holdouts, with the right most one corresponding to the standard holdout set used throughout this paper. The lines connect coefficients that survived in at least one model, with the right hand key indicating which function they refer to. The line style indicates how often a given coefficient was kept by the best fit model (as indicated by the key). The colour coding of the lines is only there to help to differentiate between them. Each run uses %15 of the data as holdout set, each of which are disjoint from each other. The resulting models, which have similar accuracy (RMSE=0.166 ± 0.004), select a somewhat different subsets of surviving coefficients , with the most important ones remaining the same and the less important ones often exchanged for comparable ones. See text for further discussion.
in Section 4.3. This indicates that the contribution to the accuracy of the model after including the angular momentum parameters is negligible: the sparse regression method found that no angular momentum parameters contributed additional information necessary to describe the SMHM relation that was not already provided using the rest of the parameters. This suggests that any correlation between the specific angular momentum history of a galaxy and that of its host halo should be the consequence of a correlation between the mass and specific angular momentum formation histories of host halos. Fig. 10 shows the predicted values of the stellar mass for all galaxies in the holdout set for three models (omitting model (iv) as it is so similar to model (ii)) compared to their real values in the EAGLE simulation. The closer a point is to the one-to-one line (black dashed line), the better the model predicted its value. We also include the RMSE of each model, as given by Eq. 17.
A different estimate of the goodness of a fit is the 2 statistic, Models: (i) mass ratio (ii) formation criterion (iii) mass ratio and ì j (iv) formation criterion and ì j which determines the amount of the variation in ì that can be explained by a model 2 : where is the standard deviation of ì. The usefulness of the 2 comes from being intuitive to interpret: the closer to one the 2 of a model is, the more accurate it is.
Both the RMSE and 2 statistics show that the three models have very similar accuracy. The formation criterion model is slightly simpler than both mass ratio models, as the former has 17 free parameters, compared with 20 and 22 from the two mass ratio models. This suggests that the formation criteria parameters, FC , are slightly more efficient at summarising the halo mass information than the mass ratio parameters, ( / 0 ).

Comparison with simpler models
While the LASSO approach uses only a fraction of the full set of available regression terms, the models it selects are still relatively complex and include non-linear combinations of terms characterising the formation history. In this section, we compare our results to simpler models. Specifically, we compare the formation criterion model from the last section with the following two models: 2 2 estimators should be considered with caution as they are easily biased by inaccurate estimations of and can have deceivingly small (or large) values. They should be used as reference only. We also include RMSE errors as goodness of fit estimators, which are far more robust.
• The first model is a third-order polynomial fit of the SMHM relation. This model includes the terms 1, 0 , 2 0 , and 3 0 . We label this model as 3 0 , with all four coefficients selected by our LASSO method 3 .
• Our second model is similar to the one presented in Equation (9) of Matthee et al. (2016). We include all terms of 0 up to the third order and all linear terms of 50 . More specifically, the eight possible terms are 1, 0 , 2 0 , 3 0 , FC 50 , 0 × FC 50 , 2 0 × FC 50 , and 3 0 × FC 50 . We did not use the model presented in Matthee et al. (2016) directly because of small differences in the calibration redshift and in the methodology used for selecting and processing the EAGLE data sets. We label this model as ( 3 0 & 50 ), with six coefficients selected by our LASSO method 4 . We have tested the prediction of this model against the predictions of Matthee et al. (2016) 5 and find that the models are comparable.
As the models grow in complexity, their prediction of the stellar (iii) Mass ratios and j R 2 = 0.94 RMSE=0.166 Figure 10. Comparison between the stellar mass predicted by the models and its actual value in the EAGLE simulation for all galaxies in the holdout set. Left, centre and right panels correspond to the mass ratio, the formation criterion, and the mass ratio and ì j models respectively (as indicated in the header of each panel). The closer each point is to the one-to-one relation (black dashed lines), the more accurate the model prediction is. The value of the RMSE and the 2 statistic are included for each model. In general the three models have equivalent accuracy. As the formation criterion and ì j model is virtually identical to the formation criterion model (see Table 1 for parameter values), we included only the latter one.   Table 2. Normalization parameters used for the stellar mass and the DM halo variables defined in section 3.3 and considered by our models. The and rows correspond to the mean and standard deviation of the variables respectively and are used in Eq. 3 to standardize the units of the variables considered. mass becomes more accurate, a way of quantifying this is by looking at the RMSE of our data set. The 3 0 model has a RMSE of 0.225 when estimated with stellar mass units 6 . We obtain very similar results looking both at the holdout set and the whole dataset. For the 3 0 & FC 50 model, the stellar mass RMSE drops to 0.181, while it is 0.166 for the formation criterion model. Assuming that contributions from the different terms can be added in quadrature, this shows that 32% of the variance of the 3 0 model is explained by including linear terms in FC 50 , while the more complex model selected by the LASSO process explains a further 10% of the variance, a modest but significant improvement.
This suggests that the biggest improvement on the SMHM residuals (modeled by 0 3 ) comes from the linear terms of FC 50 , the higher-order terms of FC 50 and the terms corresponding to other formation criteria make a smaller but significant correction to the predicted stellar mass.
To explore the improvement of the model further, we define for each galaxy the error of a model as the difference between the 6 In this sub-section, the RMSE is expressed in natural units, i.e. the logarithm of the stellar mass. This results in RMSE values which are more natural to understand, as here an RMSE of 0.2 implies that the mean error is 0.2 log 10 ( * / ). We note that the RMSE depend on the parametrisation, which throughout the rest of this work is the one defined using standardised units (see Eq. 17). actual stellar mass and the predicted one, or more precisely: = log 10 * * ( ì , X ) where * corresponds to the model predicted stellar mass of a galaxy of stellar mass * . Fig. 11 shows the 68 th and 95 th percentile ranges of | | as a function of halo mass for the reference formation criterion model (blue lines), and the 3 0 & FC 50 and 3 0 models (purple and red lines respectively). The plot shows that the differences between the three models are most significant at small halo masses, while at halo masses larger than ∼ 10 12.5 , all models are comparable. This suggests that galaxies in smaller halos are more readily explained by evolutionary effects (correlated with FC parameters), while the scatter in larger galaxies is perhaps more strongly influenced by stochastic baryonic processes, such as black hole accretion, that cannot be modelled using the halo mass history alone. This is in agreement with Matthee et al. (2016) that found no correlation between the scatter of the SMHM relation and formation time for halo masses larger than ∼ 10 12.5 . Rather than restricting, by hand, the choice of functions to terms that are linear in FC 50 , we can of course ask the LASSO methodology to simplify the formation criterion model, trading off an increase in variance for a reduction in complexity. It should be remembered, however, that this model will not provide optimal predictions for the stellar mass in a RMSE sense. We shift the balance to reduce complexity by increasing the penalty parameter of Eq. 6. As can be seen in Fig. 5, using a penalty three times larger than the one selected by the LASSO algorithm generates a model that is comparable to model 3 0 & FC 50 in terms of the RMSE and number of surviving terms. The terms retained by the model are: 1, 0 , FC 50 , FC 70 , FC 90 , FC 3 20 , FC 3 30 , FC 3 50 , with coefficients 0.0538, 1.13, 0.0315, 0.0534, 0.0242, 0.00590, 0.0104, 0.0108 respectively. Interesting this model prefers to characterise the formation histories of the haloes more precisely rather than to mix terms depending on halo mass and formation time.

Interpretation
The goal of this work is to make a model that is accurate and also explainable. With this in mind, we now try to give a physical interpretation to some of the terms kept in our model.
By looking at Table 1, we conclude that in general surviving parameters in all models can be divided into four different groups: 1. Terms forming a third order polynomial of 0 . Namely the terms 1, 0 , 2 0 , 3 0 . 2. Terms forming third order polynomials of the other dependent variables that are correlated with the mass at > 0. Namely, terms of the form / 0 , ( / 0 ) 2 and ( / 0 ) 3 for the mass ratio models with and without ì j, and terms of the form FC , FC 2 and FC 3 for the formation criterion models with and without ì j.
3. Terms corresponding to the product of 0 and either / 0 for the mass ratio models (i) and (iii) or FC for the formation criterion models (ii) and (iv).
4. Other terms corresponding to higher order combinations of crossed terms, which are more challenging to provide a physical interpretation of.
The terms in group (1) correspond to a direct modelling of the SMHM relation. Let us call 3 ( = 0) the polynomial built with the terms in group (1) and their associated coefficients, 1 : In order to compare our model stellar mass predictions with the EA-GLE stellar masses, we transform our model from the standardised units to stellar mass units: where the stellar mass, * , is expressed in , and are the mean and standard deviation operators considered in Eq. 3 already.
3 ( = 0) computed for the formation criterion model is shown as the black dashed curve of the left panel of Fig. 12. The figure shows that 3 ( = 0) provides already a good model of the SMHM relation; however, there is some scatter around it that the model does not account for. We define the residual between each galaxy and the model prediction given by 3 ( = 0) as : Galaxies in Fig. 12 are divided into four bins. The yellow bin, which is the bin with the largest values, correspond to galaxies for which their stellar masses are the most under-predicted by 3 ( = 0), while the blue bin contains those with the most over predicted stellar masses. The right panel of in Fig. 12 shows the average mass of halos in each of the four bins as a function of redshift. On average galaxies in the yellow bin live inside host halos that attained their final mass early in their evolution when the characteristic density was higher. The deeper potential well of these halos allows the creation of massive galaxies. In contrast, the galaxies in the most over-predicted bin (blue) live inside host halos that only achieved their final mass very recently and therefore had a lower characteristic density for a considerable period of time, compared to halos of the same mass in larger bins. This implies that there is a correlation between and the mass formation history and explains why coefficients in group (2) were selected by our model. This conclusion is in agreement with Zentner et al. (2014), where formation time is used to model assembly bias, and with Matthee et al. (2016) where formation time is found to be the most correlated parameter with . We emphasize that we arrive at this conclusion by using a completely different approach, that does not require any prior knowledge of the underlying physics correlating stellar mass with halo mass and formation time.
A novel result from our model is that all terms of FC with = [20,30,50,70,90] are needed in the final fit. This suggests that formation time alone is not enough for our model to remove the correlation with , but actually tracking the different formation times at which different percentages of the final halo mass were assembled leads to more accurate models.
Our model suggests that the assembly history dependence is itself a function of the final halo mass. In order to explore this, we write the polynomial fits to each of the FC terms from (2), and their associated coefficients, : 20,30,50,70,90]. We define the residual as the leftover residual once we have removed contributions from all terms from groups (1) and (2), i.e.: where = [20,30,50,70,90]. We note that is defined in standardised space, with positive corresponding to a model underprediction and negative a model overprediction. The solid lines represent the mean of the logarithm of the halo mass ratios for all galaxies in each bin, with the same colour scheme as in the left panel. The shaded contours indicate the corresponding standard deviation on the mean. Galaxies with the more negative residuals reside in halos that recently assembled their final halo mass, while galaxies with the more positive residuals reside in halos that primarily assembled their halo mass at an earlier stage of their evolution. Fig. 13 shows where galaxies are in the FC 50 vs plane, where FC 50 (in standardized units) corresponds to the redshift when 50% of the mass of a halo has been formed.
The blue and red solid lines show the average for very massive and very small halos respectively. When FC 50 is negative (i.e. smaller redshifts than the average, i.e. at later times), galaxies living in massive host halos tend to be overpredicted by the model (as shown by the blue line being above zero) and galaxies living in small halos tend to be underpredicted (as shown by the red line being below zero). This shows why terms of the form FC × 0 , corresponding to coefficients in group (3) improve our model. The fact that the model selected terms of the form FC × 0 suggests that it is not enough to model a linear relationship between stellar mass and formation time (or in our case formation criteria), but that this relation needs to be corrected by a factor that is dependent on the final halo mass. Assembly bias suggests that the stellar mass of galaxies depends on formation history, our model also suggests that this dependency is in turn dependent on the final halo mass.

Stellar mass distribution and galaxy clustering of centrals
We have shown the models capability to reproduce the stellar mass of individual galaxies from the EAGLE simulation. We now discuss our models accuracy at reproducing other statistics from EAGLE such as the distribution of galaxy masses through the stellar mass function (SMF), and the clustering of the galaxies via 2-point correlation functions.
We consider the six realisations of the formation criterion model presented in Fig. 9 as a way of providing some uncertainty on the best fit model predictions. Throughout this section any model comparison with EAGLE relates to comparisons with central galaxies in EAGLE, as our model only make predictions for such galaxies. Furlong et al. (2015) shows that the SMF of the EAGLE hydrodynamical simulation at redshift zero agrees reasonably well with the one observed from SDSS (Li & White 2009) and GAMA (Baldry et al. 2012). The red dashed line of Fig. 14 shows the central galaxy stellar mass function obtained from the stellar masses in our EAGLE data set.
The red shaded region is an estimate of the error due to Poisson noise within the EAGLE sample and is computed with the bootstrap method (Efron 1979).
The blue lines in Fig. 14 are the SMFs computed using the stellar masses predicted by each of our models. The predictions are so similar that it is difficult to differentiate between them, especially in the top panel. The bottom panel of Fig. 14 shows that the model SMFs are within 12% of the input EAGLE SMF over most of the mass range. At stellar masses above log 10 ( * / ) = 11.0 the agreement of the models SMF worsens. This is likely due to the relatively small number of galaxies at this mass range in our sample (90 out of 9521). One of the many issues of including a comparatively small sample of galaxies is that the methodology has little incentive to fit them accurately as their contribution to the goodness of fit estimations is small. One possible way to improve this is to weigh their contribution more heavily than the one from galaxies in a more numerous mass range, this possibility will be explored in future iterations of this work.
The scatter in the mass function between different models is smaller than the bootstrap error (shown as the shaded area), which suggests that the difference between the SMF of EAGLE and that of our model is not due to random sampling effects. There are notable deviations at log 10 ( * / ) = 9.0 and log 10 ( * / ) = 10.5. . Relation between FC 50 and the residual of Eq. 32 for all galaxies in our holdout set. FC 50 , which is in standardised units, maps to the redshift at which a halo acquired half of its mass. The galaxies are colour coded by 0 (Eq. 21). The blue solid line shows the mean residual, ( ), for galaxies in very massive halos, i.e. halos where 0 > 2. Those halos are more that 2 standard deviations more massive than the mean. The red solid lines shows the mean residual, ( ), for galaxies living in halos with very low mass ( 0 < 0.8). The blue and red lines have slopes of opposite sign, which is reflected in the presence of terms from group (3) in the solution (see §5.2. The plot shows that the strength of assembly bias is correlated with the final halo mass. The disagreement at log 10 ( * / ) = 9.0 is likely to be caused by selection effects, as we include a cut in halo mass which can have an effect in our model predictions at those lower stellar masses. At log 10 ( * / ) = 10.5 the remaining residuals of the model are systematically larger and asymmetric, with the offset possibly correlated with other terms not included in our methodology. These parameters could be either other halo mass properties that we have not characterised, higher-order correlations of our input parameters, or the stochastic nature of baryonic processes. For example, feedback from super massive black holes has a highly non-linear effect on the stellar mass, either by affecting it directly, or through its influence on the baryon density inside halos (Bower et al. 2017;Martizzi et al. 2012). Whatever the cause, characterising these asymmetric residuals remains a challenging but important problem.
As a result of the asymmetric scatter, we find that the SMFs predicted by the simpler models, 3 0 and 3 0 & FC 50 from section 5.1, have only minor deviations from those predicted by the full model (the formation criterion model shown in Fig. 14). Although the more complex models predict more accurately the median stellar mass, all the models assume that the residuals are symmetric around this value: i.e., while the errors of a more complex model are smaller, they are not more symmetrical around their mean value. An improved treatment will have to characterise the spread of points as well as predicting a median of the relation.
The EAGLE hydrodynamical simulation has been shown to accurately reproduce the observed two point correlation function of galaxies from 1ℎ −1 Mpc and up to 6ℎ   In order to test how well our model reproduces the correlation function of EAGLE galaxies, we divide the galaxies in each of our models into four stellar mass bins. We then compute the two point correlation function of galaxies in each mass bin. This is done by assigning to each model galaxy the same co-moving coordinates as that of the centre of its host halo. Fig. 15 shows how the correlation functions of our models split by predicted model stellar mass compares with those obtained from the EAGLE simulation, split by the actual galaxy stellar mass. Each colour corresponds to a different mass bin, with each of our six models and for each stellar mass bin shown as solid faint lines. As with Fig. 14, the shaded areas show the bootstrap error estimate on the actual EAGLE clustering. The bootstrap method is done on a galaxy basis, which is still adequate in this case as we are not trying to quantify the impact of sample (or cosmic) variance: the models use the same set of DM halos as the EAGLE data, with only the stellar mass of their host galaxies possibly differing. The correlations functions from each of our six models and for each stellar mass bin are shown as solid faint lines. Fig. 15 shows that our correlation functions agree within errors with the ones from EAGLE, which suggests that our models assign galaxy masses in a way that is sufficiently accurate to reproduce the stellar mass clustering of central galaxies up to 10ℎ −1 Mpc. It is also noticeable that the scatter on the correlation functions from our methodology is smaller that the one from bootstrap errors. Hence to be able to differentiate between the models a significantly larger simulation volume would be needed.
Hydrodynamical of EAGLE galaxies split into four stellar mass bins (coloured dashed lines as per key) compared to the clustering computed with our 6 models (i.e. 6 thin solid lines for each stellar mass bin). Bootstrap errors are shown on the EAGLE correlation functions. Bottom panel: the ratio of the predicted to the actual galaxy clustering for each stellar mass bin (same colour coding as in the upper panel). This indicates that our models result in galaxy clustering estimates split by stellar mass that agree well within the bootstrap errors with the actual clustering of EAGLE galaxies. matter and the baryonic component of the universe are computationally expensive. This limits the volumes in which they can be computed to a few (100Mpc) 3 . Our models are informed by the physical processes relating the stellar mass of a galaxy and its host DM halo. Therefore, by populating DM-only simulations in larger volumes, our models could provide new tests of the hydrodynamical physics on larger scales than the ones permitted by direct comparisons with hydrodynamical simulations. The fact that we can reproduce accurately with our models both the stellar mass and the correlation functions of EAGLE, suggests that this approach is promising for populating DM only simulations.

DISCUSSION AND CONCLUSIONS
There is a well-known correlation between the stellar mass of a galaxy and the dark matter of its host halo (SMHM relation). However, this relation has significant scatter, which suggests that other properties are significant at determining the stellar mass of a halo. The halo mass evolution history and the specific angular momentum have both been proposed to be correlated with this residuals.
We use a sparse regression methodology to model the governing equations relating the stellar mass of central galaxies to the properties of their host dark matter halos. This method builds accurate and explainable models without needing much physical knowledge of the processes that determine the stellar mass of a galaxy from the halo properties of its host. In sparse regression methods, the lack of physical knowledge is substituted by large numbers of free parameters, where each parameter models different behaviours of the dark matter halo properties. A LASSO algorithm is used to optimize solutions. This method heavily penalizes the number of surviving parameters so that as few as possible are selected without losing accuracy. Here we have modified the form of the LASSO algorithm to be more efficient when combined with a gradient descent minimizer. This is achieved by including a regularisation term that smooths out discontinuities in the gradient that are present in standard LASSO when parameters are close to zero. This smoothing is characterized by a parameter that limits how close to zero coefficients need to get before being discarded by the algorithm. We also modify the method by which the minimizer decides which path to follow in such a way that we find performance gains in large dimensional spaces.
The size of the penalty is determined by the parameter , which is optimized using a k-fold methodology with = 10. We use the one-standard-error rule to select a value of that is larger than the best-fit and therefore builds a slightly less accurate model with fewer free parameters and therefore with more explainability.
The data that we use to build our models with comes from the EAGLE simulation. However, we emphasize that this method should be able to be calibrated against any simulation with similar results. We use a sample of 9521 central galaxies from the 100 cMpc box EAGLE suite of hydrodynamical simulations. The dark matter properties are read from a DM only simulation with the same initial conditions as our hydrodynamical simulation. The simulations are matched with each other in such a way that a pair is found for 99% of the DM halos.
We build four different models that differ by the independent parameters chosen to model the galaxy stellar mass. In the first instance, we consider two distinct model setups: (i) the mass ratio model uses the ratio between the mass of a halo at a redshift and that at = 0 to parametrise the mass history of the host halo; (ii) the formation criterion model uses the redshift at which a halo formed a specific percentage of its mass. For both models we include all linear, quadratic and cubic correlations of our independent variables as free parameters of the fits. Then we consider two additional models by extending the two previous models to include parameters related to the specific angular momentum ( ì j) history of the halos. More specifically, we consider parameters that characterize both the magnitude and the direction of the specific angular momentum vector, and vary the radius of the DM halo over which to measure the magnitude of ì j. Due to computational restrictions, we include only linear terms of the free parameters related to ì j.
The computational time of our minimization is correlated with the value of : a very large value would result in a very fast computational time, but it would be hard also to distinguish useful parameters from those that should be discarded. In Fig. 7 we show that a value of = 1 × 10 −3 selects the same coefficients as slower and more accurate runs without being too computationally expensive. Some input parameters are correlated with each other, for example, the mass ratio ( / 0 ) at a given redshift and that at a neighbouring redshift slice. In principle, our answers could be susceptible to the starting point of the minimizer; however, we show in Figs. 8 that neither the explainability nor the accuracy of the model changes significantly between runs with different starting points. We show in Fig. 9 that models trained on different subsets of the same data arrive at equivalent models.
Our algorithm did not select any angular momentum parameters for either model that included specific angular momentum parameters.
In fact, all the differences between these two models and their equivalent ones without angular momentum parameters are con-sistent with variations in our methodology. This suggests that any correlation between the linear terms of the angular momentum of a host halo and the residual of the SMHM relation is the consequence of correlations between the mass history of the halo and the history of its angular momentum. Given that model the formation criterion model is slightly simpler than the mass ratio model, we conclude that the formation criteria parameters, FC , are slightly more efficient at summarizing the halo mass evolution information than the mass ratios ( / 0 ). The formation criterion model is more accurate, although more complex, than models that include only halo mass terms, or models that also include a linear dependence on a single formation time. The improvement is, however, modest. Including a single linear formation time explains 32% of the residual variance, while the full models improves this by a further 10%. If greater simplicity is required, this can be achieved (at the expense of accuracy) by increasing the penalty hyperparameter, . The resulting model prefers to select terms that more closely characterise the formation history of the halo rather than terms the mix formation time and halo mass, however.
A subset of our surviving terms can be combined into a polynomial of 0 and is therefore a model of the SMHM relation. Other subsets of surviving terms can be combined into polynomials of either FC or / 0 (depending on the parametrization of the halo mass evolution history) and therefore model the assembly bias. Terms of the shape 0 × FC p (or 0 × / 0 ) add a significant correction to very small or very large halos. Our models suggest that a single formation time is not enough to model the variation in the SMHM relation, and that a better approach is to include the times at which different percentages of the mass have been formed. This is reflected in our model by the similar contribution of terms of the form FC for all in = [20,30,50,70,90]. Our model also suggests that the relation between the stellar mass and the formation times is not the same for all galaxies, but it depends on the halo mass at = 0.
We have shown how the stellar mass function (SMF) of our model compares to that of EAGLE central galaxies. They agree well within the bootstrap errors at most stellar mass values, except around log 10 ( * / ) = 9.0 and log 10 ( * / ) = 10.5. The difference at lower stellar mass could be explained by selection effects given that our model includes a cut on halo mass that could affect the prediction of the lower stellar masses. At log 10 ( * / ) = 10.5 on the other hand, the differences between the values predicted by our model and EAGLE are not symmetric around the mean. This suggests that the remaining residual of our model might be correlated with variables that have not been explored by our model. This could be either higher-order correlations of our current variables, DM variables that we have not considered yet, or the stochastic effects of the baryon physics affecting the stellar mass of the galaxy. These will be studied in further extensions of the model. We have also shown that the correlation function of EAGLE galaxies split by stellar mass is preserved in our models within the quoted bootstrap errors at all scales considered.
The fact that we can reproduce both the stellar mass and the correlation function of EAGLE accurately suggests that this method could be used to populate DM only simulations in larger volumes in a way that preserves these statistics. Our models are informed by the physical process that relates the stellar mass of a galaxy with the evolutionary and present properties of its host DM halo. Therefore DM only simulations that are populated using our methodology can provide tests of this physics on volumes where hydrodynamical simulations are prohibitively expensive to run. So far, however, our method has only been applied to central galaxies. Satellite galaxies in general have a weaker SMHM relation than halos. This is a consequence of satellites being subjected to processes like tidal stripping and heating. These processes modify the mass of subhalos and the galaxies they contain, meaning that the stellar mass of a satellite halo is different from what one would expect when comparing it with halos that were not stripped. By adapting our methodology to account for the more complex evolution of the satellite halo mass, for example by adding maximum progenitor masses to our list of variables, it should be possible to model the stellar mass of satellite galaxies as well. However, running our methodology with satellite galaxies would require to use a larger set of free parameters and a larger data set, as there are many more satellite galaxies than central galaxies. Therefore we should explore methods to optimize our minimization without losing reliability. One approach could be to use methods like principal component analysis to transform free parameters into a parameter space where they are uncorrelated. However, this might transform free parameters into inputs that are harder to interpret and might reduce the explainability of our results. These ideas will be explored in future iterations of this work.