LAGRANGIAN NEURAL NETWORKS

Miles Cranmer

Princeton University

mcranmer

@princeton.edu

Sam Greydanus

Oregon State University

greydanus.17

@gmail.com

Stephan Hoyer

Google Research

shoyer

@google.com

Peter Battaglia

DeepMind

peterbattaglia

@google.com

David Spergel

∗

Flatiron Institute

davidspergel

@flatironinstitute.org

Shirley Ho

†

Flatiron Institute

shirleyho

@flatironinstitute.org

ABSTRACT

Accurate models of the world are built upon notions of its underlying symmetries.

In physics, these symmetries correspond to conservation laws, such as for energy

and momentum. Yet even though neural network models see increasing use in

the physical sciences, they struggle to learn these symmetries. In this paper, we

propose Lagrangian Neural Networks (LNNs), which can parameterize arbitrary

Lagrangians using neural networks. In contrast to models that learn Hamiltonians,

LNNs do not require canonical coordinates, and thus perform well in situations

where canonical momenta are unknown or difﬁcult to compute. Unlike previous

approaches, our method does not restrict the functional form of learned energies

and will produce energy-conserving models for a variety of tasks. We test our

approach on a double pendulum and a relativistic particle, demonstrating energy

conservation where a baseline approach incurs dissipation and modeling relativity

without canonical coordinates where a Hamiltonian approach fails. Finally, we

show how this model can be applied to graphs and continuous systems using a

Lagrangian Graph Network, and demonstrate it on the 1D wave equation.

Figure 1: Physicists use Lagrangians to describe the dynamics of physical systems like the double

pendulum (black). Neural networks struggle to model these dynamics over long time periods due to

their inability to conserve energy (red). In this paper, we show how to learn arbitrary Lagrangians

with neural networks, inducing a strong physical prior on the learned dynamics (blue).

∗

Also afﬁliated with Princeton University

†

Also afﬁliated with New York University, Princeton University, and Carnegie Melon University

1

1 INTRODUCTION

Neural networks excel at ﬁnding patterns in noisy and high-dimensional data. They can perform well

on tasks such as image classiﬁcation (Krizhevsky et al., 2012), language translation (Sutskever et al.,

2014), and game playing (Silver et al., 2017). Part of their success is rooted in "deep priors" which

include hard-coded translation invariances (e.g., convolutional ﬁlters), clever architectural choices

(e.g., self-attention layers), and well-conditioned optimization landscapes (e.g., batch normalization).

Yet, in spite of their ability to compete with humans on challenging tasks, these models lack many

basic intuitions about the dynamics of the physical world. Whereas a human can quickly deduce that

a ball thrown upward will return to their hand at roughly the same velocity, a neural network may

never grasp this abstraction, even after seeing thousands of examples (Bakhtin et al., 2019).

The basic problem with neural network models is that they struggle to learn basic symmetries and

conservation laws. One solution to this problem is to design neural networks that can learn arbitrary

conservation laws. This was the core motivation behind Hamiltonian Neural Networks by Greydanus

et al. (2019) and Hamiltonian Generative Networks by Toth et al. (2019). These are both types of

differential equations modeled by neural networks, or “Neural ODEs,” which were studied extensively

in Chen et al. (2018).

Yet while these models were able to learn effective conservation laws, the Hamiltonian formalism

requires that the coordinates of the system be “canonical.” In order to be canonical, the input

coordinates

(q, p)

to the model should obey a strict set of rules given by the Poisson bracket relations:

p

i

≡

∂L

∂ ˙q

i

⇐⇒ {q

i

, q

j

} = 0 {p

i

, p

j

} = 0 {q

i

, p

j

} = δ

ij

, (1)

where {f, g} =

X

i

∂f

∂q

i

∂g

∂p

i

−

∂f

∂p

i

∂g

∂q

i

,

where

˙q

indicates a time derivative and

L

the Lagrangian. The problem is that many datasets

have dynamical coordinates that do not satisfy this constraint:

p

i

is not simply

˙q

i

×

mass in many

cases. A promising alternative is to learn the Lagrangian of the system instead. Like Hamiltonians,

Lagrangians enforce conservation of total energy, but unlike Hamiltonians, they can do so using

arbitrary coordinates.

In this paper, we show how to learn Lagrangians using neural networks. We demonstrate that they can

model complex physical systems which Hamiltonian Neural Networks (HNNs) fail to model while

outperforming a baseline neural network in energy conservation. We discuss our work in the context

of “Deep Lagrangian Networks,” (DeLaNs) described in Lutter et al. (2019), a closely related method

for learning Lagrangians. Unlike our work, DeLaNs are built for continuous control applications and

only models rigid body dynamics. Our model is more general in that it does not restrict the functional

form of the Lagrangian. We also show how our model can be applied to continuous systems and

graphs using a Lagrangian Graph Network in section 5.

Table 1: An overview of neural network based models for physical dynamics. Lagrangian Neural

Networks combine many desirable properties. DeLaNs explicitly restrict the learned kinetic energy,

and HNNs implicitly do as well by requiring a deﬁnition of the canonical momenta.

Neural net Neural ODE HNN DeLaN LNN (ours)

Can model dynamical systems X X X X X

Learns differential equations X X X X

Learns exact conservation laws X X X

Learns from arbitrary coords. X X X X

Learns arbitrary Lagrangians X

2 THEORY

Lagrangians.

The Lagrangian formalism models a classical physics system with coordinates

x

t

=

(q, ˙q)

that begin in one state

x

0

and end up in another state

x

1

. There are many paths that these

2

coordinates might take as they pass from

x

0

to

x

1

, and we associate each of these paths with a scalar

value called “the action.” Lagrangian mechanics tells us that if we let the action be the functional:

S =

Z

t

1

t

0

T (q

t

, ˙q

t

) − V (q

t

, ˙q

t

) dt, (2)

where

T

is kinetic energy and

V

is potential energy, then there is only one path that the physical

system will take, and that path is the stationary (e.g., minimum) value of

S

. In order to enforce the

requirement that

S

be stationary, i.e.,

δS = 0

, we deﬁne the Lagrangian to be

L ≡ T −V

, and derive

a constraint equation called the Euler-Lagrange equation which describes the path of the system:

d

dt

∂L

∂ ˙q

j

=

∂L

∂q

j

. (3)

Euler-Lagrange with a parametric Lagrangian. Physicists traditionally write down an analytical

expression for

L

and then symbolically expand the Euler-Lagrange equation into a system of dif-

ferential equations. However, since

L

is now a black box, we must resign ourselves to the fact that

there can be no analytical expansion of the Euler-Lagrange equation. Fortunately, we can still derive

a numerical expression for the dynamics of the system. We begin by rewriting the Euler-Lagrange

equation in vectorized form as

d

dt

∇

˙q

L = ∇

q

L (4)

where

(∇

˙q

)

i

≡

∂

∂ ˙q

i

. Then we can use the chain rule to expand the time derivative

d

dt

through the

gradient of the Lagrangian, giving us two new terms, one with a ¨q and another with a ˙q:

(∇

˙q

∇

>

˙q

L)¨q + (∇

q

∇

>

˙q

L) ˙q = ∇

q

L. (5)

Here, these products of

∇

operators are matrices, e.g.,

(∇

q

∇

>

˙q

L)

ij

=

∂

2

L

∂q

j

∂ ˙q

i

. Now we can use a

matrix inverse to solve for ¨q:

¨q = (∇

˙q

∇

>

˙q

L)

−1

[∇

q

L − (∇

q

∇

>

˙q

L) ˙q]. (6)

For a given set of coordinates

x

t

= (q

t

, ˙q

t

)

, we now have a method for calculating

¨q

t

from a black

box Lagrangian, which we can integrate to ﬁnd the dynamics of the system. In the same manner as

Greydanus et al. (2019), we can also write a loss function in terms of the discrepancy between

¨x

L

t

and ¨x

true

t

.

3 RELATED WORK

Physics priors.

A variety of previous works have sought to endow neural networks with physics-

motivated inductive biases. These include work for domain-speciﬁc problems in molecular dynamics

(Rupp et al., 2012), quantum mechanics (Schütt et al., 2017), or robotics (Lutter et al., 2019). Other

are more general, such as Interaction Networks (Battaglia et al., 2016), which models physical

interactions as message passing on a graph.

Learning invariant quantities.

Recent work by Greydanus et al. (2019), Toth et al. (2019), and

Chen et al. (2019) built on previous approaches of endowing neural networks with physical priors

by demonstrating how to learn invariant quantities by approximating a Hamiltonian with a neural

network. In this paper, we follow the same approach as Greydanus et al. (2019), but with the objective

of learning a Lagrangian rather than a Hamiltonian so not to restrict the learned kinetic energy.

DeLaN.

A closely related previous work is “Deep Lagrangian Networks”, or DeLaN (Lutter et al.,

2019), in which the authors show how to learn speciﬁc types of Lagrangian systems. They assume

that the kinetic energy is an inner product of the velocity:

T = ˙q

T

M ˙q

, where

M

is a

q

-dependent

positive deﬁnite matrix. This approach works well for rigid body dynamics, which includes many

systems encountered in robotics. However, many systems do not have this kinetic energy, including,

for example, a charged particle in a magnetic ﬁeld, and a fast-moving object in special relativity.

3

4 METHODS

Solving Euler-Lagrange with JAX.

Efﬁciently implementing equation 6 represents a formidable

technical challenge. In order to train this forward model, we need to compute the inverse Hessian

(∇

˙q

∇

>

˙q

L)

−1

(we use the pseudoinverse to avoid potential singular matrices) of a neural network and

then perform backpropagation. The matrix inverse scales as

O(d

3

)

with the number of coordinates

d

.

Perhaps surprisingly, though, we can implement this operation in a few lines of JAX (Bradbury et al.,

2018) (see Appendix). We publish our code on GitHub

1

.

Training details.

For both baseline, HNN, and LNN models, we used a four-layer neural network

model with 500 hidden units, a decaying learning rate starting at

10

−3

, and a batch size of 32. We

initialize our model using a novel initialization scheme described in appendix C, which was optimized

for LNNs.

Activation functions.

Since we take the Hessian of a LNN, the second-order derivative of the

activation function is important. For example, the ReLU nonlinearity is a poor choice because

its second-order derivative is zero. In order to ﬁnd a better activation function, we performed a

hyperparameter search over ReLU

2

(squared), ReLU

3

, tanh, sigmoid, and softplus activations on the

double pendulum problem. We found that softplus performed best and thus used it for all experiments.

5 EXPERIMENTS

Double pendulum

. The ﬁrst task we considered was a dataset of simulated double pendulum

trajectories. We set the masses and lengths to

1

and learned the instantaneous accelerations over

600,000 random initial conditions. We found that the LNN and the baseline converged to similar

ﬁnal losses of

7.3

and

7.4 × 10

−2

, respectively. The more signiﬁcant difference was in energy

conservation; the LNN almost exactly conserved the true energy over time, whereas the baseline

did not. Averaging over 40 random initial conditions with 100 time steps each, the mean energy

discrepancy between the true total energy and predicted was 8% and 0.4% of the max potential energy

for the baseline and LNN models respectively.

(a) Error in angle (b) Error in energy

Figure 2: Results on the double pendulum task. In (a), we see that the LNN and baseline model

perform similarly in modelling the dynamics of the pendulum over short time periods. However, if

we plot out the energy over a very long time period in (b) we can see that the LNN model conserves

the total energy of the system signiﬁcantly better than the baseline.

Relativistic particle in a uniform potential

. The second task was a relativistic particle in a uniform

potential. For a particle with a mass of

1

in a potential

g

and with

c = 1

, special relativity gives the

Lagrangian

L = ((1 − ˙q

2

)

−1/2

− 1) + gq

. The canonical momenta of this system are

˙q(1 − ˙q

2

)

−3/2

,

which means that a Hamiltonian Neural Network will fail if given simple observables like

˙q

and

q

.

The DeLaN model will also struggle since it assumes that

T

is second order in

˙q

. To verify these

1

https://github.com/MilesCranmer/lagrangian_nns

4

predictions, we trained HNN and LNN models on systems with random initial conditions and values

of

g

. Figure 3 shows that the HNN fails without canonical coordinates whereas the LNN can work

without this extra a priori knowledge, and learns the system as accurately as an HNN trained on

canonical coordinates.

(a) HNN, arbitrary coords. (b) HNN, canonical coords. (c) LNN, arbitrary coords.

Figure 3: Results on the relativistic particle task. In (a) an HNN model fails to learn dynamics from

non-canonical coordinates. In (b) the HNN succeeds when given canonical coordinates. Finally in (c)

the LNN learns accurate dynamics even from non-canonical coordinates.

Wave equation with Lagrangian Graph Networks.

Equation (6) is applicable to many different

types of systems, including those with disconnected coordinates such as graph-like and grid-like

systems. To model this, we now learn a Lagrangian density which is summed to form the full

Lagrangian. This is similar to the Hamiltonian Graph Network of Sanchez-Gonzalez et al. (2019), or

more speciﬁcally, the Flattened Hamiltonian Graph Network of Cranmer et al. (2020). For simplicity,

we focus on 1D grids with locally-dependent dynamics (i.e., nodes are connected to their two adjacent

nodes).

A continuous material can be described in terms of quantities

φ

i

, such as the displacement of a guitar

string, at each gridpoint

x

i

. One way to model this type of system is to treat the quantities on adjacent

gridpoints as independent coordinates, and sum the regular LNN equation over all connected groups

of coordinates. For n gridpoints, the total Lagrangian is then:

L =

X

i=1:n

L

i

, for L

i

= L

density

{φ

j

,

˙

φ

j

}

j∈I

i

(7)

where

I

i

= {i, . . .}

is the set of indices connected to

i

. For a 1D grid where only the adjacent

gridpoints affect the dynamics of the center gridpoint, this is

{i, i − 1, i + 1}

. Again, we can solve

dynamics by plugging the Lagrangian into eq. (6), now with φ as the coordinates:

¨

φ =

∇

˙

φ

∇

>

˙

φ

L

−1

∇

φ

L −

∇

φ

∇

>

˙

φ

L

˙

φ

, (8)

where, e.g.,

∇

φ

≡ {

∂

∂φ

1

,

∂

∂φ

2

, . . . ,

∂

∂φ

n

}

. Note that the Hessian matrix is sparse with non-zero

entries at “neighbor of neighbor” positions in the graph, which can make it much more efﬁcient to

calculate and invert. For example, in 1D it can be calculated with only 5 forward over backwards

auto-differentiation passes and can be inverted in linear time.

We can think of this as a regular LNN where, instead of calculating the Lagrangian directly from a

ﬁxed set of coordinates, we are now accumulating a Lagrangian density over groups of coordinates.

For a different connectivity, such as a graph network, or to approximate higher-order spatial derivatives

for a continuous material, one would select

I

i

based on the adjacency matrix for the graph. This

model is a type of Graph Neural Network (Scarselli et al., 2008). The Lagrangian density itself,

L

density

, we model as an MLP. Since we write our models in JAX, we can easily vectorize this forward

model over the grid.

We consider the 1D wave equation, with

φ

representing the wave displacement:

φ(x, t)

. For wave

speed

c = 1

, the equation describing its dynamics can be written as:

¨

φ =

∂

2

φ

∂x

2

. The Lagrangian for

5

this differential equation is:

L =

R

˙

φ

2

−

∂φ

∂x

2

dx.

For the LNN to learn this, the MLP will

need to learn to approximate a ﬁnite difference operator:

L

i

=

˙

φ

2

i

−

φ

i+1

−φ

i−1

2∆x

2

for grid spacing

∆x

(or any translation + scaling of this equation). We simulate the wave equation in a box with

periodic boundary conditions, and learn the dynamics with this Lagrangian Graph Network, shown

in ﬁg. 4. The Lagrangian Graph Network models the wave equation accurately and almost exactly

conserves energy integrated across the material.

Figure 4: Results on the 1D wave equation task, using a Lagrangian Graph Network. Here, the LNN

models the Lagrangian density over three adjacent gridpoints along the wave, and this density is

summed. Here, the waves make approximately three and a half passes across the 100 grid points

over the time plotted. A movie can be viewed by clicking on the plot or by going to

https:

//github.com/MilesCranmer/gifs/blob/master/wave_equation.gif.

6 CONCLUSION

We have introduced a new class of neural networks, Lagrangian Neural Networks, which can learn

arbitrary Lagrangians. In contrast to models that learn Hamiltonians, these models do not require

canonical coordinates and thus perform well in situations where canonical momenta are unknown

or difﬁcult to compute. To evaluate our model, we showed that it could effectively conserve total

energy on a complex physical system: the double pendulum. We showed that our model could

learn non-trivial canonical momenta on a task where Hamiltonian learning struggles. Finally, we

demonstrated a graph version of the model with the 1D wave equation.

Acknowledgments

Miles Cranmer would like to thank Jeremy Goodman for several discussions and

validation of the theory behind the Lagrangian approach, and Adam Burrows and Oliver Philcox for

very helpful comments on a draft of the paper.

REFERENCES

Anton Bakhtin, Laurens van der Maaten, Justin Johnson, Laura Gustafson, and Ross Girshick. Phyre:

A new benchmark for physical reasoning. In Advances in Neural Information Processing Systems,

pp. 5083–5094, 2019.

Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction networks

for learning about objects, relations and physics. In Advances in neural information processing

systems, pp. 4502–4510, 2016.

6

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclau-

rin, and Skye Wanderman-Milne. JAX: composable transformations of Python+NumPy programs,

2018. URL http://github.com/google/jax.

Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neu-

ral Ordinary Differential Equations. Advances in neural information process-

ing systems, pp. 6571–6583, 2018. URL

http://papers.nips.cc/paper/

7892-neural-ordinary-differential-equations.pdf.

Zhengdao Chen, Jianyu Zhang, Martin Arjovsky, and Léon Bottou. Symplectic recurrent neural

networks. arXiv preprint arXiv:1909.13334, 2019.

Miles Cranmer, Alvaro Sanchez-Gonzalez, Peter Battaglia, Rui Xu, Kyle Cranmer, David Spergel,

and Shirley Ho. Discovering Physical Equations with Graph Networks. forthcoming, 2020.

Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural

networks. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and

statistics, pp. 249–256, 2010.

Samuel Greydanus, Misko Dzamba, and Jason Yosinski. Hamiltonian Neural Networks. In Advances

in Neural Information Processing Systems, pp. 15353–15363, 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing

human-level performance on imagenet classiﬁcation. In Proceedings of the IEEE international

conference on computer vision, pp. 1026–1034, 2015.

Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton. Imagenet classiﬁcation with deep convolutional

neural networks. In Advances in Neural Information Processing Systems 25, pp. 1106–1114, 2012.

Michael Lutter, Christian Ritter, and Jan Peters. Deep Lagrangian Networks: Using physics as model

prior for deep learning. arXiv preprint arXiv:1907.04490, 2019.

Matthias Rupp, Alexandre Tkatchenko, Klaus-Robert Müller, and O Anatole Von Lilienfeld. Fast

and accurate modeling of molecular atomization energies with machine learning. Physical review

letters, 108(5):058301, 2012.

Alvaro Sanchez-Gonzalez, Victor Bapst, Kyle Cranmer, and Peter Battaglia. Hamiltonian graph

networks with ode integrators. arXiv preprint arXiv:1909.12790, 2019.

Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The

graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.

M. Schmidt and H. Lipson. Distilling free-form natural laws from experimental data. science, 324

(5923):81–85, 2009.

Kristof T Schütt, Farhad Arbabzadah, Stefan Chmiela, Klaus R Müller, and Alexandre Tkatchenko.

Quantum-chemical insights from deep tensor neural networks. Nature communications, 8:13890,

2017.

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,

Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without

human knowledge. Nature, 550(7676):354, 2017.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks.

CoRR, abs/1409.3215, 2014. URL http://arxiv.org/abs/1409.3215.

Peter Toth, Danilo Jimenez Rezende, Andrew Jaegle, Sébastien Racanière, Aleksandar Botev, and

Irina Higgins. Hamiltonian Generative Networks. International Conference on Machine Learning,

2019.

7

Appendix

A SOLVING EULER-LAGRANGE WITH JAX

Here we will present a simple JAX implementation of Equation 6. Assume that

lagrangian

is a

differentiable function that takes three vectors as input and outputs a scalar. Meanwhile,

q_t

is a

vector of the velocities (

˙q

) of

q

(

q

) and

q_tt

contains the accelerations (

¨q

). The vector

m

represents

non-dynamical parameters.

q_tt = (

jax.numpy.linalg.pinv(jax.hessian(lagrangian, 1)(q, q_t, m)) @ (

jax.grad(lagrangian, 0)(q, q_t, m)

- jax.jacobian(jax.jacobian(lagrangian, 1), 0)(q, q_t, m) @ q_t

)

)

When this is called in a loss function with the LNN parameters as input:

loss(params, ...)

,

one can write jax.grad(loss, 0)(params, ...), to get the gradient.

B EXAMPLE OF LAGRANGIAN FORWARD MODEL

To demonstrate how this may work if one has learned the exact function for

L

, let us study an example

of a ball falling in a gravity g along the direction q

1

:

L =

1

2

m

˙q

2

1

+ ˙q

2

2

− mgq

1

, (9)

where

g

is the local scalar gravitational ﬁeld, and

m

is the mass of the ball. To obtain the dynamics

with our forward model, we calculate the required derivatives which gives us:

∇

˙q

∇

>

˙q

L =

m 0

0 m

, (10)

∇

q

∇

>

˙q

L = 0, and (11)

L

q

=

−mg

0

. (12)

Thus, we ﬁnd

¨q

1

¨q

2

= (∇

˙q

∇

>

˙q

L)

−1

∇

q

L − (∇

q

∇

>

˙q

L) ˙q

(13)

=

m 0

0 m

−1

−mg

0

(14)

=

−g

0

(15)

which simply says that the particle accelerates downwards along

q

1

, and moves at constant velocity

along q

2

.

C INITIALIZATION

Regular optimization techniques such as Kaiming (He et al., 2015) and Xavier (Glorot & Bengio,

2010) initialization are optimized so that in a regular neural network, the gradients of the output

with respect to each parameter will have a mean of zero and standard deviation of one. Since the

unusual optimization objective of a Lagrangian Neural Network is very nonlinear with respect to its

parameters, we found that classical initialization schemes were insufﬁcient.

To ﬁnd a better initialization scheme, we conducted an empirical optimization of the KL-divergence

of the gradient of each parameter with respect to a univariate Gaussian. We repeated this over a

variety of neural network depths and widths and ﬁt an empirical formula to our results.

8

To do this, we ran 2500 optimization steps with different initialization variances on each layer for an

MLP of ﬁxed depth and width. Biases were always initialized to zero. We recorded the optimized

σ

values over

∼

200 random hyperparameter settings with the number of hidden nodes between 50

and 300 and the number of hidden layers between one and three. Then, we used eureqa (Schmidt &

Lipson, 2009) to ﬁnd ﬁt an equation using symbolic regression that predicted the optimal initialization

variance as a function of the hyperparameters:

σ =

1

√

n

(

2.2 First layer

0.58i Hidden layer i ∈ {1, . . .}

n, Output layer,

(16)

This model was optimized for 2 input coordinates and 2 input coordinate velocities which were

sampled from univariate Gaussians.

During training, the hidden weight matrices had dimensions

n × n

and each weight matrix

was sampled from

N(0, σ

2

)

. A 100-node 4-layer model would have weight matrices with

shapes

{(4, 100), (100, 100), (100, 100), (100, 1)}

and each one had initializations sampled from

{N(0, 0.22), N(0, 0.058), N(0, 0.116), N(0, 10)}, respectively.

9