Skip to content

SymbolicRegression.jlDiscover Mathematical Laws from Data

A flexible, user-friendly framework that automatically finds interpretable equations from your data

SymbolicRegression.jl

SymbolicRegression.jl searches for symbolic expressions which optimize a particular objective.

Latest releaseDocumentationForumsPaper
versionDevDiscussionsPaper
Build statusCoverage
CICoverage Status

Check out PySR for a Python frontend. Cite this software

Quickstart ​

Install in Julia with:

julia
using Pkg
Pkg.add("SymbolicRegression")

MLJ Interface ​

The easiest way to use SymbolicRegression.jl is with MLJ. Let's see an example:

julia
import SymbolicRegression: SRRegressor
import MLJ: machine, fit!, predict, report

# Dataset with two named features:
X = (a = rand(500), b = rand(500))

# and one target:
y = @. 2 * cos(X.a * 23.5) - X.b ^ 2

# with some noise:
y = y .+ randn(500) .* 1e-3

model = SRRegressor(
    niterations=50,
    binary_operators=[+, -, *],
    unary_operators=[cos],
)

Now, let's create and train this model on our data:

julia
mach = machine(model, X, y)

fit!(mach)

You will notice that expressions are printed using the column names of our table. If, instead of a table-like object, a simple array is passed (e.g., X=randn(100, 2)), x1, ..., xn will be used for variable names.

Let's look at the expressions discovered:

julia
report(mach)

Finally, we can make predictions with the expressions on new data:

julia
predict(mach, X)

This will make predictions using the expression selected by model.selection_method, which by default is a mix of accuracy and complexity.

You can override this selection and select an equation from the Pareto front manually with:

julia
predict(mach, (data=X, idx=2))

where here we choose to evaluate the second equation.

For fitting multiple outputs, one can use MultitargetSRRegressor (and pass an array of indices to idx in predict for selecting specific equations). For a full list of options available to each regressor, see the API page.

Low-Level Interface ​

The heart of SymbolicRegression.jl is the equation_search function. This takes a 2D array and attempts to model a 1D array using analytic functional forms. Note: unlike the MLJ interface, this assumes column-major input of shape [features, rows].

julia
import SymbolicRegression: Options, equation_search

X = randn(2, 100)
y = 2 * cos.(X[2, :]) + X[1, :] .^ 2 .- 2

options = Options(
    binary_operators=[+, *, /, -],
    unary_operators=[cos, exp],
    populations=20
)

hall_of_fame = equation_search(
    X, y, niterations=40, options=options,
    parallelism=:multithreading
)

You can view the resultant equations in the dominating Pareto front (best expression seen at each complexity) with:

julia
import SymbolicRegression: calculate_pareto_frontier

dominating = calculate_pareto_frontier(hall_of_fame)

This is a vector of PopMember type - which contains the expression along with the cost. We can get the expressions with:

julia
trees = [member.tree for member in dominating]

Each of these equations is an Expression{T} type for some constant type T (like Float32).

These expression objects are callable – you can simply pass in data:

julia
tree = trees[end]
output = tree(X)

Constructing expressions ​

Expressions are represented under-the-hood as the Node type which is developed in the DynamicExpressions.jl package. The Expression type wraps this and includes metadata about operators and variable names.

You can manipulate and construct expressions directly. For example:

julia
using SymbolicRegression: Options, Expression, Node

options = Options(;
    binary_operators=[+, -, *, /], unary_operators=[cos, exp, sin]
)
operators = options.operators
variable_names = ["x1", "x2", "x3"]
x1, x2, x3 = [Expression(Node(Float64; feature=i); operators, variable_names) for i=1:3]

tree = cos(x1 - 3.2 * x2) - x1 * x1

This tree has Float64 constants, so the type of the entire tree will be promoted to Node{Float64}.

We can convert all constants (recursively) to Float32:

julia
float32_tree = convert(Expression{Float32}, tree)

We can then evaluate this tree on a dataset:

julia
X = rand(Float32, 3, 100)

tree(X)

This callable format is the easy-to-use version which will automatically set all values to NaN if there were any Inf or NaN during evaluation. You can call the raw evaluation method with eval_tree_array:

julia
output, did_succeed = eval_tree_array(tree, X)

where did_succeed explicitly declares whether the evaluation was successful.

Exporting to SymbolicUtils.jl ​

We can view the equations in the dominating Pareto frontier with:

julia
dominating = calculate_pareto_frontier(hall_of_fame)

We can convert the best equation to SymbolicUtils.jl with the following function:

julia
import SymbolicRegression: node_to_symbolic

eqn = node_to_symbolic(dominating[end].tree)
println(simplify(eqn*5 + 3))

We can also print out the full pareto frontier like so:

julia
import SymbolicRegression: compute_complexity, string_tree

println("Complexity\tMSE\tEquation")

for member in dominating
    complexity = compute_complexity(member, options)
    loss = member.loss
    string = string_tree(member.tree, options)

    println("$(complexity)\t$(loss)\t$(string)")
end

Contents ​