2017-01-12

Linear Regression and F#

Read Time: 8 minutes

Today I look into performing linear regression using F#. The implementations of interest will be the MathNet and Accord.NET libraries. I assume you already know what linear regression is, but in can you need a refresher: Linear Regression. My goal is to provide a simple explanation of how to leverage some existing F# accessible libraries. Once you know some of the basic calling functions, you can go crazy with some of the other options these libraries have to offer.

Using Paket, here is a sample paket.dependencies file.

source https://nuget.org/api/v2
nuget Accord
nuget Accord.Math
nuget Accord.Statistics
nuget Accord.MachineLearning
nuget FSharp.Charting
nuget MathNet.Numerics
nuget MathNet.Numerics.FSharp

Here is the library loading.

#r "../packages/FSharp.Charting/lib/net40/FSharp.Charting.dll"
#r "../packages/Accord/lib/net45/Accord.dll"
#r "../packages/Accord.Math/lib/net45/Accord.Math.dll"
#r "../packages/Accord.Statistics/lib/net45/Accord.Statistics.dll"
#r "../packages/MathNet.Numerics/lib/net40/MathNet.Numerics.dll"
#r "../packages/MathNet.Numerics.FSharp/lib/net40/MathNet.Numerics.FSharp.dll"

open System
open System.IO
open FSharp.Charting
open Accord
open Accord.Math
open Accord.Math.Distances
open Accord.Statistics
open Accord.Statistics.Models.Regression.Linear
open MathNet
open MathNet.Numerics
open MathNet.Numerics.LinearAlgebra
open MathNet.Numerics.LinearRegression

First I need to create some data. For this example, the formula is y = x^2 + noise + occasional outlier. The method is create 3 arrays representing x, noise, and outliers. It is a bit convoluted, but it allows me to show off a couple bits of functionality from F# and MathNet. The MathNet library includes methods to generate datasets, for the noise dataset Generate.Normal creates an array of numbers with a normal distribution. It is worth checking out the other diverse generation capabilities available. For outliers, I define an arbitrary 30% chance of a big spike in the data, as defined by pct and range variables. Then I use |||> and Array.zip3 to combine the 3 element tuple of arrays into an array of 3 element tuples. Once in this format, a map is used to calculate the formula mentioned at the start.

Sidenote: If you’ve coded any F#, you know |>. But did you know there are other, similar operators: ||> passes a tuple as two arguments, |||> passes a 3-tuple as 3 arguments.

let xData = [| 1. .. 0.25 .. 50. |]

// Args: length, mean, stddev
let noise = Generate.Normal(Array.length xData, 500., 200.) 

let makeOutliers len pct range =
    let rand = new Random()
    [|1..len|]
    |> Array.map (fun _ ->
        if rand.NextDouble() < pct
        then range * rand.NextDouble() - (range / 2.)
        else 0.)

let outliers = makeOutliers (Array.length xData) 0.3 2000. 

let yData = 
    (xData, noise, outliers) 
    |||> Array.zip3 
    |> Array.map (fun (x, n, o) -> x**2. + n + o)

MathNet

Starting with the MathNet implementation, it is time for the regression fitting. The first option is Fit.Line, this takes the x and y data, fits a line and returns the associated intercept and slope that can be plugged into a y = mx + b formula. The second option is to use Fit.LineFunc. It also takes the x and y data to fit a line. The difference is it creates a delegate function that can be used to directly calculate.

1
2
3

let mathnetIntercept, mathnetSlope = Fit.Line (xData, yData)

let regressionFunc = Fit.LineFunc(xData, yData)

Now it’s time to generate data based on the regression result. pData1 is calculated manually using the slope and intercept from the Fit.Line call. pData2 leverages the function delegate from Fit.LineFunc. I need to use .Invoke for performing the calculation.

let pData1 =
    xData
    |> Array.map (fun x -> mathnetSlope * x + mathnetIntercept)

let pData2 =
    xData
    |> Array.map (fun x -> regressionFunc.Invoke(x))

The below code combines 3 charts; the original data, plus the regression lines. Some of the effect is lost, since the lines are ontop of each other, but you hopefully get the point.

// Chart actual versus linear regression predictions
Chart.Combine([
    (xData, yData)  ||> Array.zip |> Chart.Point;
    (xData, pData1) ||> Array.zip |> Chart.Line;
    (xData, pData2) ||> Array.zip |> Chart.Line])
|> Chart.Show

Regression Comparison (MathNet)

Accord.NET

Now it is time to look at the Accord.NET implementation. Here,OrdinaryLeastSquares + Learn are used to determine line fitting. The result is a SimpleLinearRegression object.

let ols = new OrdinaryLeastSquares();
let accordRegression = ols.Learn(xData, yData)
let accordSlope = accordRegression.Slope
let accordIntercept = accordRegression.Intercept

Now it’s time to generate data based on the regression result. pData3 is calculated manually using the slope and intercept. pData4 is calculated directly by leveraging Accord’s Transform() function.

let pData3 =
    xData
    |> Array.map (fun x -> accordSlope * x + accordIntercept)

let pData4 =
    xData
    |> Array.map (fun x -> accordRegression.Transform(x))

Again, the below code combines 3 charts; the original data, plus the regression lines. As expected, these graphs are identical.

Chart.Combine([
    (xData, yData)  ||> Array.zip |> Chart.Point;
    (xData, pData3) ||> Array.zip |> Chart.Line;
    (xData, pData4) ||> Array.zip |> Chart.Line])
|> Chart.Show

Regression Comparison (Accord.NET)

So, there it is. A couple ways to do linear regression, but there is more. What good is the regression model if you can’t perform scoring against the result. Luckily, MathNet and Accord.NET have several methods of comparing datasets. There are too many options to show them all here, but here are a couple examples scoring predicted data (pData1) versus actual data (yData). For reference: MathNet Distances and Accord.NET Distances. I recommend digging deeper to find the scoring method appropriate for your specific needs.

MathNet:

printfn "Scoring: R2=%.2f R=%.2f PSE=%.2f SE=%.2f SAD=%.2f SSD=%.2f MAE=%.2f MSE=%.2f"
    (GoodnessOfFit.RSquared(pData1, yData))
    (GoodnessOfFit.R(pData1, yData))
    (GoodnessOfFit.PopulationStandardError(pData1, yData))
    (GoodnessOfFit.StandardError(pData1, yData, 2)) // 2 degrees freedom
    (Distance.SAD(pData1, yData))
    (Distance.SSD(pData1, yData))
    (Distance.MAE(pData1, yData))
    (Distance.MSE(pData1, yData))

Scoring: R2=0.75 R=0.87 PSE=427.23 SE=429.41 SAD=61878.78 SSD=35957245.48 MAE=314.11 MSE=182524.09

Accord.NET:

printfn "Scoring SE: %.2f E: %.2f E2:%.2f M:%.2f PC:%.2f"
    (accordRegression.GetStandardError(xData, yData))
    (Euclidean().Distance(pData1, yData))
    (SquareEuclidean().Distance(pData1, yData))
    (Manhattan().Distance(pData1, yData))
    (PearsonCorrelation().Similarity(pData1, yData))

Scoring SE: 429.41 E: 5996.44 E2:35957245.48 M:61878.78 PC:0.87

// Confidence Intervals
xData
|> Array.map (fun x -> 
    (x, accordRegression.GetConfidenceInterval(x, xData, yData)))
|> Array.iter (fun (x, ci) -> 
    printfn "x: %5.2f min: %7.2f max: %7.2f len:%7.2f" x ci.Min ci.Max ci.Length)

x:  1.00 min:  -90.80 max:  149.64 len: 240.44
x:  1.25 min:  -76.85 max:  161.76 len: 238.61
x:  1.50 min:  -62.90 max:  173.88 len: 236.78
x:  1.75 min:  -48.95 max:  186.00 len: 234.95
x:  2.00 min:  -35.01 max:  198.13 len: 233.14
...

I hope this has been helpful if you’re venturing into F# and regression.