Recently Microsoft announced ML.NET, a machine learning framework for .NET. This is exciting news. So my mind immediately goes to: how does this look with F#? The current post will take a look at using ML.NET’s regression module to predict concrete compressive strength based on its composite ingredients.
Update: This post is here for posterity sake, a rework of this post is here using ML.NET version 1.3.
Before jumping in too far, there is a disclaimer: ML.NET is in its early stages. I found a couple implementation and interface idiosyncrasies I suspect will change over time. Just keep that in mind moving forward. The short version is, I’ve been pleased with what I’ve seen so far. There is some room for improvement, especially having more F#-centric support for calling methods. It will be an interesting journey as the framework matures.
Update: The post was written using Microsoft.ML v0.1.0, and v0.2.0 has since been released. I have noted interfaces changes below, for the example it is just TextLoader.
With that out of the way, make sure you have .NET Core version 2.0 installed. If you don’t, head out to the .NET Core Downloads page. Select SDK for your platform. Tangential, but you can also get here by going to dot.net, then navigating to Downloads
and .NET Core
.
First, create the project and add the ML.NET package. This will be a console app in F# (obviously).
1 | dotnet new console --language F# --name MLNet-Concrete |
Next, it is time to get the data. The source I used for this post is from UCI. The dataset is an Excel file (xls), and I need it as a csv. I used ssource
(from apt install gnumeric
) to convert from Excel to CSV, but feel free to use whatever works for you.
1 | mkdir data && cd data |
Here is a sample of what the data looks like. There is a header row, I’ve transposed this to a vertical list for readablity. The first 8 columns are features, the last is the concrete compressive strength.
1 | # Header Row |
1 | # Data Rows |
Now that the project is setup and data is local, we can get to the code. Time to open up the already created Program.fs
. First, add the necessary namespaces.
1 | open System |
The ML.NET pipeline expects the data in a specific format. In the C# world, this is a class, for F# we can use a type. Below are the required types; ConcreteData
is the input data, ConcretePrediction
is the output prediction. For ConcreteData
, this is basically a a map of columns to member variables. There are a couple notable points to ensure the pipeline can properly consume the data. Each attribute must be mutable public
, it also requires the [<Column("#")>]
to specify it’s column position, and [<DefaultValue>]
attributes. For ConcretePrediction
, a single attribute is required, the prediction value. For the input data, the label variable must be named Label
. For the prediction type, the variable must be labeled Score
. There are methods where you are supposed to be able to define a ColumnName
attribute, or copy a label column into the pipeline. But frankly they didn’t work for me. I’m unclear if I was doing something wrong if its a current early-state problem. Over time I expect this will be resolved, but for now I don’t mind working within tighter constraints.
1 | type ConcreteData() = |
The structure of building a pipeline is pretty intuitive. First, create a pipeline. Then, add components to the pipeline in the order to be executed. So first, load the data with a TextLoader
. This data is comma delimited and has a header row.
1 | let pipeline = new LearningPipeline() |
After the data is loaded, feature columns need to be added to the pipeline. I’m going to use all feature columns from the file, but I don’t have to. The regressor model requires features to be numeric. In this example, that is the case and nothing special needs to be done. In cases where columns are strings, the CategoricalOneHotVectorizer()
will convert string columns to numeric mappings. I’ve provided an example line below. Even though I don’t need it, its a handy reference to have. Note the order, since it is a pipeline, the string to numeric column conversion needs to happen prior to adding the feature columns.
1 | // Example how to convert text to numeric |
Now that the features are defined, it is time to determine what training method to use. For this post FastTreeRegressor
is used. This is a boosted decision tree and generally offers pretty good results. Custom hyperparameters can also be defined. I found the defaults to be fine, but its good to see the option to tweak those values.
1 | pipeline.Add(new FastTreeRegressor()) |
For the dataset in question, the FastTreeRegressor
worked the best, but there are alternatives. I’ve listed them below. Most had worst performance, with the FastTreeTweedieRegressor
being similar. As will anything, it is good to investigate options.
1 | // Similar performance |
The last part, train the model. Note the ConcreteData
and ConcretePrediction
types as part of the Train
call.
1 | let model = pipeline.Train<ConcreteData, ConcretePrediction>() |
Validation of any model is important. For a real case, I would train on one dataset and validate against a previously unseen dataset. Since this is just an example, I validate against the training data. As a result, I expect the results to be very good, and they are. ML.NET offers an Evaluator class, which makes getting some of those crucial high-level numbers pretty easy. It takes a trained model and a dataset, and produces critical metrics. Again, this is one of those components that is crucial to an ML framework and I’m glad to see it here.
1 | // Evaluate results |
1 | # Evaluator Results: |
Backtracking to the hyperparameter example, here are those results. As you can tell, my randomly picked hyperparameter choices were not better. Certainly it seems like a fun opportunity to pair some optimization searches with the pipeline to see how methods can be improved. Of course, this is more meaningful if it is not validating against the training data, there is already a risk of overfitting that we’re not seeing.
1 | # Evaluator Results (with hyperparameters): |
Here is an example of how individual predictions can be made. Create a ConcreteData
object and provide it to the Predict
method. For this example, I pull one of those rows from the training data.
1 | let test1 = ConcreteData() |
1 | # Prediction Result: |
On a lark, let’s see what happens if slag is increased, and the water content is reduced. It looks like compressive strength gets stronger.
1 | let test2 = ConcreteData() |
1 | # Prediction Result: |
Once a model is trained, it can also be saved to a file a reloaded at a later time. This is supported by the WriteAsync
and ReadAsync
methods of a model.
1 | // Save model to file |
1 | # Prediction Result (model reloaded): |
Throughout the post, portions of the output have been provided out of band. Here is how the whole thing looks when run with dotnet run
.
1 | Not adding a normalizer. |
There you have it. A brief look into training and using an ML.NET regressor model. Although there are a couple quirks, I’m excited to see this released. This will only get better over time and if F# can be a part of that, even better.