With the release of v0.7.0, it is time to revisit K-means clustering using F# and Microsoft’s new ML.NET framework. The api has changed enough to warrant a minor rework. This post is a re-examination of a previous post F# and ML.NET Clustering. The use case will be to use examination attributes to classify mammogram results.
Note: ML.NET is still evolving, this post was written using Microsoft.ML v0.7.0.
Make sure you have .NET Core version 2.1 installed. If you don’t, head out to the .NET Core Downloads page. Select SDK for your platform. Tangential, but you can also get here by going to dot.net, then navigating to Downloads
and .NET Core
.
First, create a console F# project, then add the ML.NET package.
1 | dotnet new console --language F# --name MLNet-Mammogram |
Next, it is time to get the data. The source I used for this post is from UCI. The datafile can be found [here] (https://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data)
1 | mkdir data && cd data |
Here is a sample of what the data looks like. There is no header row. The columns represent 5 features and 1 classification column:
1 | - BI-RADS assessment (1-5) |
1 | # Data Rows |
Now that the project is setup and data is local, we can get to the code. Time to open up the already created Program.fs
. First, add the necessary namespaces.
1 | open System |
The ML.NET pipeline expects the data in a specific format. In the C# world, this is a class, for F# we can use a type. Below are the required types; MammogramData
is the input data, MammogramPrediction
is the output prediction. For MammogramData
, this is basically a a map of columns to member variables. There are a couple notable points to ensure the pipeline can properly consume the data. Each attribute must be mutable public
, it also requires the [<Column("#")>]
to specify it’s column position, and [<DefaultValue>]
attributes. For MammogramPrediction
, PredictionLabel
for the cluster id, and Score
for calculated distances from all clusters is required.
1 | type MammogramPrediction() = |
Here is one of the big changes from early versions. Where the pipeline object is gone, it has been replaced with an MLContext
. Although different, it still maintains intuitiveness, and gains additional functionality. First, create an MLContext, if desired a seed
can be defined to ensure the same results between executions.
1 | let mlContext = MLContext() |
Time to load the data. This is another method that has updated since earlier versions. First create a TextReader
with a file format definition. Then use that object to read the data from the data file. The entire file can be used for training. Alternatively, TrainTestSplit
(another new function) can be used to easily divide a single dataset into train and test sets. This is especially handy during the development process.
1 | let dataPath = "./data/mammographic_masses.data" |
After the data is loaded, feature columns need to be added to the transforms. I’m going to use all feature columns from the file, and exclude severity. The clustering model requires features to be numeric, which if fine here. As the other posts show, you can convert text to numeric mappings if necessary.
1 | let dataProcessPipeline = mlContext.Transforms.Concatenate("Features", "BiRads", "Age", "Shape", "Margin", "Density") |
Now that the features are defined, it is time to create a model. This will be KMeans
. Similar to the other trainers, custom parameters can be defined, I have decided to use K = 4
. It also has other options as as MaxIterations
, OptTol
(convergence tolerance), and NormalizeFeatures
. The KMeans trainer/estimator must be combined with the training data to create a model. The last part, create a prediction function from the model. Note the MammogramData
and MammogramPrediction
types as part of the call.
1 | let trainer = mlContext.Clustering.Trainers.KMeans(features = "Features", clustersCount = 4) |
Validation of any model is important. With the data split into train and test sets, it is easy to get metrics against the training data and then validate against the previously unseen test data.
1 | // Evaluate results (train) |
1 | Train Data: |
With the initial evaluation out of the way, it is time to move onto individual predictions. I want to create aggregate classification percentages for each cluster. To do this I take the predictive model and apply it against the the training file. Using the predicted cluster and the training label, I create a mapping for detailed predictions. Each cluster gets its own raw benign/malignant count, which can be converted into percentage likelihood for each classification. I have the details annotated in comments, to make it easier to follow. Honestly, this is the most labor-intensive part of the process. I’d love to be able to pass an cluster-aggregate-score function in as part of the trainer to eliminate this work or reprocessing the data. Once I have these results as a Map
, I can query results easy enough.
1 | // Create classifications by cluster |
Now that the clusterIdToPrediction
is defined, I can pair the ML.NET cluster prediction with the aggregated cluster classification percentages. First, create a MammogramData
object and provide it to the Predict
method. Second, use the predicted clusterId with the aggregated cluster classification percentages to get a classification result. For this example, I pull one of those rows from the training data.
1 | // Prediction |
The results show the prediction falls into cluster 1, which has a 80% likelihood it is malignant, which matches the actual value.
1 | # Prediction Result: |
Once a model has been created, it is often useful to save for later use. The save method has changed from previous versions. Once saved, this model can then be loaded for future use.
1 | // Save model to file |
As expected, the prediction results are the same with the reloaded model.
1 | # Prediction Result: (model reloaded): |
Throughout the post, portions of the output have been provided out of band. Here is how the whole thing looks when run with dotnet run
.
1 | Train Data: |
This has been a brief look into training and using an ML.NET k-means cluster model. As seen with the other models, ML.NET is providing a nice consistent interface and has some good components. It is a framework that continues to grow in a positive direction. Kudos and thanks to all the people making this a reality. That’s all for now. Until next time.