2018-11-30

F# and ML.NET Clustering (V2)

Read Time: 14 minutes

With the release of v0.7.0, it is time to revisit K-means clustering using F# and Microsoft’s new ML.NET framework. The api has changed enough to warrant a minor rework. This post is a re-examination of a previous post F# and ML.NET Clustering. The use case will be to use examination attributes to classify mammogram results.

Note: ML.NET is still evolving, this post was written using Microsoft.ML v0.7.0.

Make sure you have .NET Core version 2.1 installed. If you don’t, head out to the .NET Core Downloads page. Select SDK for your platform. Tangential, but you can also get here by going to dot.net, then navigating to Downloads and .NET Core.

First, create a console F# project, then add the ML.NET package.

1
2
3

dotnet new console --language F# --name MLNet-Mammogram
cd MLNet-Mammogram
dotnet add package Microsoft.ML --version 0.7.0

Next, it is time to get the data. The source I used for this post is from UCI. The datafile can be found [here] (https://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data)

1 2	mkdir data && cd data curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data

Here is a sample of what the data looks like. There is no header row. The columns represent 5 features and 1 classification column:

- BI-RADS assessment (1-5)
- Age (Patient's age)
- Shape (mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal))
- Margin (mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal))
- Density: (mass density high=1 iso=2 low=3 fat-containing=4 (ordinal))
- Severity: (benign=0 or malignant=1)

# Data Rows
5,67,3,5,3,1
4,43,1,1,?,1
5,58,4,5,3,1
4,28,1,1,3,0
5,57,1,5,3,1

Now that the project is setup and data is local, we can get to the code. Time to open up the already created Program.fs. First, add the necessary namespaces.

open System
open System.IO
open Microsoft.ML
open Microsoft.ML.Runtime.Api
open Microsoft.ML.Runtime.Data

The ML.NET pipeline expects the data in a specific format. In the C# world, this is a class, for F# we can use a type. Below are the required types; MammogramData is the input data, MammogramPrediction is the output prediction. For MammogramData, this is basically a a map of columns to member variables. There are a couple notable points to ensure the pipeline can properly consume the data. Each attribute must be mutable public, it also requires the [<Column("#")>] to specify it’s column position, and [<DefaultValue>] attributes. For MammogramPrediction, PredictionLabel for the cluster id, and Score for calculated distances from all clusters is required.

type MammogramPrediction() =
    [<Column("0")>]
    [<DefaultValue>]
    val mutable public BiRads:float32

    [<Column("1")>]
    [<DefaultValue>]
    val mutable public Age:float32

    [<Column("2")>]
    [<DefaultValue>]
    val mutable public Shape:float32

    [<Column("3")>]
    [<DefaultValue>]
    val mutable public Margin:float32

    [<Column("4")>]
    [<DefaultValue>]
    val mutable public Density:float32

    [<Column("5")>]
    [<DefaultValue>]
    val mutable public Label :float32

type MammogramPrediction() =
    [<DefaultValue>]
    [<ColumnName("PredictedLabel")>]
    val mutable public SelectedClusterId:uint32

    [<DefaultValue>]
    [<ColumnName("Score")>]
    val mutable public Distance: float32[]

Here is one of the big changes from early versions. Where the pipeline object is gone, it has been replaced with an MLContext. Although different, it still maintains intuitiveness, and gains additional functionality. First, create an MLContext, if desired a seed can be defined to ensure the same results between executions.

1 2	let mlContext = MLContext() // let mlContext = MLContext(seed = Nullable 1)

Time to load the data. This is another method that has updated since earlier versions. First create a TextReader with a file format definition. Then use that object to read the data from the data file. The entire file can be used for training. Alternatively, TrainTestSplit (another new function) can be used to easily divide a single dataset into train and test sets. This is especially handy during the development process.

let dataPath = "./data/mammographic_masses.data"

////////////
// Load data

let dataLoader = 
  mlContext.Data.TextReader(
    TextLoader.Arguments(
      Separator = ",",
      HasHeader = true,
      Column = 
        [|
            TextLoader.Column("BiRads", Nullable DataKind.R4, 0)
            TextLoader.Column("Age", Nullable DataKind.R4, 1)
            TextLoader.Column("Shape", Nullable DataKind.R4, 2)
            TextLoader.Column("Margin", Nullable DataKind.R4, 3)
            TextLoader.Column("Density", Nullable DataKind.R4, 4)
        |]
    )
  )

let allData = dataLoader.Read dataPath

let struct (trainingData, testingData) = mlContext.Clustering.TrainTestSplit(allData, testFraction = 0.3)

After the data is loaded, feature columns need to be added to the transforms. I’m going to use all feature columns from the file, and exclude severity. The clustering model requires features to be numeric, which if fine here. As the other posts show, you can convert text to numeric mappings if necessary.

1	let dataProcessPipeline = mlContext.Transforms.Concatenate("Features", "BiRads", "Age", "Shape", "Margin", "Density")

Now that the features are defined, it is time to create a model. This will be KMeans. Similar to the other trainers, custom parameters can be defined, I have decided to use K = 4. It also has other options as as MaxIterations, OptTol (convergence tolerance), and NormalizeFeatures. The KMeans trainer/estimator must be combined with the training data to create a model. The last part, create a prediction function from the model. Note the MammogramData and MammogramPrediction types as part of the call.

let trainer = mlContext.Clustering.Trainers.KMeans(features = "Features", clustersCount = 4)
let estimator = dataProcessPipeline.Append trainer
let trainedModel = estimator.Fit trainingData
let model = trainedModel.MakePredictionFunction<MammogramData, MammogramPrediction>(mlContext)

Validation of any model is important. With the data split into train and test sets, it is easy to get metrics against the training data and then validate against the previously unseen test data.

// Evaluate results (train)
let metricsTrain = 
  let predictions = trainedModel.Transform trainingData
  mlContext.Clustering.Evaluate(predictions, score = "Score", features = "Features")

printfn ""
printfn "Train Data:"
printfn "Avg Min Score: %f" <| metricsTrain.AvgMinScore
// Davies-Bouldin Index
printfn "DBI          : %A" <| metricsTrain.Dbi
printfn ""

// Evaluate results (test)
let metricsTest = 
  let predictions = trainedModel.Transform testingData
  mlContext.Clustering.Evaluate(predictions, score = "Score", features = "Features")

printfn ""
printfn "Test Data:"
printfn "Avg Min Score: %f" <| metricsTest.AvgMinScore
// Davies-Bouldin Index
printfn "DBI          : %A" <| metricsTest.Dbi
printfn ""

Train Data:
Avg Min Score: 31.570207
DBI          : 0.6515402653

Test Data:
Avg Min Score: 27.217818
DBI          : 0.6298469451

With the initial evaluation out of the way, it is time to move onto individual predictions. I want to create aggregate classification percentages for each cluster. To do this I take the predictive model and apply it against the the training file. Using the predicted cluster and the training label, I create a mapping for detailed predictions. Each cluster gets its own raw benign/malignant count, which can be converted into percentage likelihood for each classification. I have the details annotated in comments, to make it easier to follow. Honestly, this is the most labor-intensive part of the process. I’d love to be able to pass an cluster-aggregate-score function in as part of the trainer to eliminate this work or reprocessing the data. Once I have these results as a Map, I can query results easy enough.

// Create classifications by cluster 
let clusterClassification = 
  // Read file
  System.IO.File.ReadAllLines(dataPath)
  // Filter incomplete rows
  |> Array.filter (fun line -> not (line.Contains("?")))
  // Run predictions 
  |> Array.map (fun line -> 
    // Convert line to float array
    let row = line.Split(',') |> Array.map float32
    // Predict the clusterId of the row
    let predictedCluster = 
      model.Predict(
        MammogramData(
          BiRads = row.[0],
          Age = row.[1],
          Shape = row.[2],
          Margin = row.[3],
          Density = row.[4])) 
    // Populate benign/maligant counter for cluster (0=benign, 1=malignant)
    if int row.[5] = 0 
    then (predictedCluster.SelectedClusterId, [| 1; 0 |])
    else (predictedCluster.SelectedClusterId, [| 0; 1 |]))
  // Group by ClusterId
  |> Array.groupBy (fun (clusterId, _) -> clusterId)
  // Sum each cluster's classification counts
  |> Array.map (fun (clusterId, data) -> 
    let countSums = 
      data
      |> Array.map (fun (_, z) -> z)
      |> Array.fold (fun a (x:int []) -> 
          [| a.[0] + x.[0]; a.[1] + x.[1] |]) [| 0; 0 |] 

    (clusterId, countSums))
  |> Map.ofArray

/// Provide a prediction based on cluster id
let clusterIdToPrediction (clusterClassification:Map<uint32, int[]>) (clusterId:uint32) =
  let classifications = clusterClassification.Item clusterId

  let total = classifications |> Array.sum |> float
  let benignPct = float classifications.[0] / total
  let malignantPct = float classifications.[1] / total

  sprintf "Benign: %0.2f Malignant: %0.2f (%d, %d)" 
    benignPct 
    malignantPct 
    classifications.[0] 
    classifications.[1]

Now that the clusterIdToPrediction is defined, I can pair the ML.NET cluster prediction with the aggregated cluster classification percentages. First, create a MammogramData object and provide it to the Predict method. Second, use the predicted clusterId with the aggregated cluster classification percentages to get a classification result. For this example, I pull one of those rows from the training data.

// Prediction
let test1 = MammogramData()
test1.BiRads <- 5.f
test1.Age <- 67.f
test1.Shape <- 3.f
test1.Margin <- 5.f
test1.Density <- 3.f
// 1

let predictionTest1 = model.Predict(test1)
printfn "Predicted ClusterId: %d" predictionTest1.SelectedClusterId
printfn "Predicted Distances: %A" predictionTest1.Distance
printfn "Predicted Result: %s" (clusterIdToPrediction clusterClassification predictionTest1.SelectedClusterId)
printfn "Actual Result   : 1 (Malignant)"
printfn ""

The results show the prediction falls into cluster 1, which has a 80% likelihood it is malignant, which matches the actual value.

# Prediction Result:
Predicted ClusterId: 1
Predicted Distances: [|51.855957f; 1333.52344f; 63.1328125f; 449.377441f|]
Predicted Result: Benign: 0.20 Malignant: 0.80 (40, 163)
Actual Result   : 1 (Malignant)

Once a model has been created, it is often useful to save for later use. The save method has changed from previous versions. Once saved, this model can then be loaded for future use.

// Save model to file
let saveModel (mlContext:MLContext) trainedMode = 
  use fsWrite = new FileStream("test-model.zip", FileMode.Create, FileAccess.Write, FileShare.Write)
  mlContext.Model.Save(trainedModel, fsWrite);

saveModel mlContext trainedModel

// Load model from file and run a prediction
use fsRead = new FileStream("test-model.zip", FileMode.Open, FileAccess.Read, FileShare.Read)
let mlContextReloaded = MLContext()
let trainedModelReloaded = TransformerChain.LoadFrom(mlContextReloaded, fsRead)
let modelReloaded = trainedModel.MakePredictionFunction<MammogramData, MammogramPrediction>(mlContextReloaded);

let predictionReloaded = modelReloaded.Predict(test1)
printfn "Predicted ClusterId RL: %d" predictionReloaded.SelectedClusterId
printfn "Predicted Distances RL: %A" predictionReloaded.Distance
printfn "Predicted Result RL: %s" (clusterIdToPrediction clusterClassification predictionReloaded.SelectedClusterId)
printfn "Actual Result RL   : 1 (Malignant)"
printfn ""

As expected, the prediction results are the same with the reloaded model.

# Prediction Result: (model reloaded):
Predicted ClusterId: 1
Predicted Distances: [|51.855957f; 1333.52344f; 63.1328125f; 449.377441f|]
Predicted Result: Benign: 0.20 Malignant: 0.80 (40, 163)
Actual Result   : 1 (Malignant)

Throughout the post, portions of the output have been provided out of band. Here is how the whole thing looks when run with dotnet run.

Train Data:
Avg Min Score: 31.872808
DBI          : 0.6556137903

Test Data:
Avg Min Score: 27.691496
DBI          : 0.6635021486

ClusterId 1u => Benign: 0.20 Malignant: 0.80 (40, 163)
ClusterId 2u => Benign: 0.90 Malignant: 0.10 (95, 11)
ClusterId 3u => Benign: 0.49 Malignant: 0.51 (151, 156)
ClusterId 4u => Benign: 0.66 Malignant: 0.34 (141, 73)

Predicted ClusterId: 1
Predicted Distances: [|51.855957f; 1333.52344f; 63.1328125f; 449.377441f|]
Predicted Result: Benign: 0.20 Malignant: 0.80 (40, 163)
Actual Result   : 1 (Malignant)

Predicted ClusterId RL: 1
Predicted Distances RL: [|51.855957f; 1333.52344f; 63.1328125f; 449.377441f|]
Predicted Result RL: Benign: 0.20 Malignant: 0.80 (40, 163)
Actual Result RL   : 1 (Malignant)

This has been a brief look into training and using an ML.NET k-means cluster model. As seen with the other models, ML.NET is providing a nice consistent interface and has some good components. It is a framework that continues to grow in a positive direction. Kudos and thanks to all the people making this a reality. That’s all for now. Until next time.