Today’s topic will be to demonstrate tackling a Kaggle problem with XGBoost and F#. Comparing Quora question intent offers a perfect opportunity to work with XGBoost, a common tool used in Kaggle competitions. Luckily there is a .NET wrapper around the XGBoost library, XGBoost.Net.
Before going too far, let’s break down the data formats. First, Kaggle provides a train.csv which is used for training models. This contains question pairs and the ground truth regarding their duplicated-ness. Second, test.csv is questions pairs with no ground truth. This is used for generating the submission file to Kaggle. Third, submission.csv are the results to submit to Kaggle for judging. is_duplicate represents a percentage likelihood of being a duplicate. Below are example rows from each dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
// train.csv "id","qid1","qid2","question1","question2","is_duplicate" "0","1","2","What is the step by step guide to invest in share market in india?","What is the step by step guide to inves t in share market?","0" "1","3","4","What is the story of Kohinoor (Koh-i-Noor) Diamond?","What would happen if the Indian government stole the K ohinoor (Koh-i-Noor) diamond back?","0"
// test.csv "test_id","question1","question2" 0,"How does the Surface Pro himself 4 compare with iPad Pro?","Why did Microsoft choose core m3 and not core i3 home Surface Pro 4?" 1,"Should I have a hair transplant at age 24? How much would it cost?","How much cost does hair transplant require?"
Now that the data is out of the way, time to get started. Using Paket, here is a sample paket.dependencies file.
1 2 3 4
source https://nuget.org/api/v2
nuget FSharp.Data nuget PicNet.XGBoost
Here is the boilerplate and initial variables. Most of this is self-explanatory, although I want to call out a couple things specifically. As expected, TypeProviders will be used to load the csv datasets. When I get to the model training section, there will be hyperparameters. This object will be managed by ModelParameterType and ModelParameter. Feature extraction will use dataset-level metadata. Since this is meant to be a simple example, the only metadata will be the average number of words in a question. As shown above, the train and test files are slightly different formats. Whatever method I use, I want to be able to run the same code against train and test. StandardRow enables this by standardizing the input row format for transformation.
open System open System.IO open FSharp.Data open XGBoost
/// Percent of training dataset to use for training /// Note: ValidationPct = 1. - TrainPct [<Literal>] let TrainPct = 0.8
/// Training filename [<Literal>] let TrainFilename = "../data/train.csv"
/// Kaggle test filename (used to generate submission) [<Literal>] let TestFilename = "../data/test.csv"
/// Kaggle submission filename [<Literal>] let SubmissionFilename = "../data/submission.csv"
/// Type of hyperparameter value typeModelParameterType= | Int | Float32 /// Model hyperparameter typeModelParameter= { Name: string; Type: ModelParameterType; Value: float } /// Dataset Metadata (Used for feature calculation) typeMetadata= { AverageWordCount: float32 } // Standarized row typeStandardRow= { QuestionId: int; Label: float32; Features: float32[] }
/// Training dataset typeTrainData= CsvProvider<TrainFilename> /// Test/Submission dataset typeTestData= CsvProvider<TestFilename>
To ensure proper model training, the provided train.csv will be broken into a train and validation set. This method could be more advanced, but take the first x% for training and 100-x% for validation works well enough in this case. Since the train and test files are different, a conversion function is needed.
1 2 3 4 5 6 7 8 9 10 11 12
/// Sample dataset into train and validation datasets let sample (input:CsvProvider<TrainFilename>) trainPct = let trainRows = int (float (input.Rows |> Seq.length) * trainPct) let trainData = input.Rows |> Seq.take trainRows |> Seq.toArray let validatationData = input.Rows |> Seq.skip trainRows |> Seq.toArray (trainData, validatationData)
/// Convert the test data format to train data format /// Note: This is necessary because their train and test datasets differ slightly let convertTestToTrainFormat (input:CsvProvider<TestFilename>.Row []) :(CsvProvider<TrainFilename>.Row []) = input |> Array.map (fun x -> new CsvProvider<TrainFilename>.Row(x.Test_id, 0, 0, x.Question1, x.Question2, false))
Here are the feature generating, and supporting, functions. For pedagogical reasons the feature set is going to be overly simplistic. This won’t result in a great prediction result, but proper feature creation can be involved. More advanced feature extraction will be addressed in a later post. For now, this will be enough to get some results, without losing the primary goal in a forest of feature extraction code.
Some features will/may need aggregate information about the dataset. This is commonly used to for scaling or comparison for averages. This will be stored in a dataset metadata object that all rows will have access to during row transformation and feature extraction. The row-specific features are length and wordcount for the two questions being compared. In addition, the difference in wordcount between the questions is considered.
/// Number of words in sentence let wordCount (s:string) = Array.length (s.Split([| ' ' |]))
/// Absolute value let abs (x:int) = Math.Abs(x)
/// Calculate dataset metadata for feature calculation let metadata (input:CsvProvider<TrainFilename>.Row []) = let averageWordCount = input |> Array.collect (fun row -> [| Array.length (row.Question1.Split([| ' ' |])); Array.length (row.Question2.Split([| ' ' |])) |]) |> Array.sum |> (fun total -> float32 total / float32 (input.Length * 2))
{ Metadata.AverageWordCount = averageWordCount }
/// Calculate features for a row let rowFeatures (metadata:Metadata) (input:CsvProvider<TrainFilename>.Row) = [| float32 input.Question1.Length; float32 input.Question2.Length; (wordCount >> float32) input.Question1; (wordCount >> float32) input.Question2; (abs >> float32) (wordCount input.Question1 - wordCount input.Question2); |]
/// Transform csv row into label + features let transform (metadata:Metadata) (input:CsvProvider<TrainFilename>.Row []) = input |> Array.map(fun row -> { StandardRow.QuestionId = row.Id; Label = if row.Is_duplicate then float32 1.else float32 0.; Features = rowFeatures metadata row } )
Now it is time to look at the XGBoost functionality. Generating a model is as simple as creating a classifier, applying a hyperparameter set, and then running .Fit using the training data (features, and labels). One small mention, as can be seen, the library uses float32[] for most of it’s numeric interations.
Once the model is trained, it can be applied using PredictProba against an array of features (that match the structure of the training data). The result is an array of probabilities per class. Since this is a binary classification, [0.34, 0.66] means there is a 34% chance the result is false, and 66% chance the result is true. For the final submission, a percentage is desired, but for training, it is useful to know the binary true/false regarding duplicate question status.
// Given training data and hyperparameters, create an xgboost classification model let buildXgClassModel (trainInput:float32[][]) (trainOutput:float32[]) (parameters:ModelParameter list) = let model = XGBClassifier()
// To handle xgboost types, I carry along the type with parameter values, // and cast accordingly when I set the values parameters |> List.iter (fun parameter -> match parameter.Type with | Int -> model.SetParameter(parameter.Name, (int parameter.Value)) | Float32 -> model.SetParameter(parameter.Name, (float32 parameter.Value)))
model.Fit(trainInput, trainOutput) model
let predictionProbabilities (model:XGBClassifier) (inputs:float32[][]) = // Note, provides prob for each class (ex: 0=0.67, 1=0.33) model.PredictProba(inputs)
let predictionValues (model:XGBClassifier) (inputs:float32[][]) = // Note, provides prob for each class (ex: 0=0.67, 1=0.33) // Higher probability is the class that "wins" predictionProbabilities model inputs |> Array.map (fun x -> if x.[0] > x.[1] then0else1)
To faciliate debugging and improvement, a confusion matrix is very useful. This, along with an overall accuracy reporting will assign in future developmental interations.
/// Use a model to create predictions from input values, /// then compare target output to predicted output let evaluatePredictionResults model input targetOutput = let predictedValidationValues = predictionValues model input let predictedValidationMatches = comparePredictions targetOutput predictedValidationValues let pctValidationMatches = float (predictedValidationMatches |> Array.filter id |> Array.length) / float (predictedValidationMatches |> Array.length)
Since the submission file has specific criteria, there are some functions to create the submission file. This is primarily formatting the percents as Kaggle expects and then writing the dataset to a file.
/// Convert probabilities per classification to a single probability /// Note: if class 0 "wins", invert its percent, since the final result expects low percents to map to class 0. let convertPredictionToProbability (probabilities: float32[]) = if probabilities.[0] > probabilities.[1] then1.f - probabilities.[0] else probabilities.[1]
/// Combine question ids with prediction results let formatSubmissionData (rows:StandardRow[]) (predictions:float32[][]) = (rows, predictions) ||> Array.zip |> Array.map (fun (input, prediction) -> let questionId = input.QuestionId let probability = convertPredictionToProbability prediction (questionId, probability))
// Write submission data to file let writeSubmissionFile (submissionFilename:string) (submissionData: (int * float32)[]) = let fileStream = new StreamWriter(submissionFilename) fileStream.WriteLine("test_id,is_duplicate") submissionData |> Array.iter(fun (id, probability) -> let line = sprintf "%d,%f" id probability fileStream.WriteLine(line)) fileStream.Flush() fileStream.Close()
Now that all the hard work is done, it is time to put it all together. The first step is data preparation. First, load the training data and split into train and validation sets. Second, build dataset level metadata. Third, run transformations (feature creation) against the datasets. Fourth, structure the data for model training by generating the appropriate label and features arrays.
1 2 3 4 5 6 7 8 9 10 11 12 13
/// Training data let allData = TrainData.Load(TrainFilename) let (trainData, validationData) = sample allData TrainPct
let trainMetadata = metadata trainData let transformedTrainData = transform trainMetadata trainData let transformedValidationData = transform trainMetadata validationData
let trainInput = transformedTrainData |> Array.map (fun row -> row.Features) let trainOutput = transformedTrainData |> Array.map (fun row -> row.Label)
let validationInput = transformedValidationData |> Array.map (fun row -> row.Features) let validationOutput = transformedValidationData |> Array.map (fun row -> row.Label)
Time to train the model. XGBoost supports the below parameters. The values shown are populated with some reasonable values for the dataset in question. Out of scope for this post, but hyperparameter optimization should be leveraged here to find the best training model. In a later post I’ll discuss a simple method to approach this topic.
Once trained, report on prediction capability against the original training set as well as the validation set (which the model hasn’t seen).
/// Model training parameters let modelParameters = [ { Name = "max_depth"; Type = ModelParameterType.Int; Value = 10. }; { Name = "learning_rate"; Type = ModelParameterType.Float32; Value = 0.76 }; { Name = "gamma"; Type = ModelParameterType.Float32; Value = 1.9 }; { Name = "min_child_weight"; Type = ModelParameterType.Int; Value = 5. }; { Name = "max_delta_step"; Type = ModelParameterType.Int; Value = 0. }; { Name = "subsample"; Type = ModelParameterType.Float32; Value = 0.75 }; { Name = "colsample"; Type = ModelParameterType.Float32; Value = 0.75 }; { Name = "reg_lambda"; Type = ModelParameterType.Float32; Value = 4. }; { Name = "reg_alpha"; Type = ModelParameterType.Float32; Value = 1. } ]
/// Trained model let finalModel = buildXgClassModel trainInput trainOutput modelParameters
Here are the prediction results of train and test. The prediction capability isn’t great, but the validation set holds up comparatively well. At least overfitting isn’t a concern (for now). This also shows how more and better features have plenty of room for improvement.
1 2 3 4 5 6 7 8 9 10 11
> evaluatePredictionResults finalModel trainInput trainOutput Accuracy: 0.680396 T\P T F T 5335236546 F 66824166710
> evaluatePredictionResults finalModel validationInput validationOutput Accuracy: 0.651030 T\P T F T 1162510755 F 1746241016
Now it is time to create the final predictions and submission file for Kaggle. To do this, replicate the validate workflow, with a couple caveats. First, the test dataset is formatted slightly differently. Since this is data with no known classificaions, there is no class in the file. So I need to load the test data, then run the convert so the test data matches the format of the training data. Second, the submission file needs to be populated with a percent likelihood of the questions being duplicates (not with a straight classification). Lastly, write the id along with the result to the submission file.
1 2 3 4 5 6
let testData = TestData.Load(TestFilename).Rows |> Seq.toArray let transformedTestData = transform trainMetadata (convertTestToTrainFormat testData) let testInput = transformedTestData |> Array.map (fun row -> row.Features) let testPredictions = predictionProbabilities finalModel testInput let submissionData = formatSubmissionData transformedTestData testPredictions writeSubmissionFile SubmissionFilename submissionData
All that is left to do is submit the file for judging. Spolier alert, because this is an overly simplified model, it faired poorly. Like I mentioned in the beginning, the current feature set isn’t good. In addition, the hyper-parameters could benefit from some search of their own. These are both topics I plan on discussing in future posts. F# and .NET still have a couple more tricks up their sleeves to get these results even better. Hopefully this has provided a bit of inspiration to try F# in your own projects, perhaps even a Kaggle. Until next time.