It is time for an exploration into using a Decision Tree to classify Lego set themes using F# and Accord.NET.
First things first, the data. Rebrickable has downloadable datasets for Lego sets and pieces. I’ll use the sets.csv file as the primary dataset driver, but will grab information from sets_pieces.csv, pieces.csv, and colors.csv for feature creation. The files are not in an format appropriate for a decision tree, so some transformations will need to happen first. I don’t want the post to get too long, so this project will be broken into two components. Part 1 will be building the feature file and getting the data into the desired comsumable format, Part 2 will actually use the file to get to the end goal.
Second, the approach. The goal of part 1 is to do all transformations here. I want the end result to be a file that can be directly loaded into part 2’s code. I will use/create a series of features. Year is the set’s year, and is directly provided. The following features will need to be grouped and calculated. First is “% of the set’s pieces are type” for a couple major piece types. Second, is “% of the set’s pieces are color” for major color groups. Lastly, the prediction target is theme. The dataset has up to three themes per set (T1, T2, T3). For simplicity sake I am only going to use one theme (T1) as the target theme to predict. This will restrict the quality of my results, but as a proof-of-concept it will be good enough. Hopefully all this give me some interesting results.
Using Paket, here is a sample paket.dependencies file.
Next it is time to leverage the CsvProvider for the input files. The below code configures the types as well as loads the data. You’ve probably read it a million times, but Type Providers are really helpful to get get working with the data quickly.
// Load files let legoSets = LegoSets.Load "../data/legos/sets.csv" let legoSetPieces = LegoSetPieces.Load "../data/legos/set_pieces.csv" let legoPieces = LegoPieces.Load "../data/legos/pieces.csv" let legoColors = LegoColors.Load "../data/legos/colors.csv"
When building the features, I will be counting the number of specific colors in the set. There are 135 different colors. I only care about eight different colors: Red, Green, Blue, White, Black, Gray, Silver, and Translucent. I will ignore the rest. As an expedient hack, I search for the color text in the description. So ‘Red’, ‘Trans-Red’, and ‘Dark Red’ all count as ‘Red’. I then store these indexes for later searching. This method misses things like ‘Pink’, which is in the red family. It also means ‘Trans-Red’ counts as a red piece and a translucent piece. For a real problem I would be more thorough, but I just want to get to the decision tree.
1 2 3 4 5 6 7 8 9 10 11 12 13 14
let getColorIndexes color = legoColors.Rows |> Seq.filter (fun x -> x.Descr.Contains(color)) |> Seq.map (fun x -> x.Id) |> Seq.toList
let redIndexes = getColorIndexes "Red" let greenIndexes = getColorIndexes "Green" let blueIndexes = getColorIndexes "Blue" let whiteIndexes = getColorIndexes "White" let blackIndexes = getColorIndexes "Black" let grayIndexes = getColorIndexes "Gray" let silverIndexes = getColorIndexes "Silver" let translucentIndexes = getColorIndexes "Trans"
To perform piece counts in the sets I’ll need to do some grouping. I will use SetCountsDetail as an intermediate aggregation record type. SetDetail will be my final output form. You may notice I use counts for the aggregation, but in the final output I store “Percent of the set”. I feel this should allow the feature values to be consistent across sets. I also use the function setCountsDetailSum when folding group sums together.
Now I create a piece lookup using a Map (for the non-F#ers, think Dictionary). I also filter only the piece type categories I care about. There are 56, and for simplicity I will only look at seven. I also group all “Technic*” categories into a single “Technic” category.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
// Build lookup for pieces let pieceLookup = legoPieces.Rows // Only include these categories |> Seq.filter (fun row -> row.Category = "Bricks" || row.Category = "Plates" || row.Category = "Minifigs" || row.Category = "Panels" || row.Category = "Plants and Animals" || row.Category = "Tiles" || row.Category.Contains("Technic") ) |> Seq.map (fun row -> (row.Piece_id, if row.Category.Contains("Technic") then"Technic"else row.Category)) |> Map
Here I take a row and transpose it in the intermediate feature columns I want. isColorIndex is a helper function to determine if the specified piece color is part of one my color groupings.
// Convert data row to a SetCountsDetail record let isColorIndex indexList colorIndex = indexList |> List.filter (fun x -> x = colorIndex) |> List.length > 0
// Convert data row to a SetCountsDetail record let rowToSetCountsDetail (row:CsvProvider<"../data/legos/set_pieces.csv">.Row) = let c = pieceLookup.Item row.Piece_id { SetCountsDetail.SetId = row.Set_id; BricksCount = if c = "Bricks"then row.Num else0; PlatesCount = if c = "Plates"then row.Num else0; MinifigsCount = if c = "Minifigs"then row.Num else0; PanelsCount = if c = "Panels"then row.Num else0; PlantsAndAnimalsCount = if c = "Plants and Animals"then row.Num else0; TilesCount = if c = "Tiles"then row.Num else0; TechnicsCount = if c = "Technic"then row.Num else0; RedCount = if isColorIndex redIndexes row.Color then row.Num else0; GreenCount = if isColorIndex greenIndexes row.Color then row.Num else0; BlueCount = if isColorIndex blueIndexes row.Color then row.Num else0; WhiteCount = if isColorIndex whiteIndexes row.Color then row.Num else0; BlackCount = if isColorIndex blackIndexes row.Color then row.Num else0; GrayCount = if isColorIndex grayIndexes row.Color then row.Num else0; SilverCount = if isColorIndex silverIndexes row.Color then row.Num else0; TranslucentCount = if isColorIndex translucentIndexes row.Color then row.Num else0}
I then create a series of lookup functions to support the transformation process. In setPiecesLookup I make a Map for piece counts by SetId. It gets alittle gnarly, but it does a group by on SetId, then sums all columns up to that level. getCountLookup is used to get piece counts by setid. themesLookup maps the set’s theme text to an arbitrary int. I will save that into a lookup table/file as well for later access.
// Lookup piece count in the set, if not found, return 0 let getCountLookup k = match Map.tryFind k setPiecesLookup with | Some(x) -> x | _ -> { SetId = k; BricksCount = 0; PlatesCount = 0; MinifigsCount = 0; PanelsCount = 0; PlantsAndAnimalsCount = 0; TilesCount = 0; TechnicsCount = 0; RedCount = 0; GreenCount = 0; BlueCount = 0; WhiteCount = 0; BlackCount = 0; GrayCount = 0; SilverCount = 0; TranslucentCount = 0}
// Create theme lookups from the sets files let themesLookup = let distinctThemes = legoSets.Rows |> Seq.map (fun row -> row.T1) |> Seq.distinct
// Pair theme string with an int (distinctThemes, seq [0..Seq.length distinctThemes]) ||> Seq.zip |> Map
All the hard work is done. I now just take the data from the file and run it through a series of filters and transformations to transpose the bricktype and piece color counts into “percent of the set” columns. Once that is done I write out a file aggregatedata.csv that will be used in Part 2. I also save a themes lookup file. The lookup isn’t actually needed for the decision tree processing, but its a nice-to-have if I want to remap the int ids back to text values for evaluation.