Today’s post discusses performing word stemming with F#. This will be an expansion on a previous post, Comparing Quora question intent. As a result, it will also address some feature engineering.
For those not familar with word stems, in this context it basically refers to word bases, excluding suffixes. Stems are helpful when doing text compares, especially when dealing with data of a content-based nature. This aligns well with the Quora question comparisons. The Annytab.Stemmer library meets the needs well.
Before getting started, everything here will be an enhancement of existing code from the Kaggle Quora duplicate questions post.
First, add the Annytab.Stemmer package to the project by adding it to paket.dependencies. Then open the namespaces and create a stemmer object.
let sentence1 = "When birds fly, they are soaring above the trees while people are watching and talking" let sentence2 = "When birds are flying, they soar above the trees while people watch and talk"
let sentenceToWords (s:string) = s.Split([|' '|]) let sentence1Words = sentenceToWords sentence1 let sentence2Words = sentenceToWords sentence2
Here are the results. Notice in the stemmed word list only the bases are listed birds -> bird and watching to watch, etc. This allows for concepts to be matched better.
Time to update the feature generation. A valuable reminder is that feature generation is part art, part science. Often it is an iterative, and experimental, process. Don’t worry, intuition of what a good feature might be grows with time and experience. Using the now defined sentenceToStemWords to extract words from the questions, a comparison can be doing using a Set.intersect.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
let rowFeatures (metadata:Metadata) (input:CsvProvider<TrainFilename>.Row) = let question1Words = sentenceToStemWords input.Question1 let question2Words = sentenceToStemWords input.Question2 let wordShareCount = Set.intersect (set question1Words) (set question2Words) |> Set.count
Add matching word stems between questions as a feature has improved the accuracy by about 8%. That is a decent ROI for adding a feature.
1 2 3 4 5 6 7 8 9 10 11
> evaluatePredictionResults finalModel trainInput trainOutput Accuracy: 0.755652 T\P T F T 8429943153 F 35877160103
> evaluatePredictionResults finalModel validationInput validationOutput Accuracy: 0.704828 T\P T F T 1828113061 F 1080638710
There is one downside to this approach, common words like “a”, “and”, “the” are included in the matching word feature. This can result in a deceptively high percentage word match. To get a more representative match, these “stop words” can be excluded. Time to make another feature change. I built a stopwords list, here is a sample. The full file is here.
1 2 3 4 5 6 7
i a about after all also an
Then alter sentenceToFilteredStemWords to be sentenceToFilteredStemWords that excludes stop words. This will get me to where I want to be.
Filtering out stop words gained another 3%. Admittedly I expected a bit more, but still upwards.
1 2 3 4 5 6 7 8 9 10 11
> evaluatePredictionResults finalModel trainInput trainOutput Accuracy: 0.777598 T\P T F T 8815739913 F 32019163343
> evaluatePredictionResults finalModel validationInput validationOutput Accuracy: 0.730577 T\P T F T 1937312071 F 971439700
As you can see, using word stems and stop words to extend the features can be a useful tactic. This also serves as a good reminder that F# has the tools for interesting analysis. I hope you found this post useful. Until next time.