2017-01-26

Simple Analysis with F#

Read Time: 12 minutes

This post is a follow up to my previous look into Text Analytics. It will provide additional examples of how data can be pulled and processed in F#. I’ll also use this as an opportunity to draw more charts. For all this to happen, I’ll be doing light analysis of the full text of Mary Shelley’s “Frankenstein”.

Like last time, if you want to follow along, you’ll need to first get a free account from Cognitive Services. Then request the api access you want, for this post it is “Text Analytics”. The apikey they provide can be used in the upcoming code if you want to make your own calls. Microsoft offers a nice amount of free calls, providing plenty of room to play in the environment.

Using Paket, here is a sample paket.dependencies file.

source https://nuget.org/api/v2

nuget FSharp.Charting
nuget FSharp.Core
nuget FSharp.Data
nuget Deedle
nuget Newtonsoft.Json

Again, here is the mostly boilerplate code. It’s where I load libraries, set the Cognitive Services apikey, and the url for the Frankenstein text on the Gutenberg site.

#r "../packages/FSharp.Charting/lib/net40/FSharp.Charting.dll"
#r "../packages/FSharp.Data/lib/net40/FSharp.Data.dll"
#r "../packages/NewtonSoft.Json/lib/net40/Newtonsoft.Json.dll"
#r "../packages/Deedle/lib/net40/Deedle.dll"

open System
open System.Text.RegularExpressions
open Deedle
open FSharp.Core
open FSharp.Charting
open FSharp.Data
open Newtonsoft.Json

let apiKey = "<your api key here>"

// Frankenstein
let bookUrl = "http://www.gutenberg.org/cache/epub/84/pg84.txt"

The below code is mostly a copy of the Sentiment module and supporting functions from my previous post, modified to handle multiple documents. As promised, the modification to handle multiple documents in TextToRequestJson was an easy adjustment.

module Sentiment =
    let url = "https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/sentiment"
    type Document = { language: string; id: string; text: string }
    type Request = { documents: Document list }
    type Score = { score: float; id: string }
    type ResponseError = { id: string; message: string }
    type Response = { documents: Score list; errors: ResponseError list }

open Sentiment

// Convert text documents into request body in json format
let TextToRequestJson texts =
    let documents = 
        texts
        |> Array.map (fun t ->
            { language = "en";
              id = string (fst t);
              text = snd t})

    let request = { Request.documents = (documents |> Array.toList) }
    JsonConvert.SerializeObject(request)

// Perform http call
let AnalyticsHttp url apiKey body =
    Http.RequestString(
        url,
        httpMethod = "POST",
        headers = [
            "Content-Type", "application/json";
            "Ocp-Apim-Subscription-Key", apiKey],
        body = TextRequest body)

// Wrapper so I can call it F#-style
let JsonDeserialize<'T> s = JsonConvert.DeserializeObject<'T>(s)

Here is where some special data knowledge is required. As is typically the case, it is important to know what the data looks like. Examining the text at the url, its nicely formatted for easy reading. The result is there are line feeds mid paragraph (I’ll need to remove those), and double spacing between paragraphs (I use these to do the paragraph splitting). Additionally, the webpage has header and footer information that isn’t the text of the book. The actual text is designated by the *** START OF... and *** END OF... lines. I use them to extract the section of book text from the page. This naive method may not be perfect, but I think it’s pretty good. It is certainly good enough for my current purpose.

// Take a string and convert into list of paragraphs
// Assumption is paragraphs are split by 2 linefeeds
// Then remove mid-paragraph line breaks
let MakeParagraphs bookText =
    let paragraphBreak = new Regex(@"\r\n\s*\r\n")
    paragraphBreak.Split(bookText)
    |> Array.map (fun x -> x.Replace("\r\n", " "))

// Finds a text in an array and returns its index
let getMarkerIndex (m:string) (p:string array) =
    [0..(Array.length p) - 1]
    |> List.filter (fun i -> p.[i].StartsWith(m))
    |> List.exactlyOne

// Extracts only book text from full text
let ExtractBook a =
    let bookStart = getMarkerIndex "*** START OF THIS PROJECT GUTENBERG EBOOK" a
    let bookEnd = getMarkerIndex "*** END OF THIS PROJECT GUTENBERG EBOOK" a
    Array.sub a (bookStart + 1) (bookEnd - bookStart)

The below functions leverage the previously built utility functions to extract the raw text and transform it into paragraph form. Something kind of cool, its easy to miss, Http.RequestString grabs the plain text of url and drops it into a string. After dealing with so many verbose and heavy frameworks its a joy to use something so terse. Once the text is downloaded and transformed, I perform sentiment and word count analysis of each paragraph. I will confess, it’s temping to jam it all into a big pipeline of functions that do: url -> (sentiment, wordcount)[]. For current purposes, I prefer to have intermediate values more accessible. Also, the process is clean, but at times like these I wish F# had Haskell’s lazy infinite sequences as a built-in (see the paragraph variants below). I know F# has it’s ways to do this; but they’re not as clean as Haskell.

// Take list of id/text tuples and get sentiment by paragraph
let GetSentiment texts =
    let sentimentResponse = 
        texts
        |> TextToRequestJson
        |> (AnalyticsHttp Sentiment.url apiKey)
        |> JsonDeserialize<Sentiment.Response>

    sentimentResponse.documents
    |> List.map (fun x -> (x.id, x.score))

// Take a gutenberg url and return a book as an array of paragraphs
let GetBookParagraphs url =
    url
    |> Http.RequestString
    |> MakeParagraphs
    |> ExtractBook

// Take paragraphs and add an id for future usage.
let paragraphs =
    let temp = GetBookParagraphs bookUrl
    ([|1..Array.length temp|], temp)
    ||> Array.zip

// I'd rather write:
let paragraphs' = 
	([|1..|], GetBookParagraphs bookUrl) 
	||> Array.zip

// If I wrote GetBookParagraphs to return a Seq, I could do this:
let paragraphs'' =
    (Seq.initInfinite (fun i -> i), GetBookParagraphs bookUrl)
    ||> Seq.zip
	
// Get sentiment for paragraphs
let sentiment = 
    paragraphs
    |> GetSentiment

// Get wordcount for paragraphs
let wordCounts = 
    paragraphs
    |> Array.map (fun (id, p) -> 
        (id, p.Split([| ' '|]).Length)) 
    |> Array.toList

Now that I have my sentiment and wordcount lists, it’s time to do some quick analysis. Leveraging FSharp.Charting, its easy to put some simple reports together. A warning, the graphs are fun to look at, but there aren’t grand insights into the book. This is just a fun exercise. As a reminder sentiment is a scale 0 to 1, where 0 is very negative and 1 is very positive.

// Sentiment through book
Chart.Column(sentiment)
|> Chart.WithXAxis(Enabled=true, Title="Paragraph number")
|> Chart.WithYAxis(Enabled=true, Title="Sentiment")
|> Chart.WithTitle("Paragraph Sentiment")
|> Chart.Show

This data is almost too noisy to be useful

Paragraph Sentiment

// Paragraph wordcount through book
Chart.Column(wordCounts)
|> Chart.WithXAxis(Enabled=true, Title="Paragraph number")
|> Chart.WithYAxis(Enabled=true, Title="Word Count")
|> Chart.WithTitle("Words per Paragraph")
|> Chart.Show

This chart shows a couple of crazy long paragraph outliers.

Paragraph WordCount

// Histogram of paragraph sentiments
Chart.Histogram(sentiment |> List.map snd, Intervals=20.)
|> Chart.WithXAxis(Enabled=true, Title="Sentiment")
|> Chart.WithYAxis(Enabled=true, Title="Paragraph Count")
|> Chart.WithTitle("Sentiment Histogram")
|> Chart.Show

Here we see better breakdown on paragraph sentiment.

Sentiment Histogram

I configure the histogram to use 20 buckets. With some trial and error, this seemed like a good balance.

Note: Partially applied infix operators. For better or worse, I prefer avoiding the extra lambda syntax. If I can get away from it without obscuring intent, I do. Here is a place where I partially apply <, saving me like 13 precious keystrokes: |> List.filter ((>)500). It’s a fun trick; I think it’s more readable, but it could also be my Perl golfing tendancies emerging.

// Wordcount histogram (filter out paragraphs > 500 words)
Chart.Histogram(
    wordCounts 
    |> List.map snd 
    |> List.filter ((>)500),
    Intervals=20.)
|> Chart.WithXAxis(Enabled=true, Title="WordCount")
|> Chart.WithYAxis(Enabled=true, Title="Paragraph Count")
|> Chart.WithTitle("WordCount Histogram")
|> Chart.Show

Most paragraphs are under 200 words, with the lionshare being less than 100 words.
Wordcount Histogram

// Wordcount/Sentiment graph
(wordCounts    |> List.map snd, 
 sentiment     |> List.map snd)
||> List.zip
|> Chart.Point
|> Chart.WithXAxis(Enabled=true, Title="Words per Paragraph")
|> Chart.WithYAxis(Enabled=true, Title="Sentiment")
|> Chart.WithTitle("WordCount and Sentiment")
|> Chart.Show

Here we see if there is any trend with respect to paragraph wordcount and sentiment. I don’t really see one.

Wordcount and Sentiment

This is post is primarily about about using stock F#, and watching the data flow. With that said, I would be remiss if I didn’t mention Deedle. To do serious data analysis, Deedle is a powerful tool. It’s ability to manage and manipulate dataframes and series is extremely useful. The below tidbits don’t do the library justice, but they do provide a small taste of what can be accomplished easily.

Below I convert the data into a series, allowing more advanced reporting. Then I generate a moving average, using 30 paragraphs as the window. If you remember from above, the raw data was interesting, but I don’t believe overly insightful. A moving average helps to soften peaks and display trends better.

// Turn sentiment into a Deedle series
let sentimentSeries = sentiment |> series

// Chart a moving average
Stats.movingMean 30 sentimentSeries
|> Series.observations
|> Chart.Line
|> Chart.WithXAxis(Enabled=true, Title="Paragraph Number")
|> Chart.WithYAxis(Enabled=true, Title="Sentiment")
|> Chart.WithTitle("Sentiment Moving Average")
|> Chart.Show

This shows a easier to read sentiment trend. The periods of the book that use darker tones are easier to see now.

Sentiment Moving Average

Series provide a mechanism for basic series statistics. There is no need to calculate these yourself. This is not the full range of functionality, again it is just a view into the type of calculations readily available.

// Sentiment stats
printfn "Mean: %.2f Median: %.2f Min: %.2f Max: %.2f StdDev: %.2f Variance: %.2f" 
    (sentimentSeries.Mean())
    (sentimentSeries.Median())
    (sentimentSeries.Min())
    (sentimentSeries.Max())
    (Stats.stdDev sentimentSeries)
    (Stats.variance sentimentSeries)

Output:
Mean: 0.58 Median: 0.67 Min: 0.00 Max: 1.00 StdDev: 0.36 Variance: 0.13


// WordCount Stats
let wordCountsSeries = wordCounts |> List.map (fun (a,b) -> (a, float b)) |> series

printfn "Mean: %.2f Median: %.2f Min: %.2f Max: %.2f StdDev: %.2f Variance: %.2f" 
    (wordCountsSeries.Mean())
    (wordCountsSeries.Median())
    (wordCountsSeries.Min())
    (wordCountsSeries.Max())
    (Stats.stdDev wordCountsSeries)
    (Stats.variance wordCountsSeries)

Output:
Mean: 106.98 Median: 85.50 Min: 1.00 Max: 884.00 StdDev: 94.90 Variance: 9006.50

I hope you enjoyed this slightly deeper examination into sentiment analysis and F#.