2019-12-21

Examining Entropy with F#

Read Time: 10 minutes

Measuring entropy can be a useful heuristic when performing analysis. This is a short introduction into performing entropy analysis on binary and text files using F#.

What can entropy based analysis be used for? One specific use case is searching for secret keys in files. In particular, scanning for hardcoded secrets or passwords that are accidently committed to git. Scanning tools can be used to catch these kinds of issues. Another use case is scanning binaries for reverse engineering purposes. It is an interesting exercise to see what goes on behind the scenes with these types of tools. The ability to scan successfully is predicated on the keys being blocks of random characters (a fair assumption for good passwords). The randomness can be extracted from code files that have a certain level of predictability. At the core of this process is Information Theory, namely Shannon Entropy. If you want to follow along, you’ll need to get .NET Core installed. For visualizations, XPlot will provide the necessary functionality.

1
2
3

dotnet new console -lang F# -n Entropy
cd Entropy
dotnet add package XPlot.GoogleCharts --version 3.0.1

1
2
3

open System
open System.IO
open XPlot.GoogleCharts

$$ {H = - \sum^N_{i=1}p_i \log p_i} $$

The above equation is our starting point, where p_i is the probability of seeing a particular character. Implementation of the Shannon Entropy calculation is deceptively simple. First, take a byte array and count the frequency of each byte value. Second, sum the probabilities of seeing each byte in the byte array times the Log of the respective probability. Here is where I diverge from the typical implementation. Discussion often focuses on Log₂, since so many instances focus on bits. I prefer to work with a normalized value. For 256 possible values I use Log₂₅₆ to scale between 0 and 1, where 0 is no entropy and 1 is high entropy.

let NormalizedEntropy (data: byte[]) =
  let mutable frequency = Array.create 256 0

  // Populate byte frequencies
  for i in 0 .. (data.Length - 1) do 
    let newByte = int data.[i]
    frequency.[newByte] <- frequency.[newByte] + 1

  // Calculate entropy based on character frequencies
  let mutable entropy = 0.
  for i in 0 .. 255 do
    let probX = (float frequency.[i]) / (float data.Length)
    if probX > 0. then entropy <- entropy - (probX * Math.Log(probX, 256.))
  
  entropy

Since one of the possible uses of entropy analysis is checking files for secrets, a simple example is looking at Program.fs. For comparison I copied the file and added a couple random passwords. Taking it a step further, I generated a fake.pem just to demonstrate how the entropy of code differs from pem files (something else that might want to be scanned for in code repos).

[ "Program.fs" 
  "Program-Secret.fs" 
  "fake.pem" ]
  |> List.iter (fun filename ->
    let bytes = File.ReadAllBytes(filename)
    let e = NormalizedEntropy bytes
    printfn "%20s - %3.2f" filename e)

As the results show below, putting some randomized strings (like secret keys or passwords) in a code file increases it’s entropy. For comparison, the pem file is noticeably higher. Although this is interesting, it has its limitations for any type of real scanning. Perhaps I could scan all my code files to see if there is a typical range and use that to find anomalies, but I have a better idea.

1
2
3

       Program.fs - 0.60
Program-Secret.fs - 0.63
         fake.pem - 0.75

Details can be lost by looking at a file as a whole. Another approach is to using a sliding window through the file. Additionally leverage a threshold to show any blocks of the the file that have a high entropy. A little refactoring for readability, the entropy calculation is broken out. Then a sliding window is applied against the file. Whenever a window is above the entropy threshold, show that block (at least the printable characters). To cut down on excessive printing, don’t show overlapping windows.

/// Given a frequency array, calculate normalized entropy for a specified window size
let windowNormalizedEntropy (windowSize: int) (frequency: int[])=
  let mutable entropy = 0.
  for i in 0 .. 255 do
    let probX = (float frequency.[i]) / (float windowSize)
    if probX > 0. then entropy <- entropy - (probX * Math.Log(probX, 256.))
  entropy

/// Normalized entropy using a rolling window
let NormalizedEntropyWindowShow (windowSize: int) (threshold: float) (data: byte[]) =
  let mutable frequency = Array.create 256 0
  let mutable lastPrintedPosition = -1

  for position in 0 .. (data.Length - windowSize - 1) do
    // Add new byte to window
    let newByte = int data.[position + windowSize]
    frequency.[newByte] <- frequency.[newByte] + 1
    
    if position >= windowSize then 
      // Roll off old byte from window 
      let currentByte = int data.[position]
      frequency.[currentByte] <- frequency.[currentByte] - 1

      // Calculate entropy 
      let entropy = windowNormalizedEntropy windowSize frequency

      // Show high entropy block (if I haven't shown part already)
      if entropy >= threshold && (position > lastPrintedPosition) then
        lastPrintedPosition <- position + windowSize - 1
        let nonPrintable = RegularExpressions.Regex("[^ -~]")
        let slice = data.[position..(position + windowSize - 1)]
        let s = nonPrintable.Replace(Encoding.ASCII.GetString (slice), "?")
        printfn "%6d %6.5f %s" position entropy s

The file can then be evaluated by providing a window size and threshold. A window size of 50 bytes and threshold of 0.58 are mostly arbitrary, but some experimentation show these seem to get a reasonable result. I can now run the analysis against a code file with secrets and the pem file.

[ "Program-Secret.fs" 
  "fake.pem" ]
|> List.iter (fun filename ->
  printfn "File: %s" filename
  let bytes = File.ReadAllBytes(filename)
  NormalizedEntropyWindowShow 50 0.58 bytes)

The results provide a bit more insight than before. The entire pem file is captured (as expected). For the code file, not all things caught are secrets. Natural codeblocks get caught up in the process, and that’s fine. This is about finding possibly areas of concern, and that goal is accomplished. This is useful approach, but not perfect. There are still gaps that can be closed to make it a more robust solution.

File: Program-Secret.fs
   151 0.58482 "?  |> printfn "%s"??// let keyToFind = "ABCKEDIEM
   201 0.63482 WPFHEIK1238UIMQ890PoqEgfh"??/// Normalized entropy
  1070 0.58105 bX * Math.Log(probX, 256.))?  entropy??/// Normali
  1850 0.58105 rintable = Text.RegularExpressions.Regex("[^ -~]")
  2830 0.58391 ?[<EntryPoint>]?let main argv =?  // 94kmnhdnbi*IE
  2880 0.66548 JDMSKQL098123456hjqwemn,.<>/;':',.mnbvcxzasdfghjkl
  2930 0.70548 ;'][poiuytrewq1234567890-=+_)(*&^%$#@!QWERTYUIOP{}
  2980 0.63635 |":LKJHGFDSAZXCVBNM<>?*))]?  ?    // (Text.Encodin
File: fake.pem
    50 0.61793 04w4NMK0Ak5XVR1XR/TX/YGcj6z91Zd8j3K+0Q8SZKa9n8?QIu
   100 0.62482 0URxz/CT5I4SeS746vd1bPpcSXy0rHvLMz0UxtIxlEOuCHAoed
   150 0.60982 eDzKv4zr+MJ?/FCrKRazzAFv2/klBi3bRfW4UseYPeVs43ahRF
   200 0.62293 O/gugc11n8drzlQq3BqkwJCTAx?uj+ZqG5QI6C8x8uooOvs0xk
   ...
  1450 0.63048 xjaxX+GIqEK?TM0T8QKBgCN83f96A39Qf5WGlJuY5EAsLWi+eu
  1500 0.63048 0BO/We7heP1Op74XidcWucKaRb?t0+cNjkcnYgziBPj8TxAzMh
  1550 0.63671 ueHJIB+TRAOqxQXPuYj+15O2PudrgMrKkRFC0lG90?3rOiKJCY
  1600 0.63146 4g578SCKRNF0/xiMlXe9+DzfBCStzFirsI2Bq7tcrTmw?-----

One question that remains is what are ideal values for window size and threshold that maximize finding secrets while minimizing false positives. One of the impacting factors for this is different languages have their own characteristics, impacting the best parameters. Beyond static parameters, statistical methods can be used to determine anomalies and parameters for a file, providing a more dynamic approach. In the spirit of investigating different angles, I ran analysis against some of my repos (using a window size of 50). My code generally has a range, but it appears some scanning refinement can be accomplished by taking file type into account.

.c  : 0.530
.cl : 0.477
.cpp: 0.490
.cs : 0.486
.fs : 0.502
.fsx: 0.500
.hs : 0.513
.js : 0.492
.pl : 0.540
.ps1: 0.503
.py : 0.490
.r  : 0.516
.rkt: 0.475
.sh : 0.540
.ts : 0.520

Because there are multiple ways to look at the data, it is time to go a bit more visual. The below adaptation returns a by-sliding-window entropy array. This array is then fed into a charting function. A chart provides a nice visual representation of the entropy over the entire file.

/// Normalized entropy using a rolling window
let NormalizedEntropyWindow (windowSize: int) (data: byte[]) =
  let mutable frequency = Array.create 256 0

  [| 
    for position in 0 .. (data.Length - windowSize - 1) do
      // Add new byte to window
      let newByte = int data.[position + windowSize]
      frequency.[newByte] <- frequency.[newByte] + 1
      
      if position >= windowSize then 
        // Roll off old byte off window
        let currentByte = int data.[position]
        frequency.[currentByte] <- frequency.[currentByte] - 1

        // Calculate entropy 
        yield (windowNormalizedEntropy windowSize frequency)
  |]

[ "Program-Secret.fs" 
  "fake.pem" ]
|> List.iter (fun filename ->
  printfn "File: %s" filename
  let bytes = File.ReadAllBytes(filename)
  
  NormalizedEntropyWindow 100 bytes
  |> Chart.Line
  |> Chart.Show)

Now, as a result, I get charts showing entropy over the respective file. The bumps show where I placed some random string keys in the file. Just for kicks, I including a block of “AAAA…”, which is also evident when entropy drops to 0. As a contrast, the pem file has a pretty consistently higher entropy throughout the whole file. As with so many things, different angles on the problem help to improve intuition. Plus I’m a sucker for a good picture.

Program-Secret.fs

fake.pem

This has been a brief look into how entropy patterns can be of use when scanning files for interesting information. Until next time.