Benchmarking an Old Chestnut

Benchmarking an Old Chestnut

Testing linear vs. quadratic time-complexity to reluctantly prove myself not wrong.

This article will discuss solutions to a specific Code Wars kata (but the type of question is fairly common). If you’d prefer not to have the solution spoiled and want to try it yourself first, here’s the puzzle.

Towards the end of my bootcamp, when everyone was crunching on their final projects and really starting to grapple with the symptoms of adrenaline-occluded sleep deprivation, Lighthouse Labs shifted their mandatory morning lecture series over from material that contextualized our coursework to the stuff our mentors were actually interested in talking about: Elm and functional programming, content delivery network optimization and database sharding, cryptography, time-complexity optimization—comparatively abstract topics that would have been among the most engaging of the entire curriculum if I hadn’t been helplessly sleeping through most of them.

Last week, I was pleasantly surprised to discover that a bit of the optimization lecture had somehow permeated and persisted. At Tim’s recommendation, I was having a stab at solving a Code Wars kata while trying to coherently verbalize my thought-process (my perennial tech-interview bugaboo). We were working on this one.

(quoted from https://www.codewars.com/kata/54da5a58ea159efa38000836)

Given an array of integers, find the one that appears an odd number of times.

There will always be only one integer that appears an odd number of times.
Examples

[7] should return 7, because it occurs 1 time (which is odd).
[0] should return 0, because it occurs 1 time (which is odd).
[1,1,2] should return 2, because it occurs 1 time (which is odd).
[0,1,0,1,0] should return 0, because it occurs 3 times (which is odd).
[1,2,2,3,3,3,4,3,3,3,2,2,1] should return 4, because it appears 1 time (which is odd).

Competing half-formed strategies swarmed me, most of which were smothered by self-doubt and the fear that I was embarrassingly selecting (and at risk of publicly promoting) The Naïve Approach. Finally, the mortification of my dead air won out and I took a stab, couching it with something sufficiently weaselly like “okay, I think, maybe, to start, a naïve approach could be…”. In the debrief afterwards, it turned out that my approach was optimal (suspicious); the more conventional approach hadn’t occurred to me as it had when I’d been sitting (slumping) through the Lighthouse Labs lecture, suggesting I’d internalized better practices successfully and without my knowledge. Hooray…?

Anyhoo, if I understand how to do this, it’s got to be the first thing anyone learns about the subject whenever they’re exposed to it. But, if you’re interested, I’d love to walk you through the different conventional solutions for this puzzle as well as the benchmarking process I implemented just to see how useful my unexpected competence could be.

The Naïve Approach

Right: we want to find out if a given value in an array occurs an odd number of times (with the kind proviso that only one element in the provided inputs will ever be odd). That means we’ll want to iterate over the array at least once to check the elements:

def find_the_oddly_populated_value(test_array)
  test_array.each do |value|
    # Something good.
  end
end

When we encounter a value, we then need to figure out how many times it occurs in the array so we can check whether that number is odd or even. So, at this point, we could count how many times the current value comes up, check to see if that number is odd, and then report our success or move on to the next element:

def find_and_count(test_array)
  test_array.find { |value| test_array.count(value).odd? }
end

Idiomatic Ruby makes this look like a clean, snappy process and reads like human language, but it obscures what an expensive operation it is that we’re actually doing. Ultimately, #find and #count both iterate over the array, so the following code is roughly equivalent:

def find_and_count(test_array)
  test_array.each do |current_value_from_first_pass|
    hits = 0

    test_array.each do |possible_equivalent_value_from_second_pass|
      if current_value_from_first_pass == possible_equivalent_value_from_second_pass
        hits += 1
      end
    end

    return current_value_from_first_pass if hits.odd?
  end
end

For each element in the array, we’re checking the array an extra time. An array, [ 1, 1, 2 ], with three elements would each require us checking three more elements for a total of nine (or 12, if you count the initial iteration). If the array had eleven elements…

[ 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6 ]

… We’d be iterating 10 extra times, checking 110 elements total. The amount of time needed, while very very tiny on modern systems for small arrays, increases exponentially (quadratically, in this case) as the length of the array increases. For arrays with millions of elements, or where the operation at each step is more taxing than just checking equality1, this becomes a problem. Computer science borrows from math to describe this property of “time complexity” using “big O notation”; in our case, if the time to complete an operation increases as the square of the size of the input, we’d call it “quadratic time complexity” and depict it as O(n2)O(n^2).

My Mind-Bendingly Genius Approach

The solution I came up with2 was to build a second data structure into which I could optimally reorganize test_array.

def tally_and_check(test_array)
  # Step 1
  tallies = Hash.new(0)

  # Step 2
  test_array.each do |value|
    tallies[value] += 1
  end

  # Step 3
  tallies.each do |original_value_from_the_test_array, frequency|
    if frequency.odd?
      return original_value_from_the_test_array
    end
  end
end
  • Step 1: I created a tallies hash (more generally, a dictionary) with a default value of 0 for undefined keys (in Ruby, Hash.new(0)). tallies’s keys in this case would be the distinct values in the test array, and tallies’s values would be the frequency with which they were encountered.
  • Step 2: I iterated over test_array once, incrementing tallies at the key corresponding with the current value. In the end, I might wind up with a hash that looked like this:
    {
      smurfs: 6,
      hobbits: 2,
      dwarves: 8,
      nac_mac_feegle: 19
    }
    
  • Step 3: Finally, iterate over the tallies hash, checking each value (frequency) for oddness, and upon finding one, returning the key (original_value_from_the_test_array) as the output of the function (and my final answer).

When I first learned this, I remember despairing (sleepily) that a whole bunch of apparent and unintuitive complexity was being introduced to solve the problem. The key benefit, though, is that I’m iterating over test_array only once, and then iterating over tallies only once. Both operations have linear time complexity (represented as O(n)O(n)) resulting in total time taken that grows at a constant rate as a function of the length of the input (which is fair) rather than exponentially (which is catastrophic).3

But, really…?

Top solutions on Code Wars for this kata resembled the naïve approach, so even though Tim had given me his blessing, I wanted to see if maybe I could still be wrong somehow:

  • It seemed possible that Ruby’s #find or #count methods might involve some fancy optimizations under the hood that would get me better than expected results.
  • Additionally, the examples I gave in my discussion of the naïve approach are all worst-case scenarios, where an odd-frequency value is only ever encountered at the end of test_array. What if it’s the first element? Isn’t it an awful lot of work populating that tallies hash unnecessarily?
  • Additionally additionally, it seems like we could cut down on a bunch of unnecessary iterations of test_array if we only search a given value once:

    def find_method_that_logs_failures(test_array)
      evens = [ ]
    
      test_array.find do |value|
        # Move on if this value has already been checked.
        next if evens.include?(value)
    
        if test_array.count(value).odd?
          # Hooray, you found it!  Return true so that #find returns this value.
          true
    
        else
          # Add the value to the `evens` array so we don't check it again.
          evens << value
    
          # Return false so that #find continues working through `test_array`
          false
        end
      end
    end
    

    Or, here’s a much more elegant implementation I saw:

    def find_and_count_unique(test_array)
      # Reducing `test_array` to only its unique values with `#uniq` *also*
      # avoids redundant searches.
      test_array.uniq.find { |value| test_array.count(value).odd? }
    end
    

Benchmark Bumbum

Ruby’s Benchmark module has a pretty easy interface to work with, including its Benchmark#bmbm method that lets you run the benchmarks multiple times, throwing away initial results4 and averaging the rest. I started by creating my testing array:

example_array = (1..20).map do |number|
  Array.new(1_000_000, number)
end.flatten.push(rand(1..20)).shuffle

This builds me a shuffled array of numbers between 1 and 20, a million of each, plus one more of one of those numbers to ensure something is oddly-frequent. It’s a big array.

puts example_array.length #=> 20_000_001

This represents a general case where the oddly-frequent value could be anywhere. I ran this benchmark against the naïvest #find_and_count, the improved #find_and_count_unique, and my #tally_and_check, with the following results:

require 'benchmark'

Benchmark.bmbm do |approach|
  approach.report('Find and Count') { find_and_count(example_array) }
  approach.report('Find and Count Unique') { find_and_count_unique(example_array) }
  approach.report('Tally and Check') { tally_and_check(example_array) }
end
                            user     system      total        real
Find and Count          0.660249   0.003853   0.664102 (  0.667515)
Find and Count Unique   1.348574   0.148201   1.496775 (  1.506050)
Tally and Check         2.183379   0.000000   2.183379 (  2.193576)

I knew it!

Gratifyingly, this confirmed all my fears that I could still be wrong, and therefore all was right with the world. Multiple runs tended to produce similar results, though the margin by which the #find_and_count approaches beat my #tally_and_check varied pretty wildly. Raising this with Tim, we agreed that the variation might have to do with encountering the oddly-frequent value earlier or later in the process (ie., closer to a best-case or worst-case scenario). He suggested testing those best- and worst-case scenarios specifically:

best_case_example_array = (1..20).map do |number|
  Array.new(1_000_000, number)
end.flatten.prepend(1)

worst_case_example_array = (1..20).map do |number|
  Array.new(1_000_000, number)
end.flatten.push(20)

Both arrays remain twenty million elements long; the best-case scenario is that the first value tested yields the hit, and the worst-case scenario is that the winning value is encountered as late as possible in the array (at address [18_999_999], here).

Best-Case Scenario

Benchmark.bmbm do |approach|
  approach.report('Find and Count') { find_and_count(best_case_example_array) }
  approach.report('Find and Count Unique') { find_and_count_unique(best_case_example_array) }
  approach.report('Tally and Check') { tally_and_check(best_case_example_array) }
end
                            user     system      total        real
Find and Count          1.142833   0.000000   1.142833 (  1.146753)
Find and Count Unique   1.398049   0.140335   1.538384 (  1.542506)
Tally and Check         1.995043   0.000000   1.995043 (  2.000064)

This confirmed the pattern, with little variation. My guess would be that the extra time for #find_and_count_unique accounts for needing to iterate over the array an extra time and throw out the duplicate values.

On to the worst-case scenario.

Worst-Case Scenario

Benchmark.bmbm do |approach|
  approach.report('Find and Count') { find_and_count(worst_case_example_array) }
  # ...
end
                            user     system      total        real
Find and Count

[Done] exited with code=null in 8160.226 seconds

This sure did not finish. The streetlights dimmed, the fans spun, some magic smoke leaked out, and more than two hours passed; while it was amusing to watch my CPU bounce the task from core to core like a hot potato, I eventually called it: #find_and_count could not deal with the worst-case scenario on my machine for an array of this size. I tried the rest:

Benchmark.bmbm do |approach|
  # approach.report('Find and Count') { find_and_count(worst_case_example_array) }
  approach.report('Find and Count Unique') { find_and_count_unique(worst_case_example_array) }
  approach.report('Tally and Check') { tally_and_check(worst_case_example_array) }
end
                            user     system      total        real
Find and Count Unique   2.166757   0.131237   2.297994 (  2.304235)
Tally and Check         1.771231   0.000009   1.771240 (  1.775164)

#tally_and_check remains steady, as expected, while the optimized #find_and_count_unique’s run of two seconds and change is substantially better than #find_and_count’s never-ever. Turns out it’s faster to iterate over a twenty million-element array twenty times than it is to iterate over it twenty million times. Thanks, computer science!

Moar Numbers

Tim also pointed out that extending the possible range of values the array could hold might be more realistic and illustrative of the benefits of #tally_and_check, so I introduced a new version of each of the existing example arrays:

tims_example_array = (1..2000).map do |number|
  Array.new(10_000, number)
end.flatten.push(rand(1..2000)).shuffle

tims_best_case_example_array = (1..2000).map do |number|
  Array.new(10_000, number)
end.flatten.prepend(1)

tims_worst_case_example_array = (1..2000).map do |number|
  Array.new(10_000, number)
end.flatten.push(2000)

These arrays remain twenty million elements long but now have an even distribution of two thousand values, rather than twenty. My approach for describing best- and worst-case scenarios also remains consistent.

Best-Case Scenario Again

Benchmark.bmbm do |approach|
  approach.report('Find and Count') { find_and_count(tims_best_case_example_array) }
  approach.report('Find and Count Unique') { find_and_count_unique(tims_best_case_example_array) }
  approach.report('Tally and Check') { tally_and_check(tims_best_case_example_array) }
end
                            user     system      total        real
Find and Count          0.105408   0.000002   0.105410 (  0.105981)
Find and Count Unique   0.652062   0.134653   0.786715 (  0.793576)
Tally and Check         2.552200   0.000000   2.552200 (  2.573192)

I see the same pattern here that I did in the previous best-case scenario, just more extreme. My guess is that, rather than needing to count a million and one 1s before being sure we’d found the oddly-frequent value, the search only needed to get to ten thousand and one. … Either that, or the previous worst-case scenario fused my CPU into an unexpectedly superior configuration somehow. Probably not.

But, speaking of worst-case scenarios…

Worst-Case Scenario Again

I’m not bothering to try this on #find_and_count again. I can only afford so much magic smoke.

Benchmark.bmbm do |approach|
  # approach.report('Find and Count') { find_and_count(tims_worst_case_example_array) }  # NOPE.
  approach.report('Find and Count Unique') { find_and_count_unique(tims_worst_case_example_array) }
  approach.report('Tally and Check') { tally_and_check(tims_worst_case_example_array) }
end
                            user     system      total        real
Find and Count Unique 174.092475   0.138251 174.230726 (174.687193)
Tally and Check         1.859082   0.000001   1.859083 (  1.863179)

There we go: three minutes versus #tally_and_check’s consistent two seconds. It makes sense to me that, as the number of possible values approaches the overall length of the array, the benefit of compressing the array using #uniq to avoid redundant searches approaches zero (and then becomes a liability in the end); the slowness of #find_and_count_unique approaches the slowness of the naïviëst #find_and_count and melts my computer. Neat.

Fine, then.

After yet another bracing foray into the realm of computer-processed pointlessness, I’m prepared to admit that maybe I might be right in suspecting that O(n)O(n) is faster than O(n2)O(n^2). Thanks again for the fun, Tim.

… Of course, there’s a better way

  1. Imagine if we needed to make an API request each time, geez. ↩︎

  2. That is, the solution I forgot I remembered. ↩︎

  3. Tim points out that accessing the value at a given location in either an array/list or a hash/dictionary takes “constant time” (or O(1)O(1) in big O notation) no matter how large the structure. In this case, though, since we don’t yet know which key holds an odd value, we still need to iterate over tallies up to once, which remains an O(n)O(n) operation. Those two iterations in #tally_and_check amount to O(2n)O(2n) in this particular case, but it seems to be conventional to drop the coefficient. ↩︎

  4. Handy if it takes your machine a little while to throttle up (or throttle out) ↩︎