The Flying Dutchman: December 2016

Tuesday, December 27, 2016

Big Numbers

The human brain is ill-equipped to deal with large numbers. Drawing attention to large numbers is only partially effective. For instance, the following ticker in NYC displays the current US National Debt:

The US National Debt clock at Union Square, NYC

That number is so big it becomes abstract and meaningless. In fact, the above picture is already from a while back. By the time you read this, the actual US National Debt is closer to $20 trillion. A number so large it becomes hard to parse, especially as the image above is missing commas.

Maybe it helps if we write down the current amount, as of writing of this text, in words and compare it with another large country:

Country	Debt in Dollars	Debt in Words/Speech
USA	$19,859,586,951,270	Nineteen trillion, eight hundred fifty-nine billion, five hundred eighty-six million, nine hundred fifty-one thousand, two hundred seventy dollars.
Russia	$152,205,694,374	One hundred fifty-two billion, two hundred five million, six hundred ninety-four thousand, three hundred seventy-four dollars.

If we actually do round off the US debt to $20 billion, the Russian debt is actually the rounding error. That's how large the US number really is.

Payback Time

To get a feeling for the amount, what if you would spend 100% of your current net income to pay off the debt, how many years would it take you to pay off the US National Debt? For the average US citizen, it would take 525 million working years. If we combine the entire US workforce at roughly 155 million people and spend all their income on relieving the US debt, it would still take an entire Trump presidency to pay off the debt. This is not realistic for a few reasons, of course.

Even if each person in the US workforce would spend a modest 10% of their net income on paying off the national debt, it would take the entire country 34 years to pay off the current debt amount, assuming the US government was able to not grow the debt even larger.

Mind-boggling numbers.

Visualizing Large Amounts of Money

Visualizing the debt in the shape of $100 bills may help. Here is the US National Debt in 2013, compared to an American Football field and Miss Liberty:

US National Debt visualized in stacks of $100 bills (link)

For reference, this is what one billion dollars looks like in stacks of $100 bills:

It is rumored that Pablo Escobar lost $2.3 billion each year due to rats eating the bills. That's 2.3X the amount showing above.

Pictures help. Check out demonocracy.info for many other visualizations of large amounts of money.

Short Scale and Large Scale

The large numbers we talked about so far are huge. But, we cannot even agree on what to call really large numbers. Two different scales to talk about large numbers are in use:

Long scale. Every new term in the scale is one million times larger than the previous. A billion means a million millions (10^12), a trillion means a million billions (10^18), etc.
Short scale. Every new term is one thousand times larger than the previous. A billion means a thousand millions (10^9), trillion means a thousand billions (10^12), etc.

For fun, here is what scale is being adopted by different countries across the world, including a couple of exceptions to the rule:

Use of short and long scale across the world (wikipedia)

If you care, a milliardaire and a billionaire are equally rich. Confusing.

Algorithms and Big Numbers

Large numbers confound us. Reasoning about large numbers is even harder. So how can we make the analysis of algorithms more insightful? Applying scales helps. Writing out large numbers in words may help. Visualization definitely helps.

Say we have a certain algorithm that we run on a modest number of 50 elements. The algorithm has implementations with different time complexity. How efficient is each algorithm? How well does it scale when we increase the number of elements? Those are questions we care about when scaling to thousands, millions, or even billions of users.

For each class of complexity, the required operations for 50 elements are at the following scale:

Complexity	Operations	Number of Operations in Words/Speech
O(1)	1	One
O(log n)	4	Four
O(n)	50	Fifty
O(n log n)	195	One hundred ninety-five
O(n^2)	2 500	Two thousand, five hundred
O(2^n)	1 125 899 906 842 624	One quadrillion, one hundred twenty-five trillion, eight hundred ninety-nine billion, nine hundred six million, eight hundred forty-two thousand, six hundred twenty-four
O(factorial)	30 414 093 201 713 378 043 612 608 166 064 768 844 377 641 568 960 512 000 000 000 000	Thirty vigintillion, four hundred fourteen novemdecillion, ninety-three octodecillion, two hundred one septendecillion, seven hundred thirteen sexdecillion, three hundred seventy-eight quindecillion, forty-three quattuordecillion, six hundred twelve tredecillion, six hundred eight duodecillion, one hundred sixty-six undecillion, sixty-four decillion, seven hundred sixty-eight nonillion, eight hundred forty-four octillion, three hundred seventy-seven septillion, six hundred forty-one sextillion, five hundred sixty-eight quintillion, nine hundred sixty quadrillion, five hundred twelve trillion

Those numbers may still not mean much. But, just read out the value for O(2^n). Does sound silly, right? The number for O(factorial) actually sounds laughable. This is what humans do with large numbers. Our brain does not know what to do with them, gives up, and turns it into a joke. BTW. Who made up those names? Vigintillion? Novemdecilion? Undecillion?

Over their training and/or career, most software engineers have developed an intuitive feeling for complexity theory. They know that in interview questions, proposing an implementation that is O(n^2) means you fell for the trap set by the interviewer. There must be a way to use a hashmap somewhere to get the cost down to O(n) or even O(1).

But does the candidate really understand why O(n^2) is so bad? How do O(n) and O(n^2) relate to each other as the number of elements grows? Let's try a visualization (click the play icon to replay):

In the interactive chart above, the number of elements are between 1 and 50 and are plotted from left to right. The vertical scale shows the number of operations, capped at 800. O(log n), O(n), and even O(n log n) stay linear for a long time. The big P refers to algorithms that can be solved in polynomial time, while NP refers to nondeterministic polynomial time.

When interviewing, you definitely do not want to be nondeterministic. Already for n=10, the purple line for O(n^2) is off the scale in the above chart. Search for solutions that are O(n), as is the case in linear search, or O(log n), the number of steps needed in binary search, or ideally O(1), when using a hashtable.

Links

A large number of live tickers, including the current US National Debt can be found at usdebtclock.org.
Pronouncing numbers as words makes for a nice programming interview question, but you can also go to Calculator Soup.
Check out demonocracy.info for many other visualizations of large amounts of money.
For more details on program analysis, check out time complexity.
The O(x) visualization can be found amid numerous algorithm visualizations at PyAlgoViz.

Monday, December 19, 2016

Auger - Automatic Unit Test Generation for Python

Unit tests are crucial to software development. They verify whether a given component actually implements what it promises. They are also important for long-living code, where future maintenance can be much more expensive without the proper level of unit test coverage. Not surprisingly, the lack of unit tests is a show-stopper for submitting production code at many companies.

You may wonder, if unit tests are so great, why do engineers hate to write them? More than once have I heard fellow engineers say "If you approve my code now, I will write the unit tests in a future change list." All engineers I know actually like writing code. It is creative. Code means impact. However, in general, engineers do not like writing unit tests at all. Why is that?

Why is Unit Testing Hard?

Unit testing is hard for various reasons:

One reason would be that a unit testing framework is exactly that. A framework. When I worked on the IBM J9 and Eclipse team with Dave Thomas, he used to say "Everybody likes to write frameworks, but nobody likes to actually use them". In practice, unit test frameworks add yet another complexity to learn and master, especially when unit test frameworks have little in common across programming languages or organizations.
All unit testing frameworks start off modestly. Erich Gamma once confided in me how he and Kent Beck wrote the original version of JUnit when they got bored on a transcontinental flight. He added how those few hours of work had been by far the best investment of his technical career ever. Today, however, unit testing frameworks are far from trivial and require a considerable learning investment in understanding their power and intricacies.
Software systems themselves are also becoming increasingly complex. Even seemingly standalone components are heavily embedded in a context of complex runtimes. To isolate those dependencies and sculpt them with the proper "mocks" is an art. It is not uncommen to spend hours trying to find out how to write the mocks for one line of test code. Dependency injection and fancy mock syntax conspire to make the life of unit test authors challenging.
Code under development constantly changes. Often, coding can be a discovery game. A design doc can outline architectural decisions on what database to use and how the UI is created. However, it tends to leave the actual coding as an exercise to the reader. During coding, discoveries are made. Code is constantly refactored. Code is found obsolete and is deleted even before it is ever committed to version control. This is particularly the case when a new domain or technology is being explored. We still write unit tests, but we write them later, when the dust settles.

OK. Unit tests are hard. But, what is the alternative?

Let's agree, unit test are a chore and tough to write. This does not mean engineers are not testing their code, even if they do not write unit tests. Many engineers, like myself, write code in a highly iterative fashion. For instance, say I am writing an AppEngine app to return a web page. I would first add a route, write a Handler, and simply return "Hello World". To test, I would spin up a local instance, point my browser at localhost:8080 and see if it shows the string I expected. I iterate that step hundreds of times.

In this iterative development mode, small, incremental steps are made towards the end goal and progress is constantly validated. Each time, some more functionality is added and tested on the code in progress. At some point, we are going to be "done". At his final point in time, hundreds or thousands of "test" runs have been exercised on the code under development. Each component has had some inputs and produced some expected outputs. If only we could remember what those inputs and outputs were and our tests would be so much easier to write. Cue: Project Auger.

What is Project Auger?

Project Auger (Automated Unittest Generator) watches your Python code while you write it and automatically generates all unit tests for your code, including all the mocks. Little or no work is required by the developer.

How does Auger Work?

Auger works like a smart Python debugger that sets breakpoints for each component you are interested in. Auger tracks two kinds of function calls related to the module under test:

For each function defined in the module, Auger records both the values of the arguments and the returned results. After recording enough execution traces, unit tests can be generated with the meaningful placeholder argument values and assertions.
For each call made from a given component to dependent libraries or other components, we record the return value, so that this call can be automatically mocked out with known return values.

Auger tracks all possible functions, including instance methods, class methods, and static functions.

Consider the following example, pet.py, that provides a Pet with a name, age, and a species:

from sample.animal import Animal


class Pet(Animal):
    def __init__(self, name, *args):
        Animal.__init__(self, *args)
        self._name = name

    def get_name(self):
        return self._name

    @staticmethod
    def lower(s):
        return s.lower()

    def __str__(self):
        return '%s is a %s aged %d' % (

            self.get_name(),

            Pet.lower(self.get_species()), self.get_age()

        )


def create_pet(name, species, age=0):
    return Pet(name, species, age)


if __name__ == '__main__':
    print(Pet('Polly', 'Parrot'))
    print(create_pet('Clifford', 'Dog', 32))

This class has a few different entry points we would need to unit test:

The class Pet itself which has:

a static method, lower
two instance methods, get_name and __str__
a constructor, __init__, which is really a very special instance method

A static function that creates a Pet and returns it

The Pet class is a subclass of Animal, which we know nothing about, so we will need to mock that entire class. We do know that the class is used in the Pet constructor and inherited methods are called from Pet as well. This means that the implementation for self.get_species() and self.get_age() are unknown, as we cannot look at the implementation of Animal, when unit testing Pet. Therefore, those two inherited methods will be mocked out.

Unit Tests Generation with Auger

The above class definition combined with an execution run is enough for Auger to automatically create the following fully functional unit test:

from mock import patch
from sample.animal import Animal
import sample.pet
from sample.pet import Pet
import unittest


class PetTest(unittest.TestCase):
    @patch.object(Animal, 'get_species')
    @patch.object(Animal, 'get_age')
    def test___str__(self, mock_get_age, mock_get_species):
        mock_get_age.return_value = 12
        mock_get_species.return_value = 'Dog'
        pet_instance = Pet('Clifford', 'Dog', 12)
        self.assertEquals(pet_instance.__str__(), 'Clifford is a dog aged 12')

    def test_create_pet(self):
        self.assertIsInstance(sample.pet.create_pet(age=12,species='Dog',name='Clifford'), Pet)

    def test_get_name(self):
        pet_instance = Pet('Clifford', 'Dog', 12)
        self.assertEquals(pet_instance.get_name(), 'Clifford')

    def test_lower(self):
        self.assertEquals(Pet.lower(s='Dog'), 'dog')

if __name__ == "__main__":
    unittest.main()

No changes to the original code are needed to teach Auger anything. All that is required is for the developer to write their original code and exercise it somehow. In the above case, we simply ran python sample.pet to produce two scenarios in which Pet instances were created and manipulated. From those two sample, a single test was extracted.

Of course Auger is limited in the sense that it cannot guess what scenario is being tested. Rather than generates multiple, focused tests per module, it will generate one big test that covers the entire module. The value of Auger is more in generating all the boiler plate code, imports and mocks, ensuring proper coverage, and to generate a template for manual refinement.

To generate a set of unit tests, Auger magic is invoked:

import auger

... your code goes here ...

if __name__ == "__main__":
    with auger.magic([pet]):
        ... call the main routine for your code ...

In this case, one module is passed, pet, but multiple modules can be passed as well. Each one will be traced and unit tested.

When a unit test is produced, it is written out to the local file system under the corresponding tests folder.

IDE Integration

Auger does not have direct IDE integration per se, but works really well with PyCharm. This integration comes for free, because the IDE watches the underlying file system and will automatically discover when new files are created by Auger in the local file system. These tests can then be executed easily as well:

Adding the generated tests to Git and commit/push them to a repository means just a few clicks from that point onwards in an IDE such as PyCharm.

Future Work

Incremental test generation. Collect multiple execution runs, persist the invocations, and merge multiple runs into one test case.
Preserve manual edits performed by users on generated test cases when a test is regenerated.
Support other unit test frameworks, such as pytest, nose, cucumber, etc.
Figure out how to run Auger on itself. This is non-trivial :-)

Check out Project Auger and let me know what you think of it. Pull requests are welcomed.

Python code styled as tango by hilite.me with border:1px solid #ddd;padding:11px 7px;

Sunday, December 18, 2016

QuickSort

QuickSort is a divide-and-conquer algorithm that splits up an array into two halves, and recursively sorts each half. In other words, to sort an array, we pick a random pivot, split the array in three section, being smaller, equal, or larger than the pivot, and recursively sort the array:

  def qsort(array):
      if len(array) < 2: return array
      pivot = array[0]
      less = filter(lambda n: n<pivot, array)
      equal = filter(lambda n: n==pivot, array)
      larger = filter(lambda n: n>pivot, array)
      return qsort(less) + equal + qsort(larger)

In the original implementation of QuickSort, the first element in the array was picked as pivot, as is done on line 3 above. Below, the red bar indicates the choice of pivot. At each step of the algorithm, it is at the left of the array being sorted:

Choosing the first element as pivot can generate extra work when the array is already (partially) sorted. Rather than picking the first element, the middle element can be chosen, behaving better on already sorted arrays:

def qsort(array):
    if len(array) < 2: return array
    pivot = array[len(array)/2]
    less = filter(lambda n: n<pivot, array)
    equal = filter(lambda n: n==pivot, array)
    larger = filter(lambda n: n>pivot, array)
    return qsort(less) + equal + qsort(larger)

When choosing the value in the middle of the array as pivot, yields the following result, with again the red bar indicating the pivot:

Rather than pick the middle element as pivot, a random value can be picked from the array:

def qsort(array):
    if len(array) < 2: return array
    pivot = array[random.randint(0, len(array)-1)]
    less = filter(lambda n: n<pivot, array)
    equal = filter(lambda n: n==pivot, array)
    larger = filter(lambda n: n>pivot, array)
    return qsort(less) + equal + qsort(larger)

The pivot is now picked randomly as shown in the visualization below:

The final approach to pivot selection we will look at was invented by Robert Sedgewick and involves picking the median of three values:

def qsort(array):
    if len(array) < 2: return array
    pivot = (array[0] + array[len(array)/2] + array[-1])/3
    less = filter(lambda n: n<pivot, array)
    equal = filter(lambda n: n==pivot, array)
    larger = filter(lambda n: n>pivot, array)
    return qsort(less) + equal + qsort(larger)

The pivot is now picked by looking at three different values as shown in the visualization below:

Aside from how we pick the pivot, all versions of QuickSort shown above have the same thing in common that they use recursion for implementing the divide and conquer tactic. Recursion has overhead in that new stackframes need to be constructed for each call. Recursion also limits the maximum size of the array we can sort, but that is more of a theoretical challenge.

QuickSort can also be implement without using recursion, by using a user-defined stack to keep track of the work to be done. A possible implementation:

def qsort(array):
    work = [array]
    result = []
    while work:
        array = work.pop(0)
        if len(array) < 2:
            result.extend(array)
        else:
            pivot = array[0]
            work = [
                filter(lambda n: n<pivot, array),
                filter(lambda n: n==pivot, array),
                filter(lambda n: n>pivot, array),
            ] + work
    return result

And the execution (this time using the first element as pivot again):

In general, QuickSort is not a stable sort. Elements of equal value are not guaranteed to remain in their original sort order for most implementations. However, the implementations above keep the elements that are equal to the pivot in the original order as they were found. That makes the above implementation stable.

Check out more algorithm visualizations at PyAlgoViz.

Python code styled as tango by hilite.me with border:1px solid #ddd;padding:11px 7px;

Friday, December 16, 2016

MergeSort

MergeSort is a divide-and-conquer algorithm that recursively sorts two halves of an array and merges them to form a fully sorted end result. In other words, to sort a sequence, we find the middle point, sort the first half, sort the second half, and finally merge both halves:

def mergeSort(array, start, end):
    if end - start > 1:
        middle = (start + end) / 2
        mergeSort(array, start, middle)
        mergeSort(array, middle, end)
        merge(array, start, middle, middle, end)

The merging of the two halves is done by inserting elements of the second half into the first half. The second half keeps shrinking in size until we are done:

def merge(array, left, leftEnd, right, rightEnd):
    while left<leftEnd and right<rightEnd:
        if array[left] > array[right]:
            array.insert(left, array.pop(right))
            right += 1
            leftEnd += 1
        else:
            left += 1

Executing the above implementation on 40 random numbers produces the following result:

This implementation is a stable sort. Elements of equal value remain in their original sort order.

Check out more algorithm visualizations at PyAlgoViz.

Python code styled as tango by hilite.me with border:1px solid #ddd;padding:11px 7px;

Thursday, December 15, 2016

Generating Prime Numbers

In this blog post, we will generate prime numbers and explore brute force, memoization, and dynamic programming.

Prime number are numbers greater than one that are only divisible by themselves and one. In other words, the following Python function returns True if n is greater than one and none of the numbers in between 2 and n is a factor of n:

Of course, this implementation is rather naive and takes a lot longer than is necessary:

Our first optimization would be to realize that we need not check all factors in between 2 and n. Namely, we only need to look at the factors between 2 and n/2, as any number higher than that could never be a factor of n.

Our code now looks like this:

<== doing half the work

With this optimization in place, the algorithm runs about twice as fast already:

Still, we can do better. We can actually already stop considering factors when we reach sqrt(n):

A further optimization would be to realize that we only need to consider the factors 2, 3, 5, 7, 9, etc. as 4,6,8,etc. could not be factors as they are already divisible by 2. But, we'll leave that refinement to the reader.

Let's try our sqrt(n) optimization in the next iteration of our algorithm:

By now, our algorithm should be more than twice as fast.

Of course, we can still do better. For instance, say we are checking to see if 41 is a prime number. To validate 41, our algorithm will try the factors between 2 and sqrt(41), which is 6:

41 % 2 = 1, not a factor
41 % 3 = 2, not a factor
41 % 4 = 1, not a factor
41 % 5 = 1, not a factor
41 % 6 = 5, not a factor

When we check for factors 4 and 6, we are doing double work. Namely, if 41 would be divisible by 4, it would already be divisible by 2. The same argument goes for 6, which is already divisible by 2 and 3 itself.

In other words, to check if a given number is prime, we need to only verify the prime numbers between 2 and sqrt(n) as divisible factors. This greatly reduces the search space. All we need to do is remember the current list of prime numbers and use those as factors. This approach is generally referred to as a Prime Sieve as opposed to the trial division approach we used so far:

Note that the first line of our isPrime function now has a linear search because we used a list to hold the current primes. Ideally we would have used an OrderedSet instead of a list, to make membership checks O(1) and speed up that part of the algorithm even more. For simplicity, we use a list now.

Using our newly minted insight on using primes as factors, we produce the following execution:

That is confusing. This is slower. How can that be?

One reason is that at this point, our code is a lot more complex than before. Namely, in order to answer the question if a given number is a prime, in the worst case, we need to discover all the primes before that number. There is also some bookkeeping overhead to caching each of the primes we found so far. We don't get much apparent speedup due to all the extra work we need to do.

However, once we actually warmed up the cache, by running our loop once ahead of time, and then measuring a second iteration of our loop, we see an interesting speedup again:

Essentially, we are demonstrating the power of memoization here. The progression we have seen so far is from brute force, to a bit smarter, to using another smart insight, to remembering intermediate steps. By remembering the intermediate results, we essentially are performing a space-time tradeoff. By adding a bit more memory, we can avoid performing repetitive operations.

The next big leap in an algorithm such as finding prime numbers is whether we can make any assumptions on exactly in what order the API is accessed. For instance, are we more interested in being able to figure out if any given random number is a prime number, or are we more interested in producing a prime number generator, one that produces increasingly larger primes? If the latter is the case, we enter the domain of Dynamic Programming.

By assuming we produce increasingly larger primes, the Pythonic instrument we love to go for is a generator function. It can keep local state, return one result at a time, and resume execution at the place where it yielded the most recent result:

The use of co-routines in this manner allows us to combine the concept of memoization and cleanly wrap it in a nice abstraction. The function can be used like any regular Python function as follows:

The final iteration of our algorithm now produces:

This demonstrates a nice progression from brute force to memoization to DP.

That said, if you made it this far, you probably want to take a look at Atkin's Sieve, that generates primes using an O(n) approach.

For more algorithms and their visualizations, check out PyAlgoViz.

Tuesday, December 13, 2016

What I Learned While Fighting Fake News

In a couple of upcoming blog entries I will share what I learned while building Realnews/Fakenews, a Chrome extension that adds visual indicators to sites such as Facebook and Twitter to warn readers of possible Fake News links:

npr.org: Fake News

I will talk about the extension itself, the architecture and implementation of the server, choices in hosting plans and pricing for the server, and how to use machine learning to predict fake news from just looking at the URL.

Fake News is serious. It can lead to vigilantes storming pizza parlors and foreign parties influencing US elections or upcoming UK elections. It is hard to root out fake news as it feeds on a powerful psychological trait called confirmation bias. What that means is that fake news is often designed to cater to a specific target audience. Fake news authors write specific content their targets like to hear, whether it is factual, hyperbolic, or not. Goal is to increase the likeliness of them sharing, thereby strengthening the original cause, or more likely, generating advertisement income.

What can we do about fake news? In a simple diagram, this is how fake news spreads:

There is not much we can do to stop fake news publishers from actually creating content. What the industry can do, social media outlets in particular, is to stop giving fake news publishers a free distribution channel. That's a good step in the right direction.

Where the Realnews/Fakenews Chrome extension helps is at the reading stage. When a link is presented on Facebook or Twitter to a third party site as a shared post, a small button is added to the top right to indicate the category of the site:

In this case, the site this tweet links to is marked as real news. Of course, the subject itself is fake news, but that's not the point. When the green button is pressed, a category selector shows up as follows:

When a different category is chosen by the reader, it counts as one vote. A consensus model is used on the server so that future readers will be presented with the popular vote. Voting is non-moderated.

Once enough pages have been voted on, features can be extracted such as the hosting domain of the link, words used in the URL, or the occurrence of specific content, such as lots of ads. Machine learning is then used to classify unknown sites and seed the prediction with a meaningful starting point.

Like I said at the start, more details on the implementation will come in future blog entries. For now, try it out!

Monday, December 5, 2016

PyAlgoViz on Github

Long overdue, I finally published my PyAlgoViz source code on github. The Python source shows you how to create a sandbox on AppEngine to run Python code, step-wise debug the code, run visualization logic at each step, record the visualization, and return the result back to the browser to visualize it there using HTML5 Canvas.

You may also like the accompanying video and slide deck linked from chrislaffra.com.

Happy visualizing!