The Flying Dutchman

Simple serverless services using Google AppEngine

2017-01-19T18:32:00.001-08:00

Distributed programming is hard. Data needs to be prepared, split up, executed in parallel, and combined into one solution again. Coordination, cluster management, fault-tolerance, and optimization makes parallel processing complex. Things are simplified when the problem is embarrassingly parallel. This is the case when sections of the data can be processed without any dependency on other parts. The ultimate is a function that has no dependencies on its execution context and is side-effect free. Such a function can be easily memoized, or executed in any order, depending on a specific execution plan.

The aim of distributed programming is to increase scalability when a given set of operations cannot be scaled anymore on one server just by adding more memory or CPUs. The solution is to split up the work and send it to multiple machines to process. This allows us to scale to the processing of terabytes or petabytes of data using a cluster of thousands of nodes. That brings us to the latency scale for computing inspired by the earlier list originally collected by Peter Norvik:

The order of magnitudes for latency shown in the table above are to be considered when distributing data over a cluster. We need to balance the work to distribute to a node where the cost of transferring the data is less than computing it locally.

Serverless Functions

Serverless functions allow developers to describe logic that executes in response to a given event and executes in a stateless container that can run one or more functions at the same time. Serverless frameworks do not need to be executed in parallel, but they are easily paralellizable due to their context-free and side-effect free behavior.

What do serverless functions look like and how can they be implemented? Let’s look an example. Say, we have a Google AppEngine application written in Python with the following lambda-like function that takes a number of zipcodes and looks up their address:

def geolocate(zipcodes):
    return [get_address(zipcode) for zipcode in zipcodes]

Now assume that the get_address function itself is implemented as a service call to an external geolocation service, such as the Google Maps API. We now run into the bottleneck of network speed. Depending on the distance between the client and the server, each call realistically costs ~10ms if server and client are in the same region and ~75ms between NYC and Seattle. Crossing the Atlantic Ocean takes an additional 100ms.

If we have to resolve 1,000 addresses, the above simple code takes 10s at best and 3 minutes at worst. Typically, APIs are aware of this issue and allow for bundling of requests into one. If such bundling is not available, the solution would be to run multiple requests in parallel, either by using multiple threads or separate processes.

The above example places the bottleneck in the network. However, for different calculations, the bottleneck might be CPU-heavy and require load balancing to a number of servers in a cluster. Managing such a cluster requires extra planning. It would be nice if we could not worry about how the calls are made in parallel, and simply use a decorator as follows:

@serverless.parallel
def geolocate(zipcodes):
    # run in parallel on a cluster
    return [get_address(zipcode) for zipcode in zipcodes]

What the decorator does in this case is to split up the data into smaller chunks, send each chunk to a different server, and collate the results as they arrive in parallel, and finally return the result. For the reader, this code looks like it runs serially in one thread, but in reality it runs highly parallelized.

Before we get to how this is implemented, here is a typical run of the hosted application at simple-serverless.appspot.com:

Number of zipcodes = 3000 Regular duration = 345.57s Parallel duration = 18.98s Speedup = 18.2X Details per pipeline step: 1 - Step "geolocate" took 18.65s for 100 workers and 3000 elements 2 - Step "cleanup" took 0.34s for 100 workers and 3000 elements 3 - Step "sort" took 0.00s for 1 worker and 413 elements

In the above run, we resolved 3,000 zipcodes in ~19 seconds, while normal execution would take almost 6 minutes. Actual speedup depends on how many nodes are currently warmed up. Typically, a second run runs faster.

Implementation

To implement the above decorator, the easy part is splitting the data into smaller chunks and collating the results after all the work is done. The hard part is finding servers to run the stateless function on and use an effective load balancer to horizontally scale to an elastic demand. It turns out that Google AppEngine was designed to do exactly that.

If we can somehow handle a chunk by making a web service call back to our own domain, using GAE load balancing to dispatch back to the current server, or wake a new one depending on current load, we effectively piggyback on GAE to create a poor-man's serverless lambda implementation. It turns out that is not that hard to do.

Breaking up the original data into multiple chunks looks like this:

bucketSize = max(1, int(len(data) / WORKER_COUNT))
buckets = [
    data[n: n + bucketSize]
    for n in xrange(0, len(data), bucketSize)
]

We then convert the chunk into JSON, encode what lambda function we want to run, and invoke it as an RPC call:

# create a non-blocking, asynchronous worker to handle one 
# bucket, on our own instance, or on new ones automatically
# launched by appengine
def createWorker(bucket):
    worker = urlfetch.create_rpc(deadline=60)
    scheme = os.environ['wsgi.url_scheme']
    host = os.environ['HTTP_HOST']
    url = scheme + '://' + host + URL
    headers = {
        'Content-Type': 'application/x-www-form-urlencoded',
        SECRET_KEY: SECRET,
    }
    payload = urllib.urlencode({
        'module': method.__module__,
        'className': className,
        'isclass': isclass,
        'method': method.__name__,
        'data': json.dumps(bucket),
        'args': json.dumps(args),
        'argv': json.dumps(argv),
    })
    return urlfetch.make_fetch_call(worker, url, payload, 
         'POST', headers)

Note: We use a secret that is known only to the application, to avoid external calls to our worker functions. This security by obscurity is not recommended for production applications and a stronger authentication model should be used then.

The receiving worker route is set up in the client code as follows:

app = webapp2.WSGIApplication([
    ('/', ZipCodeHandler),
    serverless.init('/serverless_route', SERVERLESS_SECRET)

We initialize serverless with a route name of our choice, which can be any arbitrary name and the secret of our choosing. When a chunk of data is sent to that route as a POST, it is unpacked, the requested lambda function is invoked, and the result is returned in JSON format.

The results for each of the workers is collated as follows:

workers = map(createWorker, buckets)
result = list(itertools.chain(*map(getResult, workers))))

The workers are created and will invoke their lambda functions asynchronously using the createWorker function shown above. The result is collected in the getResult handler, which blocks until the result is received:

def getResult(worker):
    response = worker.get_result().content
    try:
        return json.loads(response)
    except Exception as e:
        logging.error('Error: %s' % e)
        return [e]

An additional utility is offered to handle workflow processing in the form of a pipeline:

pipeline = serverless.Pipeline(geolocate, cleanup, sort)

This sets up a serverless pipeline where data streams from geolocate, to cleanup, to sort. The pipeline is invoked using its run method:

zipcodes = [
    random.randint(10000,99999)
    for n in xrange(ZIPCODE_COUNT)
]
addresses = pipeline.run(zipcodes)

The cleanup and sort functions look similar to the geolocate function. Key is that they are fully "functional". They depend only on the values of the data they process and are completely context free. The cleanup function is executed in parallel:

@serverless.parallel
def cleanup(addresses):
    return filter(None, addresses)

The sort function is executed sequentially, in the current thread, on all the collated results from the previous step in the pipeline:

@serverless.sequential
def sort(addresses):
    return sorted(addresses)

The entire implementation of this simple serverless library is just 230 lines of Python, slightly more than the length of this README file.

Check the github repo with the source for simple-serverless.

Disclaimer: This work is a personal weekend project and is unrelated to Google Cloud Functions, a serverless functions solution based on Node.js and Google Dataflow for Python, a more comprehensive solution for data-driven distribubted programming pipelines.

The 3Sum Problem

2017-01-06T07:37:00.002-08:00

For a given list of numbers, the 3Sum problem discovers whether there is at least one combination of three numbers in the list that sum up to zero. The same problem is also listed at LeetCode, with a small twist, where all solutions are to be returned, with duplicates removed.

A few examples:

[0, 0, 0] ==> [[0, 0, 0]]
[-1, 0, 1, 2, -1, -4] ==> [[-1, -1, 2], [-1, 0, 1]]
[-2, 0, -2, 1, 0, 4, 0, -1, -2, 0, -2, 1, 0, 4, 0, -1] ==> [[-1, 0, 1], [-2, -2, 4], [0, 0, 0], [-2, 1, 1]]

The naive, brute force, yet Pythonic approach would be to use itertools.combinations to generate a collection of all possible triplets. From those triplets, we filter out the ones that sum up to zero. Each triplet we then order and add to a set to remove the duplicates. We need to temporarily convert the triples to a tuple, as lists are not hashable in Python. Finally, we convert the set of unique results into a list of lists, as requested by Leetcode:

def threesum_brute_force(L):
    triples = itertools.combinations(L, 3)  # O(n^3)

    zeroes = [t for t in triples if sum(t) == 0]
    return map(list, set([tuple(sorted(t)) for t in zeroes]))

That approach is O(n^3) because we compare all possible combinations. It may look like we actually create them ahead of time before doing any further analysis, but itertools.combinations is implemented as a generator function, so we only produce one triplet at a time, when needed. For a list with 800 elements, we end up generating and comparing 510,081,600 triples.

A similar implementation uses three loops to create each of the combinations. This is still O(n^3), more clearly showing now as we iterate over i, j, and k. This version runs much faster than the previous one, as we avoid creating all the intermediate tuples themselves. We are still doing a lot of repetitive work.

def threesum_triple_loop(L):
    result = set()
    for i in xrange(len(L) - 2):
        for j in xrange(i + 1, len(L) - 1):
            for k in xrange(j + 1, len(L)):  # O(n^3)
                if L[i] + L[j] + L[k] == 0:
                    result.add(tuple(sorted((L[i], L[j], L[k]))))
    return map(list, result)

Rather than comparing all triples, producing O(n^3) time complexity, there is a way to solve this problem in O(n^2). The insight is to first sort all the numbers. Then we loop from left to right. At each incremental step the next number will be larger than before. Rather than have two nested loops to find the next two elements of the triple, we maintain a region bound by j and k.

While we go over each number, we sum this number with the two numbers on the edge of the region. If the sum is greater than zero, we are too far to the right and we shrink the region towards the left. If the sum is less than zero, we are too far to the left. The region remembers the last result and provides a good starting point to find the next zero sum triple. This makes this solution a Dynamic Programming solution, rather than brute force.

A final optimization is to stop once i reaches zero. Namely, after than point the total sum can only be larger than zero, so we can stop iterating.

def threesum(L):
    L = sorted(L)
    result = set()
    for i in xrange(len(L) - 2):
        j = i + 1
        k = len(L) - 1
        while j<k:
            s = L[i] + L[j] + L[k]
            if s == 0:
                result.add(tuple(sorted((L[i], L[j], L[k]))))
            if s > 0:
                k -= 1
            else:
                j += 1
        if L[i] >= 0:
            break
    return map(list, result)

Using DP, we narrowed down the search space dramatically. However, in the above solution, each time we increment i, we pick the next k by selecting the last number in the list, which is also the maximum number. Now, once i gets closer and closer to zero, this k will be less appropriate and we will end up doing a linear search from k towards i to get a smaller positive number that will give us a zero sum. The optimal k can be found more efficiently with a binary search, which makes the overall algorithm more efficient again.

A similar argument applies to j. If we consider the maximum number, it may make less sense to make j become i+1 for its first candidate. Namely if i+j+k==0, then j should be -i + -k. For example, assume the number at i is -30 and the maximum for the list is 24. In that case, rather than make j point at a number such as -29, we can make j skip all the way ahead to 6, as that would be the first candidate to yield zero. Again, using binary search will give us a starting point for j more effectively than doing a linear search, if we are dealing with a large number of elements:


def threesum_binarysearch(L):
    L = sorted(L)
    n = len(L)
    result = set()
    i = 0
    while i < n - 2:
        j = binarySearch(L, i + 1, n - 2, -(L[i] + L[-1]))
        k = binarySearch(L, j + 1, n - 1, -(L[i] + L[j]))
        while j<k and k<n:
            s = L[i] + L[j] + L[k]
            if s == 0:
                result.add(tuple(sorted((L[i], L[j], L[k]))))
                k -= 1
                j += 1
            elif s > 0:
                k -= 1
            else:
                j += 1
        if L[i] == 0:
            break
        i += 1
    return map(list, result)

The binary search makes the search for j and k possible in O(log n), rather than O(n):

def binarySearch(L, min, max, target):
    while True:
        if max < min:
            return min
        mid = (min + max) / 2
        if L[mid] < target:
            min = mid + 1
        elif L[mid] > target:
            max = mid - 1
        else:
            return mid

To give you an idea of the performance of each algorithm, here is a test run with 800 numbers, showing the number of i+j+k==0 comparisons performed by each algorithm and the total time needed:

Algorithm	Comparisons	Time (s)
Brute force	510,081,600	21.09800005
Three loops	636,804	0.1089999676
DP	195,048	0.07100009918
DP + binary search	133,383	0.03600001335

In short, when solving problems like 3Sum, avoid comparing all elements. This can be done by reducing the search space and ignoring candidates that won't lead to a solution anyways. Aside from reducing the search space, if we can order it in some form, we have even more optimization opportunities. In that case, we can typically avoid linear search and use binary search.

Check out many more algorithms with visualizations at PyAlgoViz.

Python code styled as tango by hilite.me with border:1px solid #ddd;padding:11px 7px;

Big Numbers

2016-12-27T12:40:00.002-08:00

The human brain is ill-equipped to deal with large numbers. Drawing attention to large numbers is only partially effective. For instance, the following ticker in NYC displays the current US National Debt:

The US National Debt clock at Union Square, NYC

That number is so big it becomes abstract and meaningless. In fact, the above picture is already from a while back. By the time you read this, the actual US National Debt is closer to $20 trillion. A number so large it becomes hard to parse, especially as the image above is missing commas.

Maybe it helps if we write down the current amount, as of writing of this text, in words and compare it with another large country:

Country	Debt in Dollars	Debt in Words/Speech
USA	$19,859,586,951,270	Nineteen trillion, eight hundred fifty-nine billion, five hundred eighty-six million, nine hundred fifty-one thousand, two hundred seventy dollars.
Russia	$152,205,694,374	One hundred fifty-two billion, two hundred five million, six hundred ninety-four thousand, three hundred seventy-four dollars.

If we actually do round off the US debt to $20 billion, the Russian debt is actually the rounding error. That's how large the US number really is.

Payback Time

To get a feeling for the amount, what if you would spend 100% of your current net income to pay off the debt, how many years would it take you to pay off the US National Debt? For the average US citizen, it would take 525 million working years. If we combine the entire US workforce at roughly 155 million people and spend all their income on relieving the US debt, it would still take an entire Trump presidency to pay off the debt. This is not realistic for a few reasons, of course.

Even if each person in the US workforce would spend a modest 10% of their net income on paying off the national debt, it would take the entire country 34 years to pay off the current debt amount, assuming the US government was able to not grow the debt even larger.

Mind-boggling numbers.

Visualizing Large Amounts of Money

Visualizing the debt in the shape of $100 bills may help. Here is the US National Debt in 2013, compared to an American Football field and Miss Liberty:

US National Debt visualized in stacks of $100 bills (link)

For reference, this is what one billion dollars looks like in stacks of $100 bills:

It is rumored that Pablo Escobar lost $2.3 billion each year due to rats eating the bills. That's 2.3X the amount showing above.

Pictures help. Check out demonocracy.info for many other visualizations of large amounts of money.

Short Scale and Large Scale

The large numbers we talked about so far are huge. But, we cannot even agree on what to call really large numbers. Two different scales to talk about large numbers are in use:

Long scale. Every new term in the scale is one million times larger than the previous. A billion means a million millions (10^12), a trillion means a million billions (10^18), etc.
Short scale. Every new term is one thousand times larger than the previous. A billion means a thousand millions (10^9), trillion means a thousand billions (10^12), etc.

For fun, here is what scale is being adopted by different countries across the world, including a couple of exceptions to the rule:

Use of short and long scale across the world (wikipedia)

If you care, a milliardaire and a billionaire are equally rich. Confusing.

Algorithms and Big Numbers

Large numbers confound us. Reasoning about large numbers is even harder. So how can we make the analysis of algorithms more insightful? Applying scales helps. Writing out large numbers in words may help. Visualization definitely helps.

Say we have a certain algorithm that we run on a modest number of 50 elements. The algorithm has implementations with different time complexity. How efficient is each algorithm? How well does it scale when we increase the number of elements? Those are questions we care about when scaling to thousands, millions, or even billions of users.

For each class of complexity, the required operations for 50 elements are at the following scale:

Complexity	Operations	Number of Operations in Words/Speech
O(1)	1	One
O(log n)	4	Four
O(n)	50	Fifty
O(n log n)	195	One hundred ninety-five
O(n^2)	2 500	Two thousand, five hundred
O(2^n)	1 125 899 906 842 624	One quadrillion, one hundred twenty-five trillion, eight hundred ninety-nine billion, nine hundred six million, eight hundred forty-two thousand, six hundred twenty-four
O(factorial)	30 414 093 201 713 378 043 612 608 166 064 768 844 377 641 568 960 512 000 000 000 000	Thirty vigintillion, four hundred fourteen novemdecillion, ninety-three octodecillion, two hundred one septendecillion, seven hundred thirteen sexdecillion, three hundred seventy-eight quindecillion, forty-three quattuordecillion, six hundred twelve tredecillion, six hundred eight duodecillion, one hundred sixty-six undecillion, sixty-four decillion, seven hundred sixty-eight nonillion, eight hundred forty-four octillion, three hundred seventy-seven septillion, six hundred forty-one sextillion, five hundred sixty-eight quintillion, nine hundred sixty quadrillion, five hundred twelve trillion

Those numbers may still not mean much. But, just read out the value for O(2^n). Does sound silly, right? The number for O(factorial) actually sounds laughable. This is what humans do with large numbers. Our brain does not know what to do with them, gives up, and turns it into a joke. BTW. Who made up those names? Vigintillion? Novemdecilion? Undecillion?

Over their training and/or career, most software engineers have developed an intuitive feeling for complexity theory. They know that in interview questions, proposing an implementation that is O(n^2) means you fell for the trap set by the interviewer. There must be a way to use a hashmap somewhere to get the cost down to O(n) or even O(1).

But does the candidate really understand why O(n^2) is so bad? How do O(n) and O(n^2) relate to each other as the number of elements grows? Let's try a visualization (click the play icon to replay):

Could not render iframe, please disable ad-blockers.

In the interactive chart above, the number of elements are between 1 and 50 and are plotted from left to right. The vertical scale shows the number of operations, capped at 800. O(log n), O(n), and even O(n log n) stay linear for a long time. The big P refers to algorithms that can be solved in polynomial time, while NP refers to nondeterministic polynomial time.

When interviewing, you definitely do not want to be nondeterministic. Already for n=10, the purple line for O(n^2) is off the scale in the above chart. Search for solutions that are O(n), as is the case in linear search, or O(log n), the number of steps needed in binary search, or ideally O(1), when using a hashtable.

Links

A large number of live tickers, including the current US National Debt can be found at usdebtclock.org.
Pronouncing numbers as words makes for a nice programming interview question, but you can also go to Calculator Soup.
Check out demonocracy.info for many other visualizations of large amounts of money.
For more details on program analysis, check out time complexity.
The O(x) visualization can be found amid numerous algorithm visualizations at PyAlgoViz.

Auger - Automatic Unit Test Generation for Python

2016-12-19T09:15:00.002-08:00

Unit tests are crucial to software development. They verify whether a given component actually implements what it promises. They are also important for long-living code, where future maintenance can be much more expensive without the proper level of unit test coverage. Not surprisingly, the lack of unit tests is a show-stopper for submitting production code at many companies.

You may wonder, if unit tests are so great, why do engineers hate to write them? More than once have I heard fellow engineers say "If you approve my code now, I will write the unit tests in a future change list." All engineers I know actually like writing code. It is creative. Code means impact. However, in general, engineers do not like writing unit tests at all. Why is that?

Why is Unit Testing Hard?

Unit testing is hard for various reasons:

One reason would be that a unit testing framework is exactly that. A framework. When I worked on the IBM J9 and Eclipse team with Dave Thomas, he used to say "Everybody likes to write frameworks, but nobody likes to actually use them". In practice, unit test frameworks add yet another complexity to learn and master, especially when unit test frameworks have little in common across programming languages or organizations.
All unit testing frameworks start off modestly. Erich Gamma once confided in me how he and Kent Beck wrote the original version of JUnit when they got bored on a transcontinental flight. He added how those few hours of work had been by far the best investment of his technical career ever. Today, however, unit testing frameworks are far from trivial and require a considerable learning investment in understanding their power and intricacies.
Software systems themselves are also becoming increasingly complex. Even seemingly standalone components are heavily embedded in a context of complex runtimes. To isolate those dependencies and sculpt them with the proper "mocks" is an art. It is not uncommen to spend hours trying to find out how to write the mocks for one line of test code. Dependency injection and fancy mock syntax conspire to make the life of unit test authors challenging.
Code under development constantly changes. Often, coding can be a discovery game. A design doc can outline architectural decisions on what database to use and how the UI is created. However, it tends to leave the actual coding as an exercise to the reader. During coding, discoveries are made. Code is constantly refactored. Code is found obsolete and is deleted even before it is ever committed to version control. This is particularly the case when a new domain or technology is being explored. We still write unit tests, but we write them later, when the dust settles.

OK. Unit tests are hard. But, what is the alternative?

Let's agree, unit test are a chore and tough to write. This does not mean engineers are not testing their code, even if they do not write unit tests. Many engineers, like myself, write code in a highly iterative fashion. For instance, say I am writing an AppEngine app to return a web page. I would first add a route, write a Handler, and simply return "Hello World". To test, I would spin up a local instance, point my browser at localhost:8080 and see if it shows the string I expected. I iterate that step hundreds of times.

In this iterative development mode, small, incremental steps are made towards the end goal and progress is constantly validated. Each time, some more functionality is added and tested on the code in progress. At some point, we are going to be "done". At his final point in time, hundreds or thousands of "test" runs have been exercised on the code under development. Each component has had some inputs and produced some expected outputs. If only we could remember what those inputs and outputs were and our tests would be so much easier to write. Cue: Project Auger.

What is Project Auger?

Project Auger (Automated Unittest Generator) watches your Python code while you write it and automatically generates all unit tests for your code, including all the mocks. Little or no work is required by the developer.

How does Auger Work?

Auger works like a smart Python debugger that sets breakpoints for each component you are interested in. Auger tracks two kinds of function calls related to the module under test:

For each function defined in the module, Auger records both the values of the arguments and the returned results. After recording enough execution traces, unit tests can be generated with the meaningful placeholder argument values and assertions.
For each call made from a given component to dependent libraries or other components, we record the return value, so that this call can be automatically mocked out with known return values.

Auger tracks all possible functions, including instance methods, class methods, and static functions.

Consider the following example, pet.py, that provides a Pet with a name, age, and a species:

from sample.animal import Animal


class Pet(Animal):
    def __init__(self, name, *args):
        Animal.__init__(self, *args)
        self._name = name

    def get_name(self):
        return self._name

    @staticmethod
    def lower(s):
        return s.lower()

    def __str__(self):
        return '%s is a %s aged %d' % (

            self.get_name(),

            Pet.lower(self.get_species()), self.get_age()

        )


def create_pet(name, species, age=0):
    return Pet(name, species, age)


if __name__ == '__main__':
    print(Pet('Polly', 'Parrot'))
    print(create_pet('Clifford', 'Dog', 32))

This class has a few different entry points we would need to unit test:

The class Pet itself which has:

a static method, lower
two instance methods, get_name and __str__
a constructor, __init__, which is really a very special instance method

A static function that creates a Pet and returns it

The Pet class is a subclass of Animal, which we know nothing about, so we will need to mock that entire class. We do know that the class is used in the Pet constructor and inherited methods are called from Pet as well. This means that the implementation for self.get_species() and self.get_age() are unknown, as we cannot look at the implementation of Animal, when unit testing Pet. Therefore, those two inherited methods will be mocked out.

Unit Tests Generation with Auger

The above class definition combined with an execution run is enough for Auger to automatically create the following fully functional unit test:

from mock import patch
from sample.animal import Animal
import sample.pet
from sample.pet import Pet
import unittest


class PetTest(unittest.TestCase):
    @patch.object(Animal, 'get_species')
    @patch.object(Animal, 'get_age')
    def test___str__(self, mock_get_age, mock_get_species):
        mock_get_age.return_value = 12
        mock_get_species.return_value = 'Dog'
        pet_instance = Pet('Clifford', 'Dog', 12)
        self.assertEquals(pet_instance.__str__(), 'Clifford is a dog aged 12')

    def test_create_pet(self):
        self.assertIsInstance(sample.pet.create_pet(age=12,species='Dog',name='Clifford'), Pet)

    def test_get_name(self):
        pet_instance = Pet('Clifford', 'Dog', 12)
        self.assertEquals(pet_instance.get_name(), 'Clifford')

    def test_lower(self):
        self.assertEquals(Pet.lower(s='Dog'), 'dog')

if __name__ == "__main__":
    unittest.main()

No changes to the original code are needed to teach Auger anything. All that is required is for the developer to write their original code and exercise it somehow. In the above case, we simply ran python sample.pet to produce two scenarios in which Pet instances were created and manipulated. From those two sample, a single test was extracted.

Of course Auger is limited in the sense that it cannot guess what scenario is being tested. Rather than generates multiple, focused tests per module, it will generate one big test that covers the entire module. The value of Auger is more in generating all the boiler plate code, imports and mocks, ensuring proper coverage, and to generate a template for manual refinement.

To generate a set of unit tests, Auger magic is invoked:

import auger

... your code goes here ...

if __name__ == "__main__":
    with auger.magic([pet]):
        ... call the main routine for your code ...

In this case, one module is passed, pet, but multiple modules can be passed as well. Each one will be traced and unit tested.

When a unit test is produced, it is written out to the local file system under the corresponding tests folder.

IDE Integration

Auger does not have direct IDE integration per se, but works really well with PyCharm. This integration comes for free, because the IDE watches the underlying file system and will automatically discover when new files are created by Auger in the local file system. These tests can then be executed easily as well:

Adding the generated tests to Git and commit/push them to a repository means just a few clicks from that point onwards in an IDE such as PyCharm.

Future Work

Incremental test generation. Collect multiple execution runs, persist the invocations, and merge multiple runs into one test case.
Preserve manual edits performed by users on generated test cases when a test is regenerated.
Support other unit test frameworks, such as pytest, nose, cucumber, etc.
Figure out how to run Auger on itself. This is non-trivial :-)

Check out Project Auger and let me know what you think of it. Pull requests are welcomed.

Python code styled as tango by hilite.me with border:1px solid #ddd;padding:11px 7px;

QuickSort

2016-12-18T10:36:00.003-08:00

QuickSort is a divide-and-conquer algorithm that splits up an array into two halves, and recursively sorts each half. In other words, to sort an array, we pick a random pivot, split the array in three section, being smaller, equal, or larger than the pivot, and recursively sort the array:

  def qsort(array):
      if len(array) < 2: return array
      pivot = array[0]
      less = filter(lambda n: n<pivot, array)
      equal = filter(lambda n: n==pivot, array)
      larger = filter(lambda n: n>pivot, array)
      return qsort(less) + equal + qsort(larger)

In the original implementation of QuickSort, the first element in the array was picked as pivot, as is done on line 3 above. Below, the red bar indicates the choice of pivot. At each step of the algorithm, it is at the left of the array being sorted:

Could not render iframe, please disable ad-blockers.

Choosing the first element as pivot can generate extra work when the array is already (partially) sorted. Rather than picking the first element, the middle element can be chosen, behaving better on already sorted arrays:

def qsort(array):
    if len(array) < 2: return array
    pivot = array[len(array)/2]
    less = filter(lambda n: n<pivot, array)
    equal = filter(lambda n: n==pivot, array)
    larger = filter(lambda n: n>pivot, array)
    return qsort(less) + equal + qsort(larger)

When choosing the value in the middle of the array as pivot, yields the following result, with again the red bar indicating the pivot:

Could not render iframe, please disable ad-blockers.

Rather than pick the middle element as pivot, a random value can be picked from the array:

def qsort(array):
    if len(array) < 2: return array
    pivot = array[random.randint(0, len(array)-1)]
    less = filter(lambda n: n<pivot, array)
    equal = filter(lambda n: n==pivot, array)
    larger = filter(lambda n: n>pivot, array)
    return qsort(less) + equal + qsort(larger)

The pivot is now picked randomly as shown in the visualization below:

Could not render iframe, please disable ad-blockers.

The final approach to pivot selection we will look at was invented by Robert Sedgewick and involves picking the median of three values:

def qsort(array):
    if len(array) < 2: return array
    pivot = (array[0] + array[len(array)/2] + array[-1])/3
    less = filter(lambda n: n<pivot, array)
    equal = filter(lambda n: n==pivot, array)
    larger = filter(lambda n: n>pivot, array)
    return qsort(less) + equal + qsort(larger)

The pivot is now picked by looking at three different values as shown in the visualization below:

Could not render iframe, please disable ad-blockers.

Aside from how we pick the pivot, all versions of QuickSort shown above have the same thing in common that they use recursion for implementing the divide and conquer tactic. Recursion has overhead in that new stackframes need to be constructed for each call. Recursion also limits the maximum size of the array we can sort, but that is more of a theoretical challenge.

QuickSort can also be implement without using recursion, by using a user-defined stack to keep track of the work to be done. A possible implementation:

def qsort(array):
    work = [array]
    result = []
    while work:
        array = work.pop(0)
        if len(array) < 2:
            result.extend(array)
        else:
            pivot = array[0]
            work = [
                filter(lambda n: n<pivot, array),
                filter(lambda n: n==pivot, array),
                filter(lambda n: n>pivot, array),
            ] + work
    return result

And the execution (this time using the first element as pivot again):

Could not render iframe, please disable ad-blockers.

In general, QuickSort is not a stable sort. Elements of equal value are not guaranteed to remain in their original sort order for most implementations. However, the implementations above keep the elements that are equal to the pivot in the original order as they were found. That makes the above implementation stable.

Check out more algorithm visualizations at PyAlgoViz.

Python code styled as tango by hilite.me with border:1px solid #ddd;padding:11px 7px;

MergeSort

2016-12-16T14:27:00.000-08:00

MergeSort is a divide-and-conquer algorithm that recursively sorts two halves of an array and merges them to form a fully sorted end result. In other words, to sort a sequence, we find the middle point, sort the first half, sort the second half, and finally merge both halves:

def mergeSort(array, start, end):
    if end - start > 1:
        middle = (start + end) / 2
        mergeSort(array, start, middle)
        mergeSort(array, middle, end)
        merge(array, start, middle, middle, end)

The merging of the two halves is done by inserting elements of the second half into the first half. The second half keeps shrinking in size until we are done:

def merge(array, left, leftEnd, right, rightEnd):
    while left<leftEnd and right<rightEnd:
        if array[left] > array[right]:
            array.insert(left, array.pop(right))
            right += 1
            leftEnd += 1
        else:
            left += 1

Executing the above implementation on 40 random numbers produces the following result:

Could not render iframe, please disable ad-blockers.

This implementation is a stable sort. Elements of equal value remain in their original sort order.

Check out more algorithm visualizations at PyAlgoViz.

Python code styled as tango by hilite.me with border:1px solid #ddd;padding:11px 7px;

Generating Prime Numbers

2016-12-15T15:08:00.000-08:00

In this blog post, we will generate prime numbers and explore brute force, memoization, and dynamic programming.

Prime number are numbers greater than one that are only divisible by themselves and one. In other words, the following Python function returns True if n is greater than one and none of the numbers in between 2 and n is a factor of n:

Of course, this implementation is rather naive and takes a lot longer than is necessary:

Could not render iframe, please disable ad-blockers.

Our first optimization would be to realize that we need not check all factors in between 2 and n. Namely, we only need to look at the factors between 2 and n/2, as any number higher than that could never be a factor of n.

Our code now looks like this:

<== doing half the work

With this optimization in place, the algorithm runs about twice as fast already:

Could not render iframe, please disable ad-blockers.

Still, we can do better. We can actually already stop considering factors when we reach sqrt(n):

A further optimization would be to realize that we only need to consider the factors 2, 3, 5, 7, 9, etc. as 4,6,8,etc. could not be factors as they are already divisible by 2. But, we'll leave that refinement to the reader.

Let's try our sqrt(n) optimization in the next iteration of our algorithm:

Could not render iframe, please disable ad-blockers.

By now, our algorithm should be more than twice as fast.

Of course, we can still do better. For instance, say we are checking to see if 41 is a prime number. To validate 41, our algorithm will try the factors between 2 and sqrt(41), which is 6:

41 % 2 = 1, not a factor
41 % 3 = 2, not a factor
41 % 4 = 1, not a factor
41 % 5 = 1, not a factor
41 % 6 = 5, not a factor

When we check for factors 4 and 6, we are doing double work. Namely, if 41 would be divisible by 4, it would already be divisible by 2. The same argument goes for 6, which is already divisible by 2 and 3 itself.

In other words, to check if a given number is prime, we need to only verify the prime numbers between 2 and sqrt(n) as divisible factors. This greatly reduces the search space. All we need to do is remember the current list of prime numbers and use those as factors. This approach is generally referred to as a Prime Sieve as opposed to the trial division approach we used so far:

Note that the first line of our isPrime function now has a linear search because we used a list to hold the current primes. Ideally we would have used an OrderedSet instead of a list, to make membership checks O(1) and speed up that part of the algorithm even more. For simplicity, we use a list now.

Using our newly minted insight on using primes as factors, we produce the following execution:

Could not render iframe, please disable ad-blockers.

That is confusing. This is slower. How can that be?

One reason is that at this point, our code is a lot more complex than before. Namely, in order to answer the question if a given number is a prime, in the worst case, we need to discover all the primes before that number. There is also some bookkeeping overhead to caching each of the primes we found so far. We don't get much apparent speedup due to all the extra work we need to do.

However, once we actually warmed up the cache, by running our loop once ahead of time, and then measuring a second iteration of our loop, we see an interesting speedup again:

Could not render iframe, please disable ad-blockers.

Essentially, we are demonstrating the power of memoization here. The progression we have seen so far is from brute force, to a bit smarter, to using another smart insight, to remembering intermediate steps. By remembering the intermediate results, we essentially are performing a space-time tradeoff. By adding a bit more memory, we can avoid performing repetitive operations.

The next big leap in an algorithm such as finding prime numbers is whether we can make any assumptions on exactly in what order the API is accessed. For instance, are we more interested in being able to figure out if any given random number is a prime number, or are we more interested in producing a prime number generator, one that produces increasingly larger primes? If the latter is the case, we enter the domain of Dynamic Programming.

By assuming we produce increasingly larger primes, the Pythonic instrument we love to go for is a generator function. It can keep local state, return one result at a time, and resume execution at the place where it yielded the most recent result:

The use of co-routines in this manner allows us to combine the concept of memoization and cleanly wrap it in a nice abstraction. The function can be used like any regular Python function as follows:

The final iteration of our algorithm now produces:

Could not render iframe, please disable ad-blockers.

This demonstrates a nice progression from brute force to memoization to DP.

That said, if you made it this far, you probably want to take a look at Atkin's Sieve, that generates primes using an O(n) approach.

For more algorithms and their visualizations, check out PyAlgoViz.

What I Learned While Fighting Fake News

2016-12-13T18:37:00.001-08:00

In a couple of upcoming blog entries I will share what I learned while building Realnews/Fakenews, a Chrome extension that adds visual indicators to sites such as Facebook and Twitter to warn readers of possible Fake News links:

npr.org: Fake News

I will talk about the extension itself, the architecture and implementation of the server, choices in hosting plans and pricing for the server, and how to use machine learning to predict fake news from just looking at the URL.

Fake News is serious. It can lead to vigilantes storming pizza parlors and foreign parties influencing US elections or upcoming UK elections. It is hard to root out fake news as it feeds on a powerful psychological trait called confirmation bias. What that means is that fake news is often designed to cater to a specific target audience. Fake news authors write specific content their targets like to hear, whether it is factual, hyperbolic, or not. Goal is to increase the likeliness of them sharing, thereby strengthening the original cause, or more likely, generating advertisement income.

What can we do about fake news? In a simple diagram, this is how fake news spreads:

There is not much we can do to stop fake news publishers from actually creating content. What the industry can do, social media outlets in particular, is to stop giving fake news publishers a free distribution channel. That's a good step in the right direction.

Where the Realnews/Fakenews Chrome extension helps is at the reading stage. When a link is presented on Facebook or Twitter to a third party site as a shared post, a small button is added to the top right to indicate the category of the site:

In this case, the site this tweet links to is marked as real news. Of course, the subject itself is fake news, but that's not the point. When the green button is pressed, a category selector shows up as follows:

When a different category is chosen by the reader, it counts as one vote. A consensus model is used on the server so that future readers will be presented with the popular vote. Voting is non-moderated.

Once enough pages have been voted on, features can be extracted such as the hosting domain of the link, words used in the URL, or the occurrence of specific content, such as lots of ads. Machine learning is then used to classify unknown sites and seed the prediction with a meaningful starting point.

Like I said at the start, more details on the implementation will come in future blog entries. For now, try it out!

PyAlgoViz on Github

2016-12-05T13:54:00.002-08:00

PyAlgoViz on Github

Long overdue, I finally published my PyAlgoViz source code on github. The Python source shows you how to create a sandbox on AppEngine to run Python code, step-wise debug the code, run visualization logic at each step, record the visualization, and return the result back to the browser to visualize it there using HTML5 Canvas.

You may also like the accompanying video and slide deck linked from chrislaffra.com.

Happy visualizing!