tag:blogger.com,1999:blog-71528553451551014312024-02-20T08:16:13.490-08:00The Flying DutchmanChris Laffrahttp://www.blogger.com/profile/17244232814118596946noreply@blogger.comBlogger9125tag:blogger.com,1999:blog-7152855345155101431.post-76482131806827810642017-01-19T18:32:00.001-08:002017-01-19T18:32:30.957-08:00Simple serverless services using Google AppEngine<p>Distributed programming is hard. Data needs to be prepared, split up, executed in parallel, and combined into one solution again. Coordination, cluster management, fault-tolerance, and optimization makes parallel processing complex. Things are simplified when the problem is <a href="https://en.m.wikipedia.org/wiki/Embarrassingly_parallel">embarrassingly parallel</a>. This is the case when sections of the data can be processed without any dependency on other parts. The ultimate is a function that has no dependencies on its execution context and is side-effect free. Such a function can be easily memoized, or executed in any order, depending on a specific execution plan.<br><br>The aim of distributed programming is to increase scalability when a given set of operations cannot be scaled anymore on one server just by adding more memory or CPUs. The solution is to split up the work and send it to multiple machines to process. This allows us to scale to the processing of terabytes or petabytes of data using a cluster of thousands of nodes. That brings us to the <a href="https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html" target="_blank">latency scale for computing</a> inspired by the earlier list originally collected by <a href="http://norvig.com/21-days.html#answers" target="_blank">Peter Norvik</a>:<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgh9M4N_4X9q1oMmHiyUBFrLrhljT-vqqhSAe-t_IQ0u7LNHvJ5-S7vEBm7UN4ep-MANinvvGSpd9HNTxNuwggyKulraQUDyhSucDJo3hMQVKxklAW4WXDmsb6e8Ofo6t9Ou8LzJydsIUw/s1600-h/image33.png"><br><br><img title="image" style="float: none; margin-left: auto; display: block; margin-right: auto" alt="image" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8TrNr90Rkj1TauwH_0cH3eIf9pdswbd0JaS7jpANhoaQNuD3O7JvCox_pHPpSXFrS9ullvRthHoYZCBBsd02QsZzAx1TYuipmcENLQ8ovA7RQRehQl4BrR6UrxZynKQR-LHgTMhRMCEM/?imgmax=800" width="492" height="176"></a><br>The order of magnitudes for latency shown in the table above are to be considered when distributing data over a cluster. We need to balance the work to distribute to a node where the cost of transferring the data is less than computing it locally. </p> <h2>Serverless Functions</h2> <p><a href="https://martinfowler.com/articles/serverless.html" target="_blank">Serverless functions</a> allow developers to describe logic that executes in response to a given event and executes in a stateless container that can run one or more functions at the same time. Serverless frameworks do not need to be executed in parallel, but they are easily paralellizable due to their context-free and side-effect free behavior. <br><br>What do serverless functions look like and how can they be implemented? Let’s look an example. Say, we have a Google AppEngine application written in Python with the following lambda-like function that takes a number of zipcodes and looks up their address:</p><pre><code>def geolocate(zipcodes):
return [get_address(zipcode) for zipcode in zipcodes]
</code></pre>
<p>Now assume that the get_address function itself is implemented as a service call to an external geolocation service, such as the Google Maps API. We now run into the bottleneck of network speed. Depending on the distance between the client and the server, each call realistically costs ~10ms if server and client are in the same region and ~75ms between NYC and Seattle. Crossing the Atlantic Ocean takes an additional 100ms.
<p>If we have to resolve 1,000 addresses, the above simple code takes 10s at best and 3 minutes at worst. Typically, APIs are aware of this issue and allow for bundling of requests into one. If such bundling is not available, the solution would be to run multiple requests in parallel, either by using multiple threads or separate processes.
<p>The above example places the bottleneck in the network. However, for different calculations, the bottleneck might be CPU-heavy and require load balancing to a number of servers in a cluster. Managing such a cluster requires extra planning. It would be nice if we could not worry about how the calls are made in parallel, and simply use a decorator as follows:<pre><code>@serverless.parallel
def geolocate(zipcodes):
# run in parallel on a cluster
return [get_address(zipcode) for zipcode in zipcodes]
</code></pre>
<p>What the decorator does in this case is to split up the data into smaller chunks, send each chunk to a different server, and collate the results as they arrive in parallel, and finally return the result. For the reader, this code looks like it runs serially in one thread, but in reality it runs highly parallelized.
<p>Before we get to how this is implemented, here is a typical run of the hosted application at <a href="http://simple-serverless.appspot.com/">simple-serverless.appspot.com</a>:</p>
<p><code>Number of zipcodes = 3000<br>Regular duration = 345.57s<br>Parallel duration = 18.98s<br>Speedup = 18.2X<br>Details per pipeline step:<br> 1 - Step "geolocate" took 18.65s for 100 workers and 3000 elements<br> 2 - Step "cleanup" took 0.34s for 100 workers and 3000 elements<br> 3 - Step "sort" took 0.00s for 1 worker and 413 elements<br></p></code>
<p>In the above run, we resolved 3,000 zipcodes in ~19 seconds, while normal execution would take almost 6 minutes. Actual speedup depends on how many nodes are currently warmed up. Typically, a second run runs faster.
<p>
<h2>Implementation</h2>
<p>To implement the above decorator, the easy part is splitting the data into smaller chunks and collating the results after all the work is done. The hard part is finding servers to run the stateless function on and use an effective load balancer to horizontally scale to an elastic demand. It turns out that Google AppEngine was designed to do exactly that.
<p>If we can somehow handle a chunk by making a web service call back to our own domain, using GAE load balancing to dispatch back to the current server, or wake a new one depending on current load, we effectively piggyback on GAE to create a poor-man's serverless lambda implementation. It turns out that is not that hard to do.
<p>Breaking up the original data into multiple chunks looks like this:<pre><code>bucketSize = max(1, int(len(data) / WORKER_COUNT))
buckets = [
data[n: n + bucketSize]
for n in xrange(0, len(data), bucketSize)
]
</code></pre>
<p>We then convert the chunk into JSON, encode what lambda function we want to run, and invoke it as an RPC call:<pre><code># create a non-blocking, asynchronous worker to handle one <br></code><code># bucket, on our own instance, or on new ones automatically<br># launched by appengine
def createWorker(bucket):
worker = urlfetch.create_rpc(deadline=60)
scheme = os.environ['wsgi.url_scheme']
host = os.environ['HTTP_HOST']
url = scheme + '://' + host + URL
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
SECRET_KEY: SECRET,
}
payload = urllib.urlencode({
'module': method.__module__,
'className': className,
'isclass': isclass,
'method': method.__name__,
'data': json.dumps(bucket),
'args': json.dumps(args),
'argv': json.dumps(argv),
})
return urlfetch.make_fetch_call(worker, url, payload, <br> 'POST', headers)
</code></pre>
<p>Note: We use a secret that is known only to the application, to avoid external calls to our worker functions. This security by obscurity is not recommended for production applications and a stronger authentication model should be used then. <br><br>The receiving worker route is set up in the client code as follows:<pre><code>app = webapp2.WSGIApplication([
('/', ZipCodeHandler),
serverless.init('/serverless_route', SERVERLESS_SECRET)
</code></pre>
<p>We initialize serverless with a route name of our choice, which can be any arbitrary name and the secret of our choosing. When a chunk of data is sent to that route as a POST, it is unpacked, the requested lambda function is invoked, and the result is returned in JSON format.
<p>The results for each of the workers is collated as follows:<pre><code>workers = </code><code>map(createWorker, buckets)
</code><code>result = list(itertools.chain(*map(getResult, workers))))<br></code></pre>
<p>The workers are created and will invoke their lambda functions asynchronously using the <em>createWorker</em> function shown above. The result is collected in the <em>getResult</em> handler, which blocks until the result is received:<pre><code>def getResult(worker):
response = worker.get_result().content
try:
return json.loads(response)
except Exception as e:
logging.error('Error: %s' % e)
return [e]
</code></pre>
<p>An additional utility is offered to handle workflow processing in the form of a pipeline:<pre><code>pipeline = serverless.Pipeline(geolocate, cleanup, sort)
</code></pre>
<p>This sets up a serverless pipeline where data streams from <em>geolocate</em>, to <em>cleanup</em>, to <em>sort</em>. The pipeline is invoked using its <em>run</em> method:<pre><code>zipcodes = [<br> </code><code>random.randint(10000,99999)<br> for n in xrange(ZIPCODE_COUNT)<br></code><code>]
addresses = pipeline.run(zipcodes)
</code></pre>
<p>The <em>cleanup</em> and <em>sort</em> functions look similar to the <em>geolocate</em> function. Key is that they are fully "functional". They depend only on the values of the data they process and are completely context free. The <em>cleanup</em> function is executed in parallel:<pre><code>@serverless.parallel
def cleanup(addresses):
return filter(None, addresses)
</code></pre>
<p>The <em>sort</em> function is executed sequentially, in the current thread, on all the collated results from the previous step in the pipeline:<pre><code>@serverless.sequential
def sort(addresses):
return sorted(addresses)
</code></pre>
<p>The entire implementation of this simple serverless library is just 230 lines of Python, slightly more than the length of this README file.
<p>Check the <a href="https://github.com/laffra/simple-serverless" target="_blank">github repo</a> with the source for simple-serverless.
<p> <p><em>Disclaimer</em>: This work is a personal weekend project and is unrelated to <a href="https://cloud.google.com/functions">Google Cloud Functions</a>, a serverless functions solution based on Node.js and <a href="https://cloud.google.com/dataflow/docs/quickstarts/quickstart-python">Google Dataflow for Python</a>, a more comprehensive solution for data-driven distribubted programming pipelines.</p>Chris Laffrahttp://www.blogger.com/profile/17244232814118596946noreply@blogger.com0tag:blogger.com,1999:blog-7152855345155101431.post-73635662385240469992017-01-06T07:37:00.002-08:002017-01-08T18:14:27.550-08:00The 3Sum Problem<div>
For a given list of numbers, the <a href="https://en.wikipedia.org/wiki/3SUM" target="_blank">3Sum problem</a> discovers whether there is at least one combination of three numbers in the list that sum up to zero. The same problem is also listed at <a href="https://leetcode.com/problems/3sum/" target="_blank">LeetCode</a>, with a small twist, where all solutions are to be returned, with duplicates removed.<br />
<br />
A few examples:<br />
<br />
<text fill="black" font-family="Arial" font-size="15px" style="background-color: white; color: #444444; font-family: Arial;" x="120" y="20">[0, 0, 0] ==> </text><text fill="black" font-family="Arial" font-size="15px" style="background-color: white; color: #444444; font-family: Arial;" x="120" y="40">[[0, 0, 0]]</text><br />
<text fill="black" font-family="Arial" font-size="15px" style="background-color: white; color: #444444; font-family: Arial;" x="120" y="70">[-1, 0, 1, 2, -1, -4] ==> </text><text fill="black" font-family="Arial" font-size="15px" style="background-color: white; color: #444444; font-family: Arial;" x="120" y="90">[[-1, -1, 2], [-1, 0, 1]]</text><br />
<text fill="black" font-family="Arial" font-size="15px" style="background-color: white; color: #444444; font-family: Arial;" x="120" y="120">[-2, 0, -2, 1, 0, 4, 0, -1, -2, 0, -2, 1, 0, 4, 0, -1] ==> </text><text fill="black" font-family="Arial" font-size="15px" style="background-color: white; color: #444444; font-family: Arial;" x="120" y="140">[[-1, 0, 1], [-2, -2, 4], [0, 0, 0], [-2, 1, 1]]</text><br />
<br />
The naive, brute force, yet Pythonic approach would be to use <span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">itertools.combinations</span> to generate a collection of all possible triplets. From those triplets, we filter out the ones that sum up to zero. Each triplet we then order and add to a set to remove the duplicates. We need to temporarily convert the triples to a tuple, as lists are not hashable in Python. Finally, we convert the set of unique results into a list of lists, as requested by Leetcode:</div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">
</span></pre>
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">def</span> <span style="color: #0066bb; font-weight: bold;">threesum_brute_force</span>(L):
triples <span style="color: #333333;">=</span> itertools<span style="color: #333333;">.</span>combinations(L, <span style="color: #0000dd; font-weight: bold;">3</span>) <span style="color: #888888;"># O(n^3)</span></pre>
<pre style="line-height: 125%; margin: 0;"> zeroes = [t <span style="color: #008800; font-weight: bold;">for</span> t in triples <span style="color: #008800; font-weight: bold;">if</span> <span style="color: #007020;">sum</span>(t) <span style="color: #333333;">==</span> <span style="color: #0000dd; font-weight: bold;">0]</span>
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">map</span>(<span style="color: #007020;">list</span>, <span style="color: #007020;">set</span>([<span style="color: #007020;">tuple</span>(<span style="color: #007020;">sorted</span>(t)) <span style="color: #008800; font-weight: bold;">for</span> t <span style="color: black; font-weight: bold;">in</span> zeroes]))
</pre>
</div>
<br />
That approach is O(n^3) because we compare all possible combinations. It may look like we actually create them ahead of time before doing any further analysis, but <span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">itertools.combinations</span> is implemented as a generator function, so we only produce one triplet at a time, when needed. For a list with 800 elements, we end up generating and comparing 510,081,600 triples.<br />
<br />
A similar implementation uses three loops to create each of the combinations. This is still O(n^3), more clearly showing now as we iterate over i, j, and k. This version runs much faster than the previous one, as we avoid creating all the intermediate tuples themselves. We are still doing a lot of repetitive work.<br />
<br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">
</span></pre>
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">def</span> <span style="color: #0066bb; font-weight: bold;">threesum_triple_loop</span>(L):
result <span style="color: #333333;">=</span> <span style="color: #007020;">set</span>()
<span style="color: #008800; font-weight: bold;">for</span> i <span style="color: black; font-weight: bold;">in</span> <span style="color: #007020;">xrange</span>(<span style="color: #007020;">len</span>(L) <span style="color: #333333;">-</span> <span style="color: #0000dd; font-weight: bold;">2</span>):
<span style="color: #008800; font-weight: bold;">for</span> j <span style="color: black; font-weight: bold;">in</span> <span style="color: #007020;">xrange</span>(i <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>, <span style="color: #007020;">len</span>(L) <span style="color: #333333;">-</span> <span style="color: #0000dd; font-weight: bold;">1</span>):
<span style="color: #008800; font-weight: bold;">for</span> k <span style="color: black; font-weight: bold;">in</span> <span style="color: #007020;">xrange</span>(j <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>, <span style="color: #007020;">len</span>(L)): <span style="color: #888888;"># O(n^3)</span>
<span style="color: #008800; font-weight: bold;">if</span> L[i] <span style="color: #333333;">+</span> L[j] <span style="color: #333333;">+</span> L[k] <span style="color: #333333;">==</span> <span style="color: #0000dd; font-weight: bold;">0</span>:
result<span style="color: #333333;">.</span>add(<span style="color: #007020;">tuple</span>(<span style="color: #007020;">sorted</span>((L[i], L[j], L[k]))))
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">map</span>(<span style="color: #007020;">list</span>, result)
</pre>
</div>
<br />
Rather than comparing all triples, producing O(n^3) time complexity, there is a way to solve this problem in O(n^2). The insight is to first sort all the numbers. Then we loop from left to right. At each incremental step the next number will be larger than before. Rather than have two nested loops to find the next two elements of the triple, we maintain a region bound by <span style="font-family: "courier new" , "courier" , monospace;">j</span> and <span style="font-family: "courier new" , "courier" , monospace;">k</span>.<br />
<br />
While we go over each number, we sum this number with the two numbers on the edge of the region. If the sum is greater than zero, we are too far to the right and we shrink the region towards the left. If the sum is less than zero, we are too far to the left. The region remembers the last result and provides a good starting point to find the next zero sum triple. This makes this solution a <a href="https://en.wikipedia.org/wiki/Dynamic_programming" target="_blank">Dynamic Programming</a> solution, rather than brute force.<br />
<br />
A final optimization is to stop once <span style="font-family: "courier new" , "courier" , monospace;">i</span> reaches zero. Namely, after than point the total sum can only be larger than zero, so we can stop iterating.<br />
<br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">
</span></pre>
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">def</span> <span style="color: #0066bb; font-weight: bold;">threesum</span>(L):
L <span style="color: #333333;">=</span> <span style="color: #007020;">sorted</span>(L)
result <span style="color: #333333;">=</span> <span style="color: #007020;">set</span>()
<span style="color: #008800; font-weight: bold;">for</span> i <span style="color: black; font-weight: bold;">in</span> <span style="color: #007020;">xrange</span>(<span style="color: #007020;">len</span>(L) <span style="color: #333333;">-</span> <span style="color: #0000dd; font-weight: bold;">2</span>):
j <span style="color: #333333;">=</span> i <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>
k <span style="color: #333333;">=</span> <span style="color: #007020;">len</span>(L) <span style="color: #333333;">-</span> <span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">while</span> j<span style="color: #333333;"><</span>k:
s <span style="color: #333333;">=</span> L[i] <span style="color: #333333;">+</span> L[j] <span style="color: #333333;">+</span> L[k]
<span style="color: #008800; font-weight: bold;">if</span> s <span style="color: #333333;">==</span> <span style="color: #0000dd; font-weight: bold;">0</span>:
result<span style="color: #333333;">.</span>add(<span style="color: #007020;">tuple</span>(<span style="color: #007020;">sorted</span>((L[i], L[j], L[k]))))
<span style="color: #008800; font-weight: bold;">if</span> s <span style="color: #333333;">></span> <span style="color: #0000dd; font-weight: bold;">0</span>:
k <span style="color: #333333;">-=</span> <span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">else</span>:
j <span style="color: #333333;">+=</span> <span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">if</span> L[i] ><span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>:
<span style="color: #008800; font-weight: bold;">break</span>
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">map</span>(<span style="color: #007020;">list</span>, result)
</pre>
</div>
<br />
Using DP, we narrowed down the search space dramatically. However, in the above solution, each time we increment <span style="font-family: "courier new" , "courier" , monospace;">i</span>, we pick the next <span style="font-family: "courier new" , "courier" , monospace;">k</span> by selecting the last number in the list, which is also the maximum number. Now, once <span style="font-family: "courier new" , "courier" , monospace;">i</span> gets closer and closer to zero, this <span style="font-family: "courier new" , "courier" , monospace;">k</span> will be less appropriate and we will end up doing a linear search from <span style="font-family: "courier new" , "courier" , monospace;">k</span> towards <span style="font-family: "courier new" , "courier" , monospace;">i</span> to get a smaller positive number that will give us a zero sum. The optimal <span style="font-family: "courier new" , "courier" , monospace;">k</span> can be found more efficiently with a binary search, which makes the overall algorithm more efficient again.<br />
<br />
A similar argument applies to <span style="font-family: "courier new" , "courier" , monospace;">j</span>. If we consider the maximum number, it may make less sense to make <span style="font-family: "courier new" , "courier" , monospace;">j</span> become <span style="font-family: "courier new" , "courier" , monospace;">i+1</span> for its first candidate. Namely if <span style="font-family: "courier new" , "courier" , monospace;">i+j+k==0</span>, then <span style="font-family: "courier new" , "courier" , monospace;">j</span> should be <span style="font-family: "courier new" , "courier" , monospace;">-i + -k</span>. For example, assume the number at <span style="font-family: "courier new" , "courier" , monospace;">i</span> is <span style="font-family: "courier new" , "courier" , monospace;">-30</span> and the maximum for the list is <span style="font-family: "courier new" , "courier" , monospace;">24</span>. In that case, rather than make <span style="font-family: "courier new" , "courier" , monospace;">j</span> point at a number such as <span style="font-family: "courier new" , "courier" , monospace;">-29</span>, we can make <span style="font-family: "courier new" , "courier" , monospace;">j</span> skip all the way ahead to <span style="font-family: "courier new" , "courier" , monospace;">6</span>, as that would be the first candidate to yield zero. Again, using binary search will give us a starting point for <span style="font-family: "courier new" , "courier" , monospace;">j</span> more effectively than doing a linear search, if we are dealing with a large number of elements:<br />
<br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><pre style="color: #333333; line-height: 16.25px;"></pre>
<pre style="color: #333333; line-height: 16.25px;"><span style="color: #008800; font-weight: bold;">def</span> <span style="color: #0066bb; font-weight: bold;">threesum_binarysearch</span>(L):
L = <span style="color: #007020;">sorted</span>(L)
n = <span style="color: #007020;">len</span>(L)
result = <span style="color: #007020;">set</span>()
i = <span style="color: #0000dd; font-weight: bold;">0</span>
<span style="color: #008800; font-weight: bold;">while</span> i < n - <span style="color: #0000dd; font-weight: bold;">2</span>:
j = binarySearch(L, i + <span style="color: #0000dd; font-weight: bold;">1</span>, n - <span style="color: #0000dd; font-weight: bold;">2</span>, -(L[i] + L[-<span style="color: #0000dd; font-weight: bold;">1</span>]))
k = binarySearch(L, j + <span style="color: #0000dd; font-weight: bold;">1</span>, n - <span style="color: #0000dd; font-weight: bold;">1</span>, -(L[i] + L[j]))
<span style="color: #008800; font-weight: bold;">while</span> j<k <span style="color: black; font-weight: bold;">and</span> k<n:
s = L[i] + L[j] + L[k]
<span style="color: #008800; font-weight: bold;">if</span> s == <span style="color: #0000dd; font-weight: bold;">0</span>:
result.add(<span style="color: #007020;">tuple</span>(<span style="color: #007020;">sorted</span>((L[i], L[j], L[k]))))
k -= <span style="color: #0000dd; font-weight: bold;">1</span>
j += <span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">elif</span> s > <span style="color: #0000dd; font-weight: bold;">0</span>:
k -= <span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">else</span>:
j += <span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">if</span> L[i] == <span style="color: #0000dd; font-weight: bold;">0</span>:
<span style="color: #008800; font-weight: bold;">break</span>
i += <span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">map</span>(<span style="color: #007020;">list</span>, result)</pre>
</pre>
</div>
<br />
The binary search makes the search for <span style="font-family: "courier new" , "courier" , monospace;">j</span> and <span style="font-family: "courier new" , "courier" , monospace;">k</span> possible in O(log n), rather than O(n):<br />
<br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">
</span></pre>
<pre style="line-height: 125%; margin: 0;"><span style="color: #008800; font-weight: bold;">def</span> <span style="color: #0066bb; font-weight: bold;">binarySearch</span>(L, <span style="color: #007020;">min</span>, <span style="color: #007020;">max</span>, target):
<span style="color: #008800; font-weight: bold;">while</span> <span style="color: #007020;">True</span>:
<span style="color: #008800; font-weight: bold;">if</span> <span style="color: #007020;">max</span> <span style="color: #333333;"><</span> <span style="color: #007020;">min</span>:
<span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">min</span>
mid <span style="color: #333333;">=</span> (<span style="color: #007020;">min</span> <span style="color: #333333;">+</span> <span style="color: #007020;">max</span>) <span style="color: #333333;">/</span> <span style="color: #0000dd; font-weight: bold;">2</span>
<span style="color: #008800; font-weight: bold;">if</span> L[mid] <span style="color: #333333;"><</span> target:
<span style="color: #007020;">min</span> <span style="color: #333333;">=</span> mid <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">elif</span> L[mid] <span style="color: #333333;">></span> target:
<span style="color: #007020;">max</span> <span style="color: #333333;">=</span> mid <span style="color: #333333;">-</span> <span style="color: #0000dd; font-weight: bold;">1</span>
<span style="color: #008800; font-weight: bold;">else</span>:
<span style="color: #008800; font-weight: bold;">return</span> mid
</pre>
</div>
<br />
To give you an idea of the performance of each algorithm, here is a test run with 800 numbers, showing the number of <span style="font-family: "courier new", courier, monospace;">i+j+k==0</span> comparisons performed by each algorithm and the total time needed:<br />
<style type="text/css"><!--td {border: 1px solid #ccc;}br {mso-data-placement:same-cell;}--></style><br />
<table border="1" cellpadding="0" cellspacing="0" dir="ltr" style="border-collapse: collapse; border: 1px solid #ccc; font-family: arial,sans,sans-serif; font-size: 13px; table-layout: fixed;"><colgroup><col width="126"></col><col width="117"></col><col width="117"></col></colgroup><tbody>
<tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Algorithm"}" style="background-color: #434343; color: white; padding: 2px 3px 2px 3px; vertical-align: bottom;">Algorithm</td><td data-sheets-value="{"1":2,"2":"Comparisons"}" style="background-color: #434343; color: white; padding: 2px 3px 2px 3px; vertical-align: bottom;">Comparisons</td><td data-sheets-value="{"1":2,"2":"Time in seconds"}" style="background-color: #434343; color: white; padding: 2px 3px 2px 3px; vertical-align: bottom;">Time (s)</td></tr>
<tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Brute force"}" style="padding: 2px 3px 2px 3px; vertical-align: bottom;">Brute force</td><td data-sheets-value="{"1":3,"3":0.108999967575}" style="padding: 2px 3px; vertical-align: bottom;">510,081,600 </td><td data-sheets-value="{"1":3,"3":21.0980000496}" style="padding: 2px 3px; vertical-align: bottom;">21.09800005</td></tr>
<tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"Three loops"}" style="padding: 2px 3px 2px 3px; vertical-align: bottom;">Three loops</td><td data-sheets-value="{"1":3,"3":0.108999967575}" style="padding: 2px 3px; vertical-align: bottom;">636,804</td><td data-sheets-value="{"1":3,"3":0.108999967575}" style="padding: 2px 3px; vertical-align: bottom;">0.1089999676</td></tr>
<tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"DP"}" style="padding: 2px 3px 2px 3px; vertical-align: bottom;">DP</td><td data-sheets-value="{"1":3,"3":0.108999967575}" style="padding: 2px 3px; vertical-align: bottom;">195,048</td><td data-sheets-value="{"1":3,"3":0.0710000991821}" style="padding: 2px 3px; vertical-align: bottom;">0.07100009918</td></tr>
<tr style="height: 21px;"><td data-sheets-value="{"1":2,"2":"DP + binary search"}" style="padding: 2px 3px 2px 3px; vertical-align: bottom;">DP + binary search</td><td data-sheets-value="{"1":3,"3":0.108999967575}" style="padding: 2px 3px; vertical-align: bottom;">133,383</td><td data-sheets-value="{"1":3,"3":0.0360000133514}" style="padding: 2px 3px; vertical-align: bottom;">0.03600001335</td></tr>
</tbody></table>
<br />
In short, when solving problems like 3Sum, avoid comparing all elements. This can be done by reducing the search space and ignoring candidates that won't lead to a solution anyways. Aside from reducing the search space, if we can order it in some form, we have even more optimization opportunities. In that case, we can typically avoid linear search and use binary search.<br />
<br />
Check out many more algorithms with visualizations at <a href="http://pyalgoviz.appspot.com/">PyAlgoViz</a>.<br />
<br />
<hr />
Python code styled as <span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">tango</span> by <a href="http://hilite.me/">hilite.me</a> with <span style="background-color: white;"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">border:1px solid #ddd;padding:11px 7px;</span></span>Chris Laffrahttp://www.blogger.com/profile/17244232814118596946noreply@blogger.com0tag:blogger.com,1999:blog-7152855345155101431.post-58687953682808252782016-12-27T12:40:00.002-08:002016-12-27T13:13:05.664-08:00Big NumbersThe human brain is ill-equipped to deal with large numbers. Drawing attention to large numbers is only partially effective. For instance, the following ticker in NYC displays the current US National Debt:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcmC_sXMJCT3dM4MlUVGegzT5ifI3skt0k5p14jzF6H6Mj-oSWTGuL6D_tUx_OHDggpMEqL4hjW_4lAS90oUq6k1X7cviZxiIbMdBoVmh_fWQgE4zWGUCaxtEW2XWCsGS_GX9OpwhyphenhyphenQ_M/s1600/debt.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgcmC_sXMJCT3dM4MlUVGegzT5ifI3skt0k5p14jzF6H6Mj-oSWTGuL6D_tUx_OHDggpMEqL4hjW_4lAS90oUq6k1X7cviZxiIbMdBoVmh_fWQgE4zWGUCaxtEW2XWCsGS_GX9OpwhyphenhyphenQ_M/s400/debt.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">The US National Debt clock at Union Square, NYC</td></tr>
</tbody></table>
<br />
That number is so big it becomes abstract and meaningless. In fact, the above picture is already from a while back. By the time you read this, the actual US National Debt is closer to $20 trillion. A number so large it becomes hard to parse, especially as the image above is missing commas.<br />
<br />
Maybe it helps if we write down the current amount, as of writing of this text, in words and compare it with another large country:<br />
<br />
<table border="1" cellpadding="5" style="border-collapse: collapse;">
<tbody>
<tr><th width="50">Country</th><th width="140">Debt in Dollars</th><th>Debt in Words/Speech</th></tr>
<tr><td>USA</td><td>$19,859,586,951,270</td><td><a href="http://chrislaffra.com/19trillion.mp3" target="_blank">Nineteen trillion, eight hundred fifty-nine billion, five hundred eighty-six million, nine hundred fifty-one thousand, two hundred seventy dollars.</a></td></tr>
<tr><td>Russia</td><td>$152,205,694,374</td><td><a href="http://chrislaffra.com/152billion.mp3" target="_blank">One hundred fifty-two billion, two hundred five million, six hundred ninety-four thousand, three hundred seventy-four dollars.</a></td></tr>
</tbody></table>
<br />
If we actually do round off the US debt to $20 billion, the Russian debt is actually the rounding error. That's how large the US number really is.<br />
<br />
<br />
<span style="font-size: large;"><b>Payback Time</b></span><br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<br />
To get a feeling for the amount, what if <i>you</i> would spend 100% of your current net income to pay off the debt, how many years would it take you to pay off the US National Debt? For the average US citizen, it would take 525 million working years. If we combine the entire US workforce at roughly 155 million people and spend <i>all</i> their income on relieving the US debt, it would still take an entire Trump presidency to pay off the debt. This is not realistic for a few reasons, of course.<br />
<br />
Even if each person in the US workforce would spend a modest 10% of their net income on paying off the national debt, it would take the entire country 34 years to pay off the current debt amount, assuming the US government was able to not grow the debt even larger.<br />
<br />
Mind-boggling numbers.<br />
<br />
<br />
<span style="font-size: large;"><b>Visualizing Large Amounts of Money</b></span><br />
<br />
Visualizing the debt in the shape of $100 bills may help. Here is the US National Debt in 2013, compared to an American Football field and Miss Liberty:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibgOd64qe0Zth81_HQpPCwz_dM1fvBi-oeAyYygX1aKI2hp7lWONVEZhDheCxJqAWrwa4kSYojh3DeUHfIuJebkxPMn3ib72hGZQRnPut2HFm7TkJ-PYanbMqDV7PTdn4Q0IsoZe9Pvgg/s1600/debt-dollars.jpg" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="300" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEibgOd64qe0Zth81_HQpPCwz_dM1fvBi-oeAyYygX1aKI2hp7lWONVEZhDheCxJqAWrwa4kSYojh3DeUHfIuJebkxPMn3ib72hGZQRnPut2HFm7TkJ-PYanbMqDV7PTdn4Q0IsoZe9Pvgg/s400/debt-dollars.jpg" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">US National Debt visualized in stacks of $100 bills (<a href="http://www.businessinsider.com/how-can-you-even-visualize-what-the-us-national-debt-would-look-like-stacked-in-100-bills-2011-7">link</a>)</td></tr>
</tbody></table>
<br />
For reference, this is what one billion dollars looks like in stacks of $100 bills:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUgG7SVREKKx7k0fN58h1ht353LjDtjihCvQKvjq6qOJhzcGsrfhws2hscbfb0K5oiJ5oziw4aBdWxlZq3CPhBEK0P5ZYnxAXma-S22hjTbDASjQMUvQYiHBRie6em_6GyzDUWiFd6ciU/s1600/usd-1_billion_dollars-1%252C000%252C000%252C000_USD-v2.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="310" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUgG7SVREKKx7k0fN58h1ht353LjDtjihCvQKvjq6qOJhzcGsrfhws2hscbfb0K5oiJ5oziw4aBdWxlZq3CPhBEK0P5ZYnxAXma-S22hjTbDASjQMUvQYiHBRie6em_6GyzDUWiFd6ciU/s400/usd-1_billion_dollars-1%252C000%252C000%252C000_USD-v2.jpg" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
It is rumored that <a href="http://www.barstoolsports.com/iowa/at-his-peak-pablo-escobar-was-losing-2-1-billion-a-year-due-to-rats-eating-money-in-storage/">Pablo Escobar</a> lost $2.3 billion each year due to rats eating the bills. That's 2.3X the amount showing above.<br />
<br />
Pictures help.
Check out <a href="http://demonocracy.info/">demonocracy.info</a> for many other visualizations of large amounts of money.<br />
<div>
<br /></div>
<br />
<span style="font-size: large;"><b>Short Scale and Large Scale</b></span><br />
<br />
The large numbers we talked about so far are huge. But, we cannot even agree on what to call really large numbers. Two different scales to talk about large numbers are in use:<br />
<ul>
<li><b>Long scale.</b> Every new term in the scale is one million times larger than the previous. A billion means a million millions (10^12), a trillion means a million billions (10^18), etc.</li>
<li><b>Short scale.</b> Every new term is one thousand times larger than the previous. A billion means a thousand millions (10^9), trillion means a thousand billions (10^12), etc.</li>
</ul>
For fun, here is what scale is being adopted by different countries across the world, including a couple of exceptions to the rule:<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPhwMdZX8YSjhSdnI0GXUXIgppjRzBt88sG9DODj2d07P0_20laR-Kq-DZFBPVyViER34RSTviTT9dmSDmRhqysYaFlpncYmPbqlm23GgwOB6tGenJwkYc0Ps39l2Kh4fqXfzohKigH3o/s1600/Screen+Shot+2016-12-27+at+12.44.50+PM.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="251" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiPhwMdZX8YSjhSdnI0GXUXIgppjRzBt88sG9DODj2d07P0_20laR-Kq-DZFBPVyViER34RSTviTT9dmSDmRhqysYaFlpncYmPbqlm23GgwOB6tGenJwkYc0Ps39l2Kh4fqXfzohKigH3o/s400/Screen+Shot+2016-12-27+at+12.44.50+PM.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Use of short and long scale across the world (<a href="https://en.wikipedia.org/wiki/Long_and_short_scales" style="font-size: 12.8px;">wikipedia</a><span style="font-size: 12.8px;">)</span></td></tr>
</tbody></table>
<br />
If you care, a milliardaire and a billionaire are equally rich. Confusing.<br />
<br />
<br />
<b><span style="font-size: large;">Algorithms and Big Numbers</span></b><br />
<br />
Large numbers confound us. Reasoning about large numbers is even harder. So how can we make the analysis of algorithms more insightful? Applying scales helps. Writing out large numbers in words may help. Visualization definitely helps.<br />
<br />
Say we have a certain algorithm that we run on a modest number of 50 elements. The algorithm has implementations with different time complexity. How efficient is each algorithm? How well does it scale when we increase the number of elements? Those are questions we care about when scaling to thousands, millions, or even billions of users.<br />
<br />
For each class of complexity, the required operations for 50 elements are at the following scale:<br />
<br />
<table border="1" cellpadding="5" style="border-collapse: collapse;">
<tbody>
<tr><th width="80">Complexity</th><th width="85">Operations</th><th>Number of Operations in Words/Speech</th></tr>
<tr><td>O(1)</td><td>1</td><td>One</td></tr>
<tr><td>O(log n)</td><td>4</td><td>Four</td></tr>
<tr><td>O(n)</td><td>50</td><td>Fifty</td></tr>
<tr><td>O(n log n)</td><td>195</td><td>One hundred ninety-five</td></tr>
<tr><td>O(n^2)</td><td>2 500</td><td>Two thousand, five hundred</td></tr>
<tr><td>O(2^n)</td><td>1 125 899 906 842 624</td><td><a href="http://chrislaffra.com/1quadrillion.mp3" target="_blank">One quadrillion, one hundred twenty-five trillion, eight hundred ninety-nine billion, nine hundred six million, eight hundred forty-two thousand, six hundred twenty-four</a></td></tr>
<tr><td>O(factorial)</td><td>30 414 093 201 713 378 043 612 608 166 064 768 844 377 641 568 960 512 000 000 000 000</td><td><a href="http://chrislaffra.com/30vigintillion.mp3" target="_blank">Thirty vigintillion, four hundred fourteen novemdecillion, ninety-three octodecillion, two hundred one septendecillion, seven hundred thirteen sexdecillion, three hundred seventy-eight quindecillion, forty-three quattuordecillion, six hundred twelve tredecillion, six hundred eight duodecillion, one hundred sixty-six undecillion, sixty-four decillion, seven hundred sixty-eight nonillion, eight hundred forty-four octillion, three hundred seventy-seven septillion, six hundred forty-one sextillion, five hundred sixty-eight quintillion, nine hundred sixty quadrillion, five hundred twelve trillion</a></td></tr>
</tbody></table>
<br />
Those numbers may still not mean much. But, just read out the value for <b>O(2^n)</b>. Does sound silly, right? The number for <b>O(factorial)</b> actually sounds laughable. This is what humans do with large numbers. Our brain does not know what to do with them, gives up, and turns it into a joke. BTW. Who made up those names? Vigintillion? Novemdecilion? Undecillion?<br />
<br />
Over their training and/or career, most software engineers have developed an intuitive feeling for complexity theory. They know that in interview questions, proposing an implementation that is <b>O(n^2)</b> means you fell for the trap set by the interviewer. There must be a way to use a hashmap somewhere to get the cost down to <b>O(n)</b> or even <b>O(1)</b>.<br />
<br />
But does the candidate really understand why <b>O(n^2)</b> is so bad? How do <b>O(n)</b> and <b>O(n^2) </b>relate to each other as the number of elements grows? Let's try a visualization (click the play icon to replay):<br />
<br />
<div style="height: 407px; overflow: hidden; width: 600px;">
<iframe height="425" scrolling="no" src="https://pyalgoviz.appspot.com/show?visualize=true&name=Big-O%20Notation" style="margin: -12px 0 0 -5px; transform-origin: 0 0; transform: scale(0.9);" width="1187">
Could not render iframe, please disable ad-blockers.</iframe>
</div>
<span style="font-size: medium;">In the interactive chart above, the number of elements are between 1 and 50 and are plotted from left to right. The vertical scale shows the number of operations, capped at 800. <b>O(log n)</b>, <b>O(n)</b>, and even <b>O(n log n)</b> stay linear for a long time. The big P refers to algorithms that can be solved in <a href="https://en.wikipedia.org/wiki/P_versus_NP_problem">polynomial time,</a> while NP refers to nondeterministic polynomial time. </span><br />
<span style="font-size: medium;"><br /></span>
<span style="font-size: medium;">When interviewing, you definitely do not want to be nondeterministic. Already for n=10, the purple line for <b>O(n^2)</b> is off the scale in the above chart. Search for solutions that are <b>O(n)</b>, as is the case in linear search, or <b>O(log n)</b>, the number of steps needed in binary search, or ideally <b>O(1)</b>, when using a hashtable.</span><br />
<span style="font-size: medium;"><br /></span>
<span style="font-size: medium;"><br /></span>
<b><span style="font-size: large;">Links</span></b><br />
<br />
<ul>
<li>A large number of live tickers, including the current US National Debt can be found at <a href="http://usdebtclock.org/">usdebtclock.org</a>. </li>
<li>Pronouncing numbers as words makes for a nice programming interview question, but you can also go to <a href="http://www.calculatorsoup.com/calculators/conversions/numberstowords.php">Calculator Soup</a>. </li>
<li>Check out <a href="http://demonocracy.info/">demonocracy.info</a> for many other visualizations of large amounts of money.</li>
<li>For more details on program analysis, check out <a href="https://en.wikipedia.org/wiki/Time_complexity">time complexity</a>.</li>
<li>The O(x) visualization can be found amid numerous algorithm visualizations at <a href="http://pyalgoviz.appspot.com/">PyAlgoViz</a>.</li>
</ul>
<br />
<br />Chris Laffrahttp://www.blogger.com/profile/17244232814118596946noreply@blogger.com1tag:blogger.com,1999:blog-7152855345155101431.post-73487725791552474512016-12-19T09:15:00.002-08:002017-12-30T05:36:24.534-08:00Auger - Automatic Unit Test Generation for PythonUnit tests are crucial to software development. They verify whether a given component actually implements what it promises. They are also important for long-living code, where future maintenance can be much more expensive without the proper level of unit test coverage. Not surprisingly, the lack of unit tests is a show-stopper for submitting production code at many companies.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyr417GabK5XBItkQ3H3JzX26LziTawc31aCIcw_BeWcOnSy5nZ6J0LUIs-hW-40jsllJfXLhyphenhyphenGeJMO_unaF_e-POsp3nBX4bcdqEJVe9K-JNs55WtBad2EQJcXWpSQZB3tbu5bqdvq18/s1600/write+unit+test.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="258" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiyr417GabK5XBItkQ3H3JzX26LziTawc31aCIcw_BeWcOnSy5nZ6J0LUIs-hW-40jsllJfXLhyphenhyphenGeJMO_unaF_e-POsp3nBX4bcdqEJVe9K-JNs55WtBad2EQJcXWpSQZB3tbu5bqdvq18/s400/write+unit+test.jpg" width="400" /></a></div>
<br />
You may wonder, if unit tests are so great, why do engineers hate to write them? More than once have I heard fellow engineers say "If you approve my code now, I will write the unit tests in a future change list." All engineers I know actually like writing code. It is creative. Code means impact. However, in general, engineers do not like writing unit tests at all. Why is that?<br />
<b><span style="font-size: large;"><br /></span></b>
<b><span style="font-size: large;">Why is Unit Testing Hard?</span></b><br />
<br />
Unit testing is hard for various reasons:<br />
<ol>
<li>One reason would be that a unit testing framework is exactly that. A framework. When I worked on the IBM J9 and Eclipse team with <a href="http://www.davethomas.net/" target="_blank">Dave Thomas</a>, he used to say "Everybody likes to <i>write</i> frameworks, but nobody likes to actually <i>use</i> them". In practice, unit test frameworks add yet another complexity to learn and master, especially when unit test frameworks have little in common across programming languages or organizations.</li>
<li>All unit testing frameworks start off modestly. <a href="https://en.wikipedia.org/wiki/Erich_Gamma" target="_blank">Erich Gamma</a> once confided in me how he and Kent Beck wrote the original version of <a href="https://www.google.com/search?q=junit" target="_blank">JUnit</a> when they got bored on a transcontinental flight. He added how those few hours of work had been by far the best investment of his technical career ever. Today, however, unit testing frameworks are far from trivial and require a considerable learning investment in understanding their power and intricacies.</li>
<li>Software systems themselves are also becoming increasingly complex. Even seemingly standalone components are heavily embedded in a context of complex runtimes. To isolate those dependencies and sculpt them with the proper "mocks" is an art. It is not uncommen to spend hours trying to find out how to write the mocks for one line of test code. Dependency injection and fancy mock syntax conspire to make the life of unit test authors challenging.</li>
<li>Code under development constantly changes. Often, coding can be a discovery game. A design doc can outline architectural decisions on what database to use and how the UI is created. However, it tends to leave the actual coding as an exercise to the reader. During coding, discoveries are made. Code is constantly refactored. Code is found obsolete and is deleted even before it is ever committed to version control. This is particularly the case when a new domain or technology is being explored. We still write unit tests, but <a href="http://paul-m-jones.com/archives/275" target="_blank">we write them later</a>, when the dust settles.</li>
</ol>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYIDfaR6Lv3wRelDMnO1ZtE2q0Ax2c4y4Zk21QRc4nozT9zOu-s9nAt52C16961_2JNOVL7_SHNPkcr8HJdkwTNetrNNRhB4JvIcGWLPaNwxa3X4L3QMdgp-SV6-AN42Mz_afst4UWXrw/s1600/awful.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYIDfaR6Lv3wRelDMnO1ZtE2q0Ax2c4y4Zk21QRc4nozT9zOu-s9nAt52C16961_2JNOVL7_SHNPkcr8HJdkwTNetrNNRhB4JvIcGWLPaNwxa3X4L3QMdgp-SV6-AN42Mz_afst4UWXrw/s320/awful.png" width="260" /></a></div>
<br />
<b><span style="font-size: large;">
OK. Unit tests are hard. But, what is the alternative?</span></b><br />
<br />
Let's agree, unit test are a chore and tough to write. This does not mean engineers are not testing their code, even if they do not write unit tests. Many engineers, like myself, write code in a highly iterative fashion. For instance, say I am writing an AppEngine app to return a web page. I would first add a route, write a Handler, and simply return "Hello World". To test, I would spin up a local instance, point my browser at localhost:8080 and see if it shows the string I expected. I iterate that step hundreds of times.<br />
<br />
In this iterative development mode, small, incremental steps are made towards the end goal and progress is constantly validated. Each time, some more functionality is added and tested on the code in progress. At some point, we are going to be "done". At his final point in time, hundreds or thousands of "test" runs have been exercised on the code under development. Each component has had some inputs and produced some expected outputs. If only we could remember what those inputs and outputs were and our tests would be so much easier to write. Cue: <a href="https://github.com/laffra/auger" target="_blank">Project Auger</a>.<br />
<br />
<b><span style="font-size: large;">
What is Project Auger?</span></b><br />
<b><span style="font-size: large;"><br /></span></b>
<a href="https://github.com/laffra/auger" target="_blank">Project Auger</a> (Automated Unittest Generator) watches your Python code while you write it and automatically generates all unit tests for your code, including all the mocks. Little or no work is required by the developer.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXj_S9-X7TNJmpCe0lUTa5rNiBPf-4xntz5MV24ww6HcIJ2trrJgOpInXIUpk502MGlascT7fp2x2sEAUoBtUPFfNS_Sue5ILQTv5W_BQASkfKiKx3_mbCFZL283zcu8iOGFjcAlO19BE/s1600/Shut-up-and-take-my-money.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="250" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXj_S9-X7TNJmpCe0lUTa5rNiBPf-4xntz5MV24ww6HcIJ2trrJgOpInXIUpk502MGlascT7fp2x2sEAUoBtUPFfNS_Sue5ILQTv5W_BQASkfKiKx3_mbCFZL283zcu8iOGFjcAlO19BE/s400/Shut-up-and-take-my-money.jpg" width="400" /></a></div>
<br />
<br />
<b><span style="font-size: large;">
How does Auger Work?</span></b><br />
<br />
Auger works like a smart Python debugger that sets breakpoints for each component you are interested in. Auger tracks two kinds of function calls related to the module under test:<br />
<br />
<ul>
<li>For each function defined in the module, Auger records both the values of the arguments and the returned results. After recording enough execution traces, unit tests can be generated with the meaningful placeholder argument values and assertions. </li>
<li>For each call made <i>from</i> a given component to dependent libraries or other components, we record the return value, so that this call can be automatically mocked out with known return values.</li>
</ul>
<br />
<span style="font-family: inherit;">Auger tracks all possible functions, including instance methods, class methods, and static functions.</span><br />
<span style="font-family: inherit;"><br /></span>
Consider the following example, <a href="https://github.com/laffra/auger/blob/master/sample/pet.py">pet.py</a>, that provides a <span style="font-family: "courier new" , "courier" , monospace;">Pet</span> with a name, age, and a species:<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: 1px solid #ddd; overflow: auto; padding: 11px 7px; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #204a87; font-weight: bold;">from</span> <span style="color: black;">sample.animal</span> <span style="color: #204a87; font-weight: bold;">import</span> <span style="color: black;">Animal</span>
<span style="color: #204a87; font-weight: bold;">class</span> <span style="color: black;">Pet</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">Animal</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">__init__</span><span style="color: black; font-weight: bold;">(</span><span style="color: #3465a4;">self</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">name</span><span style="color: black; font-weight: bold;">,</span> <span style="color: #ce5c00; font-weight: bold;">*</span><span style="color: black;">args</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: black;">Animal</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">__init__</span><span style="color: black; font-weight: bold;">(</span><span style="color: #3465a4;">self</span><span style="color: black; font-weight: bold;">,</span> <span style="color: #ce5c00; font-weight: bold;">*</span><span style="color: black;">args</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #3465a4;">self</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">_name</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: black;">name</span>
<span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">get_name</span><span style="color: black; font-weight: bold;">(</span><span style="color: #3465a4;">self</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: #204a87; font-weight: bold;">return</span> <span style="color: #3465a4;">self</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">_name</span>
<span style="color: #5c35cc; font-weight: bold;">@staticmethod</span>
<span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">lower</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">s</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: #204a87; font-weight: bold;">return</span> <span style="color: black;">s</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">lower</span><span style="color: black; font-weight: bold;">()</span>
<span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">__str__</span><span style="color: black; font-weight: bold;">(</span><span style="color: #3465a4;">self</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: #204a87; font-weight: bold;">return</span> <span style="color: #4e9a06;">'%s is a %s aged %d'</span> <span style="color: #ce5c00; font-weight: bold;">%</span> <span style="color: black; font-weight: bold;">(</span></pre>
<pre style="line-height: 125%; margin: 0;"><span style="color: #3465a4;"> self</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">get_name</span><span style="color: black; font-weight: bold;">(),</span> </pre>
<pre style="line-height: 125%; margin: 0;"><span style="color: black;"> Pet</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">lower</span><span style="color: black; font-weight: bold;">(</span><span style="color: #3465a4;">self</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">get_species</span><span style="color: black; font-weight: bold;">()),</span> <span style="color: #3465a4;">self</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">get_age</span><span style="color: black; font-weight: bold;">()</span></pre>
<pre style="line-height: 125%; margin: 0;"><span style="color: black; font-weight: bold;"> )</span>
<span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">create_pet</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">name</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">species</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">age</span><span style="color: #ce5c00; font-weight: bold;">=</span><span style="color: #0000cf; font-weight: bold;">0</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: #204a87; font-weight: bold;">return</span> <span style="color: black;">Pet</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">name</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">species</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">age</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #204a87; font-weight: bold;">if</span> <span style="color: black;">__name__</span> <span style="color: #ce5c00; font-weight: bold;">==</span> <span style="color: #4e9a06;">'__main__'</span><span style="color: black; font-weight: bold;">:</span>
<span style="color: #204a87; font-weight: bold;">print</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">Pet</span><span style="color: black; font-weight: bold;">(</span><span style="color: #4e9a06;">'Polly'</span><span style="color: black; font-weight: bold;">,</span> <span style="color: #4e9a06;">'Parrot'</span><span style="color: black; font-weight: bold;">))</span>
<span style="color: #204a87; font-weight: bold;">print</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">create_pet</span><span style="color: black; font-weight: bold;">(</span><span style="color: #4e9a06;">'Clifford'</span><span style="color: black; font-weight: bold;">,</span> <span style="color: #4e9a06;">'Dog'</span><span style="color: black; font-weight: bold;">,</span> <span style="color: #0000cf; font-weight: bold;">32</span><span style="color: black; font-weight: bold;">))</span>
</pre>
</div>
<br />
This class has a few different entry points we would need to unit test:<br />
<ul>
<li>The class <span style="font-family: "courier new" , "courier" , monospace;">Pet</span> itself which has:</li>
<ul>
<li>a static method, <span style="font-family: "courier new" , "courier" , monospace;">lower</span></li>
<li>two instance methods, <span style="font-family: "courier new" , "courier" , monospace;">get_name</span> and <span style="font-family: "courier new" , "courier" , monospace;">__str__</span></li>
<li>a constructor, <span style="font-family: "courier new" , "courier" , monospace;">__init__</span><span style="font-family: inherit;">, which is really a very special instance method</span></li>
</ul>
<li>A static function that creates a <span style="font-family: "courier new" , "courier" , monospace;">Pet</span> and returns it</li>
</ul>
<div>
The <span style="font-family: "courier new" , "courier" , monospace;">Pet</span> class is a subclass of <span style="font-family: "courier new" , "courier" , monospace;">Animal</span>, which we know nothing about, so we will need to mock that entire class. We do know that the class is used in the <span style="font-family: "courier new" , "courier" , monospace;">Pet</span> constructor and inherited methods are called from <span style="font-family: "courier new" , "courier" , monospace;">Pet</span> as well. This means that the implementation for <span style="font-family: "courier new" , "courier" , monospace;">self.get_species()</span> and <span style="font-family: "courier new" , "courier" , monospace;">self.get_age()</span> are unknown, as we cannot look at the implementation of <span style="font-family: "courier new" , "courier" , monospace;">Animal</span>, when unit testing <span style="font-family: "courier new" , "courier" , monospace;">Pet</span>. Therefore, those two inherited methods will be mocked out.<br />
<br />
<b><span style="font-size: large;">
Unit Tests Generation with Auger</span></b></div>
<div>
<br /></div>
<div>
The above class definition combined with an execution run is enough for Auger to <i>automatically</i> create the following fully functional unit test:</div>
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: 1px solid #ddd; overflow: auto; padding: 11px 7px; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #204a87; font-weight: bold;">from</span> <span style="color: black;">mock</span> <span style="color: #204a87; font-weight: bold;">import</span> <span style="color: black;">patch</span>
<span style="color: #204a87; font-weight: bold;">from</span> <span style="color: black;">sample.animal</span> <span style="color: #204a87; font-weight: bold;">import</span> <span style="color: black;">Animal</span>
<span style="color: #204a87; font-weight: bold;">import</span> <span style="color: black;">sample.pet</span>
<span style="color: #204a87; font-weight: bold;">from</span> <span style="color: black;">sample.pet</span> <span style="color: #204a87; font-weight: bold;">import</span> <span style="color: black;">Pet</span>
<span style="color: #204a87; font-weight: bold;">import</span> <span style="color: black;">unittest</span>
<span style="color: #204a87; font-weight: bold;">class</span> <span style="color: black;">PetTest</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">unittest</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">TestCase</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: #5c35cc; font-weight: bold;">@patch.object</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">Animal</span><span style="color: black; font-weight: bold;">,</span> <span style="color: #4e9a06;">'get_species'</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #5c35cc; font-weight: bold;">@patch.object</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">Animal</span><span style="color: black; font-weight: bold;">,</span> <span style="color: #4e9a06;">'get_age'</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">test___str__</span><span style="color: black; font-weight: bold;">(</span><span style="color: #3465a4;">self</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">mock_get_age</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">mock_get_species</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: black;">mock_get_age</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">return_value</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: #0000cf; font-weight: bold;">12</span>
<span style="color: black;">mock_get_species</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">return_value</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: #4e9a06;">'Dog'</span>
<span style="color: black;">pet_instance</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: black;">Pet</span><span style="color: black; font-weight: bold;">(</span><span style="color: #4e9a06;">'Clifford'</span><span style="color: black; font-weight: bold;">,</span> <span style="color: #4e9a06;">'Dog'</span><span style="color: black; font-weight: bold;">,</span> <span style="color: #0000cf; font-weight: bold;">12</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #3465a4;">self</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">assertEquals</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">pet_instance</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">__str__</span><span style="color: black; font-weight: bold;">(),</span> <span style="color: #4e9a06;">'Clifford is a dog aged 12'</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">test_create_pet</span><span style="color: black; font-weight: bold;">(</span><span style="color: #3465a4;">self</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: #3465a4;">self</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">assertIsInstance</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">sample</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">pet</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">create_pet</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">age</span><span style="color: #ce5c00; font-weight: bold;">=</span><span style="color: #0000cf; font-weight: bold;">12</span><span style="color: black; font-weight: bold;">,</span><span style="color: black;">species</span><span style="color: #ce5c00; font-weight: bold;">=</span><span style="color: #4e9a06;">'Dog'</span><span style="color: black; font-weight: bold;">,</span><span style="color: black;">name</span><span style="color: #ce5c00; font-weight: bold;">=</span><span style="color: #4e9a06;">'Clifford'</span><span style="color: black; font-weight: bold;">),</span> <span style="color: black;">Pet</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">test_get_name</span><span style="color: black; font-weight: bold;">(</span><span style="color: #3465a4;">self</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: black;">pet_instance</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: black;">Pet</span><span style="color: black; font-weight: bold;">(</span><span style="color: #4e9a06;">'Clifford'</span><span style="color: black; font-weight: bold;">,</span> <span style="color: #4e9a06;">'Dog'</span><span style="color: black; font-weight: bold;">,</span> <span style="color: #0000cf; font-weight: bold;">12</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #3465a4;">self</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">assertEquals</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">pet_instance</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">get_name</span><span style="color: black; font-weight: bold;">(),</span> <span style="color: #4e9a06;">'Clifford'</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">test_lower</span><span style="color: black; font-weight: bold;">(</span><span style="color: #3465a4;">self</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: #3465a4;">self</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">assertEquals</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">Pet</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">lower</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">s</span><span style="color: #ce5c00; font-weight: bold;">=</span><span style="color: #4e9a06;">'Dog'</span><span style="color: black; font-weight: bold;">),</span> <span style="color: #4e9a06;">'dog'</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #204a87; font-weight: bold;">if</span> <span style="color: black;">__name__</span> <span style="color: #ce5c00; font-weight: bold;">==</span> <span style="color: #4e9a06;">"__main__"</span><span style="color: black; font-weight: bold;">:</span>
<span style="color: black;">unittest</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">main</span><span style="color: black; font-weight: bold;">()</span>
</pre>
</div>
<br />
<div>
No changes to the original code are needed to teach Auger anything. All that is required is for the developer to write their original code and exercise it somehow. In the above case, we simply ran <span style="font-family: "courier new" , "courier" , monospace;">python sample.pet</span> to produce two scenarios in which <span style="font-family: "courier new" , "courier" , monospace;">Pet</span> instances were created and manipulated. From those two sample, a single test was extracted.<br />
<br />
Of course Auger is limited in the sense that it cannot guess what <i>scenario</i> is being tested. Rather than generates multiple, focused tests per module, it will generate one big test that covers the entire module. The value of Auger is more in generating all the boiler plate code, imports and mocks, ensuring proper coverage, and to generate a template for manual refinement.<br />
<br />
To generate a set of unit tests, Auger magic is invoked:<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: 1px solid #ddd; overflow: auto; padding: 11px 7px; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #204a87; font-weight: bold;">import</span> <span style="color: black;">auger</span></pre>
<pre style="line-height: 125%; margin: 0;"><span style="color: black;">
</span></pre>
<pre style="line-height: 125%; margin: 0;"><span style="color: black;">... your code goes here ...</span></pre>
<pre style="line-height: 125%; margin: 0;"><span style="color: black;">
</span></pre>
<pre style="line-height: 125%; margin: 0;"><span style="color: #204a87; font-weight: bold;">if</span> <span style="color: black;">__name__</span> <span style="color: #ce5c00; font-weight: bold;">==</span> <span style="color: #4e9a06;">"__main__"</span><span style="color: black; font-weight: bold;">:</span>
<span style="color: #204a87; font-weight: bold;">with</span> <span style="color: black;">auger</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">magic</span><span style="color: black; font-weight: bold;">([</span><span style="color: black;">pet]</span><span style="color: black; font-weight: bold;">):</span>
... call the <span style="color: black;">main routine for your code ...<b> </b></span>
</pre>
</div>
<br />
In this case, one module is passed, <span style="font-family: "courier new" , "courier" , monospace;">pet</span>, but multiple modules can be passed as well. Each one will be traced and unit tested.<br />
<br />
When a unit test is produced, it is written out to the local file system under the corresponding <span style="font-family: "courier new" , "courier" , monospace;">tests</span> folder.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijMDKE9CKo5pa9dxA5NhKe7PLkfocz4JMCjplMZGWRRl9_upyunZjbhnGXL5u421NzDvqu8z6MP6rkHq8uP76WtjCCLhqrmWeu3Eh-cpnkcnk8jAxPyzLJUP3JPWr4g-eqCiEEwyBEguE/s1600/Screen+Shot+2016-12-19+at+11.51.59+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijMDKE9CKo5pa9dxA5NhKe7PLkfocz4JMCjplMZGWRRl9_upyunZjbhnGXL5u421NzDvqu8z6MP6rkHq8uP76WtjCCLhqrmWeu3Eh-cpnkcnk8jAxPyzLJUP3JPWr4g-eqCiEEwyBEguE/s400/Screen+Shot+2016-12-19+at+11.51.59+AM.png" width="305" /></a></div>
<h3>
</h3>
<h3>
</h3>
<b><span style="font-size: large;"><br /></span></b>
<b><span style="font-size: large;">
IDE Integration</span></b><br />
<br />
Auger does not have direct IDE integration per se, but works really well with PyCharm. This integration comes for free, because the IDE watches the underlying file system and will automatically discover when new files are created by Auger in the local file system. These tests can then be executed easily as well:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdMOBrhvZbL86cHN0NCBb5P9onEQVEXM2SUJfB3juCJrBzAwWzoPdPzqx3V0wV1CYwVLalXgloMvdQhrJmPCmTcvpTeH95SzIuqgo2uzn7axGsMXvh0KYmrgW3YOgHxuV5RgaOdxHwBVc/s1600/Screen+Shot+2016-12-19+at+11.55.15+AM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="298" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgdMOBrhvZbL86cHN0NCBb5P9onEQVEXM2SUJfB3juCJrBzAwWzoPdPzqx3V0wV1CYwVLalXgloMvdQhrJmPCmTcvpTeH95SzIuqgo2uzn7axGsMXvh0KYmrgW3YOgHxuV5RgaOdxHwBVc/s400/Screen+Shot+2016-12-19+at+11.55.15+AM.png" width="400" /></a></div>
<br />
Adding the generated tests to Git and commit/push them to a repository means just a few clicks from that point onwards in an IDE such as PyCharm.<br />
<b><span style="font-size: large;"><br /></span></b>
<b><span style="font-size: large;">Future Work</span></b><br />
<br />
<ul>
<li>Incremental test generation. Collect multiple execution runs, persist the invocations, and merge multiple runs into one test case.</li>
<li>Preserve manual edits performed by users on generated test cases when a test is regenerated.</li>
<li>Support other unit test frameworks, such as pytest, nose, cucumber, etc.</li>
<li>Figure out how to run Auger on itself. This is non-trivial :-)</li>
</ul>
<br />
<br />
<br />
Check out <a href="https://github.com/laffra/auger" target="_blank">Project Auger</a> and let me know what you think of it. Pull requests are welcomed.<br />
<br />
<br /></div>
<br />
<hr />
Python code styled as <span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">tango</span> by <a href="http://hilite.me/">hilite.me</a> with <span style="background-color: white;"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">border:1px solid #ddd;padding:11px 7px;</span></span>Chris Laffrahttp://www.blogger.com/profile/17244232814118596946noreply@blogger.com5tag:blogger.com,1999:blog-7152855345155101431.post-46824214288101710172016-12-18T10:36:00.003-08:002016-12-22T18:16:48.764-08:00QuickSort<a href="https://en.wikipedia.org/wiki/Quicksort" target="_blank">QuickSort</a> is a <a href="https://en.wikipedia.org/wiki/Divide_and_conquer_algorithms" target="_blank">divide-and-conquer</a> algorithm that splits up an array into two halves, and recursively sorts each half. In other words, to sort an array, we pick a random pivot, split the array in three section, being smaller, equal, or larger than the pivot, and recursively sort the array:<br />
<br />
<div style="background: #f8f8f8; border: 1px solid #ddd; overflow: auto; padding: 11px 7px; width: auto;">
<table><tbody>
<tr><td><pre style="line-height: 125%; margin: 0;">1
2
3
4
5
6
7</pre>
</td><td><pre style="line-height: 125%; margin: 0;"> <span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">qsort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: #204a87; font-weight: bold;">if</span> <span style="color: #204a87;">len</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span> <span style="color: #ce5c00; font-weight: bold;"><</span> <span style="color: #0000cf; font-weight: bold;">2</span><span style="color: black; font-weight: bold;">:</span> <span style="color: #204a87; font-weight: bold;">return</span> <span style="color: black;">array</span>
<span style="color: black;">pivot</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">[</span><span style="color: #0000cf; font-weight: bold;">0</span><span style="color: black; font-weight: bold;">]</span>
<span style="color: black;">less</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;"><</span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: black;">equal</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;">==</span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: black;">larger</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;">></span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #204a87; font-weight: bold;">return</span> <span style="color: black;">qsort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">less</span><span style="color: black; font-weight: bold;">)</span> <span style="color: #ce5c00; font-weight: bold;">+</span> <span style="color: black;">equal</span> <span style="color: #ce5c00; font-weight: bold;">+</span> <span style="color: black;">qsort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">larger</span><span style="color: black; font-weight: bold;">)</span>
</pre>
</td></tr>
</tbody></table>
</div>
<br />
In the original implementation of QuickSort, the first element in the array was picked as pivot, as is done on line 3 above. Below, the red bar indicates the choice of pivot. At each step of the algorithm, it is at the left of the array being sorted:<br />
<br />
<div style="height: 147px; overflow: hidden; width: 600px;">
<iframe height="165" scrolling="no" src="https://pyalgoviz.appspot.com/show?visualize=true&name=Sorting%20-%20QuickSort%20-%201%20-%20First" style="margin: -12px 0 0 -5px;" width="1187">
Could not render iframe, please disable ad-blockers.</iframe>
</div>
<br />
Choosing the first element as pivot can generate extra work when the array is already (partially) sorted. Rather than picking the first element, the middle element can be chosen, behaving better on already sorted arrays:<br />
<br />
<div style="background: #f8f8f8; border: 1px solid #ddd; overflow: auto; padding: 11px 7px; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">qsort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: #204a87; font-weight: bold;">if</span> <span style="color: #204a87;">len</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span> <span style="color: #ce5c00; font-weight: bold;"><</span> <span style="color: #0000cf; font-weight: bold;">2</span><span style="color: black; font-weight: bold;">:</span> <span style="color: #204a87; font-weight: bold;">return</span> <span style="color: black;">array</span>
<span style="color: black;">pivot</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">[</span><span style="color: #204a87;">len</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span><span style="color: #ce5c00; font-weight: bold;">/</span><span style="color: #0000cf; font-weight: bold;">2</span><span style="color: black; font-weight: bold;">]</span>
<span style="color: black;">less</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;"><</span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: black;">equal</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;">==</span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: black;">larger</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;">></span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #204a87; font-weight: bold;">return</span> <span style="color: black;">qsort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">less</span><span style="color: black; font-weight: bold;">)</span> <span style="color: #ce5c00; font-weight: bold;">+</span> <span style="color: black;">equal</span> <span style="color: #ce5c00; font-weight: bold;">+</span> <span style="color: black;">qsort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">larger</span><span style="color: black; font-weight: bold;">)</span>
</pre>
</div>
<br />
When choosing the value in the middle of the array as pivot, yields the following result, with again the red bar indicating the pivot:<br />
<br />
<div style="height: 147px; overflow: hidden; width: 600px;">
<iframe height="165" scrolling="no" src="https://pyalgoviz.appspot.com/show?visualize=true&name=Sorting%20-%20QuickSort%20-%202%20-%20Middle" style="margin: -12px 0 0 -5px;" width="1187">
Could not render iframe, please disable ad-blockers.</iframe>
<br />
<div>
<br /></div>
</div>
<br />
Rather than pick the middle element as pivot, a random value can be picked from the array:<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: 1px solid #ddd; overflow: auto; padding: 11px 7px; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">qsort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: #204a87; font-weight: bold;">if</span> <span style="color: #204a87;">len</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span> <span style="color: #ce5c00; font-weight: bold;"><</span> <span style="color: #0000cf; font-weight: bold;">2</span><span style="color: black; font-weight: bold;">:</span> <span style="color: #204a87; font-weight: bold;">return</span> <span style="color: black;">array</span>
<span style="color: black;">pivot</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">[</span><span style="color: black;">random</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">randint</span><span style="color: black; font-weight: bold;">(</span><span style="color: #0000cf; font-weight: bold;">0</span><span style="color: black; font-weight: bold;">,</span> <span style="color: #204a87;">len</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span><span style="color: #ce5c00; font-weight: bold;">-</span><span style="color: #0000cf; font-weight: bold;">1</span><span style="color: black; font-weight: bold;">)]</span>
<span style="color: black;">less</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;"><</span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: black;">equal</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;">==</span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: black;">larger</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;">></span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #204a87; font-weight: bold;">return</span> <span style="color: black;">qsort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">less</span><span style="color: black; font-weight: bold;">)</span> <span style="color: #ce5c00; font-weight: bold;">+</span> <span style="color: black;">equal</span> <span style="color: #ce5c00; font-weight: bold;">+</span> <span style="color: black;">qsort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">larger</span><span style="color: black; font-weight: bold;">)</span>
</pre>
</div>
<br />
The pivot is now picked randomly as shown in the visualization below:<br />
<br />
<div>
<div style="height: 147px; overflow: hidden; width: 600px;">
<iframe height="165" scrolling="no" src="https://pyalgoviz.appspot.com/show?visualize=true&name=Sorting%20-%20QuickSort%20-%203%20-%20Random" style="margin: -12px 0 0 -5px;" width="1187">
Could not render iframe, please disable ad-blockers.</iframe>
</div>
<br />
The final approach to pivot selection we will look at was invented by <a href="https://en.wikipedia.org/wiki/Robert_Sedgewick_(computer_scientist)">Robert Sedgewick</a> and involves picking the median of three values:<br />
<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: 1px solid #ddd; overflow: auto; padding: 11px 7px; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">qsort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: #204a87; font-weight: bold;">if</span> <span style="color: #204a87;">len</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span> <span style="color: #ce5c00; font-weight: bold;"><</span> <span style="color: #0000cf; font-weight: bold;">2</span><span style="color: black; font-weight: bold;">:</span> <span style="color: #204a87; font-weight: bold;">return</span> <span style="color: black;">array</span>
<span style="color: black;">pivot</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">[</span><span style="color: #0000cf; font-weight: bold;">0</span><span style="color: black; font-weight: bold;">]</span> <span style="color: #ce5c00; font-weight: bold;">+</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">[</span><span style="color: #204a87;">len</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span><span style="color: #ce5c00; font-weight: bold;">/</span><span style="color: #0000cf; font-weight: bold;">2</span><span style="color: black; font-weight: bold;">]</span> <span style="color: #ce5c00; font-weight: bold;">+</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">[</span><span style="color: #ce5c00; font-weight: bold;">-</span><span style="color: #0000cf; font-weight: bold;">1</span><span style="color: black; font-weight: bold;">])</span><span style="color: #ce5c00; font-weight: bold;">/</span><span style="color: #0000cf; font-weight: bold;">3</span>
<span style="color: black;">less</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;"><</span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: black;">equal</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;">==</span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: black;">larger</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;">></span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #204a87; font-weight: bold;">return</span> <span style="color: black;">qsort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">less</span><span style="color: black; font-weight: bold;">)</span> <span style="color: #ce5c00; font-weight: bold;">+</span> <span style="color: black;">equal</span> <span style="color: #ce5c00; font-weight: bold;">+</span> <span style="color: black;">qsort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">larger</span><span style="color: black; font-weight: bold;">)</span>
</pre>
</div>
<br />
The pivot is now picked by looking at three different values as shown in the visualization below:<br />
<br />
<div>
<div style="height: 147px; overflow: hidden; width: 600px;">
<iframe height="165" scrolling="no" src="https://pyalgoviz.appspot.com/show?visualize=true&name=Sorting%20-%20QuickSort%20-%204%20-%20Median" style="margin: -12px 0 0 -5px;" width="1187">
Could not render iframe, please disable ad-blockers.</iframe>
</div>
<br />
Aside from how we pick the pivot, all versions of QuickSort shown above have the same thing in common that they use recursion for implementing the divide and conquer tactic. Recursion has overhead in that new stackframes need to be constructed for each call. Recursion also limits the maximum size of the array we can sort, but that is more of a theoretical challenge.<br />
<br />
QuickSort can also be implement without using recursion, by using a user-defined stack to keep track of the work to be done. A possible implementation:<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: 1px solid #ddd; overflow: auto; padding: 11px 7px; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">qsort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: black;">work</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: black; font-weight: bold;">[</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">]</span>
<span style="color: black;">result</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: black; font-weight: bold;">[]</span>
<span style="color: #204a87; font-weight: bold;">while</span> <span style="color: black;">work</span><span style="color: black; font-weight: bold;">:</span>
<span style="color: black;">array</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: black;">work</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">pop</span><span style="color: black; font-weight: bold;">(</span><span style="color: #0000cf; font-weight: bold;">0</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #204a87; font-weight: bold;">if</span> <span style="color: #204a87;">len</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span> <span style="color: #ce5c00; font-weight: bold;"><</span> <span style="color: #0000cf; font-weight: bold;">2</span><span style="color: black; font-weight: bold;">:</span>
<span style="color: black;">result</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">extend</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: #204a87; font-weight: bold;">else</span><span style="color: black; font-weight: bold;">:</span>
<span style="color: black;">pivot</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">[</span><span style="color: #0000cf; font-weight: bold;">0</span><span style="color: black; font-weight: bold;">]</span>
<span style="color: black;">work</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: black; font-weight: bold;">[</span>
<span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;"><</span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">),</span>
<span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;">==</span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">),</span>
<span style="color: #204a87;">filter</span><span style="color: black; font-weight: bold;">(</span><span style="color: #204a87; font-weight: bold;">lambda</span> <span style="color: black;">n</span><span style="color: black; font-weight: bold;">:</span> <span style="color: black;">n</span><span style="color: #ce5c00; font-weight: bold;">></span><span style="color: black;">pivot</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">),</span>
<span style="color: black; font-weight: bold;">]</span> <span style="color: #ce5c00; font-weight: bold;">+</span> <span style="color: black;">work</span>
<span style="color: #204a87; font-weight: bold;">return</span> <span style="color: black;">result</span>
</pre>
</div>
<br />
And the execution (this time using the first element as pivot again):<br />
<br />
<div style="height: 147px; overflow: hidden; width: 600px;">
<iframe height="165" scrolling="no" src="https://pyalgoviz.appspot.com/show?visualize=true&name=Sorting%20-%20QuickSort%20-%205%20-%20No%20Recursion" style="margin: -12px 0 0 -5px;" width="1187">
Could not render iframe, please disable ad-blockers.</iframe>
</div>
<br />
In general, QuickSort is not a stable sort. Elements of equal value are not guaranteed to remain in their original sort order for most implementations. However, the implementations above keep the elements that are equal to the pivot in the original order as they were found. That makes the above implementation stable.<br />
<br />
<br />
Check out more algorithm visualizations at <a href="http://pyalgoviz.appspot.com/">PyAlgoViz</a>.<br />
<br />
<br />
<br />
<hr />
Python code styled as <span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">tango</span> by <a href="http://hilite.me/">hilite.me</a> with <span style="background-color: white;"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">border:1px solid #ddd;padding:11px 7px;</span></span>
</div>
</div>
Chris Laffrahttp://www.blogger.com/profile/17244232814118596946noreply@blogger.com0tag:blogger.com,1999:blog-7152855345155101431.post-67799070711054479682016-12-16T14:27:00.000-08:002016-12-18T10:40:34.188-08:00MergeSort<a href="https://en.wikipedia.org/wiki/Merge_sort" target="_blank">MergeSort</a> is a <a href="https://en.wikipedia.org/wiki/Divide_and_conquer_algorithms" target="_blank">divide-and-conquer</a> algorithm that recursively sorts two halves of an array and merges them to form a fully sorted end result. In other words, to sort a sequence, we find the middle point, sort the first half, sort the second half, and finally merge both halves:<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: 1px solid #ddd; overflow: auto; padding: 11px 7px; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">mergeSort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black; font-weight: bold;">array</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">start</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">end</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: #204a87; font-weight: bold;">if</span> <span style="color: black;">end </span><span style="color: #ce5c00; font-weight: bold;">- </span><span style="color: black;">start</span> <span style="color: #ce5c00; font-weight: bold;">></span> <span style="color: #0000cf; font-weight: bold;">1</span><span style="color: black; font-weight: bold;">:</span>
<span style="color: black;">middle</span> <span style="color: #ce5c00; font-weight: bold;">=</span> <span style="color: black; font-weight: bold;">(</span><span style="color: black;">start </span><span style="color: #ce5c00; font-weight: bold;">+ </span><span style="color: black;">end</span><span style="color: black; font-weight: bold;">)</span> <span style="color: #ce5c00; font-weight: bold;">/</span> <span style="color: #0000cf; font-weight: bold;">2</span>
<span style="color: black;">mergeSort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">start</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">middle</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: black;">mergeSort</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">middle</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">end</span><span style="color: black; font-weight: bold;">)</span>
<span style="color: black;">merge</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">start</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">middle</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">middle</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">end</span><span style="color: black; font-weight: bold;">)</span>
</pre>
</div>
<br />
The merging of the two halves is done by inserting elements of the second half into the first half. The second half keeps shrinking in size until we are done:<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #f8f8f8; border: 1px solid #ddd; overflow: auto; padding: 11px 7px; width: auto;">
<pre style="line-height: 125%; margin: 0;"><span style="color: #204a87; font-weight: bold;">def</span> <span style="color: black;">merge</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">array</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">left</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">leftEnd</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">right</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">rightEnd</span><span style="color: black; font-weight: bold;">):</span>
<span style="color: #204a87; font-weight: bold;">while</span> <span style="color: black;">left</span><span style="color: #ce5c00; font-weight: bold;"><</span><span style="color: black;">leftEnd</span> <span style="color: #204a87; font-weight: bold;">and</span> <span style="color: black;">right</span><span style="color: #ce5c00; font-weight: bold;"><</span><span style="color: black;">rightEnd</span><span style="color: black; font-weight: bold;">:</span>
<span style="color: #204a87; font-weight: bold;">if</span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">[</span><span style="color: black;">left</span><span style="color: black; font-weight: bold;">]</span> <span style="color: #ce5c00; font-weight: bold;">></span> <span style="color: black;">array</span><span style="color: black; font-weight: bold;">[</span><span style="color: black;">right</span><span style="color: black; font-weight: bold;">]:</span>
<span style="color: black;">array</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">insert</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">left</span><span style="color: black; font-weight: bold;">,</span> <span style="color: black;">array</span><span style="color: #ce5c00; font-weight: bold;">.</span><span style="color: black;">pop</span><span style="color: black; font-weight: bold;">(</span><span style="color: black;">right</span><span style="color: black; font-weight: bold;">))</span>
<span style="color: black;">right</span> <span style="color: #ce5c00; font-weight: bold;">+=</span> <span style="color: #0000cf; font-weight: bold;">1</span>
<span style="color: black;">leftEnd</span> <span style="color: #ce5c00; font-weight: bold;">+=</span> <span style="color: #0000cf; font-weight: bold;">1</span>
<span style="color: #204a87; font-weight: bold;">else</span><span style="color: black; font-weight: bold;">:</span>
<span style="color: black;">left</span> <span style="color: #ce5c00; font-weight: bold;">+=</span> <span style="color: #0000cf; font-weight: bold;">1</span>
</pre>
</div>
<br />
Executing the above implementation on 40 random numbers produces the following result:<br />
<br />
<div style="height: 247px; overflow: hidden; width: 600px;">
<iframe height="265" scrolling="no" src="https://pyalgoviz.appspot.com/show?visualize=true&name=Sorting%20-%20MergeSort" style="margin: -12px 0 0 -5px;" width="1187">
Could not render iframe, please disable ad-blockers.</iframe>
<br />
<div>
<br /></div>
</div>
<br />
This implementation is a stable sort. Elements of equal value remain in their original sort order.<br />
<br />
<br />
Check out more algorithm visualizations at <a href="http://pyalgoviz.appspot.com/">PyAlgoViz</a>.<br />
<div>
<br /></div>
<br />
<hr />
Python code styled as <span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">tango</span> by <a href="http://hilite.me/">hilite.me</a> with <span style="background-color: white;"><span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">border:1px solid #ddd;padding:11px 7px;</span></span>Chris Laffrahttp://www.blogger.com/profile/17244232814118596946noreply@blogger.com0tag:blogger.com,1999:blog-7152855345155101431.post-40528960383847669162016-12-15T15:08:00.000-08:002016-12-15T18:10:37.204-08:00Generating Prime Numbers<div>
In this blog post, we will generate prime numbers and explore brute force, memoization, and dynamic programming.<br />
<br />
Prime number are numbers greater than one that are only divisible by themselves and one. In other words, the following Python function returns True if n is greater than one and none of the numbers in between 2 and n is a factor of n:<br />
<br /></div>
<div style="text-align: left;">
<a href="https://pyalgoviz.appspot.com/show?name=Numbers%20-%20Prime%20Generator%20-%201%20-%20Exhaustive" target="_blank"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj1vlBrTtyezV-7FPuTF33jm2i9VjaEahzlvEZ_kwy0EA6kVI2t7T5AJzwxruVO3-0xmq1AHVe0Z60AWYtE_NQjM3sXLWhKs29vzMiVwwwR0W7ADgqNfxdxrT72JDHENV9mjvSCsHCWQ0I/s1600/Screen+Shot+2016-12-14+at+9.41.21+PM.png" /></a></div>
<div>
<br />
Of course, this implementation is rather naive and takes a lot longer than is necessary:<br />
<br /></div>
<div style="height: 427px; overflow: hidden; width: 600px;">
<iframe height="495" src="https://pyalgoviz.appspot.com/show?visualize=true&name=Numbers%20-%20Prime%20Generator%20-%201%20-%20Exhaustive" style="margin: -12px 0 0 -5px;" width="1187">
Could not render iframe, please disable ad-blockers.</iframe>
</div>
<br />
Our first optimization would be to realize that we need not check <i>all</i> factors in between 2 and n. Namely, we only need to look at the factors between 2 and n/2, as any number higher than that could never be a factor of n.<br />
<br />
Our code now looks like this:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXteQBUbMwmVjbskHJh5_Ah4sIidmkKMSkcn7Y-RNFDkpntbTE8VwnVFue5MLQhww3Z3g6h5d8NdZOQyhO8c3-s5msusMHdPrk6faGnvsM2b5A-DeWXYds7zR3pF5PSb2d0kU6zcZIAE8/s1600/Screen+Shot+2016-12-14+at+9.56.26+PM.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXteQBUbMwmVjbskHJh5_Ah4sIidmkKMSkcn7Y-RNFDkpntbTE8VwnVFue5MLQhww3Z3g6h5d8NdZOQyhO8c3-s5msusMHdPrk6faGnvsM2b5A-DeWXYds7zR3pF5PSb2d0kU6zcZIAE8/s1600/Screen+Shot+2016-12-14+at+9.56.26+PM.png" /></a></div>
<br />
<br />
<== doing half the work<br />
<br />
<br />
<br />
<br />
<br />
<br />
With this optimization in place, the algorithm runs about twice as fast already:<br />
<br />
<div style="height: 427px; overflow: hidden; width: 600px;">
<iframe height="495" src="https://pyalgoviz.appspot.com/show?delay=2&visualize=true&name=Numbers%20-%20Prime%20Generator%20-%202%20-%20Halfway" style="margin: -12px 0 0 -5px;" width="1187">
Could not render iframe, please disable ad-blockers.</iframe>
</div>
<div>
<br />
Still, we can do better. We can actually already stop considering factors when we reach sqrt(n):<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5eVZcu4bL5BRLHeDAYv7dboysDItT5d_aeW1-TulJHE2jCkvsiPPaHX3E2BwJdTPBK1l68LycGelmBZEL9yJ3ISNTzC7j2CJfaZ_clCNj6MP-XT64fpW2pWS-xzuVGHmrGaOXVu8RKMs/s1600/Screen+Shot+2016-12-14+at+11.23.51+PM.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh5eVZcu4bL5BRLHeDAYv7dboysDItT5d_aeW1-TulJHE2jCkvsiPPaHX3E2BwJdTPBK1l68LycGelmBZEL9yJ3ISNTzC7j2CJfaZ_clCNj6MP-XT64fpW2pWS-xzuVGHmrGaOXVu8RKMs/s1600/Screen+Shot+2016-12-14+at+11.23.51+PM.png" /></a></div>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
A further optimization would be to realize that we only need to consider the factors 2, 3, 5, 7, 9, etc. as 4,6,8,etc. could not be factors as they are already divisible by 2. But, we'll leave that refinement to the reader.<br />
<br />
Let's try our sqrt(n) optimization in the next iteration of our algorithm:<br />
<br /></div>
<div style="height: 427px; overflow: hidden; width: 600px;">
<iframe height="495" src="https://pyalgoviz.appspot.com/show?delay=4&visualize=true&name=Numbers%20-%20Prime%20Generator%20-%203%20-%20Sqrt
" style="margin: -12px 0 0 -5px;" width="1187">
Could not render iframe, please disable ad-blockers.</iframe>
</div>
<br />
By now, our algorithm should be more than twice as fast.<br />
<br />
Of course, we can still do better. For instance, say we are checking to see if 41 is a prime number. To validate 41, our algorithm will try the factors between 2 and sqrt(41), which is 6:<br />
<ul>
<li>41 % 2 = 1, not a factor</li>
<li>41 % 3 = 2, not a factor</li>
<li>41 % 4 = 1, not a factor</li>
<li>41 % 5 = 1, not a factor</li>
<li>41 % 6 = 5, not a factor</li>
</ul>
<div>
When we check for factors 4 and 6, we are doing double work. Namely, if 41 would be divisible by 4, it would already be divisible by 2. The same argument goes for 6, which is already divisible by 2 and 3 itself. </div>
<div>
<br /></div>
<div>
In other words, to check if a given number is prime, we need to only verify the prime numbers between 2 and sqrt(n) as divisible factors. This greatly reduces the search space. All we need to do is remember the current list of prime numbers and use those as factors. This approach is generally referred to as a <a href="https://en.wikipedia.org/wiki/Generating_primes#Prime_sieves" target="_blank">Prime Sieve</a> as opposed to the <a href="https://en.wikipedia.org/wiki/Trial_division" target="_blank">trial division</a> approach we used so far:</div>
<div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYgv3-3daPmNL_RJ_cs7HAlDEcTj_rKPRLsNrvhyphenhyphenWuU3rMo8tKqFHe7FZWUqbkvRTlijc9K3YClfrHQb86wyMUOz90k8TyhU5FsdQ4l1k4XKRYEZdvdUD3jKlyKqOa_v9qku3zVQkGz9I/s1600/Screen+Shot+2016-12-15+at+4.05.19+PM.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhYgv3-3daPmNL_RJ_cs7HAlDEcTj_rKPRLsNrvhyphenhyphenWuU3rMo8tKqFHe7FZWUqbkvRTlijc9K3YClfrHQb86wyMUOz90k8TyhU5FsdQ4l1k4XKRYEZdvdUD3jKlyKqOa_v9qku3zVQkGz9I/s1600/Screen+Shot+2016-12-15+at+4.05.19+PM.png" /></a></div>
<br /></div>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
Note that the first line of our isPrime function now has a linear search because we used a list to hold the current primes. Ideally we would have used an OrderedSet instead of a list, to make membership checks O(1) and speed up that part of the algorithm even more. For simplicity, we use a list now.<br />
<br />
Using our newly minted insight on using primes as factors, we produce the following execution:<br />
<br />
<div style="height: 427px; overflow: hidden; width: 600px;">
<iframe height="495" src="https://pyalgoviz.appspot.com/show?delay=5&visualize=true&name=Numbers%20-%20Prime%20Generator%20-%204%20-%20Primes" style="margin: -12px 0 0 -5px;" width="1187">
Could not render iframe, please disable ad-blockers.</iframe>
</div>
<br />
That is confusing. This is slower. How can that be?<br />
<br />
One reason is that at this point, our code is a lot more complex than before. Namely, in order to answer the question if a given number is a prime, in the worst case, we need to discover all the primes before that number. There is also some bookkeeping overhead to caching each of the primes we found so far. We don't get much apparent speedup due to all the extra work we need to do.<br />
<br />
However, once we actually warmed up the cache, by running our loop once ahead of time, and then measuring a second iteration of our loop, we see an interesting speedup again:<br />
<br />
<div style="height: 427px; overflow: hidden; width: 600px;">
<iframe height="495" src="https://pyalgoviz.appspot.com/show?delay=6&visualize=true&name=Numbers%20-%20Prime%20Generator%20-%205%20-%20Primes-Warm" style="margin: -12px 0 0 -5px;" width="1187">
Could not render iframe, please disable ad-blockers.</iframe>
</div>
<br />
Essentially, we are demonstrating the power of memoization here. The progression we have seen so far is from brute force, to a bit smarter, to using another smart insight, to remembering intermediate steps. By remembering the intermediate results, we essentially are performing a space-time tradeoff. By adding a bit more memory, we can avoid performing repetitive operations.<br />
<br />
The next big leap in an algorithm such as finding prime numbers is whether we can make any assumptions on exactly in what order the API is accessed. For instance, are we more interested in being able to figure out if any given random number is a prime number, or are we more interested in producing a prime number generator, one that produces increasingly larger primes? If the latter is the case, we enter the domain of Dynamic Programming.<br />
<br />
By assuming we produce increasingly larger primes, the Pythonic instrument we love to go for is a generator function. It can keep local state, return one result at a time, and resume execution at the place where it yielded the most recent result:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5Dc5255wO5hUrEmZ2dLdYsU4pf7a2bqTWqwVqO8JaZiR9v2D8oNTejIOI0vXgBnnNXgN0hZEKPFRrvc9b8Lkurn0dln8r32PqSNrm97YplITuTtY8COoip8NHHseWNa3iA7CWbyUmooU/s1600/Screen+Shot+2016-12-15+at+4.51.54+PM.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi5Dc5255wO5hUrEmZ2dLdYsU4pf7a2bqTWqwVqO8JaZiR9v2D8oNTejIOI0vXgBnnNXgN0hZEKPFRrvc9b8Lkurn0dln8r32PqSNrm97YplITuTtY8COoip8NHHseWNa3iA7CWbyUmooU/s1600/Screen+Shot+2016-12-15+at+4.51.54+PM.png" /></a></div>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
The use of co-routines in this manner allows us to combine the concept of memoization and cleanly wrap it in a nice abstraction. The function can be used like any regular Python function as follows:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFmJ3wuAMT10TVVz1x7j7GPZRbMpP8wfSyPH8YOrxyBbSVAdq6rv_4fij-I2E7Y3Z16yvsNPIb3GOQ7p4L2CMlLc75u5ePB35GNcrLy72DbLY3fpIJn_Lk3vGKTkjiIuTjn2sUNmDiVkg/s1600/Screen+Shot+2016-12-15+at+5.02.31+PM.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFmJ3wuAMT10TVVz1x7j7GPZRbMpP8wfSyPH8YOrxyBbSVAdq6rv_4fij-I2E7Y3Z16yvsNPIb3GOQ7p4L2CMlLc75u5ePB35GNcrLy72DbLY3fpIJn_Lk3vGKTkjiIuTjn2sUNmDiVkg/s1600/Screen+Shot+2016-12-15+at+5.02.31+PM.png" /></a></div>
<br />
<br />
<br />
<br />
<br />
The final iteration of our algorithm now produces:<br />
<br />
<div style="height: 427px; overflow: hidden; width: 600px;">
<iframe height="495" scrolling="false" src="https://pyalgoviz.appspot.com/show?delay=7&visualize=true&name=Numbers%20-%20Prime%20Generator%20-%206%20-%20DP" style="margin: -12px 0 0 -5px;" width="1187">
Could not render iframe, please disable ad-blockers.</iframe>
</div>
<br />
This demonstrates a nice progression from brute force to memoization to DP.<br />
<br />
That said, if you made it this far, you probably want to take a look at <a href="https://en.wikipedia.org/wiki/Sieve_of_Atkin" target="_blank">Atkin's Sieve</a>, that generates primes using an O(n) approach.<br />
<br />
For more algorithms and their visualizations, check out <a href="http://pyalgoviz.appspot.com/" target="_blank">PyAlgoViz</a>.<br />
<br />Chris Laffrahttp://www.blogger.com/profile/17244232814118596946noreply@blogger.com0tag:blogger.com,1999:blog-7152855345155101431.post-13564865460773815102016-12-13T18:37:00.001-08:002016-12-13T18:37:39.098-08:00What I Learned While Fighting Fake News<span style="font-size: large;">In a couple of upcoming blog entries I will share what I learned while building <a href="http://realnews-fakenews.appspot.com/" target="_blank">Realnews/Fakenews</a>, a Chrome extension that adds visual indicators to sites such as Facebook and Twitter to warn readers of possible Fake News links:</span><br />
<span style="font-size: large;"><br /></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://www.npr.org/tags/502124007/fake-news" style="margin-left: auto; margin-right: auto;" target="_blank"><span style="font-size: large;"><img border="0" height="250" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjrzgX_Dc4GBvOXrhamk6DBeec0FV70CK-Z0LLG5UpR9d0e3x3yUUbGceUjJpM4_JjnCEcUhYtHO-FWh1mXdvQdDbIZUTmhyphenhyphen9CxcvimDQgSsyCgiP1tHeivGDtp5wW9nCX236SK36-on1s/s400/fakenews.webp" width="400" /></span></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;"><span style="font-size: large;"><a href="http://www.npr.org/tags/502124007/fake-news" target="_blank">npr.org: Fake News</a></span></td></tr>
</tbody></table>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">I will talk about the extension itself, the architecture and implementation of the server, choices in hosting plans and pricing for the server, and how to use machine learning to predict fake news from just looking at the URL. </span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">Fake News is serious. It can lead to <a href="http://www.npr.org/sections/thetwo-way/2016/12/13/505424283/pizzagate-suspect-faces-federal-charge" target="_blank">vigilantes storming pizza parlors</a> and <a href="https://www.washingtonpost.com/business/economy/russian-propaganda-effort-helped-spread-fake-news-during-election-experts-say/2016/11/24/793903b6-8a40-4ca9-b712-716af66098fe_story.html?utm_term=.2a09dd3ef9ce" target="_blank">foreign parties influencing US elections</a> or <a href="https://www.theguardian.com/media/2016/dec/13/fake-news-could-affect-next-uk-election-warns-channel-4-executive" target="_blank">upcoming UK elections</a>. It is hard to root out fake news as it feeds on a powerful psychological trait called <a href="https://en.wikipedia.org/wiki/Confirmation_bias" target="_blank">confirmation bias</a>. What that means is that fake news is often designed to cater to a specific target audience. Fake news authors write specific content their targets like to hear, whether it is factual, hyperbolic, or not. Goal is to increase the likeliness of them sharing, thereby strengthening the original cause, or more likely, generating advertisement income.</span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">What can we do about fake news? In a simple diagram, this is how fake news spreads:</span><br />
<span style="font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi47EGM1hZemzbeL_lveMCxwYFROF6qPItAZTMgMyRjHctozct32wVFBhJQ7412-JzMaomcOeaig7H6H2eS3YxEUXk-K00fDEQvpbN9WYMKE1UNNGaUt52U8iS82I-4x0T1b3sB2rTpwyw/s1600/FakeNews+%25281%2529.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-size: large;"><img border="0" height="135" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi47EGM1hZemzbeL_lveMCxwYFROF6qPItAZTMgMyRjHctozct32wVFBhJQ7412-JzMaomcOeaig7H6H2eS3YxEUXk-K00fDEQvpbN9WYMKE1UNNGaUt52U8iS82I-4x0T1b3sB2rTpwyw/s400/FakeNews+%25281%2529.png" width="400" /></span></a></div>
<div class="separator" style="clear: both; text-align: center;">
<span style="font-size: large;"><br /></span></div>
<div class="separator" style="clear: both; text-align: left;">
<span style="font-size: large;">There is not much we can do to stop fake news publishers from actually creating content. What the industry <i>can</i> do, social media outlets in particular, is to <a href="http://www.npr.org/sections/alltechconsidered/2016/11/15/502111390/facebook-google-take-steps-to-confront-fake-news" target="_blank">stop giving fake news publishers</a> a free distribution channel. That's a good step in the right direction.</span></div>
<div class="separator" style="clear: both; text-align: left;">
<span style="font-size: large;"><br /></span></div>
<div class="separator" style="clear: both; text-align: left;">
<span style="font-size: large;">Where the Realnews/Fakenews Chrome extension helps is at the reading stage. When a link is presented on Facebook or Twitter to a third party site as a shared post, a small button is added to the top right to indicate the category of the site:</span></div>
<div class="separator" style="clear: both; text-align: left;">
<span style="font-size: large;"><br /></span></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQOI9AE4UDrhvtmz9u0wmHcVhZmvGJkG6-Tho7fSFl7E_UEE-O7JOKnSqFxD6Jl2nQQcUVycHpgpC-PisQx8qJUyT7o457oMCjLN4J5wQti1x3na3-6L1VjloNV890t4o2ffjAs5BGFQE/s1600/Screen+Shot+2016-12-13+at+9.02.58+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-size: large;"><img border="0" height="283" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhQOI9AE4UDrhvtmz9u0wmHcVhZmvGJkG6-Tho7fSFl7E_UEE-O7JOKnSqFxD6Jl2nQQcUVycHpgpC-PisQx8qJUyT7o457oMCjLN4J5wQti1x3na3-6L1VjloNV890t4o2ffjAs5BGFQE/s400/Screen+Shot+2016-12-13+at+9.02.58+PM.png" width="400" /></span></a></div>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">In this case, the site this tweet links to is marked as real news. Of course, the subject itself is <i>fake</i> news, but that's not the point. When the green button is pressed, a category selector shows up as follows:</span><br />
<span style="font-size: large;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsOWMIz5wanBtzBp3JvS_-y1CEQhyphenhyphen2Km_RdXQCeAYU_RbJPuJ55s1PR-SorbighI5m_nQQMJjoVN4PjY7vQEeZGAVM8w6XGaTh1lzacNUGdXujrYnXB97DBcfD3XPCk5G7gNjBCzUrb6c/s1600/Fake+News+2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><span style="font-size: large;"><img border="0" height="220" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsOWMIz5wanBtzBp3JvS_-y1CEQhyphenhyphen2Km_RdXQCeAYU_RbJPuJ55s1PR-SorbighI5m_nQQMJjoVN4PjY7vQEeZGAVM8w6XGaTh1lzacNUGdXujrYnXB97DBcfD3XPCk5G7gNjBCzUrb6c/s320/Fake+News+2.png" width="320" /></span></a></div>
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">When a different category is chosen by the reader, it counts as one vote. A consensus model is used on the server so that future readers will be presented with the popular vote. Voting is non-moderated. </span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">Once enough pages have been voted on, features can be extracted such as the hosting domain of the link, words used in the URL, or the occurrence of specific content, such as lots of ads. Machine learning is then used to classify unknown sites and seed the prediction with a meaningful starting point. </span><br />
<span style="font-size: large;"><br /></span>
<span style="font-size: large;">Like I said at the start, more details on the implementation will come in future blog entries. For now, <a href="http://realnews-fakenews.appspot.com/" target="_blank">try it out</a>!</span><br />
<br />Chris Laffrahttp://www.blogger.com/profile/17244232814118596946noreply@blogger.com0tag:blogger.com,1999:blog-7152855345155101431.post-69526002457907825792016-12-05T13:54:00.002-08:002016-12-05T16:17:15.479-08:00PyAlgoViz on Github<h2>
PyAlgoViz on Github</h2>
<div>
Long overdue, I finally published my <a href="http://pyalgoviz.appspot.com/">PyAlgoViz</a> source code on <a href="https://github.com/laffra/pyalgoviz">github</a>. The Python source shows you how to create a sandbox on AppEngine to run Python code, step-wise debug the code, run visualization logic at each step, record the visualization, and return the result back to the browser to visualize it there using HTML5 Canvas. </div>
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: left;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgncNwJm64I9u_ZbMDSVp-DDdSEzki-cyL5TCc_3_FkzKhnvRt33Bt01Jpc2HmxBEHejEP3UqS7a3UaUKHF9OktBjgk8CQOyDR-j_8TF3n3MG-buaOjZMcaGmT-7g_ESD2xjYgFnLBoZHg/s1600/Screen+Shot+2016-12-05+at+4.52.09+PM.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" height="263" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgncNwJm64I9u_ZbMDSVp-DDdSEzki-cyL5TCc_3_FkzKhnvRt33Bt01Jpc2HmxBEHejEP3UqS7a3UaUKHF9OktBjgk8CQOyDR-j_8TF3n3MG-buaOjZMcaGmT-7g_ESD2xjYgFnLBoZHg/s400/Screen+Shot+2016-12-05+at+4.52.09+PM.png" width="400" /></a></div>
<div>
<br /></div>
<div>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
You may also like the accompanying video and slide deck linked from <a href="http://chrislaffra.com/">chrislaffra.com</a>.</div>
<div>
<br /></div>
<div>
Happy visualizing!</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<br /></div>
Chris Laffrahttp://www.blogger.com/profile/17244232814118596946noreply@blogger.com0