Thursday, December 15, 2016

Generating Prime Numbers

In this blog post, we will generate prime numbers and explore brute force, memoization, and dynamic programming.

Prime number are numbers greater than one that are only divisible by themselves and one. In other words, the following Python function returns True if n is greater than one and none of the numbers in between 2 and n is a factor of n:

Of course, this implementation is rather naive and takes a lot longer than is necessary:

Our first optimization would be to realize that we need not check all factors in between 2 and n. Namely, we only need to look at the factors between 2 and n/2, as any number higher than that could never be a factor of n.

Our code now looks like this:

<== doing half the work

With this optimization in place, the algorithm runs about twice as fast already:

Still, we can do better. We can actually already stop considering factors when we reach sqrt(n):

A further optimization would be to realize that we only need to consider the factors 2, 3, 5, 7, 9, etc. as 4,6,8,etc. could not be factors as they are already divisible by 2. But, we'll leave that refinement to the reader.

Let's try our sqrt(n) optimization in the next iteration of our algorithm:

By now, our algorithm should be more than twice as fast.

Of course, we can still do better. For instance, say we are checking to see if 41 is a prime number. To validate 41, our algorithm will try the factors between 2 and sqrt(41), which is 6:
  • 41 % 2 = 1, not a factor
  • 41 % 3 = 2, not a factor
  • 41 % 4 = 1, not a factor
  • 41 % 5 = 1, not a factor
  • 41 % 6 = 5, not a factor
When we check for factors 4 and 6, we are doing double work. Namely, if 41 would be divisible by 4, it would already be divisible by 2. The same argument goes for 6, which is already divisible by 2 and 3 itself. 

In other words, to check if a given number is prime, we need to only verify the prime numbers between 2 and sqrt(n) as divisible factors. This greatly reduces the search space. All we need to do is remember the current list of prime numbers and use those as factors. This approach is generally referred to as a Prime Sieve as opposed to the trial division approach we used so far:

Note that the first line of our isPrime function now has a linear search because we used a list to hold the current primes. Ideally we would have used an OrderedSet instead of a list, to make membership checks O(1) and speed up that part of the algorithm even more. For simplicity, we use a list now.

Using our newly minted insight on using primes as factors, we produce the following execution:

That is confusing. This is slower. How can that be?

One reason is that at this point, our code is a lot more complex than before. Namely, in order to answer the question if a given number is a prime, in the worst case, we need to discover all the primes before that number. There is also some bookkeeping overhead to caching each of the primes we found so far. We don't get much apparent speedup due to all the extra work we need to do.

However, once we actually warmed up the cache, by running our loop once ahead of time, and then measuring a second iteration of our loop, we see an interesting speedup again:

Essentially, we are demonstrating the power of memoization here. The progression we have seen so far is from brute force, to a bit smarter, to using another smart insight, to remembering intermediate steps. By remembering the intermediate results, we essentially are performing a space-time tradeoff. By adding a bit more memory, we can avoid performing repetitive operations.

The next big leap in an algorithm such as finding prime numbers is whether we can make any assumptions on exactly in what order the API is accessed. For instance, are we more interested in being able to figure out if any given random number is a prime number, or are we more interested in producing a prime number generator, one that produces increasingly larger primes? If the latter is the case, we enter the domain of Dynamic Programming.

By assuming we produce increasingly larger primes, the Pythonic instrument we love to go for is a generator function. It can keep local state, return one result at a time, and resume execution at the place where it yielded the most recent result:

The use of co-routines in this manner allows us to combine the concept of memoization and cleanly wrap it in a nice abstraction. The function can be used like any regular Python function as follows:

The final iteration of our algorithm now produces:

This demonstrates a nice progression from brute force to memoization to DP.

That said, if you made it this far, you probably want to take a look at Atkin's Sieve, that generates primes using an O(n) approach.

For more algorithms and their visualizations, check out PyAlgoViz.

No comments:

Post a Comment