ALGORITHMS 101 – PART 5: RANDOMIZED ALGORITHMS

Having covered some very important topics like recursion and asymptotic notation, we’re going to continue our journey through algorithm basics with a discussion of randomized algorithms. By definition, the behavior of randomized algorithms incorporates randomness as some part of its formal structure. The purpose of this is probabilistic: it seeks to achieve good performance in the “average case” while its actual performance in any one case will, of course, be a random value governed by the specific but randomly selected input used in that case. Let’s consider an example:

The Hiring Problem

This example comes straight out of the chapter 5 of Intro To Algorithms 3rd Edition. Suppose that I want to hire a new worker at my company and I use an employment agency that charges me a fee for every interview I request and an even larger hiring fee if I actually want to hire an interviewee. I’m ok with paying these fees because I want to have the best possible person for the job at any point in time. Therefore, my strategy is to constantly interview applicants and every time I find someone who I think is even slightly better than the current job-holder, I just fire my current employee and replace them with the applicant. By the way, I should add that I’m a horrible boss.

Now, before I implement this cold-hearted strategy, I want to estimate its cost with all those interview and hiring fees and so I create a method called HIRE-WORKER to represent my proposed strategy. In this procedure, I create a variable called “best” which is intended to hold the value of the best candidate I’ve seen so far. It starts at 0 because there’s no candidate-zero; I haven’t yet interviewed anyone and no nobody could possibly be my best candidate yet. Then, I number off each candidate 1 to n for a total of n candidates with the variable i holding the index value for that candidate. Now, for each candidate, I want to interview them and, if they’re better than the best I’ve seen so far, I want to replace the value of “best” with their index value and hire them. Pretty straightforward.  Figure 1 below describes the HIRE-WORKER algorithm. I’ve included some Java code for an implementation example, however, it’s a bit clunky because the HIRE-WORKER algorithm is more analogous to the literal example of hiring people whereas a code implementation requires some extra accommodations which you’ll see in the code comments (for example, i have a function where a user might type in a candidate score representing their impression of a potential hire). Note also, that every time we identify a new best worker, we invoke another method “hireBest” which hires that worker. As is the case with object-oriented programming, that method would be defined elsewhere in the code.

Figure 1

the HIRE-WORKER method

private static void hireWorker {
     int best = 0;
     int bestScore = 0;
     int n = list.length;
     for (i = 1; i < n; i++) {           String number = bufferedreader.readLine(); // This is a convention for reading           int score = Integer.parseInt(number);      // user input like candidate scores           if (score > bestscore) {
               best = i;
               hireBest(i);     // every time there's a new "best" hire 'em!
          }
     }
}

We’ve spoken at length about algorithm cost and, heretofore, that cost has been in running time. In this example, we’re going to be less concerned with running time and more concerned with monetary cost: the total cash amount that I have to spend in order to find the best candidate. So let’s look back at this HIRE-WORKER algorithm. Interviewing has some cost, c-interview, while hiring incurs a larger cost that we’ll call c-hire. If we add that m is a variable representing the total number of people that I have to hire in order to find the best worker, then I can say that the total cost of implementing this algorithm—and, in the process, of finding the best hire—is O(-interview*n + c-hire*m). Recall that this is O-notation, which means that it represents the upper bound when it comes to algorithm cost. So what this tells us is that the worst case cost will be O(-interview*n + c-hire*n) because we would interview and hire every candidate n. Since the interview cost c-interview is much smaller than the hiring cost, is becomes asymptotically negligible and so we can just reduce the worst case to O(c-hire*n).

Here’s where we begin to see the random, or stochastic, element at play. In the worst of all possible cases, we get a list of potential hires n people in length and it just so happens that they are in order of worst-to-best and therefore we end up hiring every candidate we interview (in which case c-hire*m would equal c-hire*n).  But, of course, this won’t always—if ever—be the case. In fact, candidates will come in unpredictable orders and we know that we’ll hire somebody but we don’t know if that magic person will be the first person we see—in which case we will only hire one time—or the last person we see—which would be our worst-case scenario. Notice how this kind of uncertainty makes it really hard to meaningfully describe the running time of this algorithm because, once we say that our hiring cost could be anywhere from c-hire*1 up to c-hire*n, then our O-notation (-interview*n + c-hire*m) becomes a pretty useless and random upper bound doesn’t it? It establishes a huge and unpredictable range of costs for any specific execution event of this algorithm. So what do we do?

Dealing with Randomness

Well, we just can’t possibly know which input we’ll be working with for any specific instance of this algorithm. What we can do though is make some reasonable assumptions about those inputs, which will allow us to generate an average case scenario. With regard to the hiring algorithm, we can make the assumption that in any case we will be presented with candidates in a random order. If this is the case, then we can also say that our ranked list of applicants has just as much likelihood as anything of being one of the permutations of the list of candidates 1…n since each permutation—that is each possible order of candidates 1 through n—has an equal probability.

It is very important to note here that the hiring agency we’re actually working with probably doesn’t send us randomly ordered applicant lists. In such a real-world situation, these lists might appear in the order of application score (which would bias the order toward more qualified applicants) or it might similarly refer applicants in the order of when they submitted their materials to the agency (in which case the early submitters may also be more organized and therefore more qualified, again biasing the list toward a certain type of applicant). Because our algorithm cannot control this we can instead force the input to be randomized by deliberately randomizing it. For instance, we could scramble each applicant list upon entering it into the computer. This intentional randomization, which will often employ a certain code convention (an object) known as a random number generator is a quintessential example of a truly randomized algorithm. One last thing: while this randomness serves us in an abstract way, it will still be the case that each specific instance of the algorithm must operate upon some ordered list—random or not—and so the specific running time in each case will be different. Because of this, we can’t really say that such-and-such is the running time but, rather, the expected running time.

Analysis

In order to compute the expected number of times we’ll be hiring a new assistant, we first have to introduce a convention known as an indicator random variable. This variable represents the likelihood of a specific event occurring. For example, if we’re talking about flipping a coin, then we can assign a random variable to the event that the coin will land on heads. This variable will, of course, be equal to ½. So for the hiring problem, we’ll randomize the candidates, thus allowing us to say that the indicator random variable for each candidate considered will equal the likelihood of hiring that candidate. So, if we have n candidates, we’ll be considering n interview and potential hiring events and we’ll have n variables.

As we consider candidates from 1 up to n we assign each candidate we’re currently considering an index value i. This way, we can say that the candidate we’re currently considering—whichever candidate that is—is candidate i and we’ll compare that candidate to all the candidates we’ve considered so far which would be candidates numbered 1 through i-1. Because we randomized the candidate list, we can say that each candidate i is equally likely to be the best candidate we’ve seen so far and so the random variable in the case of any candidate i is equal to 1/i because each candidate i has 1/i likelihood of being better than all candidates before her (candidates 1 through i-1 that is).

If our random variable is called H-i for hiring some candidate i then the likelihood for total hires across the whole list of candidates n is equal to H-1 + H-2 +…+ H-n which is equal to 1/i for I = 1…n, or, 1 + ½ +…+1/n. Ultimately, this equals ln(n). Finally, if the overall likelihood of hiring candidates over the whole candidate list is ln(n), then the total hiring cost c-hire*n is equal to c-hire*ln(n).  Compared to the previous worst-case of O(c-hire*n) this average case of O(c-hire*ln(n)) is actually more cost-efficient, hence the advantage of algorithm randomization. With this in hand, we can now move on to Figure 2 which shows the new and improved RANDOMIZED-HIRE-WORKER algorithm. As you’ll notice, line 1 of this code calls a method that randomly permutes the list of candidates (“randomize candidate list”). This method is explained with a graphical depiction in figure 2 as well. Things get a bit tricky when it comes to randomization because it involves a concept that we have not covered yet. Therefore, you see some of the code required to implement this map:value sorting but it is incomplete (we’ll cover that topic later on).

Figure 2

A: The new and improved RANDOMIZED-HIRE-WORKER method. B: The "randomize candidate list" method invoked by line 1 of the RANDOMIZED-HIRE-WORKER method

private static void hireWorker {
     randomize(list);
     int best = 0;
     int bestScore = 0;
     int n = list.length;
     for (i = 0; i < n; i++) {           String number = bufferedreader.readLine(); // This is a convention for reading           int score = Integer.parseInt(number);      // user input like candidate scores           if (score > bestscore) {
               best = i;
               hireBest(i);     // every time there's a new "best" hire 'em!
          }
     }
}

 

private static void randomize {
     int n = list.length;
     int p = new int[n];
     for (i = 0; i < n; i++) {
          p[i] = randomGenerator.nextInt(100); //call random number generator
     }

     HashMap map = new HashMap(); //map:value sorting
     ValueComparator bvc =  new ValueComparator(map);
     TreeMap sorted_map = new TreeMap(bvc);
     for (i = 0; i < n; i++) {
          map.put(list.i , p.i);
     }
}

public int compare(Object a, Object b) {
     if((Double)base.get(a) < (Double)base.get(b)) {
          return 1;
     } else if((Double)base.get(a) == (Double)base.get(b)) {
          return 0;
     } else {
          return -1;
     }
}

The above example invokes a permutation method to randomize our candidates prior to any interviews. The exact method of permutation is more of a side-note in the context of this post and this map:value implementation is but one example among many. In any case, the point here is that it accomplishes randomization in order to ensure consistency in the average case (this is the goal of randomized algorithms) even though each specific instance will vary. This concludes our intro to randomized algorithms. Next, we’ll begin Algorithms 102, where we will start discussing algorithm design and function more specific to Bioinformatics.

Tagged ,

ALGORITHMS 101 – PART 4: DIVIDE AND CONQUER

We spoke about merge-sort earlier as an example of a divide-and-conquer algorithm. As such, it is recursive, repetitively splitting an unsorted set (the original input) into smaller subsets and then splitting those subsets into even smaller subsets etc… until the subsets contain only one element and therefore cannot be divided any further. At this point, the recursion “bottoms out.” This overall problem solving workflow goes like this: take the original problem and divide it into smaller sub-problems, conquer the sub-problems and then combine the solutions to those sub-problems into a solution to the original problem. When the sub-problems are large enough that they can still be subdivided, that is called the “recursive case” and the point at which the sub-problems cannot be divided any further is referred to as the “base case.” Looking at the recursion tree shown in Figure 1, each level of the tree is a recursive case except the bottom level, which is the base case.

Figure 1

recursion tree showing recursive cases and the base case where the recursion "bottoms out"

With this in mind, we can proceed to a more advanced divide-and-conquer algorithm: the maximum-subarray problem.

Maximum-Subarray Problem

The maximum-subarray problem can be thought of like this: suppose that I have an array A of numbers where A = (1, 2, -6, -5, 7, -2, 4, 4) and I want to know the subarray from within this set A that has the highest sum. A good analogy can be found in the stock market: if share prices go up and down and if I have a ticker that spits out a number every hour reporting the change in stock price, it might look like the array A, where the price of my share went up 1 dollar, then up 2, then down 6, down 5 etc… If I want to make an algorithm to guide my trading of this stock, I need it to tell me what within that total set A is the sub-set, among all possible sub-sets containing sequential numbers, that has the highest total sum. That subarray therefore represents a contiguous time period within which the highest total increase in stock price is found. Because I’m looking for a subarray from within the array A that has the highest total sum, I can say that I’m looking for the maximum subarray of A and, therefore, I am confronted with the maximum subarray problem.

There are many possible ways to solve the maximum subarray problem, some more efficient than others. For example, we could test every possible subarray starting with the pair (1, 2), then (1, -6) etc… all the way up to (-2, 4) and (4, 4). However, this would be an inefficient, overly laborious solution—often referred to as a “brute force” solution—and would have a theta-notation of n², where n is the number of elements in the set A.

We can do this more efficiently, however, if we implement a divide-and-conquer approach. We could divide the original subarray in half recursively, conquer the task of finding the maximum subarrays of the subproblems and then combine these solutions to find the maximum subarray for the entire set. In order to do this, we need to establish a midpoint for each array so that we can split the array—and its subarrays—in half. This lays the groundwork for recursively dividing the array A. Now, let’s take a step back and consider the starting array and its two halves. From this, we can make the following statement: any contiguous subarray of the overall array A will either A) be encompassed by the first half, B) be encompassed by the second half or C) cross the midpoint, bridging both halves.

This might seem obvious, but it’s what makes divide-and-conquer possible in this case. The maximum subarray of the set A—whatever it is—must have the highest sum over all of the subarrays in the first half, the second half and the midpoint-crossing group. Since each recursive division of each subarray creates two halves, the issue of finding the greatest subarray in these non-midpoint-crossing groups can be recursively solved because each half is itself split, creating a new left half, right half and midpoint-crossing subgroup. In order to accomplish this recursion, we need to define a method for finding the maximum subarray that crosses the midpoint because this method will be called upon each subset as it is split. This works because we know that for each level of the recursion–as shown in Figure 3–we have some subset that we can think of both as the derivative created when its superset was split as well as a superset of its own that will be split in order to generate the next lower level of the recursion. Then, the maximum subarray from this level will be either the maximum subarray from its left half, its right half or a midpoint-crossing group and THAT answer will constitute the new left-max, right-max or max-cross for the solution to the level above it. This continues until we have the end solution for the maximum subarray for the original set A.

The Find-Max-Crossing-Subarray Subroutine

So let’s start by describing an algorithm to find the maximum midpoint-crossing subarray, which we’ll call “find max-crossing subarray” as shown in Figure 2. This protocol begins by finding the maximum subarray of the left half of the array A[n], A[1…n/2]. This subarray must contain the midpoint n/2 and be contiguous with that midpoint because, afterall, we’re concerned here with those subarrays that cross the midpoint so the “for” loop considers only subarrays of the form A[i…n/2] where i is an index position starting at n/2 and moving leftward until i = 1. To reiterate, since we’re trying to cover the portion of the left half containing midpoint-crossing subarrays, they have to contain the midpoint because, if they did not, then they wouldn’t cross the midpoint. Furthermore, we also only care about contiguous subarrays and so we’re only going to look at ones that are contiguous with that midpoint. Therefore, we’ll start with only n/2, then the subarray containing n/2 and its left neighbor and then the subarray containing those two plus the next left neighbor and so forth until we ultimately consider the entire left half, from position 1 all the way to position n/2.

Left-sum holds the value of the maximum subarray of the left side and is initialized to negative infinity. It has to be done this way to ensure that no matter how high or low the subarrays are in the set A, it will at least be possible for their sums to be a new left sum. If we initialized left sum to any real number above negative infinity, then this algorithm would only work over a range of possible maximum subarray sums higher than that initial value. However, we want it to work for all possible subarray sums and so we initialize it to a value so low that any and all sums that the algorithm encounters may overwrite this initialization as the new left sum. The first subarray will, of course, have a value higher than negative infinity and so left-sum will hold the value of the first subarray and that value will be replaced with the value of any greater subarray we may come across as we consider all subarrays within A[1…n/2]. We also have an analogous protocol for the right half of the array A[n], which we’ll call A[(n/2)+1…..n]. Lastly, this protocol takes the indices i and j of the maximum subarrays from the left and right halves, along with the additive value of their respective sums. These elements—the indices and their sum—are delivered as a set, commonly referred to as a “tuple.”

Figure 2

the FIND-MAX-CROSSING-SUBARRAY method

private static long maxcross (ArrayList A, int left, int right) {
     int leftSum = -99999999;
     int sum = 0;
     for (i = mid; i >= left; i--) {
          sum += A[i];
          if (sum > leftSum) {
               leftSum = sum;
               int maxLeft = i;
          }
     }
     int rightSum = -99999999;
     int sum = 0;
     for (i = mid; i  rightSum) {
               rightSum = sum;
               int maxRight = j;
          }
     }
     int maxSum = leftSum + rightSum;
     return array(maxLeft, maxRight, maxSum);
}

This find-max-crossing-subarray procedure contains two “for” loops, each iteration of those loops takes constant time, and there must be n iterations. Therefore, this procedure takes theta n time. Also, it’s worth mentioning that the internet contains many code examples, some better than others. Though slightly modified, I want to acknowledge Laote’s Notes for providing a solid example of this subroutine.

The Complete Find-Max-Subarray Solution

The entire find-maximum-subarray algorithm, shown in Figure 3, first tests to see if the array A is only one element because, in that case, it is already its own maximum subarray. If it contains more than one element, then it determines the midpoint and divides the array A into left and right halves. Then, it recursively finds the maximum subarrays from within the left and right halves recursively. Line 6 in the pseudocode finds the maximum subarray crossing the midpoint through the find-max-crossing-subarray method described above. This part is important because notice that we refer to the find-max-crossing-subarray protocol in this code example just by referring to its title “find-max-crossing-subarray” and the actual code for that specific method is found elsewhere either in a different file or at some other part of the code in this file. This convention, where we describe a method like find-max-crossing-subarray and “call” that method in other parts of code wherever we want and as often as we want simply by using its name is an essential part of object-oriented programming, which we’ll talk more about in later series. Lastly, lines 7-11 of the pseudocode search the left subarray for the subarray with the maximum sum, if it is not there, it they turn to the right subaray and if neither contains the maximum sum, then it searches the midpoint crossing subarray.

Figure 3

the FIND-MAX-SUBARRAY method

private static long maxsub (ArrayList A, int left, int right) {
     if (left == right) {
          return left;
     }
     else {
          int mid = (left + right)/2;
          long a = maxsub (A, left, mid);
          long b = maxcross (A, left, right);
          long c = maxsub (A, mid + 1, right);
          return Math.max(a, Math.max(maxcross(A, left, right), c));
     }

}

Lines 2 and 3 take constant time because they are executed if n is the constant 1. When n is greater than 1, the recursive portion of the algorithm is executed. The find maximum subarray methods performed on the right and left halves of the set A each have a running time of n/2 (they each take linear time over an input of n/2) which divides recursively. With a theta-notation of nlog(n), this can best be represented with a recursion tree as in Figure 3 and should remind you of the recursion tree for merge-sort.

Now, we’ve described a recursive algorithm for finding the maximum subarray for array A. Next, we’ll move on to a new type of algorithm: randomized algorithms, which will help lay the foundation for an understanding of probability and will play a large part in Algorithms 102, where we’ll get into more specific bioinformatics algorithms.

Tagged , , ,

ALGORITHMS 101 – PART 3: MORE ASYMPTOTIC NOTATION

Before we dive deeper into divide-and-conquer algorithms (of which merge-sort is an example) we’ll cover a few remaining topics in asymptotic notation. We talked initially about theta-notation which, as it turns out, has been pretty useful in understanding algorithm efficiency. However, there are several types of asymptotic notation that represent different considerations on the topic of efficiency and we’re going to talk about them right now and apply them briefly to the concept of divide-and-conquer problem solving.

Theta-notation

We’ve used theta-notation to represent the worst-case running time of algorithms thus far but it’s important to take a closer look at how this representation works. We know that, for any algorithm, the actual running time varies depending on both input size and input type and we explicitly referred to this with regard to insertion-sort in that sorting an input of already sorted numbers would take much less time than sorting an input of reverse-sorted numbers. Also, we mentioned that theta-notation’s strength is in the fact that we can abstract away from so many particulars and speak in broad strokes, like speaking only of worst-case scenarios. Because of this, we can think of theta-notation as a way of making a blanket statement about a set of possible efficiencies.

Let’s put this in a mathematical notation called “set notation.” This notation describes a collection of members bound by the symbol “{“ and “}” and makes use of the symbol “:” (meaning “such that”) to describe the qualities of those members. As we mentioned, algorithms are problem solving protocols that operate on inputs n, which is to say that they are really functions f(n). Moreover, since f(n) varies with n, theta-notation for any given algorithm actually refers to a set of functions where f is the same but n is different for every possible input—really a set of f(n)’s. This is much more clearly represented in set notation as shown in Figure 1.

We introduce g(n) here as well, which is also a function operating on some input n. However, recall that, when we represent something in theta-notation, we abstract away from a lot of the mathematical particulars–we drop coefficients and lower-order terms–and so we end up with a function that is similar but not really the same as f(n) if f(n) is the actual way that an algorithm operates on inputs n. We end up with an approximate but slightly different function g(n) which, as we will see, doesn’t graph the exact same as f(n) but is very useful to express boundaries and limits across all possible f(n)’s.

Figure 1

theta-notation for the algorithm g(n) where there is an f(n) for all possible inputs n

What this really says is that there is an input size n0 such that for ALL n at and to the right of that point (that is, for all inputs greater than or equal to n0), f(n) will ALWAYS be between c1g(n) and c2g(n). Notice also that c1 and c2 are constants, and the only difference between c1g(n) and c2g(n) is the constant so really this says that for all inputs at or above n0, we can specify the function f(n) to between the constants c1 and c2. That is, above n0, f(n) is equivalent to g(n) to within a constant factor. This is called “bounding” and we say that g(n) is “asymptotically tightly bound” for f(n).

O-notation

As we just saw, theta-notation “bounds” a function from above and below. O-notation, on the other hand, only bounds a function from above, providing an “asymptotic upper bound.” Figure 2 shows the set notation and graphical expression of O-notation for a function g(n). O-notation, like theta-notation allows us to establish a boundary for a function to within a constant factor, here saying that for any input n greater than or equal to n0, the function f(n) will be at or below g(n). It’s important to note here that theta-notation is a “stronger” notation than O-notation, bounding a function from above and below, and this implies that if f(n) = theta g(n), then f(n) also equals O(g(n)).

Figure 2

O-notation for the algorithm g(n) where there is an f(n) for all possible inputs n

O-notation is very useful because it allows us to quickly ascertain the worst-case running time for algorithms. For example, insertion-sort has a doubly-nested loop, the inner loop is at-worst executed once for each pair of values i and j and both i and j are at most n, so this doubly nested loop is–at worst–n-squared. Therefore, we can immediately see just from the nesting structure of the code that the upper bound for insertion-sort is O(n²). Again, this does not describe performance for every input but it describes what we are really interested in when it comes to algorithm efficiency: the worst-case scenario. So, if nothing else, the take-away from this is that O-notation is worst-case notation, representing the upper bound for an algorithm’s running time: it won’t run any worse than its O-notation.

Omega-notation

This convention is much like O-notation except it describes a lower bound instead of an upper one. Figure 3 depicts the omega-notation of an algorithm g(n) along with its description in set notation. Omega-notation simply states that for all inputs n greater than or equal to n0, f(n) will be above cg(n). Where O-notation describes a worst-case scenario, this notation describes the best-case scenario for an algorithm in terms of running-time.

Figure 3

Omega-notation for the algorithm g(n) where there is an f(n) for all possible inputs n

In English, this says that for any and every input at or above n0, the running-time of this algorithm—its performance f(n) for any input n—will be at least  some constant multiple of g(n) and never any lower. For insertion sort, the best-case would be a previously sorted input which would reduce the running time compared to worst-case and would eliminate the need to call nested loops. However, we would still need to look through every element in the set A and so the best case—the omega-notation—would never be any better than n. In other words, insertion-sort has an omega-notation of Omega(n).

There are two more notational conventions known as o-notation (that’s “little o” as compared to “big O” described above) and w-notation (or “little omega” as opposed to “big omega”). For most things, these notations can be thought of as identical to their big O and big omega corollaries except that the little letters denote bounds that are NOT asymptotically tight. In other words, while big-O notation sets an upper bound such that f(n) is less than or equal to cg(n) for SOME constant c, little-o notation sets an upper bound such that f(n) is less than (but not equal to) cg(n) for ALL constants c. This is a much looser upper bound and in little-o notation, the performance of an algorithm on an input n, f(n), becomes insignificant as g(n) approaches infinity. More formally, the limit of f(n) divided by g(n) as n approaches infinity is simply zero. Little-omega notation has an analogous relationship to big-Omega notation and it can be said that the limit of f(n) divided by g(n) as n approaches infinity is simply infinity.

This should now give us a more solid foundation in asymptotic notation from which we can move forward into the discussion and analysis of other algorithms. We’ve covered a lot of ground in only a little space so a quick tour of Wikipedia and/or Google is always a great thing. Much of this series is based upon the MIT Press’ Introduction to Algorithms, 3rd Edition by Cormen Leiserson, Rivest and Stein, which I would recommend for anyone wanting a more in-depth tour of algorithm analysis but be warned, it’s over 1300 pages in full and easy to get lost within. Now, with this foundation in hand, we can move on to a lengthier discussion of divide-and-conquer algorithms in Part 4.

Tagged , , ,

ALGORITHMS 101 – PART2: INSERTION-SORT AND MERGE-SORT

Having previously discussed the concept of algorithm efficiency, we can now move on to two simple and practical examples: insertion-sort and merge-sort. Once we describe their function and peek under the hood with regard to how they’re implemented in code, we can model a comparison between the two using the theta-notation technique covered in Part 1.

Before we begin, there’s one quick point to make, when we describe the inner mechanics of these algorithms we’ll do it in 3 layers: language specific code (in Java), pseudo-code (which isn’t a real programming language but is laid out like one) and plain old English. This way we can easily transition from a conversational explanation of an algorithm’s function to its actual writing in code. Hey, and we even might pick up a little programming knowledge along the way. So, without further ado, let’s dive into out first example:

Insertion-Sort

The overall purpose of this algorithm is pretty simple: sort a list of numbers. In our example we’ll be a bit more specific and say that we want to sort a list of numbers from lowest to highest and, in keeping with the definition of an algorithm, we need a case-nonspecific method for doing this.

Insertion-sort works a lot like sorting a hand of playing cards in that you start with the second-to-leftmost card and move right one card at a time until you’ve gone through the whole hand. For each card, you pull it out and compare its value to all the neighbors to the left of where you pulled it from, one at a time, until its value is less than or equal to a card in the hand. Then, you simply insert the card at that position and every card to the right moves over one place. This is graphically represented in Figure 1.

Figure 1

the INSERTION-SORT method with Java code example

public static void insertion_sort (ArrayList A, int n) {
     for (int j = 1; j < array.length; j++) {
          int key = array[j];
          int i = j - 1;
          while ((i > 0 && (array[i] > key)) {
               array [i + 1] = array [i];
               i--;
          }
     array [i + 1] = key;
     }
}

Now we can describe this algorithm in more a more formal way by saying that the hand of cards is a set A with up to n elements, or, A[1…n], and that we begin at a position jwithin that set, this position is known as A[j], and that j begins at 2 and ends at the final element of set A. The numerical position of this final element is the length of A, or, A.length. So basically the scope of possible j values goes from 2 to A.length and since we perform our insertion sort for each j, the total number of times we “loop” through that protocol is one less than the total number of elements in set A, which we can express as A.length – 1 (Note: we began at position 2 and we did in fact perform the protocol at position 2, so it’s counted, but we did not do this at position 1).

One further point, the position within set A known as j, A[j] as we’re calling it, is a reference to a certain element (or card). So, if j = 5 and the 5th card is a 6, then A[5] = 6 and 6 is the key referred to by A[5]. Moreover, for every position A[j]. we’re also interested in its left neighbor i, which we refer to by A[i] and i = j – 1 (if A[j] is the 5th element, then its left neighbor A[i] is equal to A[j-1], the 4th element). With these two indexes, i and j, we can move on to a description of this algorithm in pseudocode  and Java(Figure 1).

What we’ve set up here for insertion-sort is three main building blocks for any protocol: initialization, a repeatable part—or maintenance—and termination. The repeatable part is known as a “loop invariant,” so named because it does not change throughout the process, it just gets repeated. The code shown in Figure 1 constitutes a loop invariant because it is value-nonspecific and just gets repeated over and over (or “looped”) until the program encounters the termination condition (we’ll talk more about loop invariants and their important relationship to the correctness of algorithms in later posts). Lastly, our initialization is when j = 2 and our termination condition is when j = A.length, the last element in the set A (to be a bit more specific, the termination condition is actually when j > A.length because it isn’t until the value of j increments beyond the limits of set A that the program knows not to execute the loop invariant and instead exit the subroutine).

Merge-Sort

Similar to insertion-sort, merge-sort is an algorithm for sorting numbers in a set but the mechanics are different. Sticking with the card-sorting analogy, we can visualize merge-sort in the following way: starting with my hand of cards, I split the hand into two small decks. Then, I look at the first card in each deck and place the smaller of the two face down on the table. This first card that I’ve placed on the table is the first member of my sorted set. Now, I just loop through this process, looking at the top card on each of my two decks and placing the smaller of the two face down on the table until my two decks run out. It’s important to realize that this algorithms does not require that I alternate between decks, I can keep pulling from the same deck several times in a row, I just want the smaller of whichever cards are on the top of the two decks. Once I run out of cards in my two decks, I should have a sorted series of cards on the table. This method is represented in Figure 2.

Figure 2

the MERGE-SORT method with Java code example

private void mergesort (int low, int high) {
     if (low < high) {
          int middle = (low + high)/2;
          mergesort (low, middle);
          mergesort (middle + 1, high);
          merge (low, middle, high);
     }
}
private void merge (ArrayList A, int low, int middle, int high) {
     //Merge a[low...mid] with a[mid + 1...high]
     int i = low;
     int j = mid + 1;
     for (int k = low; k  mid) {
               a[k] = aux[j++];
          } else if (j > high) {
               a[k] = aux[i++];
          } else if (less(aux[j], aux[i])) {
               a[k] = aux[j++];
          } else {
               a[k] = aux[i++];
          }
     }
}

Again, to provide a more formal description, I split the set A down the middle into two smaller sets, L and R. If A has n elements, then L.length = n/2 and R.length = n/2. Then, for each set L and R, I begin at the first element, which we’ll call i and j respectively. So, i ranges from 1 up to L.length and j ranges from 1 up to R.length. Now, I have to set an index system for my sorted set A. We’ll call it k, so the k element of A is A[k] and k ranges from 1 up to the total number of elements in subsets L and R combined or, in other words, k goes from 1 up to L.length + R.length. Finally, I can start doing my merge where I compare the i element of L, L[i] to the j element of R, R[j], and if L[i] is less than or equal to R[j], then I set the k element of A to the value of L[i], increment i by one place and increment k by one place (as I move through my deck L and I stack up the sorted pile A). Otherwise, if R[j] is less than L[i], I set the k element of A to R[j]. This process terminates once I run out of cards in decks L and R, meaning that the termination condition is that i = L.length and j = R.length. Figure 2 shows this represented in pseudocode, English and Java.

Analysis

So now it seems appropriate to wrap this up with a little comparative analysis. Algorithm performance is a variable thing depending not only upon the size of input but the kind of input for any given size. Take insertion sort as an example, even for two inputs of the same size, the running-time can vary depending on whether the input is already sorted, intermediately sorted or totally unsorted since, as you can see in Figure 1, inserting and moving cards only occurs if the pulled card belongs somewhere buried in the sorted set.

To make this easier, we will just assume the worst-case scenario for all algorithms and characterize efficiency according to the maximum number of required operations. As such, insertion-sort for a totally unsorted input of size n can be expressed as shown in Figure 3 and simplifies to a theta-notation of an² + bn + c or simply n² and thus running-time scales quadratically with input size n. Now, evaluating merge-sort is a bit trickier since it is what is referred to as a recursive algorithm (it calls upon itself) and so it is best represented as a recursion tree, shown also in Figure 3.

If you’re wondering why merge-sort is best represented as a tree, this is because a pure merge-sort algorithm actually recursively splits the initial set over and over again, resulting in single isolated values. The first split is shown in Figure 2 but you can imagine this splitting process recurring as shown in the tree; all the other operations of merge-sort are the same. A pure merge-sort algorithm would recursively call the splitting function until it “bottoms out,” that is, until the function cannot be called anymore. This would be when the size of the sets is 1. Furthermore, in this pure merge-sort case, insertion-sort is not needed because, when the subsets contain only 1 element, they are therefore already sorted. However, in more realistic situations, it may not be the case that a pure merge-sort algorithm is implemented but, instead, some hybrid of merge- and insertion sort wherein the original unsorted set is split once or a few times up to a point after which insertion sort is used (or, to speak more properly, the insertion sort “method” is “called”) to sort the split sets before merging them again. We didn’t go into that here because it confuses the example and it’s probably best to stay more basic at this point but it’s good to be aware that this type of mixing and matching algorithms is done in practice.

For merge-sort each layer of recursion takes a cn time and the fully expanded tree has logn + 1 levels making the whole operation take cn (logn + 1), or, cnlogn + cn time, which simplifies in theta-notation to just nlogn time. Compared to n² time, this is more efficient overall, as the graph above shows. One important thing to note here is that, in computer science, the default base for logarithms is 2 (not surprising considering that we’re talking about binary) which probably differs from what you and I are accustomed to from high-school or college where the default base was 10. So, in short, unless otherwise specified, “log” = log2 = log base 2.

Figure 3

comparison of insertion-sort and merge-sort

If some of the particulars of the mathematical descriptions of these algorithms—or of the Java subroutine code—don’t quite make sense, don’t worry, we’ll be covering this territory over and over throughout this series and other series. A supplemental Google search might be nice at this point to help clear up some fuzzy areas but at this point it’s completely fine to be a little uncertain about some terminology. That wraps up our discussion of insertion-sort and merge-sort and an introductory comparison of the two. Next, we’ll move on to a more in-depth discussion of asymptotic notation and we’ll be discussing divide-and-conquer algorithms (of which merge-sort is an example).

Tagged , ,

ALGORITHMS 101 – PART 1: BEGIN AT THE BEGINNING

So you’ve probably heard a great deal about algorithms in bioinformatics, or algorithms in general. At least, you’ve probably seen the word “algorithm” wantonly thrown around as you click through countless Google searches and forum posts. But what exactly is an algorithm? What are some classic examples implemented in real-world software and which ones are par for the course in Bioinformatics? (at least for common DNA sequencing analyses)

This article constitutes the first post in a series entitled “Algorithms 101” in which I’ll try to cover introductory algorithm analysis and explain several commonly used algorithms. As you’ll soon see, this is useful not only in general computer science but in understanding what about Bioinformatics is truly computationally intense and what isn’t because, as it turns out, it’s not all the work of super computers and sequencing centers.

I found understanding basic algorithms (along with a few other things like database design, which I’ll cover in another series) to be crucial in building a workable sequence analysis solution. Plus, having a few basic building blocks in place really helps map out the often overwhelming landscape of genomic Bioinformatics.

So, let’s begin at the beginning. An algorithm is a general, re-usable problem solving protocol. In science, we like protocols precisely for this reason; they can be implemented in a standardized way. Algorithms are very similar: if a user wants to use my program to sort a list of numbers, all I have to write is a standard method to sort a numerical list. What particular numbers she chooses to enter make no difference; my program just refers to the instructions that I wrote.

Imagine a computer as a tool that receives various instruction sets (programs) and performs them on user input. In order to be flexible in this way, a computer has to implement only basic instructions (like the four operations of a calculator), which can be combined in various ways to make complex programs. The basic instructions commonly found in real computers fall into three main categories: arithmetic (adding, subtracting, multiplying, dividing, remainder etc…), data manipulation (copy, load, store) and control (subroutine call, return, conditional and unconditional branch) and each one of these instructions takes a constant amount of time. In reality, while all basic instructions take constant time, they don’t all take the same constant, that is, some take longer than others but to keep things simple here, we’ll just make the assumption that all basic instructions take the same time to execute, and this time is the constant c.

As we’ll soon see though, operations that comprise combinations of these instructions do NOT take a constant amount of time, rather, their computational cost can be represented by what is called theta-notation (we’ll use it loosely now and get more specific in later posts) which is a general way to represent and compare the complexity of instruction sets. Basically, theta-notation denotes T(n) (that’s “t of n“) which is the overall running time of a program T as a function of the variable n where n is an input. In fact, theta-notation is a type of “asymptotic notation” which includes several functional notations for expressing efficiency (running-time) versus input size (n), we’ll talk about the others in later posts. So, With all this talk of functions, it should come as no surprise that this notation allows us to conveniently graph the running time of an algorithm, or graph more than one algorithm and compare them, like in Figure 1.

Figure 1

a graphical comparison of two algorithms, A and B

The great thing about theta-notation is that it allows us to plot computational cost over the range of all possible inputs. Even greater than that is the fact that an algorithm that is more efficient in theta-notation is more efficient EVEN IF it’s being run on your laptop and the less efficient alternative is being run on a super computer, pretty cool. So theta-notation, despite the daunting name, actually allows us to abstract away from a lot of the nitty gritty of system specifics and just see which algorithm—which problem solving method—works best, and not for a single input either, but over the entire range of possible inputs.

One other thing about theta-notation: mathematical representations of algorithms can be incredibly complex. Fortunately for us, we’re  only representing algorithms mathematically so that we can compare their overall function and graph that function. Since we’re just trying to get a gist for an algorithm’s behavior across a range of inputs from 1 to infinity, then we can ditch a lot of the math because it becomes irrelevant at this scale.

To say it more clearly, I have an algorithm that sorts a list of numbers through a 10-step protocol. Each step is a basic instruction and takes constant time, which we’ll represent by the variable c. So, if each instruction takes c amount of time and there are 10 steps needed to sort each number the user types in, then the running-time of this algorithm on an input of one number is 1*10*c or 10c. As the user enters more and more numbers, and remember n = input size, then the total running-time for any input n is n*10*c. In other words, this scales linearly with n.  This example has a good laboratory analogy, if I need to streak 10 agar plates, and I have to repeat the entire protocol for each plate, then the amount of time I’m going to have to spend streaking plates scales linearly with the number of plates: streaking 10 plates will take 10 times as long as streaking one plate.

As mentioned above, because this kind of mathematical representation covers all n’s from 1 to infinity, only the highest-order term matters. If, in the previous example, my algorithm took n*10*c + 6 time, I can drop that 6 because it’s insignificant as n scales to infinity. Furthermore, if my algorithm takes 2n² + 10n +6 time, then I can drop both the 10n and the 6 because they both become insignificant compared to n^2 as it scales exponentially with n.

It’s also worth mentioning as you can see from Figure 1, that when you compare the running times of two algorithms across a range of possible inputs n, it’s sometimes the case that one algorithm with perform more efficiently up to a point after which its alterative will perform better. This point, where the graphed lines cross over, is called n0 and if you can be sure that the range of inputs allowed for your program will fall above or below n0, then you can make an enlightened decision about which algorithm to choose in a comparison like this.

So that covers some basics of algorithm analysis and in the following posts of this series, we’ll get into some actual examples, beginning with two very simple algorithms: insertion-sort and merge-sort.

Tagged ,
Follow

Get every new post delivered to your Inbox.