Algorithms is my favorite class at Berkeley so far. And if I had to choose a favorite algorithm, it would be the Boyer-Moore majority voting algorithm. It’s a streaming algorithm that finds the majority element(which is the element that appears more than half of the time in a list) in linear time and constant space. If it’s guaranteed that that there is a majority element, it only requires one pass, making it a “true” streaming algorithm. However, if you do not know if a majority element exists, then it requires two passes: one to determine a possible majority element candidate, and one to determine whether or not the candidate is actually the majority element. Let’s talk about why it would seem “impossible” to find the majority element in linear time and constant space. It seems very strange because the naïve idea is to keep a count of all of the elements you see and take the max at the end, but that would require linear space. To have constant space, what would you even store as variables? Mergesort can be done in constant space, but that takes O(n log n). How can we keep track of the count of all of the variables to see which one is the most while always using the same amount of space?
How the algorithm works: You keep track of two variables: 1) The name of the current possible candidate(CPC) and 2) A counter initialized to 0 that increases/decreases depending on the element you currently stream. Every time you see an element, if it is the same name as the name of your CPC, you increase the counter by 1. If it isn’t, you decrease the counter by 1. If at anytime the counter reaches 0, you set the counter to 1 and change the CPC to the name of the current variable in the stream. And that’s it. If you don’t know if a majority element exists, on the second pass, you just have to count the amount of times the CPC is in the stream to see if it is actually the majority element.
Why do I like this algorithm? First, like I said above, it seems almost like an impossible problem. Linear time and constant space sounds too good to be true for a complex sounding problem. Second, the solution is simple. I would almost saying looking at the code for the solution is simpler than reading a description of the algorithm. Also, it was so simple that I couldn’t believe it worked the first five times I looked at the algorithm. It didn’t seem like it SHOULD work.
So why does it work, how come you only have to keep track of one variable name and not all of them? The key is that the majority element is not only the most repeated element, but must be present in more than half of the stream(there is a difference). From the perspective of the majority element, it will at most only subtract from the counter half of the times, which means it must end up at the end as CPC since it is repeated more than half of the times.
But what about all of the other elements? What if we are counting the other elements as the CPC up and down(that was a little confusing. As in the counter goes up and down but the variable name is not the majority element)? If you think about it, it’s even better for you if there are more elements that the counter goes up and down that isn’t the majority element. They are cancelling each other out. I like to describe it as each element in the stream is part of an army that consists of all of the elements of the same name. And every soldier has equal fighting strength that when they fight each other they both die. The majority element is the army that has at least one soldier at the end of the war. If you are part of the majority element army, you don’t care if two other armies fight each other. That’s better for you that they cancel each other out. That’s why this algorithm works even though you keep track of only one variable name. Intuitively, it seems like you need to keep track of all of the different element names and there are hundreds of armies. But even more intuitively, it would be harder to tell what the majority element is if there are only two armies that almost have the same amount. In other scenarios, you could almost just look at the list and see immediately. But with just two armies with relatively close amounts, it’s not easy to eyeball it.
I’m not sure if that made much sense. But you can read about it on Wikipedia if it doesn’t. It took me a long time to understand the algorithm and why it works.
Streaming algorithms are my favorite algorithms in general. I just like the idea that when you stream data, it could take on so many different forms, yet you only have to look at each piece of data once and store it in constant space and you can find the solution. And the solution is usually pretty simple. And proof of correctness is always more fun since you can explain why your simple solution solves a complex problem.
Maybe it’s a good life lesson. Sometimes when you’re dealing with a complex problem, the solution isn’t always to overanalyze all the details multiple times. Sometimes going through the details only once is good enough.