I saw this on someone’s interview question. When I start looking into it, i found it becoming more and more challenging and interesting.
After reading several papers on this topic, I have an better idea of how we may solve it and what issues we need to consider.
There are two categories of approach in tackling this problem. One is counter-based, one is Sketch-based.
Now i’m going to try explaining some of the algorithms that I liked and also proven to be efficient according to the paper.
algorithm 1: Frequent algorithm (paper address http://erikdemaine.org/papers/NetworkStats_ESA2002/paper.pdf).
This algorithm extends the classic majority algorithm. the classic majority algorithm finds the element which occurs > n/2 in an array of size n by using one counter, monitoring only one element at a time. And each occurrence of the same element will increase the counter by one, while an occurrence of different element will cancel out and decrease the counter by one. When the counter becomes zero, we assign it to the new element being compared. This way, if there exists a majority element, it will be the one being monitored at the end. we still need a second pass to verify if it’s truly a majority element.
The frequent algorithm here use m counters monitoring m element and return m candidates which will include all elements that occur more than n/m times. n is the size of data. The basic idea is when one new element come in, we assign a counter to it if there is any left; otherwise, we check if it exists among any of the m elements being monitored . increment the associated counter by one if it already exists, and decrease every counter by one if it does not exist. If one counter reaches 0, we assign it to the new element.
To be continued…