R@M3$H.NBlog

Store a stream of numbers along with the counts

20 January, 2014 - 3 min read

Given a stream of numbers, suggest a data structure to store the numbers in the stream along with their number of occurrences.
Solution: Before solving any problem make sure you lay down all the assumptions you are making and validate them with the interviewer.
Assumptions here:

  • The numbers are 32 bit integers.
  • The size of the set (cardinality) is very small compared to the possible number of integers (lets say 100 numbers)
  • Design for faster lookup

Since the size of the set is small, a data structure like an array of integers 2^32 in size will be a waste of space, even though the insert and lookup is going to be O(1). A hashtable would have a similar issue. A linked list would solve the memory issue by storing only the numbers that were present in the set but the look up suffers ( O(n) ). We would like the look up to be better than O(n).
So we've looked at an array, linked list, hashtable and none seem to solve it. Linked list looked promising but look up was O(n). Can we improve the lookup in a linked list? First of all, whats better than O(n)? O(log n), right? What data structure does that remind you of? A Binary Search Tree!!
Lets explore if we can utilize a Binary Search Tree here. So, if we started inserting numbers in a Binary Search Tree, each insertion would cost us O(log n). There are n number of insertions, so inserting all the numbers is going to be O(n log n). When we insert each number we also store the count at the node. So if a number is inserted for the first time we create a node in the tree and store the number itself and also the count (1 for the 1st occurrence). Any subsequent occurrence will not create a new node. Instead, we'll find the corresponding node and just update the count. So by the time we are done processing all the numbers we'll have a nice binary search tree with all the numbers and their occurrences.
Now the comes the look up part. Look up in a Binary Search Tree is O(log n). Any time we are presented a number and asked for the number of occurrences we can look up (or not ) the number and get its count.
If we were to design for faster insert we'd be better off using an unsorted linked list. Insert would have been O(1) (just insert at the end) and look up would be O(n). So for a small n this wouldn't be terribly bad.
Note: There is another data structure called the skip list which is a layered linked list type data structure which has O(log n) look up and insert. The linked list in the skip list is a sorted list and that would make our insert O(n log n).
Question 2:
Find Duplicates in a million number