Matrix approximation is a heavily studied sub-field in data mining and machine learning. A large set of data analysis tasks rely on obtaining a low rank approximation of matrices. Examples are dimensionality reduction, anomaly detection, data de-noising, clustering, and recommendation systems. In this article, we look into the problem of matrix approximation, and how to compute it when the whole data is not available at hand!
The content of this article is partly taken from my lecture at Stanford -CS246 course. I hope you find it useful. Please find the full content here.
Most data generated on web can be represented as a matrix, where each row of the matrix is a data point. For example, in routers every packet sent across the network is a data point that can be represented as a row in a matrix of all data points. In retail, every purchase made is a row in the matrix of all transactions.
At the same time, almost all data generated over web is of a streaming nature; meaning the data is generated by an external source at a rapid rate which we have no control over. Think of all searches users make on Google search engine in a every second. We call this data the streaming data; because just like a stream it pours in.
Some examples of typical streaming web-scale data are as following:
Think of streaming data as a matrix A containing n rows in d-dimensional space, where typically n >> d. Often n is in order of billions and increasing.
In streaming model, data arrives at high speed, one row at a time, and algorithms must process the items fast, or they are lost forever.