At the time of writing, data processing frameworks such as Small map and his “cousins” like Hadoop, Pig, Hiveeither Spark – spark Allow the data consumer to batch process data at scale. On the stream processing side, tools like mill wheel, spark transmissioneither Storm came to support the user. Still, these existing models did not meet the requirement in some common use cases.
Consider an example: a video streaming provider's business revenue comes from billing advertisers for the amount of advertising viewed on their content. They want to know how much to bill each advertiser daily and add statistics about the videos and ads. Additionally, they want to perform offline experiments with large amounts of historical data. They want to know how often and for how long their videos are viewed, with what content/ads, and by what demographics. All information must be available quickly to adjust your business in almost real time. The processing system must also be simple and flexible to adapt to the complexity of the business. They also require a system that can handle data on a global scale, as the Internet allows businesses to reach more customers than ever before. Here are some observations from the people at Google about the state of data processing systems at that time:
- Batch systems like Small map, FlumeJava (Google internal technology) and Spark do not guarantee latency SLA as they require waiting for all input data to fit into a batch before processing it.
- Streaming processing systems that provide scalability and fault tolerance fall short of the expressiveness or correctness aspect.
- Many cannot provide exactly semantics, which affects correctness.
- Others lack the primitives necessary for window creation or provide window semantics that are limited to tuple-based or render-time-based windows (e.g. spark transmission)
- Most of those that provide windows based on event time are order-based or have limited window activation.
- MillWheel and Spark Streaming are sufficiently scalable, fault-tolerant, and low-latency, but they lack high-level programming models.
They conclude that the main weakness of all the models and systems mentioned above is the assumption that unlimited input data will eventually be complete. This approach no longer makes sense when we face the reality of today's huge and highly messy data. They also believe that any approach to solving diverse real-time workloads must provide simple yet powerful interfaces to balance correctness, latency, and cost based on specific use cases. From that perspective, the article has the following conceptual contribution to the unified stream processing model:
- Allow sorted results to be computed at the time of the event (when the event occurred) across an unbounded, unordered data source with configurable combinations of correctness, latency, and cost attributes.
- Separate process implementation into four related dimensions:
– What results are being computed?
– Where at the time of the event they are being computed.
– When they materialize during the processing time,
– How do previous results relate to subsequent improvements?
- Separating the logical abstraction of data processing from the underlying physical implementation layer allows users to choose the processing engine.
In the rest of this blog we will see how Google allows this contribution. One last thing before we move on to the next section: Google noted that there are “There is nothing magical about this model. “ The model doesn't suddenly make your expensive computed task run faster; provides a general framework that allows simple expression of parallel computation, which is not tied to any specific execution engine such as Spark or Flink.
The authors of the article use the term unlimited/limited to define infinite/finite data. They avoid using streaming/batch terms because they usually imply the use of a specific execution engine. The term unbound data describes data that does not have a predefined limit, for example, user interaction events of an active e-commerce application; The data flow only stops when the application is idle. While bounded data refers to data that can be defined by clear start and end boundaries, for example, exporting daily data from the operations database.