The CQL Continuous Query Language: Semantic Foundations and Query Execution
- https://mwhittaker.github.io/papers/html/arasu2006cql.html
- Overview
- A database management system (DBMS) allows user to write ad-hoc queries against a static (or slowly changing) database. A data stream management system (DSMS) allows users to register stream queries which continuously run against streamed inputs.
- Streams and Relations
- We assume an ordered time domain $\mathcal{T}$.
- A stream $S$ is a multiset with entries of the form $(s, \tau)$ where $s$ is a tuple and $\tau \in \mathcal{T}$. We assume that there is a finite number of tuples per timestep.
- A relation $R$ is a function from the time domain $\mathcal{T}$ to a multiset of tuples (i.e. to a standard relation). $R(\tau)$ is the relation at time $\tau$.
- Abstract Semantics
- The abstract semantics is based on stream-to-relation, relation-to-relation, and relation-to-stream operators.
- Stream-to-relation
- Time-based windows:
S[Range T]
is a relation with all the tuples within the time range T
.
- Tuple-based windows:
S[Rows n]
is a relation with the n
most recent tuples; ties are broken arbitrarily.
- Partitioned windows:
S[Partition By A1, ..., Ak Rows n]
is a relation with the n
most recent tuples for every group of A1, ..., Ak
.
- Relation-to-stream
Istream
$(R)$ at time $\tau$ contains $R(\tau) - R(\tau - 1)$. Used mostly with unbounded ranges.
Dstream
$(R)$ at time $\tau$ contains $R(\tau - 1) - R(\tau)$.
Rstream
$(R)$ at time $\tau$ contains $R(\tau)$. Used mostly with now ranges.
- Istream and Dstream can be implemented using Rstream.
- By default, unbounded ranges are added, and Istreams are added to monotonic queries.
- See paper for example queries.
- Time Management
- In order to process a query at time $\tau$, we have to know that there are no more inbound tuples with time $\leq \tau$. To do so, input sources use heartbeats (low watermark). A heartbeat at time $\tau$ indicates that no more inputs with time $\leq \tau$ will be sent. There are three mechanisms for heartbeats:
- If timestamps are generated by a centralized DSMS, then calculating a low watermark is simple.
- If input sources deliver tuples in increasing timestamp order, then we can collect heartbeats from the sources and use the minimum heartbeat as a low watermark.
- If we have a global clock and bounded message delay, then we can infer the low watermark. For example, if it's 1:42 and the message delay is 1 minute, then we know that we have all messages from 1:41.
- Equivalences in CQL
- All relation-to-relation query optimizations and materialized view optimizations can be leveraged.
- Window Reduction:
SELECT Istream(L) FROM S[Range Unbounded] WHERE C
is equivalent to SELECT Rstream(L) FROM S[Now] WHERE C
.
- Filter-Window Commutativity:
(SELECT L FROM S WHERE C) [Range T]
is equivalent to SELECT L FROM S[Range T] WHERE C
. Note the query is a simple SELECT-FROM-WHERE query.
- CQL Implementation in STREAM
- Streams and relations are represented as a stream of triples $(s, \tau, \text{insert} | \text{delete})$ in nondecreasing timestamp order. Streams are represented as a sequence of insertions. Relations are represented as a sequence of insertions and deletions.
- A query plan is a graph in which vertices are operators and edges are queues (in memory). Operators read streams and relations (in their unified format) from their input queues and output a stream or relation on their output queue.
- Every operator has a corresponding synopsis (in memory) in which it can maintain its state. For example, a sliding window join may create a couple of hash tables.
- Tuple are not copied when possible. Instead, they are stored in synopses, and tuple references are passed around.
- Multiple queries can (and should) be combined into a single query plan.
- A global scheduler decides how to evaluate the query plan graphs.
- STREAM (a DSMS that implements CQL) has physical operators for all of the CQL operators and a couple of lower-level system operators.
- Questions
- Q: Give an example stream query that has ambiguous semantics and explain how CQL makes the semantics clear.
- A: ???
- Q: What is the relationship between stream processing and materialized views?
- A: The specification of a materialized view is a lot like a stream query. The materialized view is updated over time as the base tables change. Stream processing systems also use a lot of the same implementation strategies developed for materialized views.
- Q: CQL includes both streams and relations. Can we remove relations and maintain the same expressiveness?
- A: Yes. You can encode relations using streams. However, many concepts are more naturally expressed as time varying relations than as a stream of values.
- Q: How can we implement Istream and Dstream as an Rstream?
- A: ???
- Q: Why is a SQL query plan a tree but a CQL query plan a graph?
- A: ???
- Q: How do you implement the stream-based relation-to-relation operators?
- A: ???