In Part 1 of this article, we discussed how real-time computing systems have made in-roads in our personal and professional life. We also discussed that these systems demonstrate the evidence of being a game changer for many businesses, and growing adoption of such systems position them as the backbone of a robust information infrastructure within enterprise organizations.
In this second and last part of the article series, we will discuss various design patterns, architectures and technologies available to us to design, build and sustain real-time computing systems, within the context of enterprise organizations.
Before discussing the design and technical architecture, let’s take a quick look at the most common design components of a real-time/stream processing system. This sets the context for rest of the discussion.
Depending upon the type of data source (website, machine sensors, mobile devices, etc.), this component will either pull the data, or data pushes to this component by the source system. This can be done on an ultra-low latency frequency as new data becomes available.
Once data is acquired, depending upon the requirement of the design, it needs to be ingested in a computing system or processing framework for further processing. The nature of processing will also determine several architectural, performance and resource needs. Processing can be as simple as integrating an incoming stream with another dataset and can be as complex as running a machine learning algorithm on incoming dataset.
Commonly known as “external lookup” component, this may be required if alerting or decision making on data stream is part of the design. Alerting or decision making may require reference data that needs to be looked up to establish the decision context. On one hand, where doing such a lookup will enrich the real-time system, it also poses some design and architectural challenges.
This is where, if needed, data processing takes place. Processing in the context could be filtering, integration, cleansing, calculation or running certain kind of algorithms.
One important thing to keep in mind is that several real-time/streaming systems do not do any processing at all; instead, they behave like a data exchange component among two end-points, such as two different machine sensors.
Finally, once processing is done, the result of processed data delivers to consumption channels. This can be a web app, a dashboard, another machine, mobile devices, other internal or external systems, as well as stored data view for data analysts to do further analysis in an ad-hoc manner.
These systems are referred to as “real-time” systems in a very generic sense. While designing such systems, it is important to put this term in a meaningful perspective. Real time may mean different things to different business contexts, organizations and people.
In most cases, true real-time systems are about a machine-to-machine data exchange, where decision-taking intelligence resides with either data-sending end-point or data-receiving end-point.
Most other systems, where a stream of data needs to go thru additional processing steps, are considered “near real time” (NRT) or ultra-low latency systems.
Also, several real-time systems use the concept of “micro-batches”, where data is collected in small batch and then processed.
Micro-batching is also a design technique used by architects to improve the efficiency of real-time processing systems.
Before we dive into actual architectural patterns and design, let’s take a look at some of the technologies that are available to us to design, build and sustain these systems.
An important aspect of such systems is the actual processing ecosystem. Because the nature of processing is inconsistent and because processing loads can go up and down, a distributed computing system such as Apache Hadoop will be helpful. Technologies documented below work well with Apache Hadoop.
The beauty of Apache Hadoop eco-system is that it provides various options to support multiple use cases. Where technologies like Apache Flume and Apache Kafka – either used independently or with each other – will provide a robust data acquisition and ingestion system, systems like Apache Storm and Apache Spark (Spark streaming, in particular) provide robust abilities to process incoming data, perform record level enrichment, windowing computations and data splits. Spark also provides the ability to perform machine learning activities on top of incoming data stream.
Storage systems such as HDFS can be used to store the data, and NOSQL Systems such as Apache Hbase provide the ability to store data for ultra-low latency data access.
One of the common use cases for real-time processing is the indexing of the incoming data stream for active search. Apache Solr provides this functionality within the eco-system.
The ability to ingest and persist data in the same cluster and to support multiple use cases and access patterns makes Apache Hadoop Eco-System one of the best choices for deploying real-time systems.
Let’s discuss a couple of design patterns as they relate to real-time/streaming and near real-time processing systems.
There are a lot of other design patterns that can be implemented while building such systems. In the interest of time, I will only discuss two of them, but if you are interested in discussing others in the comments below, I would be more than happy to share my thoughts around those.
Also, keep in mind that several new patterns are still evolving as real-time processing design becomes the core of Internet of Things (IOT)-based systems.
This design pattern, even though it looks and sounds simple, does require a carefully thought design.
Such a pattern is more useful when processing is located within end-points, such as the machine or device sending the data or machine and device receiving the data.
Data being sent from endpoint “A” could be the result of an action that is needed by endpoint “B” for further processing. This can also be a status of health or wellness of devices.
This design pattern is applied mostly in systems that are required to do more than just exchange the data between two endpoints. Such systems can be used to identify potential frauds, generate alerts based on business rules or take pre-determined actions based on a set of variables.
Such systems would require some kind of reference data that is only available outside of the context (actual data pipeline) they are operating in, which is why such a lookup is called “external lookup”.
This design pattern can be considered as a framework and can be very powerful. This pattern can also satisfy several other functionalities such as executing algorithms on streaming data, computing and aggregation based on windowing and sending commands to various systems instead of sending data or alert.
This article is my little effort to talk about some basic architectural considerations and some basic design patterns. As you can see, possibilities are limitless and available technologies can provide the robust platform required to unleash the power of creativity.
In my view, in years to come, real-time and near real-time systems will become more common within enterprise. Established rules such as processing most of the data in batch-mode will be challenged as technology platforms become more powerful and sustainable.
Cloud computing and IOT-based systems have become a catalyst in realizing this vision.
It is only a matter of time before more and more corporations start realizing the profound value that real time, streaming and IOT systems bring to their business models.