Insights

The Impact of Real-time Computing Systems - Part 2

In Part 1 of this article, we discussed how real-time computing systems have made in-roads in our personal and professional life. We also discussed that these systems demonstrate the evidence of being a game changer for many businesses, and growing adoption of such systems position them as the backbone of a robust information infrastructure within enterprise organizations.

In this second and last part of the article series, we will discuss various design patterns, architectures and technologies available to us to design, build and sustain real-time computing systems, within the context of enterprise organizations.

Components of a Typical Real-time/stream Processing System

Before discussing the design and technical architecture, let’s take a quick look at the most common design components of a real-time/stream processing system. This sets the context for rest of the discussion.

A. Acquisition Layer

Depending upon the type of data source (website, machine sensors, mobile devices, etc.), this component will either pull the data, or data pushes to this component by the source system. This can be done on an ultra-low latency frequency as new data becomes available.

B. Ingestion Component

Once data is acquired, depending upon the requirement of the design, it needs to be ingested in a computing system or processing framework for further processing. The nature of processing will also determine several architectural, performance and resource needs. Processing can be as simple as integrating an incoming stream with another dataset and can be as complex as running a machine learning algorithm on incoming dataset.

C. Lookup Component

Commonly known as “external lookup” component, this may be required if alerting or decision making on data stream is part of the design. Alerting or decision making may require reference data that needs to be looked up to establish the decision context. On one hand, where doing such a lookup will enrich the real-time system, it also poses some design and architectural challenges.

D. Processing Component

This is where, if needed, data processing takes place. Processing in the context could be filtering, integration, cleansing, calculation or running certain kind of algorithms.

One important thing to keep in mind is that several real-time/streaming systems do not do any processing at all; instead, they behave like a data exchange component among two end-points, such as two different machine sensors.

E. Delivery Component

Finally, once processing is done, the result of processed data delivers to consumption channels. This can be a web app, a dashboard, another machine, mobile devices, other internal or external systems, as well as stored data view for data analysts to do further analysis in an ad-hoc manner.

Is it Always Real Time in a True Sense?

These systems are referred to as “real-time” systems in a very generic sense. While designing such systems, it is important to put this term in a meaningful perspective. Real time may mean different things to different business contexts, organizations and people.

In most cases, true real-time systems are about a machine-to-machine data exchange, where decision-taking intelligence resides with either data-sending end-point or data-receiving end-point.

Most other systems, where a stream of data needs to go thru additional processing steps, are considered “near real time” (NRT) or ultra-low latency systems.

Also, several real-time systems use the concept of “micro-batches”, where data is collected in small batch and then processed.

Micro-batching is also a design technique used by architects to improve the efficiency of real-time processing systems.

Technology Landscape

Before we dive into actual architectural patterns and design, let’s take a look at some of the technologies that are available to us to design, build and sustain these systems.

An important aspect of such systems is the actual processing ecosystem. Because the nature of processing is inconsistent and because processing loads can go up and down, a distributed computing system such as Apache Hadoop will be helpful. Technologies documented below work well with Apache Hadoop.

 

The beauty of Apache Hadoop eco-system is that it provides various options to support multiple use cases. Where technologies like Apache Flume and Apache Kafka – either used independently or with each other – will provide a robust data acquisition and ingestion system, systems like Apache Storm and Apache Spark (Spark streaming, in particular) provide robust abilities to process incoming data, perform record level enrichment, windowing computations and data splits. Spark also provides the ability to perform machine learning activities on top of incoming data stream.

Storage systems such as HDFS can be used to store the data, and NOSQL Systems such as Apache Hbase provide the ability to store data for ultra-low latency data access.

One of the common use cases for real-time processing is the indexing of the incoming data stream for active search. Apache Solr provides this functionality within the eco-system.

The ability to ingest and persist data in the same cluster and to support multiple use cases and access patterns makes Apache Hadoop Eco-System one of the best choices for deploying real-time systems.

Design Patterns

Let’s discuss a couple of design patterns as they relate to real-time/streaming and near real-time processing systems.

There are a lot of other design patterns that can be implemented while building such systems. In the interest of time, I will only discuss two of them, but if you are interested in discussing others in the comments below, I would be more than happy to share my thoughts around those.

Also, keep in mind that several new patterns are still evolving as real-time processing design becomes the core of Internet of Things (IOT)-based systems.

A. Design Pattern 1 – Pass thru data exchange between two end points

This design pattern, even though it looks and sounds simple, does require a carefully thought design.

Such a pattern is more useful when processing is located within end-points, such as the machine or device sending the data or machine and device receiving the data.

Data being sent from endpoint “A” could be the result of an action that is needed by endpoint “B” for further processing. This can also be a status of health or wellness of devices.

A few important considerations for such a design could be:

  1. Speed of processing must be looked at carefully. Technology selection, processing design selection should ensure best possible speed.
  2. Design must ensure that no data is lost in processing. If a data loss occurs, design must ensure to send appropriate signal to impacted endpoints.
  3. Design should ensure that data collection is decoupled from actual data exchange or processing method and that collected data is available for re-processing for some period of time. This is one of the reasons why distributed messaging systems such as Apache Kafka are so popular for such designs.
  4. Design should also ensure that the system maintains message “offset” information. This information can be used to support “rewind” or “replay” functionalities in the system.

B. Design Pattern 2 – Processing/decision making with “external lookups”

This design pattern is applied mostly in systems that are required to do more than just exchange the data between two endpoints. Such systems can be used to identify potential frauds, generate alerts based on business rules or take pre-determined actions based on a set of variables.

Such systems would require some kind of reference data that is only available outside of the context (actual data pipeline) they are operating in, which is why such a lookup is called “external lookup”.

Here are a couple of things to consider while designing such systems:

  • These kinds of systems are mostly NRT because of additional processing required. It is important to understand the threshold for “latency” and consequences of not meeting that threshold while designing the system.
  • Depending upon the nature of data and processing (or computing), the designer should explore the possibility of creating a “micro-batch” of data to make processing efficient.
  • Enormous attention should be paid to the location of external lookups. It is important to understand if there is a way to have a copy of reference data within the systems architecture to reduce network travel time. It is also important to consider if some of the reference data can be cached in memory to speed-up the lookup speed.
  • Nature of lookup must be examined carefully to identify best architecture and technology. For example, if a lookup can be done based on a single key, a solution like “Apache Hbase” to store the lookup information can provide “sub-second” lookup performance, where scan of a larger dataset for a lookup may require other solutions such as “caching” in the memory.
  • If system requires additional computations, aggregations and support for “ad-hoc” analysis, then those factors should be reviewed carefully. Nature of ad-hoc must be examined; level of data persistence and access patterns should be examined before identifying a storage technology.


This design pattern can be considered as a framework and can be very powerful. This pattern can also satisfy several other functionalities such as executing algorithms on streaming data, computing and aggregation based on windowing and sending commands to various systems instead of sending data or alert.

Conclusion

This article is my little effort to talk about some basic architectural considerations and some basic design patterns. As you can see, possibilities are limitless and available technologies can provide the robust platform required to unleash the power of creativity.

In my view, in years to come, real-time and near real-time systems will become more common within enterprise. Established rules such as processing most of the data in batch-mode will be challenged as technology platforms become more powerful and sustainable.

Cloud computing and IOT-based systems have become a catalyst in realizing this vision.

It is only a matter of time before more and more corporations start realizing the profound value that real time, streaming and IOT systems bring to their business models.

Article written by Manoj Vig
Image credit by Getty Images, Corbis, Andrew Brookes
Want more? For Job Seekers | For Employers | For Influencers