The Art and Science of Capturing Intelligence

The Art and Science of Capturing Intelligence

A log is a record of events generated by a system. This system could be an operating system OR a web application OR the tracking of network traffic, etc. Human – Machine interaction in the form of what we call auditing user actions is also an important source of intelligence.

These log files don’t have data in a very structured or organized format and thus are not ready for direct consumption by humans for analysis.

Log files contain a wealth of information like network intrusion details, unauthorized access to valuable assets (employee salaries, equities etc.), firewall related access, tracking user behavior in real time, etc.

Traditional means of collecting this information and converting it into a readable and analyzable format requires days or weeks. By the time information comes out, damage is already done!  The rise in internet usage, smartphone usage, Internet of things and distributed deployments have added a huge increase in volume and variety of log files.

Log files are not just the means of debugging and troubleshooting, but are an active and real time source of data to track fraud analytics, network performance, optimization, preventing outages, memory hot spots, etc.

Tracking log files real time helps predict a P1 ahead of time.  Left-shift gives relatively larger ground for resolution and most of the time helps prevent a P1, reduces downtime and increases response readiness and a faster recovery. The issue resolution becomes more proactive then reactive.

A system needs to be smart in case it has to make decisions on-the-fly based on the data as it is generated. It needs to orchestrate responses based on the events occurring in the system and rules defined.

We must also understand that since log is based on actual events happening in a system, a level of uncertainty in the data cannot be ignored. There is constant need to keep tuning the machine learning algorithms and refining the training data sets.  It is highly possible that an event which is marked noise may carry a hint to solve the most troubling Use Case. To identify hidden patterns, a bottom-up approach is required.

There is a huge amount of money spent by enterprises to manage IT infrastructure and multiple applications. A very common problem from our daily routine is ‘Which application caused the platform outage?’ With a single platform hosting multiple applications, this is a common issue. With distributed systems in place, another challenge is translating time stamps (between GMT, EST, and CST, etc.) manually to get a true window of activity for Root Cause Analysis and plan action for Resolution.

With more and more Enterprises moving towards Cloud infrastructure (based on network traffic and processing capacity utilized), we should be able to make decisions whether to merge clusters or increase the capacity of a respective cluster.

Based on trends and predictions, infrastructure risk profiling can be prepared. Performance tests can be done to identify bottlenecks. With proven results at hand, it is easier to engage stakeholders early in the cycle.

To conclude, by monitoring logs we can be proactive, we can automate for diagnostics and remediation and be ready for the next scale up that would be required.

Operational efficiency and resilience are the low hanging fruit and the major benefits that can be achieved by an enterprise.

Article written by Akshey Gupta
Want more? For Job Seekers | For Employers | For Contributors