The Key to Ultimate Productivity in Data Science Teams – Agile Data Analysis

The Key to Ultimate Productivity in Data Science Teams – Agile Data Analysis

When Agile meets big data and machine learning

Machine learning is set on an irreversible path to become impactful on how we’ll do our jobs and how several procedures will be automatically performed throughout many sectors and eventually touching the whole economy.

In fact, advanced analytics is the exciting technology that goes two or three steps beyond dashboard, which is using data to drive computations to inform more complex decisions – predictive analytics, decision modeling and at the end service provided by computers. Efficiency and quality improvement depend on data management, data analysis and data governance.

Can Agile work for big data projects?

In big data, multiple experts and the IT team must together build the solution in timely, defined iterative job tasks, requiring much more engagement and availability to the project. Agile is all about incremental delivery.

So what does this mean in practice? In my view, this should include not only results (the analysis) and insights (what that implies) but also the code or function that has generated them, so that the results are replicable. The agile principle has grown to deal precisely with the uncertainty regarding the users’ requirement and the business environment, imposing changes on the project frequently.

For a BI project, on the other hand, an organization needs maybe one guy for Extract Transform and Load (ETL) and one for the semantic layer plus visualizing dashboards. For big data and even more so for machine learning, several others are involved, such as an experienced software developer (Java, Python, R) and at least a knowing implementation in Hadoop ecosystem besides a list of available APIs developed from scratch. Platforms like IBM Watson, AWS Machine Learning, Microsoft Cortana and Google TensorFlow are maturing but still require technical knowledge to use them. Each team member needs to collaborate within the team and with users; this is where the agile methodology comes to play.

For many data analysts agile doesn't seem to apply to what they do. They consider that, unlike software development, analytic workflows are intrinsically uncertain, making it difficult to plan sprints. And analysts tend to work individually so there is little value to all those agile collaboration ceremonies – Standup, Sprint Planning, Retrospectives, etc. Analysts find it difficult to see how this helps them do analytical projects. I think this is because agile coaches focus on the rituals and ignore the spirit of Agile, and then they do not make the proper translation.

As mentioned earlier, the actionable machine learning kind of project is inherently iterative, with uncertainty of the final outcomes and demands requirement and algorithm change analysis after each one of the iterations. Each one has a delivery, which is a model to be shared and reviewed with the main stakeholders. Therefore, most of the agile principles and also the methodology adequately adjusted fit well.

There are lots of possible analytical routes and methodologies compatible with a narrative. In Agile, you don’t pursue any one to the end before reporting back. Scope out your options and break them out roughly. Each step (or “user story”) should be expressed as what you are going to achieve and for whom. The first few steps should be more concrete than the later steps.

ISyE at Georgia Tech – Agile machine learning techniques

The H. Milton Stewart School of Industrial & Systems Engineering (ISyE) at Georgia Tech was working on multiple data fusion for abnormality detection in the semiconductor manufacturing process, studying the massive amounts of data continuously streamed from hundreds of sensors embedded into the manufacturing equipment.

The challenges were how to extract useful information from the data, learn the system’s behavior and improve its performance. The analytical issues were complicated by sensing data’s high dimensionality, variety and velocity; and intricate spatial and temporal structures.

ISyE faculty solve these challenges by developing scalable and agile machine learning techniques that provide effective modeling and analysis of multi-sensor data streams, allowing researchers to extract essential information for manufacturing improvement.

More on overcoming challenges

In addition to real-time monitoring and fault diagnosis and control, machine learning facilitates online product inspection and can predict potential failures in the manufacturing process, thereby allowing time for planning corrective and preventive actions.

Nevertheless, there are some challengers to be overcome in each organization

  • Data scientist and business experts work as a team with SME (Subject Matter Expert) and with software engineers.
  • Rapid delivery of actionable predictions discovered from big data
  • How should agile processes be adapted for big data analytics development? How should agile principles be combined with the architecture design method to achieve effective agile big data analytics?

What should I do if machine learning is only part of a bigger project?

A Project Management Office or just a project manager should be able to manage in a hybrid approach, accommodating different executing methodologies but keeping the same high-level management approach in order to report in a consistent way.

But in some cases, should the machine learning tasks be organized into user stories? And why do you have to do this?

Those are some reasons that can occur within the firm:

  • The project management is totally organized around user stories;
  • All of the scheduling is based on fitting entire user stories into sprints, rather than individually scheduling tasks;
  • Other teams have made great use of agile methodologies, and they've benefited from modeling all the software features as user stories

Some user stories can be related to search for items by category, so that it can easily find the most relevant items within a huge, complex database. Or as a content editor, automatically create categorical designations for the items in the database, so that customers can easily find high-value data within a huge, complex database.

The tricky part is figuring out how to create subordinate user stories for the rest of the machine learning architecture. Basically, the algorithm requires two major architectural subdivisions: (A) training and (B) classification. And the training portion of the architecture requires construction of a cluster-space. But a cluster-space, in and of itself, provides zero business value. Nor does a crawler, or a feature-extractor.

There's no business value (not for the end-user, or for any of the roles internal to the company) in a partial system. A trained cluster-space is only possible with the crawler and feature extractor and only relevant if we also develop an accompanying classifier.

Any story has a role, an action and a goal. So, one suggestion is to think about writing a story which names a role doing something to achieve a goal. You can run into troubles here, getting caught up in "business value."

Start by defining overall how you'll know when you complete your task successfully. Then, "achieves business value" is making some progress toward the goal. This is what I mean by how to translate from agile approach towards the big data/machine learning project.

From the strategic decision to a bottom-up determination to change

In many organizations, leaders have just decided that their company must be an agile and intelligent organization (which means decisions are made based on big data and advanced analytics).

According to C.K Prahalad, each company has a known business model supported by business process that reinforces this model – which is one main reason that the intellectual understanding of the need for change and a desire for change are not enough. The company needs the administrative capacity to execute that change. In most companies, there is a gap between the capacity to think and the capacity to act.

Nevertheless, a more practical and agile approach to adopting machine learning is quietly taking hold of this to the next years. Following the Agile Manifesto Principles, teams of doers not afraid to get their hands dirty with unruly yet promising corporate data will completely bypass the “big data noise” and carefully pick low-hanging predictive problems that they can solve with well proven algorithms in the cloud with smaller sampled datasets that have a favorable signal to noise ratio.

As the team builds confidence in its abilities, the desire to deploy what it has built in product as well as to add more use cases will mount. No longer bound by data access issues and complex, hard-to-deploy tools, these practitioners not only start improving their core operations but also start thinking about predictive use cases with a higher risk-reward profiles that can serve as the enablers of brand new revenue streams.

The following comparison shows how some paradigms will change at the core of the traditional organizations mindset.


Agile analytics teams evolve toward the best system design by continuously seeking and adapting to feedback from the business community. Agile analytics balances the right amount of structure and formality against a sufficient amount of flexibility, with a constant focus on building the right solution.

The key to agility lies more in the core values and guiding principles than in a set of specific techniques and practices – although effective techniques and practices are important.

Article written by Werther Krause
Image credit by Getty Images, DigitalVision Vectors, gobyg
Want more? For Job Seekers | For Employers | For Contributors