Jun 30, 2015

Modern enterprise data science teams are technically diverse. The members of these teams differ in their education levels (bachelors, masters, PhD), majors (CS, statistics, natural sciences), prior work experiences (advertising, actuaries, finance, experimental physics) and used tools (Excel, SAS, R, Python, Java). One is more likely to find biology or computational neuroscience majors in data science teams today than in actuarial or financial firms of the past.

But there is a more subtle difference than the ones listed above: team members with quantitative backgrounds may differ in exposure to the *same* set of concepts. Take the example of regression. Both engineering and statistics departments devote a portion of their curriculum to teaching line fitting. The presentations in these disciplines, however, have historically differed.

Their terminology is also different: statisticians call it regression, engineers call it curve-fitting. With an increasing number of non-CS engineers and scientists joining data science teams, it is instructive to examine the differences between statistical and engineering approaches to common data science concepts.

Understanding these differences will have several benefits. First, it would enable an effective cross-communication between team members with different backgrounds. Second, it'll lead to a more efficient training. Say, for example, that statistical features of curve-fitting were important for a particular business. This business can call those aspects out explicitly and use them to craft focused training sessions for the team members who are familiar with the curve-fitting procedure, but not necessarily its statistical fundamentals. Lastly, once the business-critical technical skills are identified, they can be used to make job-descriptions more accurate (than the generic ones we often encounter) or emphasized during hiring decisions.

In the following, the example of regression is used for concreteness. There are other data science concepts that overlap multiple disciplines, but are referred to differently. Readers who've experienced this will be able to extrapolate the regression example to these other areas.

**For engineers and physical scientists**, line fitting is a tool to understand the physical law driving the observations. Kepler analyzed tables of planet position data to discover laws of planetary motion. He was interested less in predicting future positions of planets than in the laws that governed them.

**To engineers**, the values of the fitting parameters (e.g. slope and intercept) have to make sense. These values represent measurable physical quantities. For example, weight (mass) and volume have a linear relationship for a given material. If such a relationship were plotted, the slope would represent the “density.” For an automobile, the relationship of miles travelled at a constant speed vs gas consumed can be expected to be linear. The slope here represents the “mileage” of the automobile at that speed.

As such, engineers and natural scientists often have a tendency to closely inspect the values of the obtained fitting parameters. **In Sciences**, if a fit predicts physically unreasonable values of parameters, the model is discarded (and underlying experiments repeated) regardless of the fit quality.

The predictive ability of the fit is also a secondary concern. The objective of modeling in sciences is usually to propose new experiments not yet carried out and predict their results as opposed to results of future measurements from the same apparatus.

The assumptions underlying the prescribed fitting procedure are rarely mentioned explicitly. Choice of “sum of squares” as a cost function is justified since it possesses “nice” mathematical properties like differentiability and convexity that are required to locate the minimum of the cost function. It is not uncommon for non-statistics data scientists to be unable to list the assumptions behind the sum of squares cost function.

The focus here is on getting the best quality fit and using it to predict the expected values of future observations. The predictive quality of the model is explicitly captured by dependence of the “model quality” on out-of-sample error.

There is comparatively less emphasis in statistics on understanding the physical phenomena that underlie the observations. The efforts are mostly devoted to constructing accurate, predictive mathematical models. This is possibly because datasets under consideration often do not permit regeneration under controlled circumstances.

Statisticians also expend considerable energy on reducing the out-of-sample errors. This includes techniques such as adding complexity to the fitting function (feature interactions, kernels, nonlinearities), fine-tuning the cost function (regularization), reducing dimensionality, and, whenever possible, gathering more data.

In other words, interpretable models are nice but not strictly necessary for the overall success of the effort.

Between the two viewpoints considered above, there is no one that is more “correct”, “valid” or “scientific.” Both have proven successful in their respective domains. Scientific laws are re-examined despite their excellent “fit” and predictive abilities. Conversely, it is also a fact that autonomous cars and the state-of-the-art image recognition have been enabled by models whose mathematical and physical properties are less than completely understood.

Let us now contrast the linear regression math as presented in engineering and statistics. The math below is non-rigorous by design. Imagine we have a set of 100 points as shown in the scatter plot below.

Engineering disciplines typically (but not always) adopt a linear-algebra based approach to regression. Taking the simple example of single-variable (univariate) regression, we can express the observed values of dependent variables (y) as a linear function of the independent variables (x) as follows: (1)

Note that this is an over-determined set of equations since there are more equations than unknowns. To solve it, we compute the sum of squared residuals, termed as the cost function: (2)

and obtain the fit coefficients via the standard minimization procedure of setting. The solution is easier to express and generalize in matrix terms. If we set:

then the least squares problem, re-expressed in a matrix-vector form is (3)

with the solution (4)

The coefficients and obtained from equation (4) are identical to one obtained by minimization of cost function in equation (2). The matrix-based solution also generalizes to multivariate regression, i.e. to situations where we have more than one independent variable. Explicit matrix inversion is seldom carried out in practice for numerical stability reasons. Instead, the system (5)

is solved by Gaussian elimination (or, equivalently LU decomposition) or iterative methods. Engineering treatments of curve-fitting typically halt after a description of the above procedure. The justification and assumptions underlying these prescriptions are either not emphasized or deferred to statistics texts.

The Statistical approach to regression aims to capture the probability distribution of the points about their expected value.

The fitting function specifies the expected position of the dependent variable for a given input. Linear regression is the hypothesis that the expected position depends linearly on the input: We then compute the *error* between the predicted/expected (Yp) and the observed (Yi) values of the dependent variable:

We then make a few critical assumptions about our observed data:

- The errors are distributed normally about the expected value:
- The observations are independent and drawn from the same distribution
- The variance of the error is independent of input (Xi)

After this setup, we seek the values of the fitting parameters and that maximize the probability of obtaining the dataset under consideration. This is known as the maximum likelihood estimate (MLE) and is a widely used parameter estimation technique.

Since we've assumed observations to be independent of each other, the overall probability of obtaining the observed set of errors is just the product of obtaining the individual errors:

which, on substituting the normal distribution of error probabilities, becomes (6)

Maximizing the likelihood, or requires minimizing the exponent and leads to the least squares cost function. The generalization to multivariate case yields to the equation system (5).

The statistical approach is not free from assumptions and many of these may seem ad-hoc: identical normal distribution of errors, independence of observations and usage of the MLE for parameter estimation. However, the assumptions are explicit, testable and seem to provide a deeper justification for arriving at the least squares cost function.

Taking the example of regression, we touched on the issue of technical diversity in data science teams.

Specifically, we considered how quantitative team members could be exposed to common concepts in different ways. The above discussion is, of course, not a call for generalization (statisticians can't code or engineers don't know probability theory).

**Rather, it is meant to highlight an issue that is expected to become prominent as people from engineering, natural sciences and social sciences join computer scientists and statisticians in pursuit of data science.**

Non-technical or business facing people could likely waive away these differences as academic or easily handled “on the job.” In many cases, however, “statistical” or “physical” way of thinking is ingrained and impacts how job assignments are carried out (e.g. does the person use hill-climbing or write a multi-threaded program to perform exhaustive search of the parameter space).

Being mindful of the approach and vocabulary of the others' fields is helpful in guiding teams toward more fruitful collaborations as well as correctly estimating quantitative skill levels during hiring decisions.

Article written by Rohan Kekatpure

Want more? For Job Seekers | For Employers | For Influencers

Share Article: