Working with data can be hard. You might spend hours on your model or analysis without getting any reasonable results. At that point, it might be tempting to blame your performance issues on the wrong choice of method. After all, there are so many algorithms out there, there must exist that one candidate that will solve problems, right?

More often than not though, the underlying issue is the data itself. In fact, you can often get quite far with very simple models as long as you have a good dataset to work with. Thus, in this article, we will explore four ways to improve the latter.

First, we will look at the rather unsurprising approach to ‘just’ increase the size of data available. While this is indeed an obvious solution, there are some interesting considerations that we will explore. Second, we will consider ways to improve the quality of a dataset – i.e. how to build ‘better’ datasets in the closer sense.

## Wider data – the blessing of dimensionality

Is there a way to improve a dataset so much that a simple if-else rule would outperform a sophisticated Deep Learning model? The answer is ‘yes’. Consider the following, single-dimensional, binary classification problem:

Ask yourself if the best model at your disposal could perform reasonably well here. Unfortunately, the conditional class distribution appears to be completely random. Even with state-of-the-art models and high-end hardware you would not be able to build a reasonably predictive solution.

What if I told you that I created the dataset without using any random values? Here is the resolution:

Leaving out a crucial second variable turned a simple predictive problem into a hard one. Now, what are the practical consequences of this trivial example?

While your current features might appear sound, you could still lack other, less obvious but, nonetheless, crucial ones. For example, forecasting product sales might be tough without seasonality and weekend features. Also, as this toy example shows, **two or more features might even only be predictive in interaction with each other**.

In reverse, is it always a good idea to add more and more features? Of course, it is not. As you probably know, a big part of predictive modelling is variable selection. Including even more candidate features just for the sake of it will certainly make that step much more tedious.

If the extra feature looks promising though, it could make the all difference. Always question if the data you are being handed are sufficient to solve your problem.

#### Why are more dimensions better? Some theory

As a simplistic example, consider three variables, `X,Y,Z`

, with `Z`

being the target variable. Also, let all three variables follow a multivariate Gaussian distribution. We have for the mean vector and covariance matrix:

\mu=\begin{pmatrix} \mu_X \\ \mu_Y \\ \mu_Z \end{pmatrix} \quad\Sigma=\begin{bmatrix}\sigma^2_X & \sigma_{XY} & \sigma_{XZ} \\ \cdot & \sigma_Y^2 & \sigma_{YZ} \\ \cdot & \cdot & \sigma_Z^2\end{bmatrix}

Applying the law for conditional Gaussian variance twice, we get:

\sigma^2_{Z|Y}=\sigma_Z^2-\frac{\sigma_{ZY}^2}{\sigma^2_Y}\geq\sigma_{Z|Y,X}^2=\sigma_{Z}^2-\frac{\sigma_{ZY}^2}{\sigma^2_Y}-\frac{\left(\sigma_{XZ}-\frac{\sigma_{XY}\sigma_{YZ}}{\sigma_Y^2}\right)^2}{\sigma_Z^2-\frac{\sigma_{YZ}^2}{\sigma_Y^2}}

This implies that using more explanatory variables reduces predictive uncertainty under two conditions:

**Relevancy:**All explanatory variables are correlated with the target**Non-redundancy**: The explanatory variables are not highly correlated with each other

Also, as these lecture slides show for Linear Regression, you need to be aware of the curse of dimensionality. A considerable increase of model complexity requires either more data-points or stronger regularization. Otherwise, you might end up with a worse model than before.

#### Where should I expect missing columns?

**Incomplete information are everywhere**: You can almost always find information gaps in your data, if you think long enough. Unfortunately, collecting more data is not always trivial and often impossible. Try to find a sweet spot between too little information and too much effort or costs.**Image data:**Here, the equivalent of unobserved columns are unobserved pixels. Higher resolution images might be the answer. However, be aware of the curse of dimensionality.

**How to get more dimensions – and how to get the right ones:**

**Work closely with domain experts or become one yourself**: Subject matter experts can often pinpoint exactly what information is necessary to model a given problem.**Be creative with regards to alternative data**: Wallstreet can be a motivating example when it comes to the creative usage of alternative datasets. Some hedge funds, for example, are known to use parking lot satellite data to forecast quarterly sales figures of retail companies.**Increase granularity**: As mentioned in the images example, using data at a more granular level can add crucial information to your model. Consider BERT and most other modern NLP algorithms that often operate on word-pieces rather than full words as their inputs.

## Longer data – if you can’t connect the dots, how could your model?

As anyone working with data will know, it is always better to have more datapoints than less. Additional data storage is cheap in most situations. Thus, you should rather be in a position where you can exclude data from your model than to not have that data in the first place.

Let us look at some theoretical considerations:

#### Why more data is better – from a mean-squared error perspective

Consider the core concept of modern Machine Learning, empirical risk minimization. We have a **loss function** between actual target and predicted target:

L(y,\hat{M}(x))

A common choice is the square loss

L(y,\hat{M}(x))=(y-\hat{M}(x))^2

Ideally, we want to choose an optimal candidate model that minimizes the expected loss (a.k.a. **risk**) over the data-generating distribution:

\begin{gather*} M_R^*=\argmin_{\hat{M}\in\mathcal{M}}R(\hat{M},p(x,y))\\=\argmin_{\hat{M}\in\mathcal{M}}\mathbb{E}_{p(x,y)}\left[L(y,\hat{M}(x))\right] \end{gather*}

As the data-generating distribution is usually unknown, we need to estimate actual risk through the **empirical risk**:

\begin{gather*} M_R^*=\argmin_{\hat{M}\in\mathcal{M}}\hat{R}(\hat{M},(x_1,y_1)\times\cdots\times(x_N,y_N))\\=\argmin_{\hat{M}\in\mathcal{M}}\frac{1}{N}\sum_{i=1}^N L(y_i,\hat{M}(x_i)) \end{gather*}

With the square loss from before, we obtain the popular mean-squared error objective:

\argmin_{\hat{M}\in\mathcal{M}}\frac{1}{N}\sum_{i=1}^N (y_i-\hat{M}(x_i))^2

In the general case, the empirical risk estimator has the following statistical properties:

\begin{gather*} \mathbb{E}\left[\hat{R}\right]=\mathbb{E}_{p(x,y)}\left[L(y,\hat{M}(x))\right]\\ \mathbb{V}ar\left[\hat{R}\right]=\frac{1}{N}\mathbb{V}ar_{p(x,y)}\left[L(y,\hat{M}(x))\right] \end{gather*}

In plain english, the empirical risk estimator is

**Unbiased**– on average, optimizing for empirical risk is equivalent to optimizing for the true risk**Consistent**– with increasing sample size, large deviations between empirical risk and true risk become less likely

As a caveat, large sample size only guarantees you that you **CAN** better find the true risk optimal model. If your search algorithm is bad, you might still end up worse than with less samples but a good search strategy. The problem of multiple local optima in Deep Learning is an example thereof.

Also, theoretically, if mean or variance of the true risk don’t exist, any empirical risk based optimization will be flawed. This could happen if your data is heavy-tailed. Nassim Taleb has some interesting views on this problem in this video.

#### Where you might lose some observations for your model

**Sensitive data:**There are situations where law or other policies don’t allow access to the full dataset. Federated learning could be a solution in this case.**Opt-out or opt-in policies**: If your users don’t want to have their data being collected, you can’t do much besides accepting their decision. In that case you have to live with less data and make the best out of it.**Data loss or deletion**: Ideally, unintended data loss should never happen. Since we don’t live in a perfect world, though, always consider this worst case scenario.

**How to get more observations or deal with too few**

**Raise the sample rate for data collection**: If possible, try to increase the frequency of data collection – for example if you are working with sensor data. You can always switch to lower sample rates later on but never vice-versa.**Decrease the dimensionality of your data**: Usually, the more complex your model the more data you need. If you have to work with fewer data-points, decreasing the dimensionality of your data could improve model accuracy.**Use model regularization and prior knowledge**: While regularization is commonly taught, it goes much deeper than just using an L1/L2 norm. Bayesian Machine Learning, for example, is a mathematically sound framework for regularization via prior knowledge. This can go far beyond standard regularization techniques.

## Less noisy data – give me a signal

When it comes to noise, we need to distinguish between two types:

**Predictive noise**: A better term would be ‘randomness’. While you might observe a target variable without distortion, you cannot predict it with certainty. Including more predictive features can improve this typed of noise.**Perturbing noise**: Also termed**measurement error**. This is the kind of noise we want to discuss in this section. Instead of observing the variable itself, we observe some distorted version thereof. As an example, think of collecting human motion data in the midst of an earthquake.

As you can probably imagine, noisy data should be avoided or noise be minimized. Below is a simple Linear Regression example of what happens to prediction quality under noise.

We start with the following – noiseless – data generating model:

\text{Target}=\alpha + \beta \cdot \text{Feature}

Instead of the raw target and feature, we observe noisy versions thereof:

\begin{gather*} \text{Target}_{\text{noise}}=\text{Target}+\eta_\text{Target}\quad\eta_\text{Target}\sim\mathcal{N}(\mu_\text{Target},\sigma^2_\text{Target})\\ \text{Feature}_{\text{noise}}=\text{Feature}+\eta_\text{Feature}\quad\eta_\text{Feature}\sim\mathcal{N}(\mu_\text{Feature},\sigma^2_\text{Feature}) \end{gather*}

Now, we visualize two scenarios:

**Truly random (zero-mean) noise:**The errors are ‘cancelled out’ on average. This might happen when you take images with a camera shaken at random.**Systematic (non zero-mean) noise:**Your observations are distorted on average. Stains on a camera lens could cause this for image data.

Let us compare the effects of random and systematic measurement error on the regression example:

In the non-zero mean noise scenario, model distortion is considerably worse than in the zero-mean case. For real-world data, the consequences might be more or less severe. Either way, noise effects will definitely be less easy to analyze than under lab conditions.

#### An in-depth view on Gaussian data with Gaussian noise

Let us consider another simplistic, bi-variate Normal example with mean and covariance as follows:

\begin{pmatrix}X \\ Y\end{pmatrix}\sim\mathcal{N}\left(\begin{pmatrix}\mu_X\\ \mu_Y\end{pmatrix},\begin{bmatrix}\sigma_X^2 & \sigma_{XY} \\ \sigma_{XY}& \sigma_Y^2\end{bmatrix}\right)

If use linear regression, we get – for arbitrarily many data-points – the following parameters:

\begin{gather*} \beta=\frac{\sigma_{XY}}{\sigma_X^2}\\\\ \alpha=\mu_Y-\frac{\sigma_{XY}}{\sigma_X^2}\mu_X\\ =\mu_Y-\beta\mu_X \end{gather*}

Now, we pollute both variables with independent Gaussian noise:

\begin{gather*} \tilde{X}=X+\eta_X;\quad\eta_X\sim\mathcal{N}(\mu_{\eta_X},\sigma^2_{\eta_X})\\ \tilde{Y}=Y+\eta_Y;\quad\eta_Y\sim\mathcal{N}(\mu_{\eta_Y},\sigma^2_{\eta_Y}) \end{gather*}

This results in a – distorted – bi-variate Gaussian distribution:

\begin{pmatrix}\tilde{X} \\ \tilde{Y}\end{pmatrix}\sim\mathcal{N}\left(\begin{pmatrix}\mu_X +\mu_{\eta_X}\\ \mu_Y+\mu_{\eta_Y}\end{pmatrix},\begin{bmatrix}\sigma_X^2+\sigma_{\eta_X}^2 & \sigma_{XY} \\ \sigma_{XY}& \sigma_Y^2+\sigma_{\eta_Y}^2\end{bmatrix}\right)

This allows us to derive the regression parameters under

\begin{gather*} \tilde{\beta}=\frac{\sigma_{XY}}{\sigma_X^2+\sigma_{\eta_X}^2}\\\\ \tilde{\alpha}=\mu_Y+\mu_{\eta_Y}-\frac{\sigma_{XY}}{\sigma_X^2+\sigma_{\eta_X}^2}(\mu_X+\mu_{\eta_X})\\ =\mu_Y+\mu_{\eta_Y}-\tilde{\beta}\mu_X-\tilde{\beta}\mu_{\eta_X} \end{gather*}

#### What these formulas imply

**Zero-mean noise in the target**: If only the target is corrupted by zero-mean noise, your parameters will still be correct if the sample size is large enough. With increasing noise variance, you might need a larger sample size.**Zero-mean noise in the input:**In this case, the predictive power of the input feature is lessened in relation to the amount of noise. Depending on the severity, noise reduction could thus turn a formerly useless feature into a highly predictive one.**Non-zero mean noise**: Your parameter estimates and thus your predictions will be biased. You should avoid such systematic measurement error at all costs.

Of course, noise in the real world is generally much more complex. Noise could be varying over time or pollute your variables only in one direction. The above example should give you just a rough idea why it is important to limit the impact of measurement error.

#### Where can I expect noisy data?

**Sensor and image data:**Polluted sensors or camera lenses can easily introduce unnecessary noise to your data. Also, noise might hint at technical incapabilities or impurities of your data collection device. You might have to get better one if your current level of measurement error is intolerable.**Textual data:**Social media data can be particularly noisy – user mentions on Twitter or URLs within text easily bias NLP tools.

**How to reduce noise from two angles**

**Ex-ante denoising**: Ideally, you should prevent measurement error before it can even enter your datasets. For computer vision problems, this might be as simple as keeping your camera lenses clean.**Ex-post denoising**: If you cannot avoid measurement error from happening, you need to resort to the countless methods for data-denoising. A quick google search should be a good starting point.

## Better sampled data – staying true to the data generating process

Now we get to the most subtle form of potential dataset improvements. If the above issues were obvious or at least can be anticipated, this might not be the case here. Incorrectly sampled data could look totally fine yet still result in erroneous models or wrong conclusions.

In an ideal world, we can draw truly random samples from the underlying data generating distribution. In reality, however, perfect random sampling is close to impossible. This will happen at the very latest when you predict a future data based on past data.

In that case, the data generating distribution stretches arbitrarily far into the future. However, since you cannot collect data from the future (yet), your model will be biased towards the past. A model that predicts buyer preferences well today might struggle to deal with future shifts in consumer behaviour.

This is the infamous distributional or domain shift problem. No matter how superior your models are at the moment, you could see their performance vanish at any point in time. Luckily, it is also a well known problem and there exist many approaches to mitigate it to some extent.

Keep in mind, though, that domain shift is not the only instance of sampling bias. The distribution might be perfectly stable but your sampling process itself could still be flawed.

#### Empirical risk minimization with a distorted sampling distribution

Consider again the statistical properties of the empirical risk estimator:

\begin{gather*} \mathbb{E}\left[\hat{R}\right]=\mathbb{E}_{p(x,y)}\left[L(y,\hat{M}(x))\right]\\ \mathbb{V}ar\left[\hat{R}\right]=\frac{1}{N}\mathbb{V}ar_{p(x,y)}\left[L(y,\hat{M}(x))\right] \end{gather*}

A crucial detail in these formulas is `p(x,y)`

. Unless your samples came from the true data generating process, your risk estimate will be flawed. If our sampling distribution is different, say `q(x,y)`

, there is no way to guarantee that we are optimizing for the correct risk anymore. This can be exemplified in a simple though experiment:

#### A concrete example of a biased sampling process without domain shift

Imagine you had a camera in your garden and want to classify animals that are playing inside of it. Thus, you aim to build and train some convolutional neural network classifier.

Presume that there are four types of possible animals, **cats**, **dogs**, **rabbits** and **horses**. For simplicity, we also assume that each one is equally likely to appear in your garden:

Being a true enthusiast, you spend the next few days taking a lot of pictures of the respective animals. However, since you were primarily focusing on cats, the amount of cat pictures collected turns out to be much bigger. The distribution of pictures in your sample might look like this:

Now, we have a divergence between the distribution of animals in the garden and the distribution of animal images in the sample. Due to a biased sampling process, the chance of cats ending up in the training set is much larger. The sample was not taken fully at random:

To simplify things further, consider you had only two candidate computer vision models. Of course, in reality you usually have an infinite number of candidate models. For Neural Networks, for example, each possible parameter configuration is a separate candidate. Your search algorithm for the optimal model is typically gradient descent.

Anyway, presume that our two candidates have the following properties:

#### The effect of a biased sampling process

In order to select the best model out of the two, you apply empirical risk minimization. Let’s say you want to minimize a zero-one loss:

L(y,\hat{M}(x))=\begin{cases}0 & \text{if } \hat{M}(x)=y \\ 1 & else\end{cases}

Now we can calculate the expected empirical risk of candidate model 1 given the natural distribution:

\begin{gather*} \mathbb{E}_{p_{\text{natural}}}[\hat{R}(\hat{M}_1)]=0.25\cdot L(\text{'cat'},\hat{M}_1(\text{cat image}))+\cdots\\+0.25\cdot L(\text{'rabbit'},\hat{M}_1(\text{rabbit image}))\\ =0.25\cdot1+0.25\cdot0+0.25\cdot0+0.25\cdot0\\ =0.25 \end{gather*}

If we do this for both candidates and distributions, we obtain the following:

Clearly, candidate 1 is preferable once we want to use our model ‘in production’. Due to our biased sampling process, however, the inferior candidate model 2 appears more attractive. According to the above formulas, variance of the risk estimator decreases with larger samples. As a result, more data-points will actually increase the chance of selecting the wrong model.

In practice, things are, of course, much more complex. Even with mildly biased data, you might still end up with a sufficiently powerful model. On the other hand, you can never guarantee that your models won’t suffer from distributional shift over time. This makes the issue of sampling bias a permanent theme that you should always keep in mind.

#### Where you might encounter biased sampling

**Data is acquired over time:**Again, distributional shift. This is true for practically every Machine Learning dataset. As long as the distributional shift is not too unstable, you can usually handle this in a reasonable manner.**Data is systematically altered or deleted:**This might be the case when you have users that are opting out via GDPR or related forms. If this process is not independent of your users’ characteristics, expect some bias in your sampling process.

#### How to mitigate or reduce the impact of biased sampling

**Monitor your models as closely as possible**: If you are following MLOps best practices, you should already be familiar with this point.**Include the biasing variable in your model**: For example, accounting for timestamp as a separate variable might mitigate the problem of domain shift over time. As long as the pattern of distributional change itself remains constant, this could be a viable solution. Keep in mind, however, that there is still no guarantee for the presumption of constant domain shift.**Monitor and optimize the sampling procedure**: If you can control the sampling process, ensure that it comes as close to the ideal random sampling as possible.**Consider online learning**: In order to quickly adapt to a changing distribution, you should update your models as frequently as possible. As a matter of fact, online learning is the fastest way to do so. If you can afford the additional effort of updating models in real-time, you should give this idea a try.**Consider re-weighting or re-sampling**: If all the above is not feasible, you could still try methods such as inverse probability weighting.

## Conclusion

If you have been reading up until here, you have probably realized that there is always room for improvement when it comes to data. While you cannot optimize single datasets all day long, their quality is, nevertheless, essential for effective Data Science and Machine Learning. Thus, if your model just doesn’t seem to improve, consider a closer look at the inputs.

## Image sources

**Horse**– Photo by Helena Lopes on Unsplash**Dog**– Photo by Marliese Streefland on Unsplash**Cat**– Photo by Raoul Droog on Unsplash**Rabbit**– Photo by Gary Bendig on Unsplash

## References

**[1]** *Groves, Robert M., et al. Survey methodology. John Wiley & Sons, 2011.*

**[2]** *Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.*

**[3]** *Wooldridge, Jeffrey M. Introductory econometrics: A modern approach. Cengage learning, 2015.*