Data products and the cost of iteration


It is great working with data products. You are providing users with information to make better decisions on their own data, and really unlocking its potential. That said, iterating on such products can be a very involved affair, and you can be left wondering whether it really has to be so slow. This post explores why this process can take a long time, listing some potential approaches, as well as considering if and when it is worth the trouble.

The challenges of iteration

The hardest thing in changing your data pipeline is not (necessarily) the code itself, but rather measuring the effect of the implemented changes. One minor change at the start of your processing pipeline can have cascading effects across your different outputs, and it can be a fool’s errand to try and predict the changes in advance. So how can you tackle this?

An obvious approach is to have test data that you can use to control you obtain the expected result. This has the benefit that it can be relatively easy to set up and update, although depending on the product you are using, generating the test data can be challenging. The production data is also likely to be less tidy than what is in your testing battery, and you should not underestimate the complexity of the data your users may end up producing.

A battery of tests also cannot tell you how your customer’s data will change in the aggregate. For instance, let us assume you are changing how GPS coordinates map onto a country or region. The tests will be able to check whether a particular coordinate maps onto a particular country, but it won’t tell you how many data points in your customer’s account will actually change.

Nevertheless, this is a good starting point, even if it is one with its limitations.

An approach I have seen work multiple times is to build a system that allows you to compare the data you produced with and without the change in place. Perhaps you write the output of the new way of processing into a separate database while maintaining the old output, and then compare the data in the two databases.

This is not a small amount of work. You will need to set up your infrastructure to be able to support a second such instance and have a process in place to be able to compare numbers in a quick and repeatable way. You will also need to think carefully which numbers you will want to compare with which, as well as how, and that will take some iterations. On top of that, all this does not come for free. Your infrastructure costs are likely going to increase, while of course all of this also takes time.

Nevertheless, being able to do such comparisons is akin to suddenly being able to glimpse into potential alternate futures. This greatly does away with having to blindly deploy changes into production and hope for the best. Instead you can verify that your improvements will provide the benefits they are supposed to, while ensuring they also do not cause unforeseen side effects. You will also be testing the functionality with as a realistic test data as possible.

A final point before we move on. The above approach is great for assessing changes in the underlying data, but it is less helpful when you are doing other improvements, such changing how the data is being queried. Due to the huge number of combinations in which you can query your data, it is likely not cost effective (if even possible) to test every combination. However you should look at what the most common queries carried out by your users are, and test these.

No one likes (unexpected) change

One can look at Instagram for a nice example of people negatively reacting to unexpected data changes. While not a company one would immediately associate as a data product, Sarah Frier’s book No Filter includes an example of how Instagram tackled the problem of fake accounts on the platform, which in turn was reflected in users’ follower counts. When they removed them in one go, people got upset, as people wondered why they had lost a considerable proportion of their followers overnight. The fact that users were left with better quality followers was of no consolation, also as it was not immediately clear to the users what the root cause was. Instagram then started removing these fake accounts gradually, which caused fewer ruffled feathers, primarily because it is harder to notice such changes. Whether one agrees on whether this is the best approach, it demonstrates how people do not tend to appreciate unexpected, large changes in their data, even when these result in more accurate numbers. The way you conduct the roll-out of the change is important.

So you have set up your parallel data processing pipeline, and can clearly see the effect of any changes you plan to introduce. In most cases, the change in the data is minimal or non-existent, and you can release relatively quickly. Perhaps you release the change gradually across your users, just to make sure.

Then comes along that bug that has been there for a long time, and you see that the effect of the fix is actually not so insignificant. Yet you have to introduce the change, but you can see that you will get some annoyed and confused consumers.

As we saw from the Instagram example, people don’t react well to unexpected data corrections, and people will often want to receive some time to prepare for the upcoming change. Some crucial reports may need to be redone, or the the changes will increase the workload for some people and therefore staffing considerations will need to be taken into account. Thus, you may be in a position where the best thing to do is to give advance warning to your customers.

You do not want to be in a place where you need to coordinate release schedules with too many customers at the same time. A more scalable approach will be to contact your customers and tell them the date where they can expect some aspects of their data to change, and an explanation as to why. If you can give an indication in advance of how much there data will change, that is something you should also do (“We are contacting you as we have identified that your account may see a drop of 10-20% in [Metric X]. This is due to [Reason Y].”).

You may also introduce a change via a new data setting, which will allow the users to have more control for introducing a new functionality or improvement. This approach means that you can have a lighter touch in terms of outreach, and there will be less coordination that will be necessary. However you cannot introduce a new data setting for every change you make, especially when one doesn’t want to maintain the previous, now-obsolete settings. One can also also have the customer trigger the update through a one-off step, but this irreversibility can make the customer nervous and feel unsupported. These approaches should be used judiciously.

Now, maybe you focus only on fixing future data rather than fixing and reprocessing any old data, and there is a strong case for prioritising that. If we were to look at the Instagram example above, maybe you decide to merely remove ‘new’ fake accounts, and you leave the legacy ones alone. While potentially the right call, it does have an obvious downside. Your numbers will be harder to reconcile, what with the processing of the data having suddenly changed. It can be difficult to explain and document why a particular input gave output X in January, but the equivalent input gave output Y in November. Or again if one thinks of the Instagram example, it can be hard to explain why some followers are being removed while others are not. You can mitigate this issue by ensuring that it is documented what version of the code and infrastructure was responsible for each computation, but it is challenging to easily surface this information to the end user.

It isn’t always worth it

I hope the content so far has made it clear why implementing certain changes to your data product can be a very onerous affair. Truth be told though, all the effort isn’t always worth it.

For instance, if you can be confident that the change will not result in differences (e.g. you have merely sped up some queries), then extensive checks and customer outreach will be unnecessary. Not quite as risk-free, but you may know that the number of customers impacted will be low, as you may be changing only a particular edge case. However one may underestimate the impact, and introduce some unforeseen consequences. You really should have some basic tests in place, and consider performing gradual rollouts.

Finally, the truth is that if you don’t evolve your product quickly, the risk may be greater to the company than the potential of ruffling some customers’ feathers. The product might still be in its early days, and if you don’t move fast, you will lose the few users that you have, or otherwise run out of cash to keep the company alive. Also, some issues are so egregious, that they really need to be fixed immediately.

However as your company grows, your need for testing and smooth rollout procedures are only going to increase. This will help ease your customer relationships, and if you are in a regulated industry, this sort of capabilities will need to be in place when auditors come knocking.

Summa summarum

Iterating on your data product is a resource intensive affair. You will need to develop multiple ways for testing the changes, and this will increase the length of your development life cycle. As the company grows, one has to invest into automating the testing and rollout procedures so as to decrease the operational overhead. The operational word is ‘decrease’, as it will never be reduced to zero, but one can get close. However, when data is your product, and you want to reach the best possible quality, the investment is worth it.



Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *