Scale Experiment Decision-Making with Programmatic Decision Rules

Decide what to do with experiment results in codePhoto by Cytonn Photography on UnsplashThe experiment lifecycle is like the human lifecycle. First, a person or idea is born, then it develops, then it is tested, then its test ends, and then the Gods (or Product Managers) decide its worth.But a lot of things happen during a life or an experiment. Sometimes, a person or idea is good in one way but bad in another. How are the Gods supposed to decide? They have to make some tradeoffs. There’s no avoiding it.The key is to make these tradeoffs before the experiment and before we see the results. We do not want to decide on the rules based on our pre-existing biases about which ideas deserve to go to heaven (err… launch — I think I’ve stretched the metaphor far enough). We want to write our scripture (okay, one more) before the experiment starts.The point of this blog is to propose that we should write how we will make decisions explicitly—not in English, which permits vague language, e.g., “we’ll consider the effect on engagement as well, balancing against revenue” and similar wishy-washy, unquantified statements — but in code.I’m proposing an “Analysis Contract,” which enforces how we will make decisions.A contract is a function in your favorite programming language. The contract takes the “basic results” of an experiment as arguments. Determining which basic results matter for decision-making is part of defining the contract. Usually, in an experiment, the basic results are treatment effects, the standard errors of treatment effects, and configuration parameters like the number of peeks. Given these results, the contract returns an arm or a variant of the experiment as the variant that will launch. For example, it would return either ‘A’ or ‘B’ in a standard A/B test.It might look something like this:int analysis_contract(double te1, double te1_se, ....){ if ((te1/se1 < 1.96) && (...conditions...)) return 0 /* for variant 0 / if (...conditions...) return 1 / for variant 1 / / and so on */}The Experimentation Platform would then associate the contract with the particular experiment. When the experiment ends, the platform processes the contract and ships the winning variant according to the rules specified in the contract.I’ll add the caveat here that this is an idea. It’s not a story about a technique I’ve seen implemented in practice, so there may be practical issues with various details that would be ironed out in a real-world deployment. I think Analysis Contracts would mitigate the problem of ad-hoc decision-making and force us to think deeply about and pre-register how we will deal with the most common scenario in experimentation: effects that we thought we would move a lot are insignificant.By using Analysis Contracts, we can…Make decisions upfrontWe do not want to change how we make decisions because of the particular dataset our experiment happened to generate.There’s no (good) reason why we should wait until after the experiment to say whether we would ship in Scenario X. We should be able to say it before the experiment. If we are unwilling to, it suggests that we are relying on something else outside the data and the experiment results. That information might be useful, but information that doesn’t depend on the experiment results was available before the experiment. Why didn’t we commit to using it then?Statistical inference is based on a model of behavior. In that model, we know exactly how we would make decisions — if only we knew certain parameters. We gather data to estimate those parameters and then decide what to do based on our estimates. Not specifying our decision function breaks this model, and many of the statistical properties we take for granted are just not true if we change how we call an experiment based on the data we see.We might say: “We promise not to make decisions this way.” But then, after the experiment, the results aren’t very clear. A lot of things are insignificant. So, we cut the data in a million ways, find a few “significant” results, and tell a story from them. It’s hard to keep our promises.The cure isn’t to make a promise we can’t keep. The cure is to make a promise the system won’t let us (quietly) break.Be consistent, clear, and precise about how we make decisionsEnglish is a vague language, and writing our guidelines in it leaves a lot of room for interpretation. Code forces us to decide what we will do explicitly and, to say, quantitatively, e.g., how much revenue we will give up in the short run to improve our subscription product in the long run, for example.Code improves communication enormously because I don’t have to interpret what you mean. I can plug in different results and see what decisions you would have made if the results had differed. This can be incredibly useful for retrospective analysis of past experiments as well. Because we have an actual function mapping to decisions, we can run various simulations, bootstraps, etc, and re-decide

Jan 14, 2025 - 19:09

Scale Experiment Decision-Making with Programmatic Decision Rules

Decide what to do with experiment results in code

The experiment lifecycle is like the human lifecycle. First, a person or idea is born, then it develops, then it is tested, then its test ends, and then the Gods (or Product Managers) decide its worth.

But a lot of things happen during a life or an experiment. Sometimes, a person or idea is good in one way but bad in another. How are the Gods supposed to decide? They have to make some tradeoffs. There’s no avoiding it.

The key is to make these tradeoffs before the experiment and before we see the results. We do not want to decide on the rules based on our pre-existing biases about which ideas deserve to go to heaven (err… launch — I think I’ve stretched the metaphor far enough). We want to write our scripture (okay, one more) before the experiment starts.

The point of this blog is to propose that we should write how we will make decisions explicitly—not in English, which permits vague language, e.g., “we’ll consider the effect on engagement as well, balancing against revenue” and similar wishy-washy, unquantified statements — but in code.

I’m proposing an “Analysis Contract,” which enforces how we will make decisions.

A contract is a function in your favorite programming language. The contract takes the “basic results” of an experiment as arguments. Determining which basic results matter for decision-making is part of defining the contract. Usually, in an experiment, the basic results are treatment effects, the standard errors of treatment effects, and configuration parameters like the number of peeks. Given these results, the contract returns an arm or a variant of the experiment as the variant that will launch. For example, it would return either ‘A’ or ‘B’ in a standard A/B test.

It might look something like this:

int 
analysis_contract(double te1, double te1_se, ....)
{
  if ((te1/se1 < 1.96) && (...conditions...))
    return 0 /* for variant 0 */
  if (...conditions...)
    return 1 /* for variant 1 */

  /* and so on */
}

The Experimentation Platform would then associate the contract with the particular experiment. When the experiment ends, the platform processes the contract and ships the winning variant according to the rules specified in the contract.

I’ll add the caveat here that this is an idea. It’s not a story about a technique I’ve seen implemented in practice, so there may be practical issues with various details that would be ironed out in a real-world deployment. I think Analysis Contracts would mitigate the problem of ad-hoc decision-making and force us to think deeply about and pre-register how we will deal with the most common scenario in experimentation: effects that we thought we would move a lot are insignificant.

By using Analysis Contracts, we can…

Make decisions upfront

We do not want to change how we make decisions because of the particular dataset our experiment happened to generate.

There’s no (good) reason why we should wait until after the experiment to say whether we would ship in Scenario X. We should be able to say it before the experiment. If we are unwilling to, it suggests that we are relying on something else outside the data and the experiment results. That information might be useful, but information that doesn’t depend on the experiment results was available before the experiment. Why didn’t we commit to using it then?

Statistical inference is based on a model of behavior. In that model, we know exactly how we would make decisions — if only we knew certain parameters. We gather data to estimate those parameters and then decide what to do based on our estimates. Not specifying our decision function breaks this model, and many of the statistical properties we take for granted are just not true if we change how we call an experiment based on the data we see.

We might say: “We promise not to make decisions this way.” But then, after the experiment, the results aren’t very clear. A lot of things are insignificant. So, we cut the data in a million ways, find a few “significant” results, and tell a story from them. It’s hard to keep our promises.

The cure isn’t to make a promise we can’t keep. The cure is to make a promise the system won’t let us (quietly) break.

Be consistent, clear, and precise about how we make decisions

English is a vague language, and writing our guidelines in it leaves a lot of room for interpretation. Code forces us to decide what we will do explicitly and, to say, quantitatively, e.g., how much revenue we will give up in the short run to improve our subscription product in the long run, for example.

Code improves communication enormously because I don’t have to interpret what you mean. I can plug in different results and see what decisions you would have made if the results had differed. This can be incredibly useful for retrospective analysis of past experiments as well. Because we have an actual function mapping to decisions, we can run various simulations, bootstraps, etc, and re-decide the experiment based on that data.

But what if I disagree with the Analysis Contract’s decision?

One of the primary objections to Analysis Contracts is that after the experiment, we might decide we had the wrong decision function. Usually, the problem is that we didn’t realize what the experiment would do to metric Y, and our contract ignores it.

Given that, there are two roads to go down:

If we have 1000 metrics and the true effect of an experiment on each metric is 0, some metrics will likely have large magnitude effects. One solution is to go with the Analysis Contract this time and remember to consider the metric next time in the contract. Over time, our contract will evolve to better represent our true goals. We shouldn’t put too much weight on what happens to the 20th most important metric. It could just be noise.
If the effect is truly outsized and we can’t get comfortable with ignoring it, the other solution is to override the contract, making sure to log somewhere prominent that this happened. Then, update the contract because we clearly care a lot about this metric. Over time, the number of times we override should be logged as a KPI of our experimentation system. As we get the decision-making function closer and closer to the best representation of our values, we should stop overriding. This can be a good way to monitor how much ad-hoc, nonstatistical decision-making goes on. If we frequently override the contract, then we know the contract doesn’t mean much, and we are not following good statistical practices. It’s built-in accountability, and it creates a cost to overriding the contract.

Contracts as Predicates

Contracts do not need to be fully flexible code (there are probably security issues with allowing that to be specified directly into an Experimentation Platform, even if it’s conceptually nice). But we can have a system that enables experimenters to specify predicates, i.e., IF TStat(Revenue) ≤ 1.96 AND Tstat(Engagement) > 1.96 THEN X, etc. We can expose standard comparison operations alongside Tstat’s and effect magnitudes and specify decisions that way.

Thanks for reading! Does your org use anything similar to an Analysis Contract? I think it’s a great solution to a tricky human problem in experimentation, but I’d love to hear anyone’s real-world experience with a more automated approach to experiment decision-making.

Zach

Connect at LinkedIn: https://linkedin.com/in/zlflynn

Scale Experiment Decision-Making with Programmatic Decision Rules was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Be sure to check out our new bug bounty platf...

Research DevOps metrics and KPIs

Introduction to Terraform: Revolutionizing In...

Leveraging Azure Key Vault for Secrets Manage...

Deploying and Configuring a Hybrid Identity L...

How Do Real Estate Listings Benefit from AI C...

What is Computer Vision? – A Comprehensive Ov...

Reshaping Data Pipelines: A Data Engineer’s R...

Answer Data Questions for Non-Technical Stake...

Why your AI investments aren’t paying off

These $99 earbuds I tested give Apple's base ...

Microsoft hints that free Windows 11 upgrades...

Why Does ChatGPT’s Algorithm ‘Think’ in Chinese?

Galaxy S25 press photos leak ahead of next we...

Is humanity alone in the Universe? What scien...

Scale Experiment Decision-Making with Programmatic Decision Rules

Decide what to do with experiment results in code

Make decisions upfront

Be consistent, clear, and precise about how we make decisions

But what if I disagree with the Analysis Contract’s decision?

Contracts as Predicates

Tags:

Deep Dive into KV-Caching In Mistral

The Complete Introduction to Time Series Classification in Python

Microsoft relaunches its free Copilot for businesses as...

What Happened To The Tycoon App From Shark Tank Season 6?

Inside Meta’s race to beat OpenAI: “We need to learn ho...

Popular Posts

Introducing vulne-soldier: A Modern AWS EC2 Vulner...

Be sure to check out our new bug bounty platform!

Research DevOps metrics and KPIs

Introduction to Terraform: Revolutionizing Infrast...

These $99 earbuds I tested give Apple's base AirPo...

11 Must-Know Websites Every Developer Should Bookmark

Spicychat Alternatives

The Intelligence Age by Sam Altman

Scale Experiment Decision-Making with Programmatic Decision Rules

Decide what to do with experiment results in code

Make decisions upfront

Be consistent, clear, and precise about how we make decisions

But what if I disagree with the Analysis Contract’s decision?

Contracts as Predicates

Tags:

Related Posts

Popular Posts