Published on Data Blog

Best within years, wrong across years: lessons from a global poverty imputation challenge

April 17, 2026

This page in:

Best within years, wrong across years: lessons from a global poverty imputation challenge

A global imputation challenge shows that selecting the right model for poverty estimation can be as difficult as building the model itself. / Image: Shutterstock

Policy makers need up-to-date poverty data to guide budgets, target programs, and assess results. Yet in many countries, the latest household survey with consumption data is several years old. Newer surveys may exist, but without consumption data, poverty cannot be measured directly. As a result, key decisions are often made without current information on living standards—including whether labor markets are delivering better jobs and rising earnings.

A common solution is survey-to-survey (S2S) imputation. Analysts estimate how household characteristics relate to welfare using an older survey, then apply that model to a newer survey where consumption is missing. This allows faster and lower-cost monitoring. The risk is that relationships may have changed.

In practice, the real challenge is not just prediction accuracy but choosing the right model when the true poverty rate is unknown and decisions must rely on partial validation and judgment.

To examine this challenge, the World Bank partnered with DrivenData to run a Survey-to-Survey Imputation Challenge. Participants trained models on three surveys with observed consumption and then predicted poverty in three newer surveys where consumption was deliberately withheld.

To replicate real-world uncertainty, evaluation data were split into two parts. Participants received feedback only on part of the data—the validation set—while their performance on the test set was kept hidden until final rankings were determined. Participants therefore had to select one final model without knowing its performance on the test set — just as policy teams must choose an imputation model without observing the true poverty rate. Crucially, unlike standard machine learning practice where validation and test data share the same data-generation process, here the validation data came from a pre-COVID period (2015–2018) while the test data came from 2022, after COVID had potentially reshaped economic relationships.

After the competition ended, predictions were compared to the actual poverty rates across 20 percentiles of the consumption distribution—assessing both prediction accuracy and model selection under uncertainty.

The most interesting finding: choosing the right model is as hard as building it

Participants could submit multiple models but had to choose one final submission. In practice, this decision proved difficult. Nearly half of the participants failed to choose their strongest model, even when they had already produced it.

Figure 1 illustrates this by plotting the hypothetical rank using their best-performing submission and the actual rank based on their chosen model. For some participants, the gap was substantial. Selection risk was evident even at the top: of participants who could have ranked in the top 10, only one did after choosing their preferred model. In real-world settings, similar misselection could mean publishing an imputed poverty trend that appears credible but deviates meaningfully from reality.

Figure 1. Chosen result vs best result (with 45° line)

^{Note: Each point is a participant. The x-axis is the rank they could have achieved using their best submission on the test set; the y-axis is the rank based on the submission they chose. Points above the 45° line indicate participants who did not select their best-performing submission.}

In-sample validation does not always guarantee optimal out-of-sample performance

Validation set results were informative. Teams that performed well on the validation set often performed well overall. However, small differences in validation performance frequently disappeared or reversed on the test set. A model that appeared slightly better during the competition sometimes performed worse when applied on the test set.

Figure 2 shows this pattern. If validation results perfectly predicted final outcomes, all points would lie on the 45-degree line. Instead, many deviate substantially—some who ranked near the top on the validation set finished much lower, while others with modest ranks ultimately performed very well.

Partial feedback can narrow the field, but it cannot reliably identify the most accurate model. Small validation differences should not be overinterpreted — especially when, as in cross-year imputation, the validation and test data do not share the same data-generation process. In that setting, a model that ranks well on validation data may perform poorly when the underlying economic relationships have shifted.

Figure 2. Validation rank vs final rank scatter (with 45° line)

^{Note: Each point is a participant. The x-axis shows rank by validation performance; the y-axis shows final rank based on the chosen submission. Dispersion around the 45° line illustrates why validation feedback is informative but not decisive for selecting winners.}

Predicting poverty during a shock period was very difficult

Rankings show relative performance—but the key question is how close predictions came to the truth. Figure 3 compares predicted poverty changes with the true change in the test data.

In this setting, models built with standard household covariates — such as education, housing, and employment — often implied falling poverty when poverty had actually risen. To see why: if employment rates held steady post-COVID but real wages collapsed, a model relying on employment status as a welfare proxy would predict stability rather than deterioration. Variables that directly capture welfare shocks can partially mitigate this risk, but they are rarely available in data-deprived contexts. Under structural change, models trained on older data can become systematically wrong, not just noisier. Participants were also not informed of the country to protect competition integrity, preventing prior knowledge of national poverty trends from shaping predictions.

Figure 3. Bias in predicted poverty changes by poverty-line percentiles for 2022

^{Note: Boxplots summarize bias (percentage points) across submitted predictions by participants.}

Why this matters for real-time poverty monitoring

This challenge was not about rankings; the purpose was to better understand the risks involved when producing poverty estimates based on S2S imputation. When poverty predictions are inaccurate, they can distort operational decisions: which groups are prioritized, whether policy dialogue focuses on job quality or quantity, and how quickly to scale labor-market and social protection responses. A model that “looks good” on partial validation can still produce a plausible—but wrong—story about whether jobs and earnings are improving for the poor.

The results reinforce a central lesson: imputation works well when relationships are stable, but structural change raises both prediction error and model-selection risk simultaneously. This aligns with our earlier work suggesting teams should treat structural change as the default risk rather than the exception. During shocks, model selection cannot rely on validation-set ranking alone—it should also consider economic intuition, country context, robustness across specifications, and consistency with macroeconomic information such as GDP and labor market trends. Sensitivity checks and transparent communication of caveats are equally essential.

Ultimately, improving algorithms matters—but improving how models are selected and how uncertainty is communicated may be even more important for responsible real-time poverty monitoring, and for credible tracking of poverty reduction.

Join the Conversation

The content of this field is kept private and will not be shown publicly

Remaining characters: 1000

I have read the Privacy Notice and consent to my personal data being processed, to the extent necessary, to submit my comment for moderation. I also consent to having my name published.

Best within years, wrong across years: lessons from a global poverty imputation challenge

The most interesting finding: choosing the right model is as hard as building it

In-sample validation does not always guarantee optimal out-of-sample performance

Predicting poverty during a shock period was very difficult

Why this matters for real-time poverty monitoring

Get updates from Data Blog

Jaime Fernandez

Paul Corral

Maria Eugenia Genoni

Andrés Ham

Leonardo Lucchetti

Henry Stemmler

Peter Lanjouw

Kimberly Bolch

Juliana Soares

Join the Conversation