Skip to contents

Introduction

This vignette demonstrates how to use the mwana package’s functions to estimate the prevalence of wasting. The package allow users to estimate prevalence based on:

  • The weight-for-height z-score (WFHZ) and/or edema;
  • The raw MUAC values and/or edema;
  • The combined prevalence and
  • The MUAC-for-age z-score (MFAZ) and/or edema.

The prevalence functions in mwana were carefully conceived and designed to simplify the workflow of a nutrition data analyst, especially when dealing with datasets containing imperfections that require additional layers of analysis. Let’s try to clarify this with two scenarios that I believe will remind you of the complexity involved:

  • When analysing a multi-area dataset, users will likely need to estimate the prevalence for each area individually. Afterward, they must extract the results and collate in a summary table.

  • When working with MUAC data, if age ratio test is rated as problematic, an additional tool is required to weight the prevalence and correct for age bias. In unfortunate cases where multiple areas face this issue, the workflow must be repeated several times, making the process cumbersome and highly error-prone 😬.

With mwana you no longer have to worry about this 🥳 as the functions are designed to deal with that. To demonstrate their use, we will use different datasets containing some imperfections alluded above:

  • anthro.02 : a survey data with survey weights. Read more about this data with ?anthro.02.
  • anthro.03 : district-level SMART surveys with two districts whose WFHZ standard deviation are rated as problematic while the rest are within range. Do ?anthro.03 for more details.
  • anthro.04 : a community-based sentinel site data. The data has different characteristics that require different analysis approaches.

Now we can begin delving into each function.

Estimation of the prevalence of wasting based on WFHZ

To estimate the prevalence of wasting based on WFHZ we use the compute_wfhz_prevalence() function. The dataset to supply must have been wrangled by mw_wrangle_wfhz().

As usual, we start off by inspecting our dataset:

tail(anthro.02)
# A tibble: 6 × 14
  province strata cluster   sex   age weight height edema  muac wtfactor   wfhz
  <chr>    <chr>    <int> <dbl> <dbl>  <dbl>  <dbl> <chr> <dbl>    <dbl>  <dbl>
1 Nampula  Urban      285     1  59.5   13.8   90.7 n       149     487.  0.689
2 Nampula  Rural      234     1  59.5   17.2  105.  n       193    1045.  0.178
3 Nampula  Rural      263     1  59.6   18.4  100   n       156     952.  2.13
4 Nampula  Rural      257     1  59.7   15.9  100.  n       149     987.  0.353
5 Nampula  Rural      239     1  59.8   12.5   91.5 n       135     663. -0.722
6 Nampula  Rural      263     1  60.0   14.3   93.8 n       142     952.  0.463
# ℹ 3 more variables: flag_wfhz <dbl>, mfaz <dbl>, flag_mfaz <dbl>

We can see that the dataset contains the required variables for a WFHZ prevalence analysis, including for a weighted analysis. This dataset has already been wrangled, so we do not need to call the WFHZ wrangler in this case. We will begin the demonstration with an unweigthed analysis - typical of SMART surveys - and then we proceed to a weighted analysis.

Estimation of unweighted prevalence

To achieve this we do:

anthro.02 |>
  compute_wfhz_prevalence(
    .wt = NULL,
    .edema = edema,
    .summary_by = NULL
  )

This will return:

# A tibble: 1 × 16
  gam_n  gam_p gam_p_low gam_p_upp gam_p_deff sam_n   sam_p sam_p_low sam_p_upp
  <dbl>  <dbl>     <dbl>     <dbl>      <dbl> <dbl>   <dbl>     <dbl>     <dbl>
1   121 0.0408    0.0322    0.0494        Inf    43 0.00664   0.00273    0.0106
# ℹ 7 more variables: sam_p_deff <dbl>, mam_n <dbl>, mam_p <dbl>,
#   mam_p_low <dbl>, mam_p_upp <dbl>, mam_p_deff <dbl>, wt_pop <dbl>

If for some reason the variable edema is not available in the dataset, or it’s there but not plausible, we can exclude it from the analysis by setting the argument .edema to NULL:

anthro.02 |>
  compute_wfhz_prevalence(
    .wt = NULL,
    .edema = NULL, # Setting .edema to NULL
    .summary_by = NULL
  )

And we get:

# A tibble: 1 × 16
  gam_n  gam_p gam_p_low gam_p_upp gam_p_deff sam_n sam_p sam_p_low sam_p_upp
  <dbl>  <dbl>     <dbl>     <dbl>      <dbl> <dbl> <dbl>     <dbl>     <dbl>
1   107 0.0342    0.0263    0.0420        Inf    29     0         0         0
# ℹ 7 more variables: sam_p_deff <dbl>, mam_n <dbl>, mam_p <dbl>,
#   mam_p_low <dbl>, mam_p_upp <dbl>, mam_p_deff <dbl>, wt_pop <dbl>

If we inspect the gam_n and gam_p columns of this output table and the previous, we notice differences in the numbers. This occurs because edema cases were excluded in the second implementation. Note that you will observed a change if there are positive cases of edema in the dataset; otherwise, setting .edema = NULL will have no effect whatsoever.

The above output summary does not show results by province. We can control that using the .summary_by argument. In the above examples, it was set to NULL; now let’s pass the name of the column containing the locations where the data was collected. In our case, the column is province:

anthro.02 |>
  compute_wfhz_prevalence(
    .wt = NULL,
    .edema = edema,
    .summary_by = province # province is the variable's name holding data on where the survey was conducted.
  )

And voila :

# A tibble: 2 × 17
  province gam_n  gam_p gam_p_low gam_p_upp gam_p_deff sam_n   sam_p sam_p_low
  <chr>    <dbl>  <dbl>     <dbl>     <dbl>      <dbl> <dbl>   <dbl>     <dbl>
1 Zambezia    41 0.0290    0.0195    0.0384        Inf    10 0.00351 0.0000639
2 Nampula     80 0.0546    0.0397    0.0695        Inf    33 0.0103  0.00282
# ℹ 8 more variables: sam_p_upp <dbl>, sam_p_deff <dbl>, mam_n <dbl>,
#   mam_p <dbl>, mam_p_low <dbl>, mam_p_upp <dbl>, mam_p_deff <dbl>,
#   wt_pop <dbl>

A table with two rows is returned with each province’s statistics.

Estimation of weighted prevalence

To get the weighted prevalence, we make the use of the .wt argument. We pass to it the column name containing the final survey weights. In our case, the column name is wtfactor. We pass it in quotation " ":

anthro.02 |>
  compute_wfhz_prevalence(
    .wt = "wtfactor", # Passing the wtfactor to .wt
    .edema = edema,
    .summary_by = province
  )

And you get:

anthro.02 |>
  compute_wfhz_prevalence(
    .wt = "wtfactor",
    .edema = edema,
    .summary_by = province
  )
# A tibble: 2 × 17
  province gam_n  gam_p gam_p_low gam_p_upp gam_p_deff sam_n   sam_p sam_p_low
  <chr>    <dbl>  <dbl>     <dbl>     <dbl>      <dbl> <dbl>   <dbl>     <dbl>
1 Zambezia    41 0.0261    0.0161    0.0361       1.16    10 0.00236 -0.000255
2 Nampula     80 0.0595    0.0410    0.0779       1.52    33 0.0129   0.00272
# ℹ 8 more variables: sam_p_upp <dbl>, sam_p_deff <dbl>, mam_n <dbl>,
#   mam_p <dbl>, mam_p_low <dbl>, mam_p_upp <dbl>, mam_p_deff <dbl>,
#   wt_pop <dbl>

The work under the hood of compute_wfhz_prevalence()

Under the hood, before starting the prevalence estimation, the function first checks the quality of the WFHZ standard deviation. If it is not rated as problematic, it proceeds with a complex sample-based analysis; otherwise, prevalence is estimated applying the PROBIT method. This is as you see in the body of the plausibility report generated by ENA. The anthro.02 dataset has no such issues, so you don’t see compute_wfhz_prevalence() in action on this regard. To see that, let’s use the anthro.03 dataset: .

anthro.03 contains problematic standard deviation in Metuge and Maravia districts, while the remaining districts are within range.

Let’s inspect our dataset:

# A tibble: 6 × 9
  district cluster  team sex     age weight height edema  muac
  <chr>      <int> <int> <chr> <dbl>  <dbl>  <dbl> <chr> <int>
1 Metuge         2     2 m      9.99   10.1   69.3 n       172
2 Metuge         2     2 f     43.6    10.9   91.5 n       130
3 Metuge         2     2 f     32.8    11.4   91.4 n       153
4 Metuge         2     2 f      7.62    8.3   69.5 n       133
5 Metuge         2     2 m     28.4    10.7   82.3 n       143
6 Metuge         2     2 f     12.3     6.6   69.4 n       121

Now let’s apply the prevalence function:

anthro.03 |>
  mw_wrangle_wfhz(
    sex = sex,
    .recode_sex = TRUE,
    height = height,
    weight = weight
  ) |>
  compute_wfhz_prevalence(
    .wt = NULL,
    .edema = edema,
    .summary_by = district
  )

The returned output will be:

================================================================================
# A tibble: 4 × 17
  district   gam_n  gam_p gam_p_low gam_p_upp gam_p_deff sam_n   sam_p sam_p_low
  <chr>      <dbl>  <dbl>     <dbl>     <dbl>      <dbl> <dbl>   <dbl>     <dbl>
1 Metuge        NA 0.0251   NA        NA              NA    NA 0.00155  NA
2 Cahora-Ba…    25 0.0738    0.0348    0.113         Inf     4 0.00336  -0.00348
3 Chiuta        11 0.0444    0.0129    0.0759        Inf     2 0.00444  -0.00466
4 Maravia       NA 0.0450   NA        NA              NA    NA 0.00351  NA
# ℹ 8 more variables: sam_p_upp <dbl>, sam_p_deff <dbl>, mam_n <dbl>,
#   mam_p <dbl>, mam_p_low <dbl>, mam_p_upp <dbl>, mam_p_deff <dbl>,
#   wt_pop <dbl>

Can you spot the differences? 😎 Yes, you’re absolutely correct! While in Cahora-Bassa and Chiúta districts all columns are populated with numbers, in Metuge and Maravia, only the gam_p, sam_p and mam_p columns are filled with numbers, and everything else with NA. These are district where the PROBIT method was applied, while in Cahora-Bassa and Chiúta ditricts the standard complex sample analysis was done.

Estimation of the prevalence of wasting based on MFAZ

The prevalence of wasting based on MFAZ can be estimated using the compute_mfaz_prevalence() function. This function works and is implemented the same way as demonstrated in Section 1.1, with the exception of the data wrangling that is based on MUAC. This was demonstrated in the plausibility checks. In this way, to avoid redundancy, we will not demonstrate the workflow.

Estimation of the prevalence of wasting based on the raw MUAC values

This job is assigned to compute_muac_prevalence(). Once you call the function, before starting the prevalence estimation, it first evaluates the acceptability of the MFAZ standard deviation and the age ratio test. Yes, you read well, MFAZ’s standard deviation, not on the raw values MUAC.

Important

Although the acceptability is evaluated on the basis of MFAZ, the actual prevalence is estimated on the basis of the raw MUAC values. MFAZ is also used to detect outliers and flag them to be excluded from the prevalence analysis.

The MFAZ standard deviation and the age ratio test results are used to control the prevalence analysis flow in this way:

  • If the MFAZ standard deviation and the age ratio test are both not problematic, a standard complex sample-based prevalence is estimated.
  • If the MFAZ standard deviation is not problematic but the age ratio test is problematic, the CDC/SMART MUAC tool age-weighting approach is applied.
  • If the MFAZ standard deviation is problematic, even if age ratio is not problematic, no prevalence analysis is estimated, instead NA are thrown.

When working with a multiple-area dataset, these conditionals will still be applied according to each area’s situation.

How does it work on a multi-area dataset

Fundamentally, the function performs the standard deviation and age ratio tests, evaluates their acceptability, and returns a summarized table by area. It then iterates over that summary table row by row checking the above conditionals. Based on the conditionals of each row (area), the function accesses the original dataset, computes the prevalence accordingly, and returns the results.

To demonstrate this we will use the anthro.04 dataset.

As usual, let’s first inspect it:

# A tibble: 6 × 8
  province   cluster   sex   age  muac edema   mfaz flag_mfaz
  <chr>        <int> <dbl> <int> <dbl> <chr>  <dbl>     <dbl>
1 Province 3     743     2    21   130 n     -1.50          0
2 Province 3     743     2     9   126 n     -1.33          0
3 Province 3     743     2    12   128 n     -1.27          0
4 Province 3     743     2    34   145 n     -0.839         0
5 Province 3     743     2    11   130 n     -1.04          0
6 Province 3     743     2    33   140 n     -1.23          0

You see that this data has already been wrangled, so we will go straight to the prevalence estimation.

Important

As in ENA Software, make sure you run the plausibility check before you call the prevalence function. This is good to know about the acceptability of your data. If we do that with anthro.04 we will see which province has issues, hence what we should be expecting to see in below demonstrations based on the conditionals stated above.

anthro.04 |>
  compute_muac_prevalence(
    .wt = NULL,
    .edema = edema,
    .summary_by = province
  )

This will return:

# A tibble: 3 × 17
  province   gam_n  gam_p gam_p_low gam_p_upp gam_p_deff sam_n   sam_p sam_p_low
  <chr>      <dbl>  <dbl>     <dbl>     <dbl>      <dbl> <dbl>   <dbl>     <dbl>
1 Province 1   135  0.104    0.0778     0.130        Inf    19  0.0133   0.00682
2 Province 2    NA  0.112   NA         NA             NA    NA  0.0201  NA
3 Province 3    NA NA       NA         NA             NA    NA NA       NA
# ℹ 8 more variables: sam_p_upp <dbl>, sam_p_deff <dbl>, mam_n <dbl>,
#   mam_p <dbl>, mam_p_low <dbl>, mam_p_upp <dbl>, mam_p_deff <dbl>,
#   wt_pop <dbl>

We see that in Province 1, all columns are filled with numbers; in Province 2, some columns are filled with numbers, while other columns are filled with NAs: this is where the age-weighting approach was applied. Lastly, in Province 3 a bunch of NA are filled everywhere - you know why 😉 .

Estimation of weighted prevalence

For this we go back anthro.02 dataset.

We approach this task as follows:

## Load library ----
library(dplyr)

## Compute prevalence ----
anthro.02 |>
  mw_wrangle_age(
    age = age,
    .decimals = 2
  ) |>
  mw_wrangle_muac(
    sex = sex,
    .recode_sex = FALSE,
    muac = muac,
    .recode_muac = TRUE,
    .to = "cm",
    age = "age"
  ) |>
  mutate(
    muac = recode_muac(muac, .to = "mm")
  ) |>
  compute_muac_prevalence(
    .wt = "wtfactor",
    .edema = edema,
    .summary_by = province
  )

This will return:

================================================================================
# A tibble: 2 × 17
  province gam_n  gam_p gam_p_low gam_p_upp gam_p_deff sam_n  sam_p sam_p_low
  <chr>    <dbl>  <dbl>     <dbl>     <dbl>      <dbl> <dbl>  <dbl>     <dbl>
1 Nampula     70 0.0571    0.0369    0.0773       2.00    28 0.0196   0.00706
2 Zambezia    65 0.0552    0.0380    0.0725       1.67    18 0.0133   0.00412
# ℹ 8 more variables: sam_p_upp <dbl>, sam_p_deff <dbl>, mam_n <dbl>,
#   mam_p <dbl>, mam_p_low <dbl>, mam_p_upp <dbl>, mam_p_deff <dbl>,
#   wt_pop <dbl>

Warning

You may have noticed that in the above code block, we called the recode_muac() function inside mutate(). This is because after you use mw_wrangle_muac(), it puts the MUAC variable in centimeters. The compute_muac_prevalence() function was defined to accept MUAC in millimeters. Therefore, it must be converted to millimeters.

Estimation of the combined prevalence of wasting

The estimation of the combined prevalence of wasting is a task attributed to the compute_combined_prevalence() function. The case-definition is based on the WFHZ, the raw MUAC values and edema. From the workflow standpoint, it combines the workflow demonstrated in Section 1.1 and in Section 1.3.

To demonstrate it’s implementation we will use the anthro.01 dataset.

Let’s inspect our data:

# A tibble: 6 × 11
  area       dos        cluster  team sex   dob      age weight height edema
  <chr>      <date>       <int> <int> <chr> <date> <int>  <dbl>  <dbl> <chr>
1 District E 2023-12-04       1     3 m     NA        59   15.6  109.  n
2 District E 2023-12-04       1     3 m     NA         8    7.5   68.6 n
3 District E 2023-12-04       1     3 m     NA        19    9.7   79.5 n
4 District E 2023-12-04       1     3 f     NA        49   14.3  100.  n
5 District E 2023-12-04       1     3 f     NA        32   12.4   92.1 n
6 District E 2023-12-04       1     3 f     NA        17    9.3   77.8 n
# ℹ 1 more variable: muac <int>

Data wrangling

Fundamentally, it combines the wrangling workflow of WFHZ and MUAC data:

## Load library ----
library(dplyr)

## Apply the wrangling workflow ----
anthro.01 |>
  mw_wrangle_age(
    dos = dos,
    dob = dob,
    age = age,
    .decimals = 2
  ) |>
  mw_wrangle_muac(
    sex = sex,
    .recode_sex = TRUE,
    muac = muac,
    .recode_muac = TRUE,
    .to = "cm",
    age = "age"
  ) |>
  mutate(
    muac = recode_muac(muac, .to = "mm")
  ) |>
  mw_wrangle_wfhz(
    sex = sex,
    weight = weight,
    height = height,
    .recode_sex = FALSE
  )

This is to get the wfhz and flag_wfhz the mfaz and flag_mfaz added to the dataset. In the output below, we just selected these columns:

================================================================================
================================================================================
# A tibble: 1,191 × 5
   area         wfhz flag_wfhz   mfaz flag_mfaz
   <chr>       <dbl>     <dbl>  <dbl>     <dbl>
 1 District E -1.83          0 -1.45          0
 2 District E -0.956         0 -1.67          0
 3 District E -0.796         0 -0.617         0
 4 District E -0.74          0 -1.02          0
 5 District E -0.679         0 -0.93          0
 6 District E -0.432         0 -1.10          0
 7 District E -0.078         0 -0.255         0
 8 District E -0.212         0 -0.677         0
 9 District E -1.07          0 -2.18          0
10 District E -0.543         0 -0.403         0
# ℹ 1,181 more rows

Under the hood, compute_combined_prevalence() applies the same analysis approach as in compute_wfhz_prevalence() and in compute_muac_prevalence(). It checks the acceptability of the standard deviation of WFHZ and MFAZ and of the age ratio test. The following conditionals are checked and applied:

  • If the standard deviation of WFHZ, of MFAZ and the age ratio test are not problematic, the standard complex sample-based estimation is applied.
  • If either the standard deviation of WFHZ or of MFAZ or the age ratio test is problematic, prevalence is not computed, and NA are thrown.

In this function, a concept of “combined flags” is used.

What is combined flag?

Combined flags consists of defining as flag any observation that is flagged in either flag_wfhz or flag_mfaz vectors. A new column cflags for combined flags is created and added to the dataset. This ensures that all flagged observations from both WFHZ and MFAZ data are excluded from the prevalence analysis.

Table 1: Overview of case-definition of combined flag
flag_wfhz flag_mfaz cflags
1 0 1
0 1 1
0 0 0

Now that we understand what happens under the hood, we can now proceed to implement it:

## Load library ----
library(dplyr)

## Apply the workflow ----
anthro.01 |>
  mw_wrangle_age(
    dos = dos,
    dob = dob,
    age = age,
    .decimals = 2
  ) |>
  mw_wrangle_muac(
    sex = sex,
    .recode_sex = TRUE,
    muac = muac,
    .recode_muac = TRUE,
    unit = "cm",
    .to = "age"
  ) |>
  mutate(
    muac = recode_muac(muac, .to = "mm")
  ) |>
  mw_wrangle_wfhz(
    sex = sex,
    weight = weight,
    height = height,
    .recode_sex = FALSE
  ) |>
  compute_combined_prevalence(
    .wt = NULL,
    .edema = edema,
    .summary_by = area
  )

We get this:

================================================================================
================================================================================
# A tibble: 2 × 17
  area       cgam_n  cgam_p cgam_p_low cgam_p_upp cgam_p_deff csam_n   csam_p
  <chr>       <dbl>   <dbl>      <dbl>      <dbl>       <dbl>  <dbl>    <dbl>
1 District E     NA NA         NA         NA               NA     NA NA
2 District G     55  0.0703     0.0447     0.0958         Inf     13  0.00747
# ℹ 9 more variables: csam_p_low <dbl>, csam_p_upp <dbl>, csam_p_deff <dbl>,
#   cmam_n <dbl>, cmam_p <dbl>, cmam_p_low <dbl>, cmam_p_upp <dbl>,
#   cmam_p_deff <dbl>, wt_pop <dbl>

In district E NAs were returned because there were issues with the data. I leave it to you to figure out what was/were the issue/issues.

Tip

Consider running the plausibility checkers.