Data Generation

The meow framework is built around a central function, meow_sim().

Writing your own data loaders

Data loaders are where you encode your data generating process (DGP) for a simulation. We provide two very basic data loaders, one that loads pre-generated parameters and responses from files, and one that generates data according to a 1PL IRT model. Let’s start with the loader that reads existing data to take a look at the core features any data loader function needs to participate cooperatively in the meow ecosystem.

data_existing <- function(
  resp_path = 'data/sample-resp.csv',
  theta_path = 'data/true-sample-theta.rds',
  diff_path = 'data/true-sample-diff.rds'
) {
  out <- list(
    resp = utils::read.csv(resp_path),
    theta_tru = readRDS(theta_path),
    diff_tru = readRDS(diff_path)
  )
  return(out)
}

Function arguments

Interestingly enough, there are no required inputs for a data loader function, as it’s the first thing that gets called in an internal meow simulation. Your data loader can have, theoretically, whatever arguments you like with only a few exceptions.

The arguments theta, diff, resp, resp_cur, adj_mat, and select_seed should not be used, as these are standard arguments used in the item selection and parameter update functions.

Besides these, any argument names are generally fine and can be passed directly to your data loader through as named arguments to the meow_sim() function.

Return values

First, let’s look at the output. You can see that it’s a list with three named components: resp, theta_tru, and diff_tru.

The `$resp` object

A resp object is a dataframe of item response data in long form. It contains three columns, each containing numeric data, and the response matrix is aligned to the data standard in the Item Response Warehouse. The first, id, contains a 1-indexed integer respondent id for person $i$ . The second, item, contains a 1-indexed item identifier for item $j$ . The third, resp, contains the simulated response of person $i$ to item $j$ , $x_{ij}$ . For dichotomous responses, we use a binary response variable, $x_{ij} \in \{0,1\}$ .

The `$theta_tru` object

Next, we have the theta_tru object. Currently, this is implemented as a vector that contains the true abilities, where the ith element, $\theta_i$ , is the true ability of the $i^\text{th}$ respondent.

In the near future, this will be re-implemented as a dataframe to more easily allow for multidimensional models.

The `$diff_tru` object

Finally, we come to the diff_tru object. Currently, this is implemented as a vector that contains the true item difficulties, where the jth element, $b_j$ , is the true difficulty of the $j^\text{th}$ item.

In the near future, this will be re-implemented as a dataframe to more easily allow for IRT models beyond a 1PL.

Implementing a 1PL data loader

Now let’s look at our default 1PL based data loader. The code for it is shown below:

data_default <- function(N_persons = 100, N_items = 50, data_seed = 242424) {
  # note default behavior is fixed seed to ensure data consistency across runs
  set.seed(data_seed)

  theta_tru <- stats::rnorm(N_persons)
  diff_tru <- stats::rnorm(N_items)

  theta_mat <- matrix(
    theta_tru,
    nrow = N_persons,
    ncol = N_items,
    byrow = FALSE
  )
  diff_mat <- matrix(diff_tru, nrow = N_persons, ncol = N_items, byrow = TRUE)

  p <- stats::plogis(theta_mat - diff_mat)
  resp <- matrix(
    stats::rbinom(length(p), 1, p),
    nrow = N_persons,
    ncol = N_items
  ) |>
    as.data.frame() |>
    dplyr::mutate(id = 1:N_persons) |>
    tidyr::pivot_longer(
      tidyselect::starts_with('V'),
      names_to = 'item',
      values_to = 'resp',
      names_prefix = 'V'
    ) |>
    dplyr::select(.data$id, .data$item, .data$resp) |>
    dplyr::mutate(dplyr::across(tidyselect::everything(), as.numeric))

  out <- list(resp = resp, theta_tru = theta_tru, diff_tru = diff_tru)
  set.seed(NULL)
  return(out)
}

This function simulates data according to a very simple and inflexible DGP. It takes a number of respondents, N_persons and a number of items, N_items, and does the following:

Sets a seed according to the supplied data_seed.
Simulates person abilities, $\theta_i$ , for $i \in \{1, ..., N_\text{persons}\}$ where $\theta_i \sim \mathcal{N}(0,1)$
Simulates item difficulties, $b_j$ , for $j \in \{1, ..., N_\text{items}\}$ where $b_j \sim \mathcal{N}(0,1)$
Simulates dichotomous item responses, $x_{ij} \in \{0,1\}$ , according to a 1PL item response function: $P(x_{ij} = 1 | \theta_i) = \frac{1}{1+e^{-(\theta_i - b_j)}}$
Clears the seed by calling set.seed(NULL), so that the subsequent analysis is not determined by the data seed.
Packages these items up and returns them inside of a list.

Final important note on setting random seeds

If your data loader uses any random number generation and you allow it to take a random seed for reproducibility, make sure to clear the seed at the end of your data loader by calling set.seed(NULL). If you don’t, the current version of that seed will persist for all of the downstream simulation, and the full process will become deterministic. Specifically, if you want to compare multiple runs of an item selection and/or parameter update algorithm that uses some degree of randomization when applied to the same dataset, failing to clear the seed at the end of the data loader will make this impossible.