Missing Graphs

Missing graphs (m-graphs) are causal DAG models for missing data, where the binary indicator node R_x determines if node X is missing. The missingness mechanism (i.e. what are the causes of missing values) are then determined by incoming edges to the indicator node. In PARCS, a simple function called m_graph_convert returns the corresponding m-graph for a graph object. It takes the indicator variables and mask the data based on the indicator realizations

from pyparcs.helpers.missing_data import m_graph_convert
from pyparcs import Description, Graph
import numpy as np
np.random.seed(2022)

description = Description({'C': 'normal(mu_=0, sigma_=1)',
                           'A': 'normal(mu_=2C-1, sigma_=1)',
                           'R_A': 'bernoulli(p_=C+A-0.3AC), correction[target_mean=0.3]'},
                          infer_edges=True)

graph = Graph(description)
samples, _ = graph.sample(5)
print(samples)
#           C         A  R_A
# 0  0.774417  2.049457  0.0
# 1 -0.652315 -1.713998  0.0
# 2  1.310389  3.075017  1.0
# 3  0.240281 -0.888637  0.0
# 4 -0.884086 -2.497936  0.0
print(m_graph_convert(samples, missingness_prefix='R_', shared_subscript=False))
#           C         A
# 0  0.774417       NaN
# 1 -0.652315       NaN
# 2  1.310389  3.075017
# 3  0.240281       NaN
# 4 -0.884086       NaN

In order to use the function, you need to define the indicators of variables with a specific prefix, such as R_.

Warning

Since the function m_graph_convert doesn’t have access to the description file, and only reads the sample data, it is up to the user to comply with the M-graph assumptions and restrictions, e.g. not having an edge from indicator nodes to main variables.

Using M-graph Templates

Above, we showed how one can create an m-graph by determining a DAG as before. Another option is to allow PARCS to randomly set up a missingness mechanism for the dataset and induce the missingness. Many literatures have proposed unique m-graph structures for missingness mechanisms. These structures determine which edges are allowed among R`s and from `Z to R`s. By having two (sub) graphs of `Z and R, we can use the .randomize_connection_to() method, to create outgoing edges from Z to R, while masking the edges according to one of the known m-graph structures. This way, R->R edges must be induced when creating R, and Z->R will be induced by the randomizer. An example is provided below:

Ground-truth Graph

We start from a Z graph. It can be a graph of data nodes, or simulated data. In this example, we use the following graph description file:

# === A causal Triangle: Treatment, Outcome, Confounder ===
# nodes
Z_1: normal(?)
Z_2: normal(?)
Z_3: normal(?)
Z_4: bernoulli(?), correction[target_mean=0.3]
# edges
Z_1->Z_2: identity(), correction[]
Z_1->Z_4: identity(), correction[]
Z_2->Z_3: identity(), correction[]
Z_3->Z_4: identity(), correction[]

Missingness Indicators Subgraph

Next, we determine the R subgraph. As mentioned, R->R edges must be defined here. We can again use a graph description file. But we can also use the indicator_outline helper which sets up an R subgraph for a given graph:

from pyparcs.helpers.missing_data import indicator_outline, sc_mask, m_graph_convert
np.random.seed(42)


outline_R = indicator_outline(adj_matrix=np.zeros(shape=(4, 4)),
                              node_names=[f'Z_{i}' for i in range(1, 5)],
                              miss_ratio=0.5,
                              prefix='R',
                              subscript_only=True)
pprint(outline_R)
# {'R_1': 'bernoulli(p_=?), correction[target_mean=0.5]',
#  'R_2': 'bernoulli(p_=?), correction[target_mean=0.5]',
#  'R_3': 'bernoulli(p_=?), correction[target_mean=0.5]',
#  'R_4': 'bernoulli(p_=?), correction[target_mean=0.5]'}

The parameters:

adj_matrix determines the edges among R nodes. In this example, there is no edge among Rs
node_name is the list of names for Z nodes
prefix and subscript_only tells the function how to make names for R nodes. If subscript_only=True then the node names are R_ and the subscript of Z nodes. If False, then the names will be R_<Z node name>
file_dir is the directory of the created graph description file.

For the R->R adjacency matrix, PARCS again provides helper functions to induce different missingness mechanisms; nevertheless, you can :

R_adj_matrix(size, shuffle, density) freely randomize edges among Rs to induce an acyclic structure.
R_attrition_adj_matrix(size, step, density) gives an adjacency matrix which induces attrition missingness, i.e. edges from early Rs to later Rs. step determines how many Rs can an indicator affect. e.g. if step=2 then R_1 will have an edge to R_2 and R_3 but not anymore.

Note

In order to make post-hoc changes, simply edit the resulting graph description file. For example you can set target_mean parameter to the correction of edges in order to control the ratio of missingness.

Creating Z->R Edges

Next step is to use the randomize connection method to create the Z->R edges. Here we apply masks in order to determine the missingness mechanism. These masks can be made manually, or read from the helper module as well:

fully_observed_mar creates an p x q mask matrix where p and q are length of Z and R node vectors respectively, and p - q determines the number of fully observed Z nodes (they have no R). With the parameter fully_observed_indices we specify the index of fully observed nodes.
nsc_mask allows for no-self censoring mechanism, i.e. all the Z->R edges are allowed except for Z_i->R_i edges (diagonal is zero)
sc_mask is an identity matrix, allowing only for self-censoring edges
block_conditional_mask is an upper triangular mask where each Z can have edges to only later Rs (not previous ones). This mask assumes an order in Z nodes (e.g. a chronological order).

Finally, we sample from the constructed graph as follows:

import numpy as np
import pandas as pd
from pprint import pprint
from pyparcs import Description, Graph, Guideline
from pyparcs.helpers.missing_data import indicator_outline, sc_mask, m_graph_convert
np.random.seed(42)


outline_R = indicator_outline(adj_matrix=np.zeros(shape=(4, 4)),
                              node_names=[f'Z_{i}' for i in range(1, 5)],
                              miss_ratio=0.5,
                              prefix='R',
                              subscript_only=True)
pprint(outline_R)
# {'R_1': 'bernoulli(p_=?), correction[target_mean=0.5]',
#  'R_2': 'bernoulli(p_=?), correction[target_mean=0.5]',
#  'R_3': 'bernoulli(p_=?), correction[target_mean=0.5]',
#  'R_4': 'bernoulli(p_=?), correction[target_mean=0.5]'}

mask = pd.DataFrame(sc_mask(size=4),
                    index=[f'Z_{i}' for i in range(1, 5)],
                    columns=[f'R_{i}' for i in range(1, 5)])
guideline = Guideline('simple_guideline.yml')

description = Description('outline_Z.yml')
description.randomize_connection_to(outline_R, guideline,
                                    mask=mask)

graph = Graph(description)
samples, _ = graph.sample(1000)
masked_samples = m_graph_convert(samples, shared_subscript=True)

print(masked_samples.sample(4))
#           Z_1       Z_2  Z_4       Z_3
# 795  0.570035 -1.417940  1.0       NaN
# 354       NaN       NaN  NaN  0.724395
# 538       NaN       NaN  NaN       NaN
# 516       NaN -0.022907  NaN -1.825223
print(masked_samples.notna().mean())
# Z_3    0.503
# Z_4    0.493
# Z_1    0.473
# Z_2    0.482