Missing Graphs

Missing graphs (m-graphs) are causal DAG models for missing data, where the binary indicator node R_x determines if node X is missing. The missingness mechanism (i.e. what are the causes of missing values) are then determined by incoming edges to the indicator node. In PARCS, a simple function called m_graph_convert returns the corresponding m-graph for a graph object. It takes the indicator variables and mask the data based on the indicator realizations

 1from pyparcs.helpers.missing_data import m_graph_convert
 2from pyparcs import Description, Graph
 3import numpy as np
 4np.random.seed(2022)
 5
 6description = Description({'C': 'normal(mu_=0, sigma_=1)',
 7                           'A': 'normal(mu_=2C-1, sigma_=1)',
 8                           'R_A': 'bernoulli(p_=C+A-0.3AC), correction[target_mean=0.3]'},
 9                          infer_edges=True)
10
11graph = Graph(description)
12samples, _ = graph.sample(5)
13print(samples)
14#           C         A  R_A
15# 0  0.774417  2.049457  0.0
16# 1 -0.652315 -1.713998  0.0
17# 2  1.310389  3.075017  1.0
18# 3  0.240281 -0.888637  0.0
19# 4 -0.884086 -2.497936  0.0
20print(m_graph_convert(samples, missingness_prefix='R_', shared_subscript=False))
21#           C         A
22# 0  0.774417       NaN
23# 1 -0.652315       NaN
24# 2  1.310389  3.075017
25# 3  0.240281       NaN
26# 4 -0.884086       NaN

In order to use the function, you need to define the indicators of variables with a specific prefix, such as R_.

Warning

Since the function m_graph_convert doesn’t have access to the description file, and only reads the sample data, it is up to the user to comply with the M-graph assumptions and restrictions, e.g. not having an edge from indicator nodes to main variables.

Using M-graph Templates

Above, we showed how one can create an m-graph by determining a DAG as before. Another option is to allow PARCS to randomly set up a missingness mechanism for the dataset and induce the missingness. Many literatures have proposed unique m-graph structures for missingness mechanisms. These structures determine which edges are allowed among R`s and from `Z to R`s. By having two (sub) graphs of `Z and R, we can use the .randomize_connection_to() method, to create outgoing edges from Z to R, while masking the edges according to one of the known m-graph structures. This way, R->R edges must be induced when creating R, and Z->R will be induced by the randomizer. An example is provided below:

Ground-truth Graph

We start from a Z graph. It can be a graph of data nodes, or simulated data. In this example, we use the following graph description file:

 1# === A causal Triangle: Treatment, Outcome, Confounder ===
 2# nodes
 3Z_1: normal(?)
 4Z_2: normal(?)
 5Z_3: normal(?)
 6Z_4: bernoulli(?), correction[target_mean=0.3]
 7# edges
 8Z_1->Z_2: identity(), correction[]
 9Z_1->Z_4: identity(), correction[]
10Z_2->Z_3: identity(), correction[]
11Z_3->Z_4: identity(), correction[]

Missingness Indicators Subgraph

Next, we determine the R subgraph. As mentioned, R->R edges must be defined here. We can again use a graph description file. But we can also use the indicator_outline helper which sets up an R subgraph for a given graph:

 1from pyparcs.helpers.missing_data import indicator_outline, sc_mask, m_graph_convert
 2np.random.seed(42)
 3
 4
 5outline_R = indicator_outline(adj_matrix=np.zeros(shape=(4, 4)),
 6                              node_names=[f'Z_{i}' for i in range(1, 5)],
 7                              miss_ratio=0.5,
 8                              prefix='R',
 9                              subscript_only=True)
10pprint(outline_R)
11# {'R_1': 'bernoulli(p_=?), correction[target_mean=0.5]',
12#  'R_2': 'bernoulli(p_=?), correction[target_mean=0.5]',
13#  'R_3': 'bernoulli(p_=?), correction[target_mean=0.5]',
14#  'R_4': 'bernoulli(p_=?), correction[target_mean=0.5]'}

The parameters:

  • adj_matrix determines the edges among R nodes. In this example, there is no edge among Rs

  • node_name is the list of names for Z nodes

  • prefix and subscript_only tells the function how to make names for R nodes. If subscript_only=True then the node names are R_ and the subscript of Z nodes. If False, then the names will be R_<Z node name>

  • file_dir is the directory of the created graph description file.

For the R->R adjacency matrix, PARCS again provides helper functions to induce different missingness mechanisms; nevertheless, you can :

  • R_adj_matrix(size, shuffle, density) freely randomize edges among Rs to induce an acyclic structure.

  • R_attrition_adj_matrix(size, step, density) gives an adjacency matrix which induces attrition missingness, i.e. edges from early Rs to later Rs. step determines how many Rs can an indicator affect. e.g. if step=2 then R_1 will have an edge to R_2 and R_3 but not anymore.

Note

In order to make post-hoc changes, simply edit the resulting graph description file. For example you can set target_mean parameter to the correction of edges in order to control the ratio of missingness.

Creating Z->R Edges

Next step is to use the randomize connection method to create the Z->R edges. Here we apply masks in order to determine the missingness mechanism. These masks can be made manually, or read from the helper module as well:

  • fully_observed_mar creates an p x q mask matrix where p and q are length of Z and R node vectors respectively, and p - q determines the number of fully observed Z nodes (they have no R). With the parameter fully_observed_indices we specify the index of fully observed nodes.

  • nsc_mask allows for no-self censoring mechanism, i.e. all the Z->R edges are allowed except for Z_i->R_i edges (diagonal is zero)

  • sc_mask is an identity matrix, allowing only for self-censoring edges

  • block_conditional_mask is an upper triangular mask where each Z can have edges to only later Rs (not previous ones). This mask assumes an order in Z nodes (e.g. a chronological order).

Finally, we sample from the constructed graph as follows:

 1import numpy as np
 2import pandas as pd
 3from pprint import pprint
 4from pyparcs import Description, Graph, Guideline
 5from pyparcs.helpers.missing_data import indicator_outline, sc_mask, m_graph_convert
 6np.random.seed(42)
 7
 8
 9outline_R = indicator_outline(adj_matrix=np.zeros(shape=(4, 4)),
10                              node_names=[f'Z_{i}' for i in range(1, 5)],
11                              miss_ratio=0.5,
12                              prefix='R',
13                              subscript_only=True)
14pprint(outline_R)
15# {'R_1': 'bernoulli(p_=?), correction[target_mean=0.5]',
16#  'R_2': 'bernoulli(p_=?), correction[target_mean=0.5]',
17#  'R_3': 'bernoulli(p_=?), correction[target_mean=0.5]',
18#  'R_4': 'bernoulli(p_=?), correction[target_mean=0.5]'}
19
20mask = pd.DataFrame(sc_mask(size=4),
21                    index=[f'Z_{i}' for i in range(1, 5)],
22                    columns=[f'R_{i}' for i in range(1, 5)])
23guideline = Guideline('simple_guideline.yml')
24
25description = Description('outline_Z.yml')
26description.randomize_connection_to(outline_R, guideline,
27                                    mask=mask)
28
29graph = Graph(description)
30samples, _ = graph.sample(1000)
31masked_samples = m_graph_convert(samples, shared_subscript=True)
32
33print(masked_samples.sample(4))
34#           Z_1       Z_2  Z_4       Z_3
35# 795  0.570035 -1.417940  1.0       NaN
36# 354       NaN       NaN  NaN  0.724395
37# 538       NaN       NaN  NaN       NaN
38# 516       NaN -0.022907  NaN -1.825223
39print(masked_samples.notna().mean())
40# Z_3    0.503
41# Z_4    0.493
42# Z_1    0.473
43# Z_2    0.482