Missing Graphs
Missing graphs (m-graphs) are causal DAG models for missing data, where the binary indicator node R_x determines if node X is missing. The missingness mechanism (i.e. what are the causes of missing values) are then determined by incoming edges to the indicator node. In PARCS, a simple function called m_graph_convert
returns the corresponding m-graph for a graph object. It takes the indicator variables and mask the data based on the indicator realizations
1from pyparcs.helpers.missing_data import m_graph_convert
2from pyparcs import Description, Graph
3import numpy as np
4np.random.seed(2022)
5
6description = Description({'C': 'normal(mu_=0, sigma_=1)',
7 'A': 'normal(mu_=2C-1, sigma_=1)',
8 'R_A': 'bernoulli(p_=C+A-0.3AC), correction[target_mean=0.3]'},
9 infer_edges=True)
10
11graph = Graph(description)
12samples, _ = graph.sample(5)
13print(samples)
14# C A R_A
15# 0 0.774417 2.049457 0.0
16# 1 -0.652315 -1.713998 0.0
17# 2 1.310389 3.075017 1.0
18# 3 0.240281 -0.888637 0.0
19# 4 -0.884086 -2.497936 0.0
20print(m_graph_convert(samples, missingness_prefix='R_', shared_subscript=False))
21# C A
22# 0 0.774417 NaN
23# 1 -0.652315 NaN
24# 2 1.310389 3.075017
25# 3 0.240281 NaN
26# 4 -0.884086 NaN
In order to use the function, you need to define the indicators of variables with a specific prefix, such as R_
.
Warning
Since the function m_graph_convert
doesn’t have access to the description file, and only reads the sample data, it is up to the user to comply with the M-graph assumptions and restrictions, e.g. not having an edge from indicator nodes to main variables.
Using M-graph Templates
Above, we showed how one can create an m-graph by determining a DAG as before. Another option is to allow PARCS to randomly set up a missingness mechanism for the dataset and induce the missingness. Many literatures have proposed unique m-graph structures for missingness mechanisms. These structures determine which edges are allowed among R`s and from `Z to R`s. By having two (sub) graphs of `Z and R, we can use the .randomize_connection_to() method, to create outgoing edges from Z to R, while masking the edges according to one of the known m-graph structures. This way, R->R edges must be induced when creating R, and Z->R will be induced by the randomizer. An example is provided below:
Ground-truth Graph
We start from a Z graph. It can be a graph of data nodes, or simulated data. In this example, we use the following graph description file:
1# === A causal Triangle: Treatment, Outcome, Confounder ===
2# nodes
3Z_1: normal(?)
4Z_2: normal(?)
5Z_3: normal(?)
6Z_4: bernoulli(?), correction[target_mean=0.3]
7# edges
8Z_1->Z_2: identity(), correction[]
9Z_1->Z_4: identity(), correction[]
10Z_2->Z_3: identity(), correction[]
11Z_3->Z_4: identity(), correction[]
Missingness Indicators Subgraph
Next, we determine the R subgraph. As mentioned, R->R edges must be defined here. We can again use a graph description file. But we can also use the indicator_outline
helper which sets up an R subgraph for a given graph:
1from pyparcs.helpers.missing_data import indicator_outline, sc_mask, m_graph_convert
2np.random.seed(42)
3
4
5outline_R = indicator_outline(adj_matrix=np.zeros(shape=(4, 4)),
6 node_names=[f'Z_{i}' for i in range(1, 5)],
7 miss_ratio=0.5,
8 prefix='R',
9 subscript_only=True)
10pprint(outline_R)
11# {'R_1': 'bernoulli(p_=?), correction[target_mean=0.5]',
12# 'R_2': 'bernoulli(p_=?), correction[target_mean=0.5]',
13# 'R_3': 'bernoulli(p_=?), correction[target_mean=0.5]',
14# 'R_4': 'bernoulli(p_=?), correction[target_mean=0.5]'}
The parameters:
adj_matrix
determines the edges among R nodes. In this example, there is no edge among Rsnode_name
is the list of names for Z nodesprefix
andsubscript_only
tells the function how to make names for R nodes. If subscript_only=True then the node names are R_ and the subscript of Z nodes. If False, then the names will be R_<Z node name>file_dir
is the directory of the created graph description file.
For the R->R
adjacency matrix, PARCS again provides helper functions to induce different missingness mechanisms; nevertheless, you can :
R_adj_matrix(size, shuffle, density)
freely randomize edges among Rs to induce an acyclic structure.R_attrition_adj_matrix(size, step, density)
gives an adjacency matrix which induces attrition missingness, i.e. edges from early Rs to later Rs.step
determines how many Rs can an indicator affect. e.g. ifstep=2
then R_1 will have an edge to R_2 and R_3 but not anymore.
Note
In order to make post-hoc changes, simply edit the resulting graph description file. For example you can set target_mean
parameter to the correction
of edges in order to control the ratio of missingness.
Creating Z->R Edges
Next step is to use the randomize connection method to create the Z->R
edges. Here we apply masks in order to determine the missingness mechanism. These masks can be made manually, or read from the helper module as well:
fully_observed_mar
creates an p x q mask matrix where p and q are length of Z and R node vectors respectively, and p - q determines the number of fully observed Z nodes (they have no R). With the parameterfully_observed_indices
we specify the index of fully observed nodes.nsc_mask
allows for no-self censoring mechanism, i.e. all the Z->R edges are allowed except for Z_i->R_i edges (diagonal is zero)sc_mask
is an identity matrix, allowing only for self-censoring edgesblock_conditional_mask
is an upper triangular mask where each Z can have edges to only later Rs (not previous ones). This mask assumes an order in Z nodes (e.g. a chronological order).
Finally, we sample from the constructed graph as follows:
1import numpy as np
2import pandas as pd
3from pprint import pprint
4from pyparcs import Description, Graph, Guideline
5from pyparcs.helpers.missing_data import indicator_outline, sc_mask, m_graph_convert
6np.random.seed(42)
7
8
9outline_R = indicator_outline(adj_matrix=np.zeros(shape=(4, 4)),
10 node_names=[f'Z_{i}' for i in range(1, 5)],
11 miss_ratio=0.5,
12 prefix='R',
13 subscript_only=True)
14pprint(outline_R)
15# {'R_1': 'bernoulli(p_=?), correction[target_mean=0.5]',
16# 'R_2': 'bernoulli(p_=?), correction[target_mean=0.5]',
17# 'R_3': 'bernoulli(p_=?), correction[target_mean=0.5]',
18# 'R_4': 'bernoulli(p_=?), correction[target_mean=0.5]'}
19
20mask = pd.DataFrame(sc_mask(size=4),
21 index=[f'Z_{i}' for i in range(1, 5)],
22 columns=[f'R_{i}' for i in range(1, 5)])
23guideline = Guideline('simple_guideline.yml')
24
25description = Description('outline_Z.yml')
26description.randomize_connection_to(outline_R, guideline,
27 mask=mask)
28
29graph = Graph(description)
30samples, _ = graph.sample(1000)
31masked_samples = m_graph_convert(samples, shared_subscript=True)
32
33print(masked_samples.sample(4))
34# Z_1 Z_2 Z_4 Z_3
35# 795 0.570035 -1.417940 1.0 NaN
36# 354 NaN NaN NaN 0.724395
37# 538 NaN NaN NaN NaN
38# 516 NaN -0.022907 NaN -1.825223
39print(masked_samples.notna().mean())
40# Z_3 0.503
41# Z_4 0.493
42# Z_1 0.473
43# Z_2 0.482