Creating the First Graph

The following PARCS code example creates a simulation graph and samples from it:

A PARCS code example

from pyparcs import Description, Graph
import numpy
numpy.random.seed(42)

outline = {'A': 'normal(mu_=0, sigma_=1)',
           'B': 'bernoulli(p_=0.4)',
           'C': 'normal(mu_=2A+B^2-1, sigma_=1)',
           'A->C': 'identity()',
           'B->C': 'identity()'}

description = Description(outline, infer_edges=False)
graph = Graph(description)
samples, _ = graph.sample(5)

# samples:
#           A    B         C
# 0 -0.319852  1.0 -0.020850
# 1  0.249876  0.0 -1.511305
# 2 -1.571066  1.0 -2.885899
# 3  0.547763  0.0  1.974996
# 4  0.963863  0.0  0.019293

Let’s explore the code part by part.

Graph Outline

As PARCS simulates data based on a causal DAG, the first step in simulation is to describe a graph. For this, PARCS provides an easy and readable way.

Outline

outline = {'A': 'normal(mu_=0, sigma_=1)',
           'B': 'bernoulli(p_=0.4)',
           'C': 'normal(mu_=2A+B^2-1, sigma_=1)',
           'A->C': 'identity()',
           'B->C': 'identity()'}

Line 5 of the code defines an outline dictionary which is the structural equations of the graph. The first three entries describe three nodes: \(A, B, C\). Node A follows a standard normal distribution and B follows a Bernoulli distribution with the success probability of \(p(B=1) = 0.4\). The outline describes node C using another normal distribution with unit standard deviation, whose mean is a function of A and B. In other words, we have \(C = B^2+2A-1 + \epsilon\), where \(\epsilon \sim \mathcal{N}(0, 1)\).

So, what are the rules and conventions of writing a graph outline?

Each key in the outline dictionary describes a node or an edge. The signature for the edge names is the arrow symbol ->. Indeed, if an edge X_i -> X_j is described, then nodes X_i and X_j must be described in the outline too. Node names follow a specific set of rules too. For instance, they must start with a letter, and the only allowed special character is the underscore.
Nodes are described by a sampling distribution. In this case, the nodes are known as stochastic in PARCS. There are other types of nodes available too, such as deterministic and data nodes. We will explore them in the later sections.
The equation describing the distribution parameters of a child node can consist of a bias term, and linear, interaction, and quadratic terms of the parent variables. Thus, valid terms to include for a node with 3 parents A, B and C are: A, B, C, A^2, AB, AC, B^2, BC, C^2, (bias).
Edges are described by a so-called edge function. These functions are transformations applied to the parent samples before being passed to the child node. In our example, the functions are identity, thus no transformation is applied. If a non-identity transformation \(E(X)\) is applied, then the input to the child parameters are \(E(X)\) rather than \(X\), even though in the outline, we still write X in the child node entry.

An outline can also be written in a YAML file. For instance, equivalent to the dictionary above, we can have the following file:

outline.yml

# nodes
A: normal(mu_=0, sigma_=1)
B: bernoulli(p_=0.4)
C: normal(mu_=2A+B^2-1, sigma_=1)
# edges
A->C: identity()
B->C: identity()

If the outline is written in a YAML file, the Description object in line 11 receives the path to the file as Description('./path/to/outline.yml'). Writing the outline separately in a YAML helps us with better code separation and cleaner workspace.

Descriptions

In the next step, the outline is passed to the Description class.

Description

description = Description(outline, infer_edges=False)

As we will see in later sections, this intermediate step between the outline and the main PARCS graph, helps us do useful preprocessing. For now, let’s study how this object works.

First, the infer_edges=False argument in line 11: You must have noticed that the outline holds sort of a redundancy with respect to describing the edges. As the node names A and B appear in the equation for C, PARCS could have infer the edges A->C and B->C. But if you don’t include the edges in the outline, you will get an error:

PARCS error due to incomplete outline

from pyparcs import Description, Graph
import numpy
numpy.random.seed(42)

outline = {'A': 'normal(mu_=0, sigma_=1)',
           'B': 'bernoulli(p_=0.4)',
           'C': 'normal(mu_=2A+B^2-1, sigma_=1)'}

description = Description(outline, infer_edges=False)

# pyparcs.core.exceptions.DescriptionError: The term 2A in the description file is invalid.
#     Another possibility is that a node name exists in another node's parameter,
#     while it is not marked as a parent by the described edges; in this case, check for the
#     nodes/edges consistency.

So what is the rationale for this behavior? Note that edges are described by their edge function, so PARCS needs to know what function you have chosen for an edge in order to sample from the graph. However, since the identity() function is a basic function and will probably selected most of the times, you can set the infer_edges=True argument and skip the edges in the outline, and Description assumes an identity edge for all possibilities:

implicit edges are inferred

from pyparcs import Description, Graph
import numpy
numpy.random.seed(42)

outline = {'A': 'normal(mu_=0, sigma_=1)',
           'B': 'bernoulli(p_=0.4)',
           'C': 'normal(mu_=2A+B^2-1, sigma_=1)'}

description = Description(outline, infer_edges=True)
graph = Graph(description)
samples, _ = graph.sample(5)

# samples:
#           A    B         C
# 0 -0.319852  1.0 -0.020850
# 1  0.249876  0.0 -1.511305
# 2 -1.571066  1.0 -2.885899
# 3  0.547763  0.0  1.974996
# 4  0.963863  0.0  0.019293

If you want to set other edge functions, you must describe them explicitly in the outline.

Now, let’s briefly check the main attributes of the description object. Firstly, the description object has two nodes and edges attributes which hold the lists of parsed nodes and edges for the graph:

instantiating a Description object

from pprint import pprint

pprint(description.nodes)
# {'A': {'correction_config': {},
#        'dist_params_coefs': {'mu_': {'bias': 0.0,
#                                      'interactions': [],
#                                      'linear': []},
#                              'sigma_': {'bias': 1.0,
#                                         'interactions': [],
#                                         'linear': []}},
#        'do_correction': False,
#        'output_distribution': 'normal'},
#  'B': {'correction_config': {},
#        'dist_params_coefs': {'p_': {'bias': 0.4,
#                                     'interactions': [],
#                                     'linear': []}},
#        'do_correction': False,
#        'output_distribution': 'bernoulli'},
#  'C': {'correction_config': {},
#        'dist_params_coefs': {'mu_': {'bias': -1.0,
#                                      'interactions': [0, 0, 1],
#                                      'linear': [2.0, 0]},
#                              'sigma_': {'bias': 1.0,
#                                         'interactions': [0, 0, 0],
#                                         'linear': [0, 0]}},
#        'do_correction': False,
#        'output_distribution': 'normal'}}

pprint(description.edges)
# {'A->C': {'do_correction': False,
#           'function_name': 'identity',
#           'function_params': {}},
#  'B->C': {'do_correction': False,
#           'function_name': 'identity',
#           'function_params': {}}}

Each entry in these two attributes describe a node or an edge. As a start, you check the output_distribution and function_name keys. Next, you can see how the equations in the nodes are reflected in the dist_params_coefs key. These coefficient vectors follows the orders of the parent list which is another attribute of the description object.

Description nodes and edges

print(description.parents_list)
# {'A': [], 'B': [], 'C': ['A', 'B']}

Good news is, as a general user of PARCS, you almost never need to dig into these attributes. The interplay between the objects handle the graph definition and data generation for you. This was only a sneak peek into behind the scenes.

Instantiate a Graph

Finally, the description object is passed to the Graph class to instantiate a graph object, from which you can sample the generated data. More specifically, the .sample() method provides the samples.

Description

description = Description(outline, infer_edges=False)
graph = Graph(description)
samples, _ = graph.sample(5)

You probably have noticed that the sampling method returns two values, one of which we ignored in the code example by the underscore symbol. PARCS samples the graph by sampling an error vector from a continuous uniform \(\text{unif}(0, 1)\) for each node and pass it to the inverse CDF of the distribution. In this approach, all the stochasticity of a node is reflected in the error values, as the result of iCDF given the error value is fixed. This means that reproducibility is guaranteed by controlling the error vectors.

Going back to the sampling method, the second output is the sampled error vectors.

returning the error vectors

from pyparcs import Description, Graph
import numpy
numpy.random.seed(42)

description = Description('./outline.yml')
graph = Graph(description)
samples, errors = graph.sample(2)
print(samples)
#           A    B         C
# 0 -0.319852  1.0 -0.020850
# 1  0.249876  0.0 -1.511305

print(errors)
#           A         B         C
# 0  0.374540  0.950714  0.731994
# 1  0.598658  0.156019  0.155995

As a quick sanity check, you can compare the error vector for B: [0.95, 0.15] with the samples for B: [1.0, 0.0]. As we know, the errors give \(1\) if bigger than \(p\), and \(0\) otherwise.

The sampling method allows us to reproduce the results by re-using the sampled errors:

reusing sampled errors

from pyparcs import Description, Graph
import numpy
numpy.random.seed(42)

description = Description('./outline.yml')
graph = Graph(description)
samples, errors = graph.sample(2)
print(samples)
#           A    B         C
# 0 -0.319852  1.0 -0.020850
# 1  0.249876  0.0 -1.511305

new_samples, _ = graph.sample(use_sampled_errors=True, sampled_errors=errors)
print(new_samples)
#           A    B         C
# 0 -0.319852  1.0 -0.020850
# 1  0.249876  0.0 -1.511305

Note that when using the sampled errors, we do not set the size argument, as the graph object samples according to the length of the error vectors. This was a trivial use of sampled errors. The scenario becomes interesting if you want to compare the generated data of a graph before and after a change. In the example below, we cut an edge and compare one data point specifically:

reusing sampled errors after changing the graph

from pyparcs import Description, Graph
import numpy
numpy.random.seed(42)

description = Description({'A': 'normal(mu_=0, sigma_=1)',
                           'B': 'exponential(lambda_=A^2+1)'},
                          infer_edges=True)
graph = Graph(description)
samples, errors = graph.sample(1)
print(samples)
#           A         B
# 0 -0.319852  4.112427
print(errors)
#          A         B
# 0  0.37454  0.950714
new_description = Description({'A': 'normal(mu_=0, sigma_=1)',
                               'B': 'exponential(lambda_=1)'})
new_graph = Graph(new_description)
new_samples, new_errors = new_graph.sample(use_sampled_errors=True,
                                           sampled_errors=errors)
print(new_samples)
#           A         B
# 0 -0.319852  4.010121
print(new_errors)
#          A         B
# 0  0.37454  0.950714