Nodes and Edges

This section explains the details of nodes and edges, their different types, and other information that can be useful when simulating data using PARCS.

Node Types

Stochastic Nodes

In the previous section we saw how a node is described by its sampling distribution. This type of node is named stochastic node in PARCS. For a distribution with many parameters, the general description of these nodes follows this pattern:

N: outputDistribution(param1_=..., param2_=..., ...)

As explained in the theoretical background, parameters accept bias, linear, interactions, and quadratic terms of the parents. For example, if a node has 3 parents A, B and C, possible terms are:

A, B, C, A^2, AB, AC, B^2, BC, C^2

and hence, a possible combination looks something like A^2 + 2A + BC -1.

Stochastic nodes can be marked with a feature called Correction. To explain this feature, look at the graph below:

a simple PARCS graph

from pyparcs import Description, Graph

description = Description({'A': 'normal(mu_=0, sigma_=1)',
                           'B': 'bernoulli(p_=0.3)'})
graph = Graph(description)
samples, _ = graph.sample(5)
print(samples)
#           A    B
# 0 -1.527236  0.0
# 1  0.385980  1.0
# 2 -0.217019  0.0
# 3 -0.137873  0.0
# 4  0.462656  1.0

This is a normal simulation graph. Now what happens if we describe B as a child of A?

a graph that raises an error!

from pyparcs import Description, Graph

description = Description({'A': 'normal(mu_=0, sigma_=1)',
                           'B': 'bernoulli(p_=2A)',
                           'A->B': 'identity()'})
graph = Graph(description)
samples, _ = graph.sample(5)
print(samples)
# pyparcs.core.exceptions.DistributionError: Bern(p) probabilities are out of [0, 1] range

Of course the range of 2A is bigger than [0, 1]; therefore, the bernoulli distribution receives invalid success probabilities for some of the samples. For distribution parameters with a valid range smaller than real range, PARCS provide a simple sigmoid transformation, such that it maps the real range to a custom range. For the bernoulli parameter, a standard sigmoid transformation does the job, as it maps the values to [0, 1]. For other parameters such as the normal’s standard deviation, or the exponential’s rate, we can obtain wider ranges by customizing the sigmoid function. This function is activated by putting a correction[] after the distribution:

Using node correction

description = Description({'A': 'normal(mu_=0, sigma_=1)',
                           'B': 'bernoulli(p_=2A), correction[]',
                           'A->B': 'identity()'})
graph = Graph(description)
samples, _ = graph.sample(5)
print(samples)
#           A    B
# 0 -0.135696  0.0
# 1  0.085163  0.0
# 2  0.428591  1.0
# 3 -0.586112  0.0
# 4  1.222686  0.0

the correction function has three parameters: lower, upper, target_mean. As a default, lower and upper parameters are 0 and 1. If we want to impose a certain mean on the corrected parameter, we can set the target_mean parameter also. This parameter is useful, for example, when we want to simulate a bernoulli node to have certain expected values.

Using node correction

description = Description({'A': 'normal(mu_=0, sigma_=1)',
                           'B': 'bernoulli(p_=2A), correction[lower=0, upper=1, target_mean=0.3]',
                           'A->B': 'identity()'})
graph = Graph(description)
samples, _ = graph.sample(1000)

print(samples['B'].mean())
# 0.319

Note that by using correction, the given argument to the corrected parameter is wrapped with a sigmoid function, and the target mean achieved by updating the bias value. So it is better to read carefully the formula of correction in order to include it in your simulations. Overall, this feature can be useful as researchers often use this trick to sample from the distributions with restricted parameters.

Deterministic Nodes

How to describe a simulation variable as \(Y = a_0 + a_1X_1 + a_2\sin{(X_2)} - a_3\log (X_3^3)\)? Clearly, we cannot use the stochastic nodes. Apart from the fact that only bias, linear, interaction and quadratic terms are allowed, the described variable is not a random variable, but a deterministic one. You can use deterministic(_, _) nodes to implement such nodes in the graph. First argument points to a python script which consists of custom python functions, and second argument specifies the name of the function:

Node C is deterministic

from pyparcs import Description, Graph
import numpy as np
np.random.seed(42)

description = Description({'A': 'bernoulli(p_=0.3)',
                           'B': 'normal(mu_=0, sigma_=1)',
                           'C': 'deterministic(./custom_functions.py, log_sin)',
                           'A->C': 'identity()', 'B->C': 'identity()'})
graph = Graph(description)
samples, errors = graph.sample(2)
print(samples)
#      A         B         C
# 0  0.0  1.651819  2.204312
# 1  0.0 -1.010956  3.559511

print(errors)
#           A         B         C
# 0  0.374540  0.950714  0.731994
# 1  0.598658  0.156019  0.155995

and the custom_functions.py file is:

custom_functions.py

import numpy as np


def log_sin(data):
    """some custom functions"""
    return 1 + 2 * np.log(data['A']+1) + 3 * np.sin(data['B']**2)

Some remarks:

The parents of the deterministic edge are only specified by the edges in the outline, and the columns that we read from the data object in the custom function.
Because of (1) the infer_edges option then does not work for deterministic nodes
The data object which the function receives, is a pandas DataFrame. Note that the edge functions are already applied on the data, before being passed to the node.
Even though not needed, the error vector is sampled for the deterministic node too. Apart from the consistency, this will help us later if we want to compare two graphs over a unique datapoint when a deterministic node is replaced by a stochastic one.

Warning

PARCS identifies the custom function by appending the given path to the file, to the python directory. Therefore, be careful not to have a name conflict with other functions in your python path. A good cautionary measure is to always use a unique prefix for the functions such as parcs_custom_...

What if we wanted to create a stochastic node (e.g. normally distributed variable) but with custom equation for its parameters? Currently, this is not immediately possible; but with as a quick workaround, you can first create a dummy deterministic node and pass it to the stochastic node:

A dummy node to simulate more complex stochastic nodes

from pyparcs import Description, Graph
import numpy as np
np.random.seed(42)

description = Description({'A': 'bernoulli(p_=0.3)',
                           'B': 'normal(mu_=0, sigma_=1)',
                           'C_dummy': 'deterministic(./custom_functions.py, log_sin)',
                           'C': 'normal(mu_=C_dummy, sigma_=1)',
                           'A->C_dummy': 'identity()', 'B->C_dummy': 'identity()',
                           'C_dummy->C': 'identity()'})
graph = Graph(description)
samples, _ = graph.sample(2)
print(samples)
#      A         B   C_dummy         C
# 0  0.0  1.651819  2.204312  2.823166
# 1  0.0 -1.011057  3.559830  1.988763

Later we introduce a small feature to better represent and filter out the dummy nodes from the last samples data frame.

Data Nodes

In many simulation studies, it is desirable to augment a real dataset with some simulated nodes; an example is simulating a missing data scenario where a missingness mask is simulated for a given dataset. Data nodes allow us to read an external CSV file to sample a node. The argument of data nodes are the CSV file path and name of the variable in the CSV file.

dataset.csv

,A,B
0,10,100
1,11,101
2,12,102

main file

from pyparcs import Description, Graph
import numpy as np
np.random.seed(42)

description = Description({'A': 'data(./dataset.csv, A)',
                           'B': 'data(./dataset.csv, B)',
                           'C': 'normal(mu_=A+B, sigma_=1)',
                           'A->C': 'identity()',
                           'B->C': 'identity()'})
graph = Graph(description)
samples, errors = graph.sample(5)

print(samples)
#     A    B           C
# 0  11  101  112.618855
# 1  11  101  110.988943
# 2  10  100  110.256234
# 3  12  102  115.879470
# 4  12  102  113.091568

print(errors)
#           A         B         C
# 0  0.374540  0.374540  0.731994
# 1  0.598658  0.598658  0.155995
# 2  0.058084  0.058084  0.601115
# 3  0.708073  0.708073  0.969910
# 4  0.832443  0.832443  0.181825

Some remarks:

The error vectors for A and B are identical. This is because we are sampling from the joint \((A, B)\) distribution. In other words, the rows of the CSV file are being sampled, rather than the single elements.
The length of the dataset is 3, but in the example, we sampled 5 data points. Because of that, we have duplicate samples for \((A, B)\). The interpretation of this behavior is that we are sampling from a distribution which is determined by the dataset. Just like a parametric distribution such as Bernoulli, realizations may repeat.
We can sample the entire dataset once, using the full_data=True in the sample method. This way, the size of the sample is determined by the length of the dataset, and every data point is sampled once.

sampling using full_data

#     A    B           C
# 0  10  100  110.618855
# 1  11  101  110.988943
# 2  12  102  114.256234

print(errors)
#      A    B         C
# 0  0.0  0.0  0.731994
# 1  0.5  0.5  0.155995

Note

Data nodes must have no incoming edges in the outline

Constant Nodes

This class of nodes provides a simple way of adding a constant-value node to the graph. The syntax in the description file is as simple as creating a node by constant(<value>)

sampling constant nodes

description = Description({'A': 'constant(2)',
                           'B': 'constant(3)'})
graph = Graph(description)
samples, _ = graph.sample(3)
print(samples)
#      A    B
# 0  2.0  3.0
# 1  2.0  3.0
# 2  2.0  3.0

Note

We can simulate deterministic and constant nodes by hacking stochastic normal node with mu_ being a desired constant or determinist term, and sigma_ being 0. Using the deterministic and constant nodes, however, adds more to the clarity of the simulation and the description file.

Edge Correction

Similar to the concept of distribution parameter correction for stochastic nodes, edges can also be marked for a correction. For edges, this is equal to normalization of the input values using the mean and standard deviation of the parent node distribution.

sampling constant nodes

description = Description({'A': 'normal(mu_=100, sigma_=10)',
                           'B': 'normal(mu_=A, sigma_=1)',
                           'A->B': 'identity()'})
graph = Graph(description)
samples, errors = graph.sample(3)
#             A           B
# 0   96.801476   98.453296
# 1  106.188546  106.438423
# 2   89.890436   88.879378

description = Description({'A': 'normal(mu_=100, sigma_=10)',
                           'B': 'normal(mu_=A, sigma_=1)',
                           'A->B': 'identity(), correction[]'})
graph = Graph(description)
samples, _ = graph.sample(use_sampled_errors=True,
                          sampled_errors=errors)
#             A         B
# 0   96.801476  1.445580
# 1  106.188546  0.967103
# 2   89.890436 -1.897181

Edge correction for other edge function also has the same effect. The difference is, after normalization, the input values go through the transformation defined by the edge function.

Nodes and Edges Tags

Last piece of information to complete the PARCS outlines syntax is the nodes and edges tags. There are a couple of occasions when we want to tag some nodes and edges to control the simulation behavior. This can be done by introducing a tags[] expression in the line.

A: poisson(lambda_=2), tags[T1, T2]
B: normal(mu_=A^2, sigma_=A+1), correction[], tags[T3]
A->B: sigmoid(alpha=1, beta=0, gamma=0, tau=1), tags[T4, T5]

First use of the tags that we introduce is to suppress returning the dummy nodes in the samples using the D tag:

sampling constant nodes

description = Description({'A_dummy': 'normal(mu_=0, sigma_=1), tags[D]',
                           'A': 'poisson(lambda_=A_dummy^2+1)',
                           'B': 'bernoulli(p_=A), correction[]',
                           'A_dummy->A': 'identity()',
                           'A->B': 'identity()'})
graph = Graph(description)
samples, errors = graph.sample(3)
print(samples)
#      A    B
# 0  1.0  0.0
# 1  2.0  1.0
# 2  5.0  1.0
print(errors)
#           A   A_dummy         B
# 0  0.539704  0.723826  0.236487
# 1  0.829672  0.600388  0.479554
# 2  0.476101  0.977471  0.635977

As shown in the code, the error vector for the dummy node is still included in the returned errors, but the samples do only have A and B.