celer.datasets.make_correlated_data#

celer.datasets.make_correlated_data(n_samples=100, n_features=50, corr=0.6, snr=3, density=0.2, w_true=None, random_state=None)[source]#

Generate correlated design matrix with decaying correlation rho**|i-j|. according to

\[y = X w^* + \epsilon\]

such that \(||X w^*|| / ||\epsilon|| = snr\).

The generated features have mean 0, variance 1 and the expected correlation structure:

\[\mathbb E[x_i] = 0~, \quad \mathbb E[x_i^2] = 1 \quad \text{and} \quad \mathbb E[x_ix_j] = \rho^{|i-j|}\]
Parameters:
n_samples: int

Number of samples in the design matrix.

n_features: int

Number of features in the design matrix.

corr: float

Correlation \(\rho\) between successive features. The element \(C_{i, j}\) in the correlation matrix will be \(\rho^{|i-j|}\). This parameter should be selected in \([0, 1[\).

snr: float or np.inf

Signal-to-noise ratio. In np.inf, no noise is added.

density: float

Proportion of non zero elements in w_true if it must be simulated.

w_true: np.array, shape (n_features,) | None

True regression coefficients. If None, an array with nnz non zero standard Gaussian entries is simulated.

random_state: int | RandomState instance | None (default)

Determines random number generation for data generation. Use an int to make the randomness deterministic.

Returns:
X: ndarray, shape (n_samples, n_features)

A design matrix with Toeplitz covariance.

y: ndarray, shape (n_samples,)

Observation vector.

w_true: ndarray, shape (n_features,)

True regression vector of the model.