Estimate Gaussian/RBF kernel parameter beta from covariates (supports landmarks)

Computes a data-driven reference scale for the Gaussian/RBF kernel from covariates using a robust "median nearest-neighbor (or nearest-landmark) distance" heuristic, and returns the corresponding kernel parameter $\beta$.

The Gaussian/RBF kernel is assumed to be written in the form $$k(u,v) = \exp\{-\beta \|u-v\|^2\} = \exp\{-\|u-v\|^2/(2\sigma^2)\},$$ hence $\beta = 1/(2\sigma^2)$. This function first estimates a typical distance scale $\sigma_0$ by the median of distances, then sets $\beta_0 = 1/(2\sigma_0^2)$.

If Uk is NULL, $\sigma_0$ is estimated as the median of nearest-neighbor distances within U (excluding self-distance). If Uk is provided, $\sigma_0$ is estimated as the median of nearest-landmark distances from each sample in U to its closest landmark in Uk.

To control memory usage for large N (and M), distances are computed in blocks. Optionally, columns of U can be randomly subsampled via sample_size to reduce cost.

nmfkc.kernel.beta.nearest.med(
  U,
  Uk = NULL,
  block_size = 1000,
  block_size_Uk = 2000,
  sample_size = NULL
)

Arguments

U: A numeric matrix of covariates (K x N); columns are samples.
Uk: An optional numeric matrix of landmarks (K x M); columns are landmark points. If provided, distances are computed from samples in U to landmarks in Uk.
block_size: Integer. Number of columns of U processed per block when computing distances (controls memory usage). If N <= 1000, it is automatically set to N.
block_size_Uk: Integer. Number of columns of Uk processed per block when Uk is not NULL (controls memory usage). If M <= 2000, it is automatically set to M.
sample_size: Integer or NULL. If not NULL, randomly subsamples this many columns of U (without replacement) before computing distances, to reduce computational cost.

Value

A list with elements:

beta: Estimated kernel parameter $\beta_0 = 1/(2\sigma_0^2)$.
beta_candidates: Numeric vector of candidate $\beta$ values (logarithmic grid) intended for cross-validation.
dist_median: The estimated distance scale $\sigma_0$ (median of nearest-neighbor or nearest-landmark distances).
block_size_used: The effective block size(s) used. Either a scalar (no Uk) or a named vector c(U=..., Uk=...) when Uk is provided.
sample_size_used: The number of columns of U actually used (after subsampling).
uk_is_u: Logical flag indicating whether Uk was detected as identical to U (only returned when Uk is provided).

Details

Candidate grid: Along with beta, the function returns beta_candidates, a small logarithmic grid suitable for cross-validation.

In the landmark case (Uk provided), the grid is designed to be symmetric on the bandwidth scale $\sigma$ around $\sigma_0$ over one decade: $$\sigma = \sigma_0 \times 10^{t}, \quad t \in \{-1,-2/3,-1/3,0,1/3,2/3,1\}.$$ Using $\beta = 1/(2\sigma^2)$, this corresponds to $$\beta = \beta_0 \times 10^{-2t}.$$

When Uk is NULL, a shorter coarse grid may be returned (see Value).

Notes:

When Uk is identical to U, the function detects this case and excludes self-distances (distance 0) to avoid $\sigma_0=0$.
sample_size performs random subsampling without setting a seed. For reproducible results, set set.seed() before calling this function.

Examples

# Basic (nearest-neighbor within U)
# beta_info <- nmfkc.kernel.beta.nearest.med(U)
# beta0 <- beta_info$beta
# betas <- beta_info$beta_candidates

# With landmarks (nearest-landmark distances)
# beta_info <- nmfkc.kernel.beta.nearest.med(U, Uk)
# beta0 <- beta_info$beta
# betas <- beta_info$beta_candidates