Computes a data-driven reference scale for the Gaussian/RBF kernel from covariates using a robust "median nearest-neighbor (or nearest-landmark) distance" heuristic, and returns the corresponding kernel parameter \(\beta\).

The Gaussian/RBF kernel is assumed to be written in the form $$k(u,v) = \exp\{-\beta \|u-v\|^2\} = \exp\{-\|u-v\|^2/(2\sigma^2)\},$$ hence \(\beta = 1/(2\sigma^2)\). This function first estimates a typical distance scale \(\sigma_0\) by the median of distances, then sets \(\beta_0 = 1/(2\sigma_0^2)\).

If Uk is NULL, \(\sigma_0\) is estimated as the median of nearest-neighbor distances within U (excluding self-distance). If Uk is provided, \(\sigma_0\) is estimated as the median of nearest-landmark distances from each sample in U to its closest landmark in Uk.

To control memory usage for large N (and M), distances are computed in blocks. Optionally, columns of U can be randomly subsampled via sample_size to reduce cost.

nmfkc.kernel.beta.nearest.med(
  U,
  Uk = NULL,
  block_size = 1000,
  block_size_Uk = 2000,
  sample_size = NULL
)

Arguments

U

A numeric matrix of covariates (K x N); columns are samples.

Uk

An optional numeric matrix of landmarks (K x M); columns are landmark points. If provided, distances are computed from samples in U to landmarks in Uk.

block_size

Integer. Number of columns of U processed per block when computing distances (controls memory usage). If N <= 1000, it is automatically set to N.

block_size_Uk

Integer. Number of columns of Uk processed per block when Uk is not NULL (controls memory usage). If M <= 2000, it is automatically set to M.

sample_size

Integer or NULL. If not NULL, randomly subsamples this many columns of U (without replacement) before computing distances, to reduce computational cost.

Value

A list with elements:

  • beta: Estimated kernel parameter \(\beta_0 = 1/(2\sigma_0^2)\).

  • beta_candidates: Numeric vector of candidate \(\beta\) values (logarithmic grid) intended for cross-validation.

  • dist_median: The estimated distance scale \(\sigma_0\) (median of nearest-neighbor or nearest-landmark distances).

  • block_size_used: The effective block size(s) used. Either a scalar (no Uk) or a named vector c(U=..., Uk=...) when Uk is provided.

  • sample_size_used: The number of columns of U actually used (after subsampling).

  • uk_is_u: Logical flag indicating whether Uk was detected as identical to U (only returned when Uk is provided).

Details

Candidate grid: Along with beta, the function returns beta_candidates, a small logarithmic grid suitable for cross-validation.

In the landmark case (Uk provided), the grid is designed to be symmetric on the bandwidth scale \(\sigma\) around \(\sigma_0\) over one decade: $$\sigma = \sigma_0 \times 10^{t}, \quad t \in \{-1,-2/3,-1/3,0,1/3,2/3,1\}.$$ Using \(\beta = 1/(2\sigma^2)\), this corresponds to $$\beta = \beta_0 \times 10^{-2t}.$$

When Uk is NULL, a shorter coarse grid may be returned (see Value).

Notes:

  • When Uk is identical to U, the function detects this case and excludes self-distances (distance 0) to avoid \(\sigma_0=0\).

  • sample_size performs random subsampling without setting a seed. For reproducible results, set set.seed() before calling this function.

Examples

# Basic (nearest-neighbor within U)
# beta_info <- nmfkc.kernel.beta.nearest.med(U)
# beta0 <- beta_info$beta
# betas <- beta_info$beta_candidates

# With landmarks (nearest-landmark distances)
# beta_info <- nmfkc.kernel.beta.nearest.med(U, Uk)
# beta0 <- beta_info$beta
# betas <- beta_info$beta_candidates