Estimate Gaussian/RBF kernel parameter beta from covariates (supports landmarks)
Source:R/nmfkc.R
nmfkc.kernel.beta.nearest.med.RdComputes a data-driven reference scale for the Gaussian/RBF kernel from covariates using a robust "median nearest-neighbor (or nearest-landmark) distance" heuristic, and returns the corresponding kernel parameter \(\beta\).
The Gaussian/RBF kernel is assumed to be written in the form $$k(u,v) = \exp\{-\beta \|u-v\|^2\} = \exp\{-\|u-v\|^2/(2\sigma^2)\},$$ hence \(\beta = 1/(2\sigma^2)\). This function first estimates a typical distance scale \(\sigma_0\) by the median of distances, then sets \(\beta_0 = 1/(2\sigma_0^2)\).
If Uk is NULL, \(\sigma_0\) is estimated as the median of
nearest-neighbor distances within U (excluding self-distance).
If Uk is provided, \(\sigma_0\) is estimated as the median of
nearest-landmark distances from each sample in U to its closest landmark in Uk.
To control memory usage for large N (and M), distances are computed in blocks.
Optionally, columns of U can be randomly subsampled via sample.size to reduce cost.
Usage
nmfkc.kernel.beta.nearest.med(
U,
Uk = NULL,
block.size = 1000,
block.size.Uk = 2000,
sample.size = NULL,
...
)Arguments
- U
A numeric matrix of covariates (
K x N); columns are samples.- Uk
An optional numeric matrix of landmarks (
K x M); columns are landmark points. If provided, distances are computed from samples inUto landmarks inUk.- block.size
Integer. Number of columns of
Uprocessed per block when computing distances (controls memory usage). IfN <= 1000, it is automatically set toN.- block.size.Uk
Integer. Number of columns of
Ukprocessed per block whenUkis notNULL(controls memory usage). IfM <= 2000, it is automatically set toM.- sample.size
Integer or
NULL. If notNULL, randomly subsamples this many columns ofU(without replacement) before computing distances, to reduce computational cost.- ...
Additional arguments (ignored; reserved for future use).
Value
A list with elements:
beta: Estimated kernel parameter \(\beta_0 = 1/(2\sigma_0^2)\).beta_candidates: Numeric vector of candidate \(\beta\) values (logarithmic grid) intended for cross-validation.dist_median: The estimated distance scale \(\sigma_0\) (median of nearest-neighbor or nearest-landmark distances).block.size.used: The effective block size(s) used. Either a scalar (noUk) or a named vectorc(U=..., Uk=...)whenUkis provided.sample.size.used: The number of columns ofUactually used (after subsampling).uk_is_u: Logical flag indicating whetherUkwas detected as identical toU(only returned whenUkis provided).
Details
Candidate grid:
Along with beta, the function returns beta_candidates, a small logarithmic grid
suitable for cross-validation.
In the landmark case (Uk provided), the grid is designed to be symmetric on the
bandwidth scale \(\sigma\) around \(\sigma_0\) over one decade:
$$\sigma = \sigma_0 \times 10^{t}, \quad t \in \{-1,-2/3,-1/3,0,1/3,2/3,1\}.$$
Using \(\beta = 1/(2\sigma^2)\), this corresponds to
$$\beta = \beta_0 \times 10^{-2t}.$$
When Uk is NULL, a shorter coarse grid may be returned (see Value).
Notes:
When
Ukis identical toU, the function detects this case and excludes self-distances (distance 0) to avoid \(\sigma_0=0\).sample.sizeperforms random subsampling without setting a seed. For reproducible results, setset.seed()before calling this function.
Examples
# Basic (nearest-neighbor within U)
# beta_info <- nmfkc.kernel.beta.nearest.med(U)
# beta0 <- beta_info$beta
# betas <- beta_info$beta_candidates
# With landmarks (nearest-landmark distances)
# beta_info <- nmfkc.kernel.beta.nearest.med(U, Uk)
# beta0 <- beta_info$beta
# betas <- beta_info$beta_candidates