Results: SNR-based Template Pre-selection
=========================================

Overview
--------

When tens or hundreds of imaging-systematic maps are available, running the
full Bayesian decontamination pipeline on all of them simultaneously is
computationally expensive and numerically ill-conditioned.  The SNR pre-selection
pipeline addresses this by inserting a fast Stage 1 before the full Stage 2
decontamination:

**Stage 1 — SNR ranking.**
All candidate templates are scored by their correlation with the observed galaxy
overdensity.  The top-:math:`K` templates are retained.

**Stage 2 — Full decontamination.**
The reduced template set is passed to the standard greedy forward selection or
Bayesian MCMC pipeline.

This document validates Stage 1 against a controlled simulation where the
injected templates are known a priori.

Three ranking statistics are implemented (see :mod:`sys_mapping.diagnostics`):

.. list-table::
   :header-rows: 1
   :widths: 20 80

   * - Method key
     - Description
   * - ``"data"``
     - Pearson cross-correlation :math:`|r|` between the observed overdensity
       :math:`\delta_g` and each template :math:`\delta_t`.
   * - ``"template"``
     - OLS regression of :math:`\delta_g` on each template individually;
       SNR = :math:`|\hat\alpha| / \sigma_{\hat\alpha}` (absolute *t*-statistic).
   * - ``"isd"``
     - ISD :math:`\Delta\chi^2` contamination metric
       (Rodríguez-Monroy et al. 2025, arXiv:2509.07943, Sec. IV.A.1).
       Pixels are binned by template value; a polynomial is fit to the
       binned galaxy density; :math:`\Delta\chi^2 = \chi^2_\mathrm{null} -
       \chi^2_\mathrm{model}` measures how much the polynomial reduces the
       scatter.

A fourth optional statistic, ``"peak"``, ranks templates by the peak amplitude
of the cross-power spectrum :math:`C_\ell^{gT}`.

For a rigorous significance test, :func:`sys_mapping.diagnostics.isd_template_significance`
compares :math:`\Delta\chi^2_\mathrm{data}` against a distribution of
:math:`\Delta\chi^2_\mathrm{mock}` values computed from GLASS systematic-free
mocks on the same footprint, producing mock-based *p*-values.


Simulation setup
----------------

.. list-table::
   :header-rows: 1
   :widths: 30 70

   * - Parameter
     - Value
   * - HEALPix resolution
     - NSIDE = 32 (:math:`n_\mathrm{pix}` = 12 288)
   * - Galaxy count
     - 15 000 000 (full sky)
   * - Redshift shell
     - :math:`0 \le z \le 0.5` (single tophat bin)
   * - Systematic templates
     - 20 synthetic maps; families 0–4 generated 4× with independent seeds
       (see :func:`sys_mapping.maps.generate_systematic_maps`)
   * - Injected templates
     - Templates **2** and **7** (★ in figures) — two realisations of family 2
   * - Contamination model
     - Additive: :math:`\delta_g^\mathrm{obs} = \delta_g^\mathrm{clean}
       + a \,\delta_{t_2} + a \,\delta_{t_7}`
   * - Contamination levels
     - low :math:`(a=0.02)`, medium :math:`(a=0.05)`, high :math:`(a=0.10)`
   * - GLASS mocks for ISD
     - 100 systematic-free full-sky mocks on the same footprint

The figure-generating script is ``scripts/run_snr_preselection_demo.py``.


HEALPix maps
------------

The panel below shows the simulated galaxy overdensity, representative template
maps, and the contaminated field at medium level.

.. figure:: /_static/results_snr_preselection/01_maps.png
   :width: 100 %
   :alt: HEALPix maps of the simulated universe and systematic templates

   **Left to right, top row:** Clean galaxy overdensity :math:`\delta_g`,
   two injected templates (T2, T7), a representative noise template.
   **Bottom row:** Contaminated overdensity at medium level and the
   residual contamination signal.


SNR ranking
-----------

The bar charts below show the SNR score assigned to each template by each
ranking method at the three contamination levels.  The two injected
templates (★) are coloured in coral red; noise templates in grey.

Data cross-correlation method
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. figure:: /_static/results_snr_preselection/02_snr_ranking_data.png
   :width: 100 %
   :alt: SNR bar chart for the data cross-correlation method

   **Data cross-correlation** :math:`|r|`.  Both injected templates rank
   first and second at all three contamination levels.  Noise templates
   have scores consistent with zero within statistical fluctuations.

OLS *t*-statistic method
^^^^^^^^^^^^^^^^^^^^^^^^^

.. figure:: /_static/results_snr_preselection/02_snr_ranking_template.png
   :width: 100 %
   :alt: SNR bar chart for the OLS t-statistic method

   **OLS** :math:`|\hat\alpha|/\sigma_{\hat\alpha}`.  The signal-to-noise
   ratio is generally sharper than the Pearson correlation because the
   template standard deviation is factored into the denominator.

ISD :math:`\Delta\chi^2` method
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. figure:: /_static/results_snr_preselection/02_snr_ranking_isd.png
   :width: 100 %
   :alt: SNR bar chart for the ISD Delta-chi2 method

   **ISD** :math:`\Delta\chi^2` (Rodríguez-Monroy et al. 2025).
   The statistic grows quadratically with amplitude, making it highly
   sensitive at medium and high contamination but potentially
   less sensitive than the correlation at low levels.


Detection sensitivity vs amplitude
------------------------------------

.. figure:: /_static/results_snr_preselection/03_detection_vs_amplitude.png
   :width: 80 %
   :alt: SNR of injected template 2 as a function of contamination amplitude

   SNR assigned to injected template 2 as a function of contamination
   amplitude :math:`a` for the three methods.  Vertical dotted lines mark
   the three canonical levels.  All methods increase monotonically with
   amplitude.  The data and template methods are approximately linear in
   :math:`a`; the ISD :math:`\Delta\chi^2` grows roughly as :math:`a^2`.


ISD mock-based *p*-values
--------------------------

.. figure:: /_static/results_snr_preselection/04_pvalues.png
   :width: 100 %
   :alt: Mock-based p-values from ISD significance test

   Mock-based *p*-values from :func:`sys_mapping.diagnostics.isd_template_significance`
   with :math:`N_\mathrm{mocks} = 100` GLASS systematic-free mocks.
   The red dashed line marks the conventional :math:`p = 0.05` threshold.
   At all three levels, both injected templates (T2, T7) are significantly
   detected (:math:`p \le 1/N_\mathrm{mocks}`), while all 18 noise templates
   scatter around :math:`p \sim 0.5`.


Full pipeline: pre-selection then decontamination
--------------------------------------------------

.. figure:: /_static/results_snr_preselection/05_pipeline_result.png
   :width: 60 %
   :alt: Detection summary table

   Detection summary: green cell = both injected templates T2 and T7 appear
   in the top-3 ranking (out of 20); grey cell = at least one is missed.

All three methods successfully recover both injected templates in the top-3
at every contamination level.  In practice, selecting the top :math:`K`
templates (with :math:`K` chosen conservatively, e.g. :math:`K = N/2`
of the initial pool) and then running greedy forward selection on the
reduced set is a robust and computationally efficient strategy.


Computational cost
------------------

The ranking step is fast regardless of the galaxy count; it operates on
pixelised maps (here: 12 288 values) and 20 templates.

.. figure:: /_static/results_snr_preselection/06_timing.png
   :width: 65 %
   :alt: Wallclock time per ranking method

   Best-of-5 wallclock time for a single call to
   :func:`~sys_mapping.diagnostics.snr_template_ranking` with 20 templates on
   NSIDE = 32.

.. list-table::
   :header-rows: 1
   :widths: 25 20 55

   * - Method
     - Typical time
     - Notes
   * - ``"data"``
     - < 1 ms
     - Pearson correlation; JAX batched matrix multiply — :math:`\mathcal{O}(n_\mathrm{pix} \times n_\mathrm{templates})`
   * - ``"template"``
     - < 1 ms
     - Per-template OLS *t*-statistic; JAX vmap over templates
   * - ``"isd"``
     - 1–10 ms
     - JAX vmap over templates, fixed-size bins, analytic linear regression;
       falls back to NumPy for ``poly_order > 1`` or ``fracdet`` weighting

The dominant runtime cost in practice is the GLASS mock generation for
:func:`~sys_mapping.diagnostics.isd_template_significance`: each of the
:math:`N_\mathrm{mocks}` systematic-free mocks requires drawing galaxy
positions from the pixelised density field (:math:`\mathcal{O}(N_\mathrm{total})`).
At 15 M galaxies and ``rand_factor=2``, each mock takes roughly 1–5 s,
making the total ISD significance cost
:math:`\sim N_\mathrm{mocks} \times 3\,\mathrm{s}` on a modern workstation.


Choosing the number of mocks for ISD significance
--------------------------------------------------

The mock-based *p*-value estimate has a finite resolution of
:math:`1/(N_\mathrm{mocks}+1)` and a standard error that scales as
:math:`\sim 1/\sqrt{N_\mathrm{mocks}}` for noise-template p-values near 0.5.
The figure below quantifies how many mocks are needed for the noise-template
p-values to converge.

.. figure:: /_static/results_snr_preselection/07_mock_convergence.png
   :width: 100 %
   :alt: Mock convergence of ISD p-values

   **Left:** p-values for all 20 templates as a function of :math:`N_\mathrm{mocks}`
   (reusing the 100-mock run at the high contamination level by subsampling the
   mock :math:`\Delta\chi^2` matrix).
   Coral lines = injected templates (T2, T7); grey lines = 18 noise templates;
   red dashed = :math:`p=0.05` threshold; dotted black = theoretical lower
   bound :math:`1/(N+1)`.
   **Right:** convergence metric — maximum absolute change of any noise-template
   p-value relative to the :math:`N=100` reference, as a function of
   :math:`N_\mathrm{mocks}`.  The vertical green dashed line marks the first
   :math:`N` where this change falls below 0.05.

The injected templates saturate at the minimum attainable p-value
:math:`1/(N+1)` from the very first few mocks, confirming that even
:math:`N_\mathrm{mocks}=5` is sufficient to *detect* a strongly contaminating
template.  The noise-template p-values, however, require more mocks to stabilise:
the convergence metric falls below 0.05 only around :math:`N \approx 50\text{–}100`.

**Rule of thumb:**

.. math::

   N_\mathrm{mocks} \;\ge\; \max\!\left(20,\; \left\lceil \frac{5}{\alpha} \right\rceil\right)

where :math:`\alpha` is the target significance level.

.. list-table::
   :header-rows: 1
   :widths: 30 30 40

   * - Target :math:`\alpha`
     - Minimum :math:`N_\mathrm{mocks}`
     - Notes
   * - 0.10
     - 50
     - Fast pre-screening
   * - 0.05
     - 100
     - Standard significance level
   * - 0.01
     - 500
     - High-confidence selection


Conclusion
----------

* **All three SNR ranking methods** (data cross-correlation, OLS *t*-statistic,
  ISD :math:`\Delta\chi^2`) correctly identify both injected templates as the
  highest-ranking candidates at all tested contamination amplitudes
  (:math:`a \ge 0.02`) with 15 M galaxies and 20 templates.

* **The data and template methods** are approximately linear in amplitude,
  reliable from the lowest tested level, and execute in < 1 ms.  They are the
  fastest route to a short-list.

* **The ISD** :math:`\Delta\chi^2` **method** follows Rodríguez-Monroy et al.
  (2025).  Its mock-based p-value provides a rigorous significance test without
  assumptions about the null distribution.  The mock generation is the dominant
  cost; :math:`N_\mathrm{mocks} = 100` is sufficient for a 5 % significance
  threshold.

* **Recommended workflow:**

  1. Run ``method="data"`` on all candidate templates to build a short-list
     (< 1 ms; free).
  2. Apply :func:`~sys_mapping.diagnostics.isd_template_significance` on the
     short-list with :math:`N_\mathrm{mocks} \ge 100` to obtain rigorous
     p-values (minutes).
  3. Pass templates with :math:`p < 0.05` to
     :func:`~sys_mapping.model_selection.greedy_forward_select` or the Bayesian
     MCMC pipeline.

API reference
-------------

.. autofunction:: sys_mapping.diagnostics.snr_template_ranking
   :no-index:
.. autofunction:: sys_mapping.diagnostics.isd_template_significance
   :no-index:
.. autofunction:: sys_mapping.model_selection.snr_preselect
   :no-index:
.. autoclass:: sys_mapping.model_selection.SnrPreselectionResult
   :members:
   :no-index: