Testing
=======

``sys_mapping`` has a comprehensive test suite in the ``tests/`` directory.
Tests are organised by module and cover correctness, numerical accuracy,
and performance.

Running the tests
-----------------

Install the package with the development extras first::

   pip install -e ".[dev]"

Then run the full suite::

   pytest tests/ -v

Run a single module::

   pytest tests/test_contamination.py -v

Run tests that require scikit-learn (skipped otherwise)::

   pip install scikit-learn
   pytest tests/test_regression.py -v

Run performance benchmarks (slow; uses ``--benchmark`` marker)::

   pytest tests/test_benchmarks.py -v

----

Test modules
------------

.. list-table::
   :header-rows: 1
   :widths: 35 10 55

   * - File
     - Tests
     - Coverage
   * - ``test_contamination.py``
     - 15
     - :mod:`~sys_mapping.contamination` — parameter layout, pack/unpack
       round-trip, forward model, two-point correction formula
   * - ``test_likelihood.py``
     - 7
     - :mod:`~sys_mapping.likelihood` — Gaussian and skew-normal log-likelihoods,
       JAX gradient correctness via finite differences
   * - ``test_maps.py``
     - 32
     - :mod:`~sys_mapping.maps` — power spectrum generation, synthetic map
       generation, catalog pixelisation, overdensity computation; plus
       ``load_real_template`` / ``load_real_templates`` (5 synthetic FITS
       fixture tests + 7 real-data tests guarded by ``@real_data``)
   * - ``test_correction.py``
     - 12
     - :mod:`~sys_mapping.correction` — noise debiasing, PCA rotation,
       round-trip param transform, two-point correction
   * - ``test_model_selection.py``
     - 7
     - :mod:`~sys_mapping.model_selection` — LRT statistic, chi-squared
       distribution under the null, p-value monotonicity
   * - ``test_accuracy.py``
     - 34
     - End-to-end accuracy: contamination round-trip errors < 1e-10;
       overdensity mean near zero; likelihood gradient norm; correction
       bias below tolerance; amplitude bias estimator
   * - ``test_power_spectrum.py``
     - 19
     - :mod:`~sys_mapping.power_spectrum` — harmonic bias formula
       :math:`b_\ell = -n/(2\ell+1)`, pseudo-Cℓ white noise validation,
       template subtraction identity (alpha=0), BMP/EMP mode projection
   * - ``test_regression.py``
     - 26
     - :mod:`~sys_mapping.regression` — weight bounds, ElasticNet recovery
       (skipped without scikit-learn), ISD convergence and polynomial
       order reduction, method comparison output keys;
       ``combined_mcmc`` keys and default-method assertion.
   * - ``test_diagnostics.py``
     - 19
     - :mod:`~sys_mapping.diagnostics` — null test on uncorrelated fields,
       SNR ranking shape and ordering, footprint masking output consistency
   * - ``test_mocks.py``
     - 40
     - :mod:`~sys_mapping.mocks` — lognormal field properties (skewness,
       positivity, reproducibility), galactic mask geometry, mock catalog
       shapes and physics, suite scenarios, pipeline integration
   * - ``test_benchmarks.py``
     - 19
     - Execution time regression: contamination, maps, likelihood,
       correction, and utils must complete within set thresholds
   * - ``test_timing.py``
     - 240
     - Wall-clock scaling of all six methods as a function of NSIDE
       (8, 16, 32, 64) and n_templates (1–10); 40 configs × 4 NSIDE
       values = 160 tests per fast method; MCMC tests marked
       ``@pytest.mark.slow`` (skip with ``-m "not slow"``)
   * - ``test_real_templates.py``
     - 28
     - Integration tests using real GAIA and LS DR10 HEALPix maps as
       systematic templates (synth_5, synth_6); all six methods exercised
       on a synthetic mock built within the LS10 survey footprint (NSIDE = 32,
       ~5 954 valid pixels).  Tests skipped automatically if FITS files are
       absent.  See :ref:`real-template-tests` below.

----

Test design philosophy
-----------------------

**Unit tests** check individual functions in isolation:

* Input/output shapes are always verified.
* Round-trip identities are tested (e.g. ``apply_contamination`` ∘
  ``invert_contamination`` = identity).
* Known closed-form results are tested to machine precision where applicable
  (e.g. the harmonic bias formula, pack/unpack).

**Accuracy tests** (``test_accuracy.py``) verify that numerical errors
remain below physically meaningful thresholds across a range of NSIDE
values, template counts, and noise levels.

**Integration tests** (``TestMockPipelineIntegration`` in ``test_mocks.py``)
run the full pipeline from raw catalogs to overdensity estimates and confirm
that OLS recovers injected additive amplitudes to within 50 % at low
signal-to-noise (NSIDE = 16, 100 galaxies / pixel).

**Skipped tests** — ``test_real_templates.py`` tests are skipped when the
GAIA and LS10 FITS files are absent from
``~/data/legacysurvey/dr10/systematics/``.  ElasticNet tests in
``test_regression.py`` are skipped when ``scikit-learn`` is not installed.

----

Continuous integration
-----------------------

The test suite is designed to run in the ``sys_map`` conda environment::

   conda activate sys_map
   pytest tests/ -v --tb=short

Expected output with scikit-learn and real data files present::

   258 passed in ~34 s

Without real data files (no GAIA/LS10 FITS)::

   230 passed in ~21 s

----

Systematic test matrix
-----------------------

Beyond the unit and integration test suite, a separate **systematic test
matrix** is available in ``scripts/run_systematic_tests.py``.  This script
is not run as part of ``pytest``; it is a standalone benchmark that exhaustively
evaluates all six methods across 32 contamination configurations (Tier 1:
additive-only and multiplicative-only with 1–7 templates; Tier 2: mixed
additive+multiplicative with varying numbers of multiplicative templates).

The key metric is :math:`\sigma[(1+\delta_g^{\rm corr})/(1+\delta_g^{\rm true})]`,
the standard deviation of the pixel-level correction ratio.  Results and figures
are documented in :doc:`results_systematic_tests`.

To run::

   conda activate sys_map
   python scripts/run_systematic_tests.py \\
       --nside 32 --n-walkers 64 --n-steps 200 --n-burn 50 \\
       --output-dir results/systematic_tests/

Expected runtime: ~8 minutes (NSIDE = 32, all 6 methods, 32 configurations).

After re-running with the updated script, the CSV will contain a ``time_s`` column
per method-configuration row, and two timing figures are written to
``results/systematic_tests/``:

* ``timing_vs_ntemplates.png`` — wall-clock time vs. n_templates per method (log scale)
* ``timing_mean_per_method.png`` — mean compute time per method across all 32 configs

----

Timing scaling tests
---------------------

``tests/test_timing.py`` parametrises all six methods over NSIDE ∈ {8, 16, 32, 64}
and n_templates ∈ {1, …, 10}, measuring actual wall-clock time for each
(NSIDE, n_templates) pair.  MCMC tests are marked ``@pytest.mark.slow`` and
can be skipped::

   pytest tests/test_timing.py -v -s -m "not slow"   # OLS / ElasticNet / ISD only
   pytest tests/test_timing.py -v -s                  # all methods (slow)

The ``-s`` flag is required to see the per-test timing printed to stdout.
Each test asserts that the method completes within a generous wall-clock bound
(30 s for OLS/ISD, 120 s for ElasticNet/ISD-3, 600 s for MCMC), so the suite
fails only if a method becomes catastrophically slow.  Use the stdout output for
the actual scaling data.

Expected test count: 40 (NSIDE × n_templates) per method class × 6 classes = 240 tests
(excluding MCMC slow tests from the default run: 160 fast tests).

----

Adding new tests
-----------------

Follow the existing pattern:

1. Create a test class per public function group (e.g. ``class TestMyFunction``).
2. Use ``@pytest.fixture(scope="class")`` for expensive shared objects.
3. Test shapes, dtype, value ranges, and known analytic limits.
4. For stochastic tests, fix the seed via ``seed=`` or ``np.random.default_rng(N)``.
5. Add the new file to this page.

----

.. _real-template-tests:

Real-template integration tests
---------------------------------

``tests/test_real_templates.py`` tests every implemented method end-to-end on
a synthetic galaxy mock built using **real observational systematic maps**
from GAIA DR3 and the Legacy Survey DR10.  These tests exercise the entire
pipeline — loading, normalisation, injection, inference, and model selection
— with physically realistic templates rather than toy random fields.

Template set
^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 12 20 18 50

   * - Label
     - Source
     - NSIDE (tests)
     - Physical interpretation
   * - ``synth_0``
     - Synthetic, family 0 (:math:`C_\ell \propto e^{-\ell/500}`)
     - 32
     - Large-scale coherent artefact (e.g. zodiacal light)
   * - ``synth_1``
     - Synthetic, family 1 (:math:`C_\ell \propto e^{-(\ell/250)^2}`)
     - 32
     - Intermediate-scale artefact (e.g. airglow)
   * - **synth_5**
     - **GAIA DR3** — ``GAIA_nstar_faint_NSIDE_00032.fits`` (``nstar_faint`` column)
     - **32**
     - **Faint-star surface density** — proxy for stellar contamination
       (misclassified stars inflate galaxy counts)
   * - **synth_6**
     - **LS DR10** — ``LS10_GALDEPTH_Z_NSIDE_0032.fits`` (``GALDEPTH_Z`` column)
     - **32**
     - **Galaxy depth in the z band** — proxy for selection-depth variations
       (depth modulates survey completeness multiplicatively)

Survey footprint: the LS10 depth valid mask (pixels where GALDEPTH_Z > 0)
restricts the mock to **5 954 pixels** (48.4 % of the NSIDE = 32 sphere).
GAIA is a full-sky map (12 288 / 12 288 valid pixels); its values inside the
LS10 footprint are used.

Mock configuration and injected parameters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :widths: 40 60
   :header-rows: 1

   * - Parameter
     - Value
   * - NSIDE
     - 32 (pixel area ≈ 3.4 deg², 12 288 pixels total)
   * - Survey footprint pixels
     - 5 954 (LS10 depth mask)
   * - Templates :math:`n_s`
     - 4 (synth_0, synth_1, GAIA_nstar_faint, LS10_GALDEPTH_Z)
   * - :math:`a_i^{\rm true}` (additive)
     - :math:`(0.08,\ {-0.05},\ 0.06,\ {-0.04})`
   * - :math:`b_i^{\rm true}` (multiplicative)
     - :math:`(0.04,\ 0.00,\ {-0.03},\ 0.05)`
   * - Mean galaxies per pixel :math:`\bar n`
     - 50
   * - Random / galaxy ratio
     - 8×
   * - Seed
     - 7

The galaxy overdensity follows a lognormal field
:math:`\delta_g^{\rm true} = e^{G - \sigma_G^2/2} - 1` with
:math:`C_\ell^G \propto (\ell+1)^{-2}`, :math:`\sigma_G = 0.5`.
Note that template 1 (``synth_1``) has :math:`b_1^{\rm true} = 0` (purely
additive) while templates 0, 2, 3 have non-zero multiplicative amplitudes —
a realistic mixed-contamination scenario.

Method results
^^^^^^^^^^^^^^

Each of the six implemented methods is run once on this fixed mock:

.. list-table::
   :header-rows: 1
   :widths: 22 20 16 42

   * - Method
     - Mean :math:`|\hat a_i - a_i^{\rm true}|`
     - Tolerance
     - Notes
   * - **OLS**
     - < 0.20
     - 0.20
     - Ordinary least-squares pixel regression; fastest method
   * - **ElasticNet**
     - < 0.25
     - 0.25
     - Cross-validated (3 folds); requires ``scikit-learn ≥ 1.3``
   * - **ISD-1** (poly_order = 1)
     - < 0.25
     - 0.25
     - Converges in < 50 iterations
   * - **ISD-3** (poly_order = 3)
     - n/a (numerically unstable)
     - finite values only
     - Polynomial expansion produces 34 features for :math:`n_s = 4`,
       :math:`n_{\rm pix}/n_{\rm feat} \approx 175`; ill-conditioned
       with real correlated templates.  Check finiteness only.
   * - **MCMC-additive**
     - < 0.25
     - 0.25
     - Chain shape :math:`(n_w \times 160,\; n_s + 1)` with
       :math:`n_w \geq 2(n_s+1)+2 = 12`; :math:`n_{\rm dim} = 5`
   * - **MCMC-combined** (Berlfein+2024)
     - < 0.30 for :math:`\hat a_i`; < 0.30 for :math:`\hat b_i`
     - 0.30
     - Chain shape :math:`(n_w \times 160,\; 2n_s + 1)`;
       :math:`n_{\rm dim} = 9`.  Positive posterior variances confirmed.

Model selection and diagnostics
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* **LRT** (:math:`H_0`: additive, :math:`H_1`: combined) — the additive null
  is rejected at the 5 % level because :math:`b_0, b_2, b_3 \neq 0`.
  The test statistic :math:`\lambda_{\rm LR} = 2[\ln\mathcal{L}_1 - \ln\mathcal{L}_0]`
  is positive and :math:`p < 0.05`.

* **Null test** — the maximum Pearson correlation :math:`\max_i |r_i|` between
  the OLS-corrected weights and the template maps satisfies
  :math:`\max|r_i| < 0.50`, confirming partial residual removal.

* **SNR ranking** — the SNR array has shape :math:`(4,)`, all entries are
  :math:`\geq 0`, and at least one entry is :math:`> 0.01`, demonstrating
  that the real GAIA and LS10 maps carry detectable systematic signal.

Running the real-template tests
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Ensure the FITS files are present at their default paths (see
:func:`~sys_mapping.maps.load_real_templates`), then::

   conda activate sys_map
   pytest tests/test_real_templates.py -v

Expected output::

   28 passed in ~62 s

To skip when files are absent, they degrade gracefully::

   28 skipped (reason: GAIA/LS10 FITS files not found)

----

.. rubric:: Test outcome

The full test suite was executed in the ``sys_map`` conda environment
(Python 3.11, JAX 64-bit, scikit-learn ≥ 1.3, real GAIA and LS10 FITS files
present) against the current codebase.

**Results:**

* **258 passed, 0 failed, 0 errors** (excluding ``test_timing.py``)
* ``test_real_templates.py`` — **28 passed** using real GAIA DR3 and LS10 DR10
  systematic maps; all six methods completed without error on the 5 954-pixel
  LS10 footprint.
* ``test_regression.py`` — **26 passed**, including the previously known edge
  case (``TestElasticNet::test_weights_bounded_positive``), which now passes.
* The systematic test matrix (``scripts/run_systematic_tests.py``, 32
  configurations) ran to completion; results are documented in
  :doc:`results_systematic_tests`.

**Status: PASSED** — all 258 tests pass with no known failures.