# Shield — Improvement & Security Backlog

> Honest, ranked list of what would make the system better, organised by
> severity / effort. Not a roadmap — just everything I noticed while
> sweeping v1.2.0 that isn't yet shipped. Pick the ones that hurt.

## Legend

| Symbol | Meaning |
|--------|---------|
| 🔴    | Real bug or security gap. Fix before next deploy. |
| 🟠    | Material risk or UX papercut. Should fix in next sprint. |
| 🟡    | Polish / efficiency. Fix when there's time. |
| ⚪    | Nice-to-have. Park it. |
| Effort | S = < 1 hour · M = 1–4 hours · L = 1–2 days · XL = 1+ week |

---

## 🔴 P0 — Fix before next deploy

### 1. `APP_URL` and `PUBLIC_BASE_URL` are two names for the same thing
**File:** `services/email.js` (uses `APP_URL`), `README.md` (tells ops to set `PUBLIC_BASE_URL`).
**Issue:** `email.sendVerification` / `sendPasswordReset` build `${APP_URL}/auth/verify?token=...` — but `APP_URL` isn't in any actual `.env` example and the deploy docs say `PUBLIC_BASE_URL`. If neither is set, the URLs in verification/reset emails are `undefined/...` and click straight to a 404. Registration currently relies on operators noticing and fixing this.
**Fix:** Pick one name, put it in `.env.example`, have both code paths read it. Add an assertion at startup that it's set in `NODE_ENV=production`.
**Effort:** S

### 2. `autoEscalate` predicate is wrong when `sent_at` is NULL
**File:** `jobs/autoEscalate.js:108` — `AND (escalated_at IS NULL OR escalated_at < COALESCE(sent_at, 0))`.
**Issue:** `COALESCE(sent_at, 0)` returns 0 when `sent_at` is NULL. Then `escalated_at < 0` is always false. So **any takedown that was never marked sent (e.g. the operator used a manual chain-only path, or the row pre-dates the `sent_at` column) will never auto-escalate.** Silent failure.
**Fix:** Replace with `AND (escalated_at IS NULL OR escalated_at < IFNULL(sent_at, next_escalate_at - 1))`, or simpler: split the predicate into "no escalation yet" vs "last escalation before last send" and add a unit test.
**Effort:** S

### 3. `scanLimiter` and `apiLimiter` are defined but only `apiLimiter` is wired
**File:** `middleware/rateLimit.js:32` (defined), never imported; `routes/scan.js` has no limiter.
**Issue:** `/scan/run`, `/scan/image`, `/scan/video` are gated only by `planGate('scan')` — a logged-in free-tier user can hit them 100×/sec, each costing search proxy / vision-AI provider / Sharp CPU. Cheap DOS of the upstream.
**Fix:** `app.use('/scan', scanLimiter)` in `app.js` (after auth) or `router.use(scanLimiter)` in `routes/scan.js`. Also wire it to `/cases/:id/discover` — that's the most expensive endpoint now.
**Effort:** S

### 4. `/scan/video` has no top-level try/catch
**File:** `routes/scan.js:200`.
**Issue:** If `hash.phashFrames` throws (network timeout, malformed base64), the user gets a 500 with the unhandled error in the body, no `ref`, no alert, no audit. Also doesn't go through the `userErrors` envelope.
**Fix:** Wrap in try/catch, return `{error: 'INTERNAL', ref, message: 'Something went wrong, try again or contact support with this ref.'}`, raise an alert.
**Effort:** S

### 5. `email` body for transactional email is being sent as `text` but the API expects `text` OR `html`, not both as plaintext
**File:** `services/email.js:25-29`.
**Issue:** The body is sent as `text: <plaintext>` AND `html: <pre>${escaped}</pre>`. That's actually fine. **However:** if a takedown letter contains `$1`, `$$`, or other shell-like sequences that get passed through transactional email's optional Markdown parser, they may render incorrectly in some clients. Minor — only matters for visual fidelity.
**Effort:** S

### 6. CSP `unsafe-inline` on scripts
**File:** `app.js:55-58` — `script-src 'self' 'unsafe-inline' ...`.
**Issue:** We allow inline `<script>` blocks. The HTML pages have inline `<script>` blocks (every `*.html` in `public/views/` does `csrf()` + `api()` + inline event handlers). Removing `unsafe-inline` requires either nonce-based CSP or moving every inline script into a static file. Real fix is the latter.
**Fix:** Extract every inline `<script>` block from `public/views/*.html` into `public/js/*.js` files; add `nonce-` middleware that issues a fresh nonce per request and rewrites the CSP header. (M = a few hours of refactor; L = full sweep with verification.)
**Effort:** M (initial) / L (full)

---

## 🟠 P1 — Next sprint

### 7. Admin gate (2FA) has no client-side nudge
**File:** `public/js/admin.js`, `public/views/admin-dashboard.html`.
**Issue:** When a superadmin first logs in, every admin endpoint returns 401 with `error: 'TWO_FA_REQUIRED'`. The dashboard JS just sees a JSON 401 and shows an empty panel. There's no in-page flow to enroll from the dashboard — you have to know to `POST /admin/2fa/start` first.
**Fix:** Add a banner on the admin dashboard that detects the 401, calls `/admin/2fa/status`, and shows a "Set up 2FA" wizard with QR code rendering (use a small QR lib like `qrcode`).
**Effort:** M

### 8. No CSRF exemption for Stripe webhooks / transactional email inbound
**File:** `middleware/csrf.js` (wherever it lives).
**Issue:** Webhook route is registered before json parser, which is good. But if anyone adds a new webhook in future (e.g. transactional email inbound), they may forget. Worth a comment in `app.js` near `app.use('/webhook/stripe', ...)`.
**Effort:** S

### 9. `userErrors` envelope is inconsistently used
**Files:** Most routes use `{error: 'CODE'}`. `middleware/userErrors.js` defines `{error, ref, message}` with a stable ref. Some routes use the envelope, most don't.
**Issue:** Operators see the `ref` in user-facing errors and grep `/admin/alerts?ref=...` to find detail. If the route never set the ref, the operator hits a dead end. The `ref` mechanism is only as useful as it is consistent.
**Fix:** Wrap every error-returning route handler with `asyncRoute` (already exists). Standardise on `{error, ref, message}` everywhere.
**Effort:** M (sweep) / S (one route at a time)

### 10. No automated tests
**File:** Entire `tests/` directory doesn't exist.
**Issue:** Every fix in v1.2.0 was verified by hand. One regex typo and we don't know until prod.
**Fix:** Start with a `test/` folder and add:
- `test/auth.test.js` — login lockout, 2FA gate, session save.
- `test/apiKey.test.js` — prefix validation, per-service shape.
- `test/discovery.test.js` — uses `NCII_DEEP_DISCOVERY_TEST_MODE=1` to assert 3 validated / 3 production_pending.
- `test/autoEscalate.test.js` — predicate correctness, especially the `sent_at IS NULL` case from #2.
- `test/migration.test.js` — runs migrations on a fresh sqlite file and asserts schema.
**Effort:** L (initial) / ongoing

### 11. `case_discovery_scans` cleanup
**File:** `config/db.js`.
**Issue:** The table grows forever. Old scans (and their evidence JSONs in `data/evidence/`) accumulate. A 1000-case deployment with 10 scans/case = 10,000 rows + 10,000 JSON files.
**Fix:** Add a cron in `jobs/dailyReset.js` (or a new `jobs/retention.js`) that prunes `case_discovery_scans` rows older than 90 days AND their matching evidence files, with a `NCII_EVIDENCE_RETENTION_DAYS=90` env override. Keep an export to cold storage as a separate `archive_evidence(case_id, scan_id)` action.
**Effort:** M

### 12. No healthcheck endpoint
**File:** `app.js`.
**Issue:** The deploy doc says "external monitor pings `/health` every minute" — but there is no `/health` route. The README lies.
**Fix:** Add a tiny `GET /health` that returns `{ok: true, uptime, db_ok: <bool>, workers: {discovery: 'running', takedown: 'running', escalate: 'running'}}`. Don't require auth. Use it for both uptime monitoring AND a startup wait-for-ready check.
**Effort:** S

### 13. No structured logging
**File:** Everywhere.
**Issue:** `console.log` and `console.warn` only. No level, no JSON, no request-id correlation. When an alert fires, you can find the cause in `data/server.log` only if you remember the timestamp. There IS a `data/server.log` in the project root which is excluded from the tarball — so the operator on the server has nothing to look at after a deploy.
**Fix:** Use `pino` (or a 30-line `logger.js` shim) with JSON output, request-id middleware, log levels, redact `key_value` + `password` + `token`. Pipe to `/var/log/ncii-shield/app.log` and rotate.
**Effort:** M

---

## 🟡 P2 — Polish

### 14. `keyRotator` error messages leak the key length
**File:** `services/keyRotator.js`.
**Issue:** When a key fails, the error message includes the key's last 4 chars. Useful for debugging, slightly leaky in logs.
**Fix:** Keep for dev (`NODE_ENV=development`), mask for prod.
**Effort:** S

### 15. `secrets/env` vs DB-only API keys
**File:** `.env.example`, `services/email.js`.
**Issue:** The deploy doc says "API keys live in `api_keys` table only; `.env` placeholders are not authoritative." But `.env.example` still mentions `STRIPE_SECRET_KEY` and `STRIPE_WEBHOOK_SECRET`, which are NOT in the `api_keys` table — they're `.env` only. So the rule has exceptions that aren't documented.
**Fix:** Either move Stripe to `api_keys` (it's a multi-key-friendly API) or document the two-bucket rule explicitly with a list of which secrets live where.
**Effort:** S

### 16. The victim_name variants list is hard-coded English / Sinhala
**File:** `services/deepDiscovery.js`.
**Issue:** If the operator has a Tamil / Hindi / Tagalog victim, the name-variant generator won't produce a useful query set. Same for keywords — only EN+SI are scanned.
**Fix:** Make the keyword list configurable per case (or per org) via a `keyword_lists` table that's already in the schema. Same for name variant generation — make the user's `name` field accept a comma-separated list of spellings.
**Effort:** M

### 17. No "duplicate scan" guard
**File:** `routes/cases.js:128` (`POST /:id/discover`).
**Issue:** Operator can spam-click "Run deep discovery" and queue 5 scans. They all run sequentially, burning quota.
**Fix:** Refuse to enqueue if there's a `running` or `queued` scan for this case in the last 5 minutes; return 409 with the existing `scan_id`.
**Effort:** S

### 18. Operator can accidentally send to an invalid email
**File:** `routes/admin.js:323` (`POST /admin/production-pending/:id/send`).
**Issue:** Only checks `td.contact_email` is truthy. Doesn't validate it's a real email. We send anyway.
**Fix:** Add `validator.isEmail(td.contact_email)` check; refuse to send to `example.com`, `test.com`, the reserved TLDs, etc.
**Effort:** S

### 19. Image / video scan doesn't deduplicate on hash
**File:** `routes/scan.js`, `models/finding.js`.
**Issue:** Two users scanning the same image create two `findings` rows. The `pHash` column exists but isn't used to find-existing-first.
**Fix:** Before creating a `findings` row, look for an existing row with the same `phash` within the last 30 days and link the new scan to it.
**Effort:** M

### 20. No rate limit on `/auth/2fa/*` or `/admin/*`
**File:** `app.js`, `routes/twofa.js`, `routes/admin.js`.
**Issue:** An attacker who knows the admin's password can brute-force the 6-digit TOTP at 1000 attempts/sec.
**Fix:** Wrap `/auth/login/2fa` with a tighter limiter (e.g. 5/min per session, 20/min per IP). Same for `/2fa/verify` in case it's exposed.
**Effort:** S

---

## ⚪ P3 — Nice-to-have

### 21. Push notifications (Web Push / FCM)
Currently the operator only sees alerts when they open the dashboard. A "Scan finished — 3 production_pending ready" push would let them act faster.
**Effort:** L

### 22. Multi-language templates
Takedown letters are English-only. The legal sections cite 17 U.S.C. § 512, GDPR Art. 17, etc. — for a Sri Lankan case, the operator probably wants to also cite the Computer Emergency Readiness Team of Sri Lanka (Sri Lanka CERT|CC) and the local criminal code § 354A / § 399.
**Effort:** L

### 23. Bulk import of candidate URLs
A "paste 50 URLs and triage them" UI for operators who already have a list. Currently the only way in is via deep discovery.
**Effort:** M

### 24. Public trust-and-safety API
Allow vetted researchers / partner NGOs to submit URLs to Shield for takedown, via a separate `/public/v1/...` namespace with a partner API key.
**Effort:** L (and brings compliance review)

### 25. SOC 2 / GDPR DPA paperwork
For enterprise customers. Not a code change, but the deploy doc should mention it.
**Effort:** XL (mostly human)

### 26. Native mobile / desktop wrapper
A Capacitor / Tauri wrapper around the operator dashboard so case triage can happen on the phone.
**Effort:** XL

### 27. Better evidence bundle export
Currently each scan writes one JSON. The operator may want a single ZIP per case with all scans + all takedown letters + all sent-status as PDF, for legal hold / law enforcement hand-off.
**Effort:** M

### 28. Discord / Slack webhook on case milestones
"Case opened", "First response received from registrar", "Content removed". Useful for org-mode operators.
**Effort:** M

### 29. Per-org branding on takedown letters
The platform `from` is hard-coded "Platform Admin <admin@yourdomain.com>". Agency / Standard plans would want their own brand on the letterhead.
**Effort:** M (add an `org.letterhead_json` column, render it in `platformContacts.buildTemplate`)

### 30. Reverse image search adapter
Currently the operator must already have the image. Add a TinEye / Google Vision / vision-AI provider vision flow that takes a description and finds candidate images. Useful when the operator has a URL but no image file.
**Effort:** L

---

## What I'd ship in the next 2 weeks

If I had 2 weeks of founder time, this is the order:

1. **#2, #3, #4** — all S effort, all real bugs. (Day 1)
2. **#1** — S effort, prevents a support nightmare. (Day 1)
3. **#12** — S effort, makes the deploy doc honest. (Day 1)
4. **#7** — M effort, removes the largest UX papercut (admin gets 401 and doesn't know why). (Day 2)
5. **#10** — start the test suite. Even 5 tests is better than zero. (Day 3-4)
6. **#11** — cron retention. Otherwise the system falls over at year 1. (Day 5)
7. **#17, #18, #20** — small S-effort hardening. (Day 5)
8. **#9** — wrap routes with `asyncRoute`, one route at a time during normal bug fixes. (Ongoing)

The rest is park-it. #21-#30 are "after Series A" features.
