# Shield — Improvement & Security Backlog > Honest, ranked list of what would make the system better, organised by > severity / effort. Not a roadmap — just everything I noticed while > sweeping v1.2.0 that isn't yet shipped. Pick the ones that hurt. ## Legend | Symbol | Meaning | |--------|---------| | 🔴 | Real bug or security gap. Fix before next deploy. | | 🟠 | Material risk or UX papercut. Should fix in next sprint. | | 🟡 | Polish / efficiency. Fix when there's time. | | ⚪ | Nice-to-have. Park it. | | Effort | S = < 1 hour · M = 1–4 hours · L = 1–2 days · XL = 1+ week | --- ## 🔴 P0 — Fix before next deploy ### 1. `APP_URL` and `PUBLIC_BASE_URL` are two names for the same thing **File:** `services/email.js` (uses `APP_URL`), `README.md` (tells ops to set `PUBLIC_BASE_URL`). **Issue:** `email.sendVerification` / `sendPasswordReset` build `${APP_URL}/auth/verify?token=...` — but `APP_URL` isn't in any actual `.env` example and the deploy docs say `PUBLIC_BASE_URL`. If neither is set, the URLs in verification/reset emails are `undefined/...` and click straight to a 404. Registration currently relies on operators noticing and fixing this. **Fix:** Pick one name, put it in `.env.example`, have both code paths read it. Add an assertion at startup that it's set in `NODE_ENV=production`. **Effort:** S ### 2. `autoEscalate` predicate is wrong when `sent_at` is NULL **File:** `jobs/autoEscalate.js:108` — `AND (escalated_at IS NULL OR escalated_at < COALESCE(sent_at, 0))`. **Issue:** `COALESCE(sent_at, 0)` returns 0 when `sent_at` is NULL. Then `escalated_at < 0` is always false. So **any takedown that was never marked sent (e.g. the operator used a manual chain-only path, or the row pre-dates the `sent_at` column) will never auto-escalate.** Silent failure. **Fix:** Replace with `AND (escalated_at IS NULL OR escalated_at < IFNULL(sent_at, next_escalate_at - 1))`, or simpler: split the predicate into "no escalation yet" vs "last escalation before last send" and add a unit test. **Effort:** S ### 3. `scanLimiter` and `apiLimiter` are defined but only `apiLimiter` is wired **File:** `middleware/rateLimit.js:32` (defined), never imported; `routes/scan.js` has no limiter. **Issue:** `/scan/run`, `/scan/image`, `/scan/video` are gated only by `planGate('scan')` — a logged-in free-tier user can hit them 100×/sec, each costing search proxy / vision-AI provider / Sharp CPU. Cheap DOS of the upstream. **Fix:** `app.use('/scan', scanLimiter)` in `app.js` (after auth) or `router.use(scanLimiter)` in `routes/scan.js`. Also wire it to `/cases/:id/discover` — that's the most expensive endpoint now. **Effort:** S ### 4. `/scan/video` has no top-level try/catch **File:** `routes/scan.js:200`. **Issue:** If `hash.phashFrames` throws (network timeout, malformed base64), the user gets a 500 with the unhandled error in the body, no `ref`, no alert, no audit. Also doesn't go through the `userErrors` envelope. **Fix:** Wrap in try/catch, return `{error: 'INTERNAL', ref, message: 'Something went wrong, try again or contact support with this ref.'}`, raise an alert. **Effort:** S ### 5. `email` body for transactional email is being sent as `text` but the API expects `text` OR `html`, not both as plaintext **File:** `services/email.js:25-29`. **Issue:** The body is sent as `text: ` AND `html: <pre>${escaped}</pre>`. That's actually fine. **However:** if a takedown letter contains `$1`, `$$`, or other shell-like sequences that get passed through transactional email's optional Markdown parser, they may render incorrectly in some clients. Minor — only matters for visual fidelity. **Effort:** S ### 6. CSP `unsafe-inline` on scripts **File:** `app.js:55-58` — `script-src 'self' 'unsafe-inline' ...`. **Issue:** We allow inline `<script>` blocks. The HTML pages have inline `<script>` blocks (every `*.html` in `public/views/` does `csrf()` + `api()` + inline event handlers). Removing `unsafe-inline` requires either nonce-based CSP or moving every inline script into a static file. Real fix is the latter. **Fix:** Extract every inline `<script>` block from `public/views/*.html` into `public/js/*.js` files; add `nonce-` middleware that issues a fresh nonce per request and rewrites the CSP header. (M = a few hours of refactor; L = full sweep with verification.) **Effort:** M (initial) / L (full) --- ## 🟠 P1 — Next sprint ### 7. Admin gate (2FA) has no client-side nudge **File:** `public/js/admin.js`, `public/views/admin-dashboard.html`. **Issue:** When a superadmin first logs in, every admin endpoint returns 401 with `error: 'TWO_FA_REQUIRED'`. The dashboard JS just sees a JSON 401 and shows an empty panel. There's no in-page flow to enroll from the dashboard — you have to know to `POST /admin/2fa/start` first. **Fix:** Add a banner on the admin dashboard that detects the 401, calls `/admin/2fa/status`, and shows a "Set up 2FA" wizard with QR code rendering (use a small QR lib like `qrcode`). **Effort:** M ### 8. No CSRF exemption for Stripe webhooks / transactional email inbound **File:** `middleware/csrf.js` (wherever it lives). **Issue:** Webhook route is registered before json parser, which is good. But if anyone adds a new webhook in future (e.g. transactional email inbound), they may forget. Worth a comment in `app.js` near `app.use('/webhook/stripe', ...)`. **Effort:** S ### 9. `userErrors` envelope is inconsistently used **Files:** Most routes use `{error: 'CODE'}`. `middleware/userErrors.js` defines `{error, ref, message}` with a stable ref. Some routes use the envelope, most don't. **Issue:** Operators see the `ref` in user-facing errors and grep `/admin/alerts?ref=...` to find detail. If the route never set the ref, the operator hits a dead end. The `ref` mechanism is only as useful as it is consistent. **Fix:** Wrap every error-returning route handler with `asyncRoute` (already exists). Standardise on `{error, ref, message}` everywhere. **Effort:** M (sweep) / S (one route at a time) ### 10. No automated tests **File:** Entire `tests/` directory doesn't exist. **Issue:** Every fix in v1.2.0 was verified by hand. One regex typo and we don't know until prod. **Fix:** Start with a `test/` folder and add: - `test/auth.test.js` — login lockout, 2FA gate, session save. - `test/apiKey.test.js` — prefix validation, per-service shape. - `test/discovery.test.js` — uses `NCII_DEEP_DISCOVERY_TEST_MODE=1` to assert 3 validated / 3 production_pending. - `test/autoEscalate.test.js` — predicate correctness, especially the `sent_at IS NULL` case from #2. - `test/migration.test.js` — runs migrations on a fresh sqlite file and asserts schema. **Effort:** L (initial) / ongoing ### 11. `case_discovery_scans` cleanup **File:** `config/db.js`. **Issue:** The table grows forever. Old scans (and their evidence JSONs in `data/evidence/`) accumulate. A 1000-case deployment with 10 scans/case = 10,000 rows + 10,000 JSON files. **Fix:** Add a cron in `jobs/dailyReset.js` (or a new `jobs/retention.js`) that prunes `case_discovery_scans` rows older than 90 days AND their matching evidence files, with a `NCII_EVIDENCE_RETENTION_DAYS=90` env override. Keep an export to cold storage as a separate `archive_evidence(case_id, scan_id)` action. **Effort:** M ### 12. No healthcheck endpoint **File:** `app.js`. **Issue:** The deploy doc says "external monitor pings `/health` every minute" — but there is no `/health` route. The README lies. **Fix:** Add a tiny `GET /health` that returns `{ok: true, uptime, db_ok: <bool>, workers: {discovery: 'running', takedown: 'running', escalate: 'running'}}`. Don't require auth. Use it for both uptime monitoring AND a startup wait-for-ready check. **Effort:** S ### 13. No structured logging **File:** Everywhere. **Issue:** `console.log` and `console.warn` only. No level, no JSON, no request-id correlation. When an alert fires, you can find the cause in `data/server.log` only if you remember the timestamp. There IS a `data/server.log` in the project root which is excluded from the tarball — so the operator on the server has nothing to look at after a deploy. **Fix:** Use `pino` (or a 30-line `logger.js` shim) with JSON output, request-id middleware, log levels, redact `key_value` + `password` + `token`. Pipe to `/var/log/ncii-shield/app.log` and rotate. **Effort:** M --- ## 🟡 P2 — Polish ### 14. `keyRotator` error messages leak the key length **File:** `services/keyRotator.js`. **Issue:** When a key fails, the error message includes the key's last 4 chars. Useful for debugging, slightly leaky in logs. **Fix:** Keep for dev (`NODE_ENV=development`), mask for prod. **Effort:** S ### 15. `secrets/env` vs DB-only API keys **File:** `.env.example`, `services/email.js`. **Issue:** The deploy doc says "API keys live in `api_keys` table only; `.env` placeholders are not authoritative." But `.env.example` still mentions `STRIPE_SECRET_KEY` and `STRIPE_WEBHOOK_SECRET`, which are NOT in the `api_keys` table — they're `.env` only. So the rule has exceptions that aren't documented. **Fix:** Either move Stripe to `api_keys` (it's a multi-key-friendly API) or document the two-bucket rule explicitly with a list of which secrets live where. **Effort:** S ### 16. The victim_name variants list is hard-coded English / Sinhala **File:** `services/deepDiscovery.js`. **Issue:** If the operator has a Tamil / Hindi / Tagalog victim, the name-variant generator won't produce a useful query set. Same for keywords — only EN+SI are scanned. **Fix:** Make the keyword list configurable per case (or per org) via a `keyword_lists` table that's already in the schema. Same for name variant generation — make the user's `name` field accept a comma-separated list of spellings. **Effort:** M ### 17. No "duplicate scan" guard **File:** `routes/cases.js:128` (`POST /:id/discover`). **Issue:** Operator can spam-click "Run deep discovery" and queue 5 scans. They all run sequentially, burning quota. **Fix:** Refuse to enqueue if there's a `running` or `queued` scan for this case in the last 5 minutes; return 409 with the existing `scan_id`. **Effort:** S ### 18. Operator can accidentally send to an invalid email **File:** `routes/admin.js:323` (`POST /admin/production-pending/:id/send`). **Issue:** Only checks `td.contact_email` is truthy. Doesn't validate it's a real email. We send anyway. **Fix:** Add `validator.isEmail(td.contact_email)` check; refuse to send to `example.com`, `test.com`, the reserved TLDs, etc. **Effort:** S ### 19. Image / video scan doesn't deduplicate on hash **File:** `routes/scan.js`, `models/finding.js`. **Issue:** Two users scanning the same image create two `findings` rows. The `pHash` column exists but isn't used to find-existing-first. **Fix:** Before creating a `findings` row, look for an existing row with the same `phash` within the last 30 days and link the new scan to it. **Effort:** M ### 20. No rate limit on `/auth/2fa/*` or `/admin/*` **File:** `app.js`, `routes/twofa.js`, `routes/admin.js`. **Issue:** An attacker who knows the admin's password can brute-force the 6-digit TOTP at 1000 attempts/sec. **Fix:** Wrap `/auth/login/2fa` with a tighter limiter (e.g. 5/min per session, 20/min per IP). Same for `/2fa/verify` in case it's exposed. **Effort:** S --- ## ⚪ P3 — Nice-to-have ### 21. Push notifications (Web Push / FCM) Currently the operator only sees alerts when they open the dashboard. A "Scan finished — 3 production_pending ready" push would let them act faster. **Effort:** L ### 22. Multi-language templates Takedown letters are English-only. The legal sections cite 17 U.S.C. § 512, GDPR Art. 17, etc. — for a Sri Lankan case, the operator probably wants to also cite the Computer Emergency Readiness Team of Sri Lanka (Sri Lanka CERT|CC) and the local criminal code § 354A / § 399. **Effort:** L ### 23. Bulk import of candidate URLs A "paste 50 URLs and triage them" UI for operators who already have a list. Currently the only way in is via deep discovery. **Effort:** M ### 24. Public trust-and-safety API Allow vetted researchers / partner NGOs to submit URLs to Shield for takedown, via a separate `/public/v1/...` namespace with a partner API key. **Effort:** L (and brings compliance review) ### 25. SOC 2 / GDPR DPA paperwork For enterprise customers. Not a code change, but the deploy doc should mention it. **Effort:** XL (mostly human) ### 26. Native mobile / desktop wrapper A Capacitor / Tauri wrapper around the operator dashboard so case triage can happen on the phone. **Effort:** XL ### 27. Better evidence bundle export Currently each scan writes one JSON. The operator may want a single ZIP per case with all scans + all takedown letters + all sent-status as PDF, for legal hold / law enforcement hand-off. **Effort:** M ### 28. Discord / Slack webhook on case milestones "Case opened", "First response received from registrar", "Content removed". Useful for org-mode operators. **Effort:** M ### 29. Per-org branding on takedown letters The platform `from` is hard-coded "Platform Admin <admin@yourdomain.com>". Agency / Standard plans would want their own brand on the letterhead. **Effort:** M (add an `org.letterhead_json` column, render it in `platformContacts.buildTemplate`) ### 30. Reverse image search adapter Currently the operator must already have the image. Add a TinEye / Google Vision / vision-AI provider vision flow that takes a description and finds candidate images. Useful when the operator has a URL but no image file. **Effort:** L --- ## What I'd ship in the next 2 weeks If I had 2 weeks of founder time, this is the order: 1. **#2, #3, #4** — all S effort, all real bugs. (Day 1) 2. **#1** — S effort, prevents a support nightmare. (Day 1) 3. **#12** — S effort, makes the deploy doc honest. (Day 1) 4. **#7** — M effort, removes the largest UX papercut (admin gets 401 and doesn't know why). (Day 2) 5. **#10** — start the test suite. Even 5 tests is better than zero. (Day 3-4) 6. **#11** — cron retention. Otherwise the system falls over at year 1. (Day 5) 7. **#17, #18, #20** — small S-effort hardening. (Day 5) 8. **#9** — wrap routes with `asyncRoute`, one route at a time during normal bug fixes. (Ongoing) The rest is park-it. #21-#30 are "after Series A" features.