# Shield v1.4.0 — Evidence workspace + AI classification + dedup

Shipped: 2026-06-15

## What's new since v1.3.0

### 1. Evidence workspace (Pro+Agency only)

The operator can now open a finding in a dedicated viewer that
shows the **actual leaked page** in an iframe (no download, no
storage), and:

- **Save .mhtml** — server-side MHTML build with all images embedded
  as base64. Single self-contained file. Opens in Chrome / Edge /
  IE11 / any modern mail client. The HTML is **never written to
  disk** — bytes are streamed to the browser, nothing persists.
- **Classify with AI** — manual button that calls vision-AI provider-M2.7 and
  returns one of 6 categories: `unauthorized_video` /
  `unauthorized_image` / `unauthorized_other` / `name_only` /
  `not_ncii` / `unknown`, with confidence + a quoted evidence
  phrase from the page.
- **Search across findings** — one search box on the case scans
  title + snippet + url + classification evidence across every
  finding on the case.
- **In-page search** — Ctrl+F-style overlay that operates on the
  page content shown in the iframe.

### 2. AI auto-classification (Pro+Agency only)

When a deep-discovery scan confirms a tier-1 or tier-2 finding
(NCCI keyword in title + name co-occurrence in body), the worker
fires a fire-and-forget `setImmediate` that calls the same
classifier. The operator sees the category badge populate a few
seconds after the row appears. Pro+Agency only — the case owner's
plan must be one of those.

### 3. Real AI provider wired (vision-AI provider)

- Endpoint: `https://api.minimax.io/v1/text/chatcompletion_v2`
  (full path — do NOT append `/chat/completions`)
- Model: `vision-AI provider-M2.7` (1M context, frontier multimodal)
- Reasoning tokens consume ~half of `max_tokens`; bumped to 800
- `M2.7`, `M2.7-highspeed`, `M3` cost estimates added to
  `services/ai.js::estimateCost`
- Real API key stored in the `api_keys` table (id=11, label
  `live:vision-AI provider`, quota 5000/day)

### 4. Dedup (no double emailing)

Two model-layer dedup invariants:

- **Findings**: `(case_id, url)` is unique. A second scrape source
  rediscovering the same URL returns the existing row with
  `__deduped: true`. Enforced at the DB level by the new
  `uniq_findings_case_url` index.
- **Takedowns**: `(case_id, domain, contact_email)` is unique.
  Multiple scrape sources all surfacing the same registrar abuse
  address produce one takedown, not three. Chain-only takedowns
  (no contact found) dedup on `(case_id, domain)`.
- Different recipients on the same domain ARE allowed — you want
  one platform contact + one registrar contact for the same case.
- `case_discovery_scans.deduped` column tracks per-scan dedup
  events so the operator can see "I skipped 12 dupes this run."

Backfill script `scripts/backfill-dedup-takedowns.js` removed 18
pre-existing duplicate rows (kept the smallest id, deleted the
rest, with a `.pre-dedup-backup-*.sqlite` safety copy).

### 5. Platform-first contact discovery — extended patterns

- Cloudflare `__cf_email__` hex decode (most adult sites use it)
- Trailing-slash fix — `/contacts` returns 404, `/contacts/`
  returns 200 on most WordPress permalink setups; `PRIORITY_PATHS`
  now tries both
- **NEW: `gen_mail('lhs', 'rhs')` JS-obfuscation** — parses the
  older PHP `gen_mail()` pattern that EroMe and many other sites
  use. Next time EroMe-style sites show up the system auto-extracts
  `contact@host` without manual intervention.
- Cloudflare-fronted fallback — `abuse@cloudflare.com` added to
  the escalation chain when the platform is behind Cloudflare

### 6. Auto-learner + escalation chain improvements

- `backfill-escalation-chain.js` — one-shot script that extracts
  the escalation chain from existing email bodies and writes it
  into `escalation_json`. Fixes the case where production_pending
  rows from a manual build were missing the chain metadata, so
  the 30h auto-escalator had nothing to follow.
- 5 follow-up emails to `abuse@cloudflare.com` already auto-
  delivered (one per original recipient that didn't respond in 30h)

### 7. Bug fixes (since v1.3.0)

- **`scanId` TDZ** in `services/deepDiscovery.js:245` — the 7-day
  query-reuse code referenced `scanId` before it was declared. Fixed
  by using `preScanId` (the parameter) instead.
- **`platformContactsRepo` undefined** in `services/deepDiscovery.js` —
  the deep-discovery worker called `platformContactsRepo.recordDiscovery`
  in 3 places but the import was missing. Added.
- **`MIME type + reply_to`** in `services/email.js` — `sendRaw` now
  honors `replyTo` and the proper `multipart/related` content type
  for MHTML attachments.
- **Dashboard URL truncation** — `dashboard.js::fmtUrl` now shows
  full URL (domain + path, no protocol) with `title` attr for hover
  tooltip; falls back to middle-ellipsis at 70 chars.
- **API call counts** for the new `case_discovery_scans.deduped`
  column are correctly displayed in the scan progress UI.

### 8. Operator UX

- **Manual classify button** on every finding card + the in-page
  viewer (Pro+Agency)
- **MHTML download button** on every finding card (Pro+Agency)
- **"Save" `.mhtml`** action from the dashboard's recent-takedowns
  widget (no — the viewer is the entry point; the MHTML is
  downloaded directly via `Content-Disposition: attachment`)
- **Cross-findings search bar** on the case page

## Files changed / added (since v1.3.0)

```
.env                          ← real vision-AI provider + endpoint (replace before deploy)
app.js                        ← /findings route mounted
config/db.js                  ← finding_classifications table + deduped column
config/assertProductionEnv.js
middleware/auth.js            ← requirePlan() helper
models/finding.js             ← findByCaseAndUrl + dedup in create()
models/findingClassification.js (NEW)
models/takedown.js            ← findByCaseDomainContact + dedup in create()
routes/findings.js            (NEW) /findings/:id/{page,save.mhtml,view,classify,classification}
routes/cases.js               ← GET /:id/findings
services/ai.js                ← categorize() + vision-AI provider-M2.7 endpoint
services/contactDiscovery.js  ← gen_mail parser + /contacts/ + Cloudflare fallback
services/deepDiscovery.js     ← auto-classify on tier-3 + dedup integration
services/email.js             ← replyTo + multipart/related
services/mhtmlBuilder.js      (NEW) self-contained MHTML builder
public/views/finding-viewer.html      (NEW)
public/views/findings-search.html     (NEW)
public/views/case.html        ← findings + evidence panel
public/js/dashboard.js        ← URL formatter (no truncation)
public/css/main.css           ← .url-cell style
scripts/takedown-erome.js              (NEW)
scripts/backfill-dedup-takedowns.js   (NEW)
scripts/backfill-escalation-chain.js  (NEW)
test/dedup.test.js                     (NEW) — 6 tests, all green
```

## Deploying to cPanel

1. Upload `ncii-shield-1.4.0.tar.gz` to the file manager.
2. Extract into the app root (e.g. `~/ncii-shield`).
3. Copy `.env.example` → `.env`. Set:
   - `PUBLIC_BASE_URL=https://takedowns.vesamuni.com`
   - `SESSION_SECRET=<64+ random chars>`
   - `TRUST_PROXY=1`
   - `RESEND_FROM_EMAIL=noreply@takedowns.vesamuni.com`
   - `RESEND_FROM_NAME=Shield`
   - `RESEND_ALERT_TO=<your alert email>`
   - `NODE_ENV=production`
   - **Migrate API keys via `/admin → API keys`** (transactional email, search proxy,
     Stripe). The vision-AI provider key is already in the api_keys table — no
     action needed; just don't re-add it with the wrong URL.
4. In cPanel "Setup Node.js App":
   - Node version: 24.x
   - Application root: the extracted dir
   - Application startup file: `app.js`
   - Passenger log file: `data/server.log`
5. Run `npm install --omit=dev --no-audit --no-fund` in the app
   directory. (For cPanel Passenger, prefer `npm ci --omit=dev`.)
6. Click "Restart" in the Node.js App panel.
7. Visit `/health` — expect 200 if transactional email + search proxy + vision-AI provider keys
   are all configured.
8. Visit `/admin` — first-run will require a 2FA enrollment before
   the admin panel is accessible.

## Migration / data preservation

- **The sqlite db is NOT included in the tarball.** It's runtime
  state, lives in `data/ncii.sqlite` + `data/sessions.sqlite`. On
  cPanel the operator should:
  - If upgrading: copy the existing `data/` dir to the new
    deploy before first start. Schema migrations run idempotently
    on boot.
  - If fresh: `node seed-admin.js` to seed the superadmin, then
    login + add transactional email / search proxy / Stripe keys.

## Tests

41 tests, 0 failures, all green. New this release: 6 dedup tests
in `test/dedup.test.js` (Finding dedup, Finding different-URL,
Takedown dedup with contact, Takedown dedup different recipients,
chain-only dedup, null-vs-real-contact distinction).

```
$ node test/run.js
…
  41 passed · 0 failed
```

## Risk notes

- **Auto-classify** is fire-and-forget from `setImmediate` — if the
  AI service is down the deep-discovery scan still completes. The
  operator can manually classify any un-categorized finding via the
  "Classify" button.
- **MHTML build** may fail on Cloudflare-fronted sites that block
  all non-browser User-Agents. In that case the page still loads
  in the iframe (live fetch) but the MHTML download returns 502
  "MHTML_BUILD_FAILED" — the operator should use the in-iframe
  "Save page as" fallback.
- **gen_mail parser** is heuristic — false positives are possible
  if a JS function with the same name happens to be defined. The
  classifier only adds to a Set, and the dedup layer + valid email
  filter prevent any malformed output from being used as a takedown
  recipient.

— shipped 2026-06-15
