Web scraping with network controls

Use this guide when user code needs internet access, but you still need strict control over which destinations can be reached.

Diagram: Policy-enforced scraping flow

Start with filtered mode

Enable only approved hosts first, then run scraping logic.

Library
CLI
API

import { DockerIsol8 } from "@isol8/core";

const engine = new DockerIsol8({
  mode: "ephemeral",
  network: "filtered",
  networkFilter: {
    whitelist: [
      "^api\\.github\\.com$",
      "^en\\.wikipedia\\.org$",
    ],
    blacklist: ["^169\\.254\\."],
  },
  timeoutMs: 30000,
  memoryLimit: "512m",
});

await engine.start();

isol8 run scraper.py \
  --net filtered \
  --allow "^api\.github\.com$" \
  --allow "^en\.wikipedia\.org$" \
  --deny "^169\.254\."

{
  "request": {
    "code": "print('scrape')",
    "runtime": "python"
  },
  "options": {
    "network": "filtered",
    "networkFilter": {
      "whitelist": ["^api\\.github\\.com$"],
      "blacklist": ["^169\\.254\\."]
    }
  }
}

In filtered mode, blacklist rules take precedence over whitelist rules.

Pattern 1: approved API fetch

const result = await engine.execute({
  runtime: "python",
  code: `
import urllib.request, json

url = "https://api.github.com/repos/Illusion47586/isol8"
resp = urllib.request.urlopen(url)
data = json.loads(resp.read())
print(json.dumps({
  "repo": data["full_name"],
  "stars": data["stargazers_count"]
}))
`,
});

console.log(result.stdout);

Pattern 2: graceful handling for blocked hosts

const result = await engine.execute({
  runtime: "python",
  code: `
import urllib.request

targets = [
  "https://api.github.com",
  "https://example-blocked-domain.invalid"
]

for url in targets:
  try:
    urllib.request.urlopen(url, timeout=5)
    print(f"ALLOW {url}")
  except Exception as e:
    print(f"BLOCK {url}: {e}")
`,
});

Pattern 3: scraping HTML with packages

For richer parsing, install parser libraries:

const result = await engine.execute({
  runtime: "python",
  installPackages: ["requests", "beautifulsoup4"],
  code: `
import requests
from bs4 import BeautifulSoup

html = requests.get("https://en.wikipedia.org/wiki/Docker_(software)", timeout=10).text
soup = BeautifulSoup(html, "html.parser")
first_p = soup.select_one(".mw-parser-output > p:not(.mw-empty-elt)")
print(first_p.get_text(strip=True)[:300])
`,
});

Authenticated API calls with secrets

When scraping private APIs, inject credentials using secrets.

const secured = new DockerIsol8({
  mode: "ephemeral",
  network: "filtered",
  networkFilter: {
    whitelist: ["^api\\.example\\.com$"],
    blacklist: [],
  },
  secrets: {
    API_TOKEN: process.env.API_TOKEN!,
  },
});

const result = await secured.execute({
  runtime: "python",
  code: `
import os, urllib.request, json

req = urllib.request.Request(
  "https://api.example.com/data",
  headers={"Authorization": f"Bearer {os.environ['API_TOKEN']}"}
)
resp = urllib.request.urlopen(req)
print(resp.status)
`,
});

Secret masking applies to stdout/stderr text. If script writes secrets to files, those file contents are not auto-redacted.

Observe network behavior during scraping

Enable network request logs for filtered runs:

isol8 run scraper.py \
  --net filtered \
  --allow "^api\.github\.com$" \
  --log-network \
  --no-stream

In non-stream mode, CLI prints collected network log entries when available.

Remote scraping workers

For centralized scraping infrastructure, run remote server and use RemoteIsol8.

import { RemoteIsol8 } from "@isol8/core";

const remote = new RemoteIsol8(
  {
    host: "http://localhost:3000",
    apiKey: process.env.ISOL8_API_KEY!,
    sessionId: "scrape-job-001",
  },
  {
    network: "filtered",
    networkFilter: {
      whitelist: ["^api\\.github\\.com$"],
      blacklist: [],
    },
    timeoutMs: 30000,
  }
);

await remote.start();
const res = await remote.execute({
  runtime: "python",
  code: "print('remote scrape run')",
});
await remote.stop();

Safer scraping design patterns

whitelist exact hostnames instead of broad wildcards
keep timeouts short for external requests
parse to structured output (JSON) rather than raw HTML dumps
separate fetch and parse stages to isolate failures
pre-bake stable dependencies to avoid per-run install overhead

Security model

Understand filtered mode enforcement and seccomp boundaries.

Remote server and client

Run scraping workloads with centralized session/policy management.

Execution guide

Execution request fields, streaming, and output behavior.

Option mapping

Exact CLI/config/API/library mapping for network and runtime options.

​Diagram: Policy-enforced scraping flow

​Start with filtered mode

​Pattern 1: approved API fetch

​Pattern 2: graceful handling for blocked hosts

​Pattern 3: scraping HTML with packages

​Authenticated API calls with secrets

​Observe network behavior during scraping

​Remote scraping workers

​Safer scraping design patterns

​Related pages

Security model

Remote server and client

Execution guide

Option mapping

Diagram: Policy-enforced scraping flow

Start with filtered mode

Pattern 1: approved API fetch

Pattern 2: graceful handling for blocked hosts

Pattern 3: scraping HTML with packages

Authenticated API calls with secrets

Observe network behavior during scraping

Remote scraping workers

Safer scraping design patterns

Related pages