Code · src/tools/retrieval/connectors/README.md

src/tools/retrieval/connectors/README.md 9,714 bytes · markdown
# Retrieval connectors

A **connector** is the only place the framework touches a specific data
source. The Baseline pillar agents don't know SEC EDGAR from Bloomberg
from a CSV vault — they hand a `JobRequest.sources` string to the
dispatcher, and the dispatcher looks up the connector registered under
that name.

> **Scope rule.** The framework ships with one connector — `mock`. SEC,
> Bloomberg, internal databases, and any other domain source live in
> *this* directory as additions, never as core changes.

## Available connectors

All public-data connectors below use the User-Agent
`MR mitchell.roy@sia-partners.com` against their respective sources
(SEC requires it; the rest accept it as a courtesy). Each declares a
conservative rate-limit envelope that the dispatcher's `RateLimiter`
enforces on its behalf.

| Name                  | File                       | What it returns                                                                                                      | Auth      |
|-----------------------|----------------------------|----------------------------------------------------------------------------------------------------------------------|-----------|
| `sec-edgar`           | `sec-edgar.ts`             | Composite SEC retrieval — ticker / name → CIK then XBRL company facts. Includes the four "sec_*" standalone fetchers. | none      |
| `sec-financials`      | `sec-financials.ts`        | XBRL company facts JSON for one CIK, optionally narrowed to a comma-separated concept list.                          | none      |
| `sec-submissions`     | `sec-submissions.ts`       | Filing history (most recent ≤50) for one CIK, optionally narrowed to specific filing types.                          | none      |
| `sec-filing-document` | `sec-filing-document.ts`   | Single filing document — pass-through of a `https://www.sec.gov/Archives/...` URL, HTML stripped, truncated.         | none      |
| `finra-brokercheck`   | `finra-brokercheck.ts`     | Broker-dealer firm/individual registration and disciplinary history from FINRA's public BrokerCheck.                 | none      |
| `iapd`                | `iapd.ts`                  | Investment Adviser Public Disclosure — Form ADV firm/individual lookups via the IAPD search gateway.                 | none      |
| `fred`                | `fred.ts`                  | Federal Reserve Economic Data — time-series observations for any FRED series id.                                     | `FRED_API_KEY` (free) |
| `irs-bmf`             | `irs-bmf.ts`               | IRS Exempt Organizations Business Master File — per-state CSV download parsed to JSON rows.                          | none      |

Additional SEC-specific surfaces live under `sec-edgar-*.ts` —
filings (`sec-edgar-filings.ts`), XBRL extras (`sec-edgar-xbrl.ts`),
and insider ownership forms (`sec-edgar-insider.ts`).

## Contract

Every connector satisfies the `RetrievalConnector` interface in
[`../interface.ts`](../interface.ts):

```ts
export interface RetrievalConnector {
  readonly name: string;                                     // unique registration key
  readonly authRequired: boolean;                            // credentials needed?
  readonly rateLimit: { requestsPerSecond: number; burstSize?: number };

  fetch(params: FetchParams): Promise<RawPayload>;           // the actual call
  isAvailable(): Promise<boolean>;                           // liveness probe
}
```

`FetchParams` carries `{ entity: { id, aliases }, period, filingTypes?, query? }`.
`RawPayload` carries `{ source, sourceUrl?, capturedAt, contentType, rawContent, metadata }`.

## Required behaviours

1. **Stamp provenance.** Set `source` to your connector name, `sourceUrl`
   to the canonical upstream URL, and `capturedAt` to an ISO-8601
   timestamp captured *at fetch time*. The Source/Extraction agent
   relies on these to satisfy Standard 4.
2. **Respect declared rate limits.** Return realistic numbers in
   `rateLimit`; the dispatcher's `RateLimiter` enforces them on your
   behalf. Don't add your own ad-hoc throttling.
3. **Fail with `RetrievalError`.** Surface structured failures with one
   of the documented categories (`unavailable`, `no-content`,
   `auth-failed`, `rate-limited`, `invalid-request`, `internal`). The
   dispatcher converts these into a `DispatchResult` for the caller —
   never throw a raw `Error` (Std 12).
4. **No domain leakage into the framework.** Anything specific to the
   upstream source — endpoint URLs, auth flows, payload shapes, label
   conventions — stays inside the connector module.
5. **Use the shared `httpGet` / `httpRequest`** from
   [`../http-client.ts`](../http-client.ts) when you need to make
   outbound HTTP calls; that's where the framework controls
   User-Agent, timeouts, and body-size caps.
6. **Narrow-first responses (Std 5 protection).** Any tool that can
   return a large response body (XBRL fact sets, full filing texts,
   full filing histories, broad search results) MUST default to a
   *summary* describing the available data elements — names,
   counts, structural hints — when the caller does not specify which
   elements they want. Full data is returned only when the caller
   names specific elements (concept names, filing accession numbers,
   document URLs, etc.). This protects downstream LLMs from being
   force-fed hundreds of irrelevant rows just because the agent
   called the tool without arguments; it also makes "discover, then
   fetch targeted" a one-tool pattern instead of two. See
   `sec_financials` in `sec-edgar.ts` for the canonical example.
7. **Accept scope parameters (Std 5 protection, applied to filterable
   axes).** Retrieval tools whose responses span multiple periods,
   entities, units, or other natural axes SHOULD accept optional
   scope arguments (`period`, `entity`, `unit`, `concept`, etc.) and
   filter their response server-side to just the matching rows
   before returning. The agent passes whatever scope it already knows
   from the JobRequest; the connector does the narrowing. This is
   the same principle as rule #6 applied to the *arguments* of a
   tool rather than the choice between tools — instead of returning
   a 232-row time series so the agent can find the one FY-2024 row,
   accept `period: "FY-2024"` and return one row. Scope arguments are
   optional: omitting them returns the full slice on that axis,
   preserving discovery use cases. See `sec_company_concept` in
   `sec-edgar-xbrl.ts` for the canonical example (`period` + `unit`).

## Adding a new connector

1. Create `src/tools/retrieval/connectors/<your-source>.ts` (skeleton
   below). Keep it self-contained — auth, URL building, response
   shaping all live in this one file.
2. Register it from your application bootstrap (or wherever you wire
   the framework) by calling:
   ```ts
   import { registerConnector } from '../dispatcher.js';
   import { YourConnector } from './your-source.js';
   registerConnector(new YourConnector(config));
   ```
3. Reference it from a `JobRequest` by name:
   ```jsonc
   { "sources": ["your-source"] }
   ```

The dispatcher will route to your connector. If a `JobRequest` names a
source no connector is registered for, the dispatcher fails clearly
(`category: 'connector-not-registered'`) and lists the known sources.

## Skeleton

```ts
// src/tools/retrieval/connectors/your-source.ts
import {
  RetrievalError,
  type FetchParams,
  type RawPayload,
  type RetrievalConnector,
} from '../interface.js';
import { httpGet } from '../http-client.js';

export interface YourSourceConfig {
  readonly apiKey?: string;
  readonly baseUrl: string;
}

export class YourSourceConnector implements RetrievalConnector {
  readonly name = 'your-source';
  readonly authRequired = true;
  readonly rateLimit = { requestsPerSecond: 2, burstSize: 4 };

  constructor(private readonly cfg: YourSourceConfig) {}

  async isAvailable(): Promise<boolean> {
    if (this.authRequired && !this.cfg.apiKey) return false;
    return true;
  }

  async fetch(params: FetchParams): Promise<RawPayload> {
    if (!this.cfg.apiKey) {
      throw new RetrievalError('auth-failed', 'YOUR_SOURCE_API_KEY is not configured');
    }
    const url = this.buildUrl(params);
    try {
      const res = await httpGet(url, {
        headers: { Authorization: `Bearer ${this.cfg.apiKey}` },
      });
      return {
        source: this.name,
        sourceUrl: res.url,
        capturedAt: new Date().toISOString(),
        contentType: res.contentType,
        rawContent: res.body,
        metadata: { status: res.status, entity: params.entity.id, period: params.period },
      };
    } catch (err) {
      // Translate HTTP errors into RetrievalError categories.
      const msg = err instanceof Error ? err.message : String(err);
      if (/timeout/i.test(msg)) throw new RetrievalError('unavailable', msg);
      if (/40\d/.test(msg))      throw new RetrievalError('auth-failed', msg);
      if (/429/.test(msg))       throw new RetrievalError('rate-limited', msg);
      throw new RetrievalError('internal', msg);
    }
  }

  private buildUrl(params: FetchParams): string {
    const e = encodeURIComponent(params.entity.id);
    const p = encodeURIComponent(params.period);
    return `${this.cfg.baseUrl}/filings?entity=${e}&period=${p}`;
  }
}
```

## What connectors must **not** do

- They must not normalize, deduplicate, infer, or interpret values —
  that's the Normalization and Resolution agents' job.
- They must not persist anything; the orchestrator does write-back.
- They must not call the LLM tool — that's the parser/normalizer
  layers. Connectors retrieve raw bytes only.