refactor!: Introduce the `ContextPipeline` abstraction #3119

janbuchar · 2025-08-07T11:22:23Z

closes Port over ContextPipeline from the Python counterpart #2479

Plan

In my opinion, it makes a lot of sense to do the remaining changes in a separate PR.

Introduce a ContextPipeline abstraction
Update crawlers to use it
Make sure that existing tests pass
Refine the ContextPipeline.compose signature and the semantics of BasicCrawlerOptions.contextPipelineEnhancer to maximize DX
Write tests for the contextPipelineEnhancer
Resolve added TODO comments (fix immediately or make issues)
Update documentation

Intent

The context-pipeline branch introduces a fundamental architectural change to how Crawlee crawlers build and enhance the crawling context passed to request handlers. The core motivation is to fix the composition and extensibility nightmare in the current crawler hierarchy.

The Problem

Rigid inheritance hierarchy: Crawlers were stuck in a brittle inheritance chain where each layer manipulated the context object while assuming that it already satisfied its final type. Multiple overrides of BasicCrawler lifecycle methods made the execution flow even harder to follow.
Context enhancement via monkey-patching: Manual property assignment (crawlingContext.page = page, crawlingContext.$ = $) scattered everywhere. It was a mess to follow and impossible to reason about.
Cleanup coordination: Resource cleanup was handled by separate _cleanupContext methods that were not co-located with the initialization.
Extension mechanism was broken: The CrawlerExtension.use() API tried to let you extend crawlers (the ones based on HttpCrawler) by overwriting properties - completely type-unsafe and fragile as hell.

The Solution

Introduces ContextPipeline - a middleware-based composition pattern where:

Each crawler layer defines how it enhances the context through explicit action functions
Cleanup logic is co-located with initialization via optional cleanup functions
Type safety is maintained through TypeScript generics that track context transformations
The pipeline executes middleware sequentially with proper error handling and guaranteed cleanup

Key Design Decisions

1. Middleware Pattern

Declarative middleware composition with co-located cleanup:

contextPipeline.compose({
  action: async (context) => ({ page, $ }),
  cleanup: async (context) => { await page.close(); }
})

2. Type-Safe Context Building

The ContextPipeline<TBase, TFinal> tracks type transformations through the chain:

ContextPipeline<CrawlingContext, CrawlingContext>
  .compose<{ page: Page }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page }>
  .compose<{ $: CheerioAPI }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page, $: CheerioAPI }>

3. New Extension Mechanism

The CrawlerExtension.use() is gone. New approach via contextPipelineEnhancer:

new BasicCrawler({
  contextPipelineEnhancer: (pipeline) => 
    pipeline.compose({
      action: async (context) => ({ myCustomProp: ... })
    })
})

Discussion Topics

1. The API

The current way to express a context pipeline middleware has some shortcomings (ContextPipeline.compose, BasicCrawlerOptions.contextPipelineEnhancer). I suggest resolving this in another PR.

2. Migration Path

For most legitimate use cases, this should be non-breaking. Those who extend the Crawler classes in non-trivial ways may need to adjust their code though - the non-public interface of BasicCrawler and HttpCrawler changed quite a bit.

3. Performance

The pipeline uses Object.defineProperties for each middleware. Is this a serious performance consideration?

…archy

janbuchar · 2025-08-27T15:15:39Z

packages/core/src/crawlers/context_pipeline.ts

+    /** The main middleware function that enhances the context */
+    action: (context: TCrawlingContext) => Promise<TCrawlingContextExtension>;
+    /** Optional cleanup function called after the consumer finishes or fails */
+    cleanup?: (context: TCrawlingContext & TCrawlingContextExtension, error?: unknown) => Promise<void>;


Returning a cleanup callback from action may be a better approach. A benefit of that would be having access to the outer scope

janbuchar · 2025-10-16T14:15:07Z

packages/http-crawler/src/internals/file-download.ts

+import { Router } from '../index.js';
+import { parseContentTypeFromResponse } from './utils.js';

-export type FileDownloadOptions<


I made quite substantial changes to the options here so that we don't need to accept requestHandler xor streamHandler. My motivation was that I though it'd be difficult to shoehorn this into ContextPipeline, but it actually may be pretty easy.

So... is this worth it or do we want the original interface?

Mainly for @barjin 's eyes

I think that with the changes from #3163, the streamHandler / requestHandler might not be necessary anyway.

I'm not too happy about the current design anyway, so feel free to change the interface as you see fit 👍

barjin

Let's merge this quick so we unblock all the other PRs 😄

A few ideas to get the discussion started:

barjin · 2025-10-17T11:54:41Z

packages/core/src/errors.ts

+
+export class ContextPipelineInitializationError extends Error {
+    constructor(
+        public error: unknown,


Perhaps you'd want to use Error { cause } field MDN? It seems more idiomatic that way.

Great idea, thanks!

barjin · 2025-10-17T11:58:12Z

packages/core/src/errors.ts

+export class ContextPipelineInitializationError extends Error {
+    constructor(
+        public error: unknown,
+        public crawlingContext: {},


Logging the error instance will also log the entire attached crawling context into the console. This can be quite heavy - see all the stuff we're attaching to the context e.g. in the browser crawlers.

I see the context being useful for debugging as well, so this is rather a question - are you aware / sure about this?

I'm aware of that, it actually bit me earlier 😁 After some consideration, we probably don't need the context here. The Python version does, but here we can make do with a simple cast.

barjin · 2025-10-17T14:12:35Z

test/core/crawlers/cheerio_crawler.test.ts

             */
            let numberOfRotations = -1;
            const failedRequestHandler = vitest.fn();
+            const got = new GotScrapingHttpClient();


afaiac we want to sunset got-scraping in favour of impit. Can you please use ImpitHttpClient here?

barjin · 2025-10-17T14:17:49Z

test/core/crawlers/basic_browser_crawler.ts

+    ) {
+        super({
+            ...options,
+            contextPipelineBuilder: () => this.buildContextPipeline(),


Is this a necessary step when subclassing Crawler instances now? Shouldn't this be the default behaviour anyway?

Only BrowserCrawler, and that one is special in many ways...

barjin · 2025-10-17T14:48:16Z

packages/linkedom-crawler/src/internals/linkedom-crawler.ts

+            ...options,
+            contextPipelineBuilder: () =>
+                this.buildContextPipeline().compose({
+                    action: async (context) => this.parseContent(context),


nit: maybe making two methods (parseContent, addHelpers) would help with dogfooding the ContextPipeline design? The current parseContent method does more than just parse content.

Just to be sure, you mean that we should split the parseContent middleware into two separate steps, right? Sounds good to me...

Yes, that was the idea 👍

barjin · 2025-10-17T14:51:14Z

packages/linkedom-crawler/src/internals/linkedom-crawler.ts

-        };
-    }
+            async waitForSelector(selector: string, timeoutMs = 5_000) {
+                const $ = cheerio.load(this.body);


are we sure that this in the context extensions will always be bound to the context? I'd rather reference the crawlingContext passed as the param to the outer function, just to be sure.

barjin · 2025-10-17T15:11:29Z

packages/core/src/crawlers/crawler_commons.ts

-    /**
-     * Get a key-value store with given name or id, or the default one for the crawler.
-     */
-    getKeyValueStore: (idOrName?: string) => Promise<KeyValueStore>;


The RestrictedContext's getKeyValueStore returns a "limited" KVS type - some methods (like e.g. recordExists or getAutosavedValue) are missing from there. those are imo safe to call from a regular (non-adaptive) crawler.

Was this intentional?

barjin · 2025-10-17T15:35:44Z

packages/cheerio-crawler/src/internals/cheerio-crawler.ts

+                contextPipelineBuilder: () =>
+                    this.buildContextPipeline().compose({
+                        action: async (context) => await this.parseContent(context),
+                    }),


idea: what about replacing this with

override async buildContextPipeline() { return super.buildContextPipeline().compose({ ... }

It feels slightly more OOP.

Also, then there might be no need for the contextPipelineBuilder constructor option and we'd only have the ...Enhancer method? I feel like having two such methods might be too much for the users (plus, imo a user will only want to extend the fully initialized context anyway).

I will extend this later with more detail, but the issue with a simple abstract method lies in correct generic types - when implementing CheerioCrawler for instance, you want to extend HttpCrawler<CheerioCrawlingContext>, but the inheritance "language" cannot express overriding HttpCrawler.buildContextPipeline (that returns a ContextPipeline<CrawlingContext, HttpCrawlingContext>) with CheerioCrawler.buildContextPipeline (that returns ContextPipeline<CrawlingContext, CheerioCrawlingContext>).

Oof. I hope that makes at least some sense. In any way, contextPipelineBuilder is not meant for your regular user, you shouldn't need it unless you're implementing a new crawler type. We should probably make that clearer, not sure why.

refactor!: Introduce the ContextPipeline abstraction

c92a0d2

janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Aug 7, 2025

janbuchar added 6 commits August 8, 2025 11:02

Expose ContextMiddleware type

9d881a2

Simplify ContextPipeline type

37c6f45

Adjust ContextPipeline interface

84fff41

Get rid of HttpCrawler.use

c26594e

Use ContextPipeline in the HttpCrawler part of the crawler class hier…

82f79b1

…archy

Refine error handler signatures

28a5187

janbuchar commented Aug 27, 2025

View reviewed changes

janbuchar added 15 commits September 9, 2025 17:21

Refine types

a7860d7

Improve types in basic crawler

69d6270

Improve ErrorTracker and ErrorSnapshotter types

c691c1c

Migrate crawlers to ContextPipeline

7205cb7

Update AdaptivePlaywrightCrawler to use ContextPipeline

1e564a1

Update lockfile

7c1ab99

Use contextPipelineEnhancer in AdaptivePlaywrightCrawler

250ba8c

Resolve type error

be9ee3f

Resolve lint errors

2eab7a2

Handler error handlers

e8567d6

Handle navigation hooks

f6338e2

Fix type errors

b96d6b0

Update tests

dd2df4f

Fix ow validation schema

4b7b66d

Run contextPipelineBuilder lazily

81434f7

janbuchar mentioned this pull request Oct 7, 2025

fix(adaptive-playwright-crawler): use shared enqueue links wrapper #3188

Open

janbuchar added 3 commits October 9, 2025 15:42

Fix basic crawler test

de9fd99

Fix cheerio crawler tests; Simplify request handler timeout logic

5bb45f2

Fix BrowserCrawler tests

3a530b8

janbuchar force-pushed the context-pipeline branch from e8e52aa to 3a530b8 Compare October 15, 2025 13:28

Fix adaptive crawler tests

f947ae7

janbuchar added 4 commits October 16, 2025 11:08

Fix the rest of adaptive crawler tests

a87ed39

Fix collateral damage in basic crawler test

e512371

Update and fix file download test

6c545ed

Sort imports

3cd8290

janbuchar marked this pull request as ready for review October 16, 2025 13:26

janbuchar requested review from B4nan, barjin and vladfrangu October 16, 2025 13:26

janbuchar commented Oct 16, 2025

View reviewed changes

barjin reviewed Oct 17, 2025

View reviewed changes

refactor!: Introduce the ContextPipeline abstraction #3119

Are you sure you want to change the base?

refactor!: Introduce the ContextPipeline abstraction #3119

Uh oh!

Conversation

janbuchar commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Plan

Intent

The Problem

The Solution

Key Design Decisions

1. Middleware Pattern

2. Type-Safe Context Building

3. New Extension Mechanism

Discussion Topics

1. The API

2. Migration Path

3. Performance

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

refactor!: Introduce the `ContextPipeline` abstraction #3119

refactor!: Introduce the `ContextPipeline` abstraction #3119

janbuchar commented Aug 7, 2025 •

edited

Loading