Skip to content

Conversation

janbuchar
Copy link
Contributor

@janbuchar janbuchar commented Aug 7, 2025

Plan

In my opinion, it makes a lot of sense to do the remaining changes in a separate PR.

  • Introduce a ContextPipeline abstraction
  • Update crawlers to use it
  • Make sure that existing tests pass
  • Refine the ContextPipeline.compose signature and the semantics of BasicCrawlerOptions.contextPipelineEnhancer to maximize DX
  • Write tests for the contextPipelineEnhancer
  • Resolve added TODO comments (fix immediately or make issues)
  • Update documentation

Intent

The context-pipeline branch introduces a fundamental architectural change to how Crawlee crawlers build and enhance the crawling context passed to request handlers. The core motivation is to fix the composition and extensibility nightmare in the current crawler hierarchy.

The Problem

  1. Rigid inheritance hierarchy: Crawlers were stuck in a brittle inheritance chain where each layer manipulated the context object while assuming that it already satisfied its final type. Multiple overrides of BasicCrawler lifecycle methods made the execution flow even harder to follow.

  2. Context enhancement via monkey-patching: Manual property assignment (crawlingContext.page = page, crawlingContext.$ = $) scattered everywhere. It was a mess to follow and impossible to reason about.

  3. Cleanup coordination: Resource cleanup was handled by separate _cleanupContext methods that were not co-located with the initialization.

  4. Extension mechanism was broken: The CrawlerExtension.use() API tried to let you extend crawlers (the ones based on HttpCrawler) by overwriting properties - completely type-unsafe and fragile as hell.

The Solution

Introduces ContextPipeline - a middleware-based composition pattern where:

  • Each crawler layer defines how it enhances the context through explicit action functions
  • Cleanup logic is co-located with initialization via optional cleanup functions
  • Type safety is maintained through TypeScript generics that track context transformations
  • The pipeline executes middleware sequentially with proper error handling and guaranteed cleanup

Key Design Decisions

1. Middleware Pattern

Declarative middleware composition with co-located cleanup:

contextPipeline.compose({
  action: async (context) => ({ page, $ }),
  cleanup: async (context) => { await page.close(); }
})

2. Type-Safe Context Building

The ContextPipeline<TBase, TFinal> tracks type transformations through the chain:

ContextPipeline<CrawlingContext, CrawlingContext>
  .compose<{ page: Page }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page }>
  .compose<{ $: CheerioAPI }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page, $: CheerioAPI }>

3. New Extension Mechanism

The CrawlerExtension.use() is gone. New approach via contextPipelineEnhancer:

new BasicCrawler({
  contextPipelineEnhancer: (pipeline) => 
    pipeline.compose({
      action: async (context) => ({ myCustomProp: ... })
    })
})

Discussion Topics

1. The API

The current way to express a context pipeline middleware has some shortcomings (ContextPipeline.compose, BasicCrawlerOptions.contextPipelineEnhancer). I suggest resolving this in another PR.

2. Migration Path

For most legitimate use cases, this should be non-breaking. Those who extend the Crawler classes in non-trivial ways may need to adjust their code though - the non-public interface of BasicCrawler and HttpCrawler changed quite a bit.

3. Performance

The pipeline uses Object.defineProperties for each middleware. Is this a serious performance consideration?

@janbuchar janbuchar added the t-tooling Issues with this label are in the ownership of the tooling team. label Aug 7, 2025
/** The main middleware function that enhances the context */
action: (context: TCrawlingContext) => Promise<TCrawlingContextExtension>;
/** Optional cleanup function called after the consumer finishes or fails */
cleanup?: (context: TCrawlingContext & TCrawlingContextExtension, error?: unknown) => Promise<void>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a cleanup callback from action may be a better approach. A benefit of that would be having access to the outer scope

@janbuchar janbuchar marked this pull request as ready for review October 16, 2025 13:26
import { Router } from '../index.js';
import { parseContentTypeFromResponse } from './utils.js';

export type FileDownloadOptions<
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made quite substantial changes to the options here so that we don't need to accept requestHandler xor streamHandler. My motivation was that I though it'd be difficult to shoehorn this into ContextPipeline, but it actually may be pretty easy.

So... is this worth it or do we want the original interface?

Mainly for @barjin 's eyes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that with the changes from #3163, the streamHandler / requestHandler might not be necessary anyway.

I'm not too happy about the current design anyway, so feel free to change the interface as you see fit 👍

Copy link
Contributor

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this quick so we unblock all the other PRs 😄

A few ideas to get the discussion started:


export class ContextPipelineInitializationError extends Error {
constructor(
public error: unknown,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you'd want to use Error { cause } field MDN? It seems more idiomatic that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, thanks!

export class ContextPipelineInitializationError extends Error {
constructor(
public error: unknown,
public crawlingContext: {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging the error instance will also log the entire attached crawling context into the console. This can be quite heavy - see all the stuff we're attaching to the context e.g. in the browser crawlers.

I see the context being useful for debugging as well, so this is rather a question - are you aware / sure about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware of that, it actually bit me earlier 😁 After some consideration, we probably don't need the context here. The Python version does, but here we can make do with a simple cast.

*/
let numberOfRotations = -1;
const failedRequestHandler = vitest.fn();
const got = new GotScrapingHttpClient();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaiac we want to sunset got-scraping in favour of impit. Can you please use ImpitHttpClient here?

) {
super({
...options,
contextPipelineBuilder: () => this.buildContextPipeline(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a necessary step when subclassing Crawler instances now? Shouldn't this be the default behaviour anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only BrowserCrawler, and that one is special in many ways...

...options,
contextPipelineBuilder: () =>
this.buildContextPipeline().compose({
action: async (context) => this.parseContent(context),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe making two methods (parseContent, addHelpers) would help with dogfooding the ContextPipeline design? The current parseContent method does more than just parse content.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure, you mean that we should split the parseContent middleware into two separate steps, right? Sounds good to me...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was the idea 👍

};
}
async waitForSelector(selector: string, timeoutMs = 5_000) {
const $ = cheerio.load(this.body);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that this in the context extensions will always be bound to the context? I'd rather reference the crawlingContext passed as the param to the outer function, just to be sure.

/**
* Get a key-value store with given name or id, or the default one for the crawler.
*/
getKeyValueStore: (idOrName?: string) => Promise<KeyValueStore>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RestrictedContext's getKeyValueStore returns a "limited" KVS type - some methods (like e.g. recordExists or getAutosavedValue) are missing from there. those are imo safe to call from a regular (non-adaptive) crawler.

Was this intentional?

Comment on lines +174 to +177
contextPipelineBuilder: () =>
this.buildContextPipeline().compose({
action: async (context) => await this.parseContent(context),
}),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idea: what about replacing this with

override async buildContextPipeline() {
   return super.buildContextPipeline().compose({ 
    ...
}

It feels slightly more OOP.

Also, then there might be no need for the contextPipelineBuilder constructor option and we'd only have the ...Enhancer method? I feel like having two such methods might be too much for the users (plus, imo a user will only want to extend the fully initialized context anyway).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will extend this later with more detail, but the issue with a simple abstract method lies in correct generic types - when implementing CheerioCrawler for instance, you want to extend HttpCrawler<CheerioCrawlingContext>, but the inheritance "language" cannot express overriding HttpCrawler.buildContextPipeline (that returns a ContextPipeline<CrawlingContext, HttpCrawlingContext>) with CheerioCrawler.buildContextPipeline (that returns ContextPipeline<CrawlingContext, CheerioCrawlingContext>).

Oof. I hope that makes at least some sense. In any way, contextPipelineBuilder is not meant for your regular user, you shouldn't need it unless you're implementing a new crawler type. We should probably make that clearer, not sure why.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants