-
Notifications
You must be signed in to change notification settings - Fork 1.1k
refactor!: Introduce the ContextPipeline
abstraction
#3119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v4
Are you sure you want to change the base?
Conversation
/** The main middleware function that enhances the context */ | ||
action: (context: TCrawlingContext) => Promise<TCrawlingContextExtension>; | ||
/** Optional cleanup function called after the consumer finishes or fails */ | ||
cleanup?: (context: TCrawlingContext & TCrawlingContextExtension, error?: unknown) => Promise<void>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning a cleanup callback from action
may be a better approach. A benefit of that would be having access to the outer scope
e8e52aa
to
3a530b8
Compare
import { Router } from '../index.js'; | ||
import { parseContentTypeFromResponse } from './utils.js'; | ||
|
||
export type FileDownloadOptions< |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made quite substantial changes to the options here so that we don't need to accept requestHandler
xor streamHandler
. My motivation was that I though it'd be difficult to shoehorn this into ContextPipeline
, but it actually may be pretty easy.
So... is this worth it or do we want the original interface?
Mainly for @barjin 's eyes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that with the changes from #3163, the streamHandler
/ requestHandler
might not be necessary anyway.
I'm not too happy about the current design anyway, so feel free to change the interface as you see fit 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this quick so we unblock all the other PRs 😄
A few ideas to get the discussion started:
|
||
export class ContextPipelineInitializationError extends Error { | ||
constructor( | ||
public error: unknown, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you'd want to use Error { cause }
field MDN? It seems more idiomatic that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, thanks!
export class ContextPipelineInitializationError extends Error { | ||
constructor( | ||
public error: unknown, | ||
public crawlingContext: {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging the error instance will also log the entire attached crawling context into the console. This can be quite heavy - see all the stuff we're attaching to the context e.g. in the browser crawlers.
I see the context being useful for debugging as well, so this is rather a question - are you aware / sure about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware of that, it actually bit me earlier 😁 After some consideration, we probably don't need the context here. The Python version does, but here we can make do with a simple cast.
*/ | ||
let numberOfRotations = -1; | ||
const failedRequestHandler = vitest.fn(); | ||
const got = new GotScrapingHttpClient(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaiac we want to sunset got-scraping
in favour of impit. Can you please use ImpitHttpClient
here?
) { | ||
super({ | ||
...options, | ||
contextPipelineBuilder: () => this.buildContextPipeline(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a necessary step when subclassing Crawler
instances now? Shouldn't this be the default behaviour anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only BrowserCrawler
, and that one is special in many ways...
...options, | ||
contextPipelineBuilder: () => | ||
this.buildContextPipeline().compose({ | ||
action: async (context) => this.parseContent(context), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe making two methods (parseContent
, addHelpers
) would help with dogfooding the ContextPipeline
design? The current parseContent
method does more than just parse content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure, you mean that we should split the parseContent
middleware into two separate steps, right? Sounds good to me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was the idea 👍
}; | ||
} | ||
async waitForSelector(selector: string, timeoutMs = 5_000) { | ||
const $ = cheerio.load(this.body); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure that this
in the context extensions will always be bound to the context? I'd rather reference the crawlingContext
passed as the param to the outer function, just to be sure.
/** | ||
* Get a key-value store with given name or id, or the default one for the crawler. | ||
*/ | ||
getKeyValueStore: (idOrName?: string) => Promise<KeyValueStore>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RestrictedContext
's getKeyValueStore
returns a "limited" KVS type - some methods (like e.g. recordExists
or getAutosavedValue
) are missing from there. those are imo safe to call from a regular (non-adaptive) crawler.
Was this intentional?
contextPipelineBuilder: () => | ||
this.buildContextPipeline().compose({ | ||
action: async (context) => await this.parseContent(context), | ||
}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idea: what about replacing this with
override async buildContextPipeline() {
return super.buildContextPipeline().compose({
...
}
It feels slightly more OOP.
Also, then there might be no need for the contextPipelineBuilder
constructor option and we'd only have the ...Enhancer
method? I feel like having two such methods might be too much for the users (plus, imo a user will only want to extend the fully initialized context anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will extend this later with more detail, but the issue with a simple abstract method lies in correct generic types - when implementing CheerioCrawler
for instance, you want to extend HttpCrawler<CheerioCrawlingContext>
, but the inheritance "language" cannot express overriding HttpCrawler.buildContextPipeline
(that returns a ContextPipeline<CrawlingContext, HttpCrawlingContext>
) with CheerioCrawler.buildContextPipeline
(that returns ContextPipeline<CrawlingContext, CheerioCrawlingContext>
).
Oof. I hope that makes at least some sense. In any way, contextPipelineBuilder
is not meant for your regular user, you shouldn't need it unless you're implementing a new crawler type. We should probably make that clearer, not sure why.
ContextPipeline
from the Python counterpart #2479Plan
In my opinion, it makes a lot of sense to do the remaining changes in a separate PR.
ContextPipeline
abstractionContextPipeline.compose
signature and the semantics ofBasicCrawlerOptions.contextPipelineEnhancer
to maximize DXcontextPipelineEnhancer
Intent
The
context-pipeline
branch introduces a fundamental architectural change to how Crawlee crawlers build and enhance the crawling context passed to request handlers. The core motivation is to fix the composition and extensibility nightmare in the current crawler hierarchy.The Problem
Rigid inheritance hierarchy: Crawlers were stuck in a brittle inheritance chain where each layer manipulated the context object while assuming that it already satisfied its final type. Multiple overrides of
BasicCrawler
lifecycle methods made the execution flow even harder to follow.Context enhancement via monkey-patching: Manual property assignment (
crawlingContext.page = page
,crawlingContext.$ = $
) scattered everywhere. It was a mess to follow and impossible to reason about.Cleanup coordination: Resource cleanup was handled by separate
_cleanupContext
methods that were not co-located with the initialization.Extension mechanism was broken: The
CrawlerExtension.use()
API tried to let you extend crawlers (the ones based onHttpCrawler
) by overwriting properties - completely type-unsafe and fragile as hell.The Solution
Introduces
ContextPipeline
- a middleware-based composition pattern where:action
functionscleanup
functionsKey Design Decisions
1. Middleware Pattern
Declarative middleware composition with co-located cleanup:
2. Type-Safe Context Building
The
ContextPipeline<TBase, TFinal>
tracks type transformations through the chain:3. New Extension Mechanism
The
CrawlerExtension.use()
is gone. New approach viacontextPipelineEnhancer
:Discussion Topics
1. The API
The current way to express a context pipeline middleware has some shortcomings (
ContextPipeline.compose
,BasicCrawlerOptions.contextPipelineEnhancer
). I suggest resolving this in another PR.2. Migration Path
For most legitimate use cases, this should be non-breaking. Those who extend the Crawler classes in non-trivial ways may need to adjust their code though - the non-public interface of
BasicCrawler
andHttpCrawler
changed quite a bit.3. Performance
The pipeline uses
Object.defineProperties
for each middleware. Is this a serious performance consideration?