Middlewares(中间件)¶
Frontier Middleware 位于
FrontierManager 和
Backend objects 之间, 根据 frontier data flow 的流程,处理 Request 和 Response。
Middlewares 是一个轻量级、低层次的系统,可以用来过滤和更改 Frontier 的 requests 和 responses。
激活一个 middleware¶
要激活 Middleware component, 需要添加它到
MIDDLEWARES setting(这是一个列表,包含类的路径或者一个 Middleware 对象)。
这是一个例子:
MIDDLEWARES = [
'frontera.contrib.middlewares.domain.DomainMiddleware',
]
Middlewares按照它们在列表中定义的相同顺序进行调用,根据你自己的需要安排顺序。 该顺序很重要,因为每个中间件执行不同的操作,并且您的中间件可能依赖于一些先前(或后续的)执行的中间件。
最后,记住一些 middlewares 需要通过特殊的 setting。详细请参考 each middleware documentation 。
写你自己的 middleware¶
写自己的 Frontera middleware 是很简单的。每个 Middleware 是一个继承 Component 的 Python 类。
FrontierManager 会通过下面的方法和所有激活的 middlewares 通信。
-
class
frontera.core.components.Middleware¶ Methods
-
Middleware.frontier_start()¶ Called when the frontier starts, see starting/stopping the frontier.
-
Middleware.frontier_stop()¶ Called when the frontier stops, see starting/stopping the frontier.
-
Middleware.add_seeds(seeds)¶ This method is called when new seeds are added to the frontier.
参数: seeds (list) – A list of Requestobjects.返回: Requestobject list orNone应该返回
None或者Request的列表。如果返回
None,FrontierManager将不会处理任何中间件,并且种子也不会到达Backend。如果返回
Request列表,该列表将会传给下个中间件。这个过程会在每个激活的中间件重复,直到它到达Backend。如果要过滤任何种子,请不要将其包含在返回的对象列表中。
-
Middleware.page_crawled(response)¶ This method is called every time a page has been crawled.
参数: response (object) – The Responseobject for the crawled page.返回: ResponseorNone应该返回
None或者一个Response对象。如果返回
None,FrontierManager将不会处理任何中间件,并且Backend不会被通知。如果返回
Response,它将会被传给下个中间件。这个过程会在每个激活的中间件重复,直到它到达Backend。如果要过滤页面,只需返回 None。
-
Middleware.request_error(page, error)¶ This method is called each time an error occurs when crawling a page.
参数: - request (object) – The crawled with error
Requestobject. - error (string) – A string identifier for the error.
返回: RequestorNone应该返回
None或者一个Request对象。如果返回
None,FrontierManager将不会和其他任何中间件通信,并且Backend不会被通知。如果返回一个
Response对象,它将会被传给下个中间件。这个过程会在每个激活的中间件重复,直到它到达Backend。如果要过滤页面错误,只需返回 None。
- request (object) – The crawled with error
Class Methods
-
Middleware.from_manager(manager)¶ Class method called from
FrontierManagerpassing the manager itself.Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
-
内置 middleware 参考¶
这篇文章描述了 Frontera 所有的 Middleware 组件。如何使用和写自己的 middleware,请参考 middleware usage guide.。
有关默认启用的组件列表(及其顺序),请参阅 MIDDLEWARES 设置。
DomainMiddleware¶
-
class
frontera.contrib.middlewares.domain.DomainMiddleware¶ This
Middlewarewill add adomaininfo field for everyRequest.metaandResponse.metaif is activated.domainobject will contain the following fields, with both keys and values as bytes:- netloc: URL netloc according to RFC 1808 syntax specifications
- name: Domain name
- scheme: URL scheme
- tld: Top level domain
- sld: Second level domain
- subdomain: URL subdomain(s)
An example for a
Requestobject:>>> request.url 'http://www.scrapinghub.com:8080/this/is/an/url' >>> request.meta['domain'] { "name": "scrapinghub.com", "netloc": "www.scrapinghub.com", "scheme": "http", "sld": "scrapinghub", "subdomain": "www", "tld": "com" }
If
TEST_MODEis active, It will accept testing URLs, parsing letter domains:>>> request.url 'A1' >>> request.meta['domain'] { "name": "A", "netloc": "A", "scheme": "-", "sld": "-", "subdomain": "-", "tld": "-" }
UrlFingerprintMiddleware¶
-
class
frontera.contrib.middlewares.fingerprint.UrlFingerprintMiddleware¶ This
Middlewarewill add afingerprintfield for everyRequest.metaandResponse.metaif is activated.Fingerprint will be calculated from object
URL, using the function defined inURL_FINGERPRINT_FUNCTIONsetting. You can write your own fingerprint calculation function and use by changing this setting. The fingerprint must be bytes.An example for a
Requestobject:>>> request.url 'http//www.scrapinghub.com:8080' >>> request.meta['fingerprint'] '60d846bc2969e9706829d5f1690f11dafb70ed18'
-
frontera.utils.fingerprint.hostname_local_fingerprint(key)¶ This function is used for URL fingerprinting, which serves to uniquely identify the document in storage.
hostname_local_fingerprintis constructing fingerprint getting first 4 bytes as Crc32 from host, and rest is MD5 from rest of the URL. Default option is set to make use of HBase block cache. It is expected to fit all the documents of average website within one cache block, which can be efficiently read from disk once.参数: key – str URL 返回: str 20 bytes hex string
DomainFingerprintMiddleware¶
-
class
frontera.contrib.middlewares.fingerprint.DomainFingerprintMiddleware¶ This
Middlewarewill add afingerprintfield for everyRequest.metaandResponse.metadomainfields if is activated.Fingerprint will be calculated from object
URL, using the function defined inDOMAIN_FINGERPRINT_FUNCTIONsetting. You can write your own fingerprint calculation function and use by changing this setting. The fingerprint must be bytesAn example for a
Requestobject:>>> request.url 'http//www.scrapinghub.com:8080' >>> request.meta['domain'] { "fingerprint": "5bab61eb53176449e25c2c82f172b82cb13ffb9d", "name": "scrapinghub.com", "netloc": "www.scrapinghub.com", "scheme": "http", "sld": "scrapinghub", "subdomain": "www", "tld": "com" }