Middlewares(中间件)¶
Frontier Middleware
位于
FrontierManager
和
Backend
objects 之间, 根据 frontier data flow 的流程,处理 Request
和 Response
。
Middlewares 是一个轻量级、低层次的系统,可以用来过滤和更改 Frontier 的 requests 和 responses。
激活一个 middleware¶
要激活 Middleware
component, 需要添加它到
MIDDLEWARES
setting(这是一个列表,包含类的路径或者一个 Middleware
对象)。
这是一个例子:
MIDDLEWARES = [
'frontera.contrib.middlewares.domain.DomainMiddleware',
]
Middlewares按照它们在列表中定义的相同顺序进行调用,根据你自己的需要安排顺序。 该顺序很重要,因为每个中间件执行不同的操作,并且您的中间件可能依赖于一些先前(或后续的)执行的中间件。
最后,记住一些 middlewares 需要通过特殊的 setting。详细请参考 each middleware documentation 。
写你自己的 middleware¶
写自己的 Frontera middleware 是很简单的。每个 Middleware
是一个继承 Component
的 Python 类。
FrontierManager
会通过下面的方法和所有激活的 middlewares 通信。
-
class
frontera.core.components.
Middleware
¶ Methods
-
Middleware.
frontier_start
()¶ Called when the frontier starts, see starting/stopping the frontier.
-
Middleware.
frontier_stop
()¶ Called when the frontier stops, see starting/stopping the frontier.
-
Middleware.
add_seeds
(seeds)¶ This method is called when new seeds are added to the frontier.
参数: seeds (list) – A list of Request
objects.返回: Request
object list orNone
应该返回
None
或者Request
的列表。如果返回
None
,FrontierManager
将不会处理任何中间件,并且种子也不会到达Backend
。如果返回
Request
列表,该列表将会传给下个中间件。这个过程会在每个激活的中间件重复,直到它到达Backend
。如果要过滤任何种子,请不要将其包含在返回的对象列表中。
-
Middleware.
page_crawled
(response)¶ This method is called every time a page has been crawled.
参数: response (object) – The Response
object for the crawled page.返回: Response
orNone
应该返回
None
或者一个Response
对象。如果返回
None
,FrontierManager
将不会处理任何中间件,并且Backend
不会被通知。如果返回
Response
,它将会被传给下个中间件。这个过程会在每个激活的中间件重复,直到它到达Backend
。如果要过滤页面,只需返回 None。
-
Middleware.
request_error
(page, error)¶ This method is called each time an error occurs when crawling a page.
参数: - request (object) – The crawled with error
Request
object. - error (string) – A string identifier for the error.
返回: Request
orNone
应该返回
None
或者一个Request
对象。如果返回
None
,FrontierManager
将不会和其他任何中间件通信,并且Backend
不会被通知。如果返回一个
Response
对象,它将会被传给下个中间件。这个过程会在每个激活的中间件重复,直到它到达Backend
。如果要过滤页面错误,只需返回 None。
- request (object) – The crawled with error
Class Methods
-
Middleware.
from_manager
(manager)¶ Class method called from
FrontierManager
passing the manager itself.Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
-
内置 middleware 参考¶
这篇文章描述了 Frontera 所有的 Middleware
组件。如何使用和写自己的 middleware,请参考 middleware usage guide.。
有关默认启用的组件列表(及其顺序),请参阅 MIDDLEWARES 设置。
DomainMiddleware¶
-
class
frontera.contrib.middlewares.domain.
DomainMiddleware
¶ This
Middleware
will add adomain
info field for everyRequest.meta
andResponse.meta
if is activated.domain
object will contain the following fields, with both keys and values as bytes:- netloc: URL netloc according to RFC 1808 syntax specifications
- name: Domain name
- scheme: URL scheme
- tld: Top level domain
- sld: Second level domain
- subdomain: URL subdomain(s)
An example for a
Request
object:>>> request.url 'http://www.scrapinghub.com:8080/this/is/an/url' >>> request.meta['domain'] { "name": "scrapinghub.com", "netloc": "www.scrapinghub.com", "scheme": "http", "sld": "scrapinghub", "subdomain": "www", "tld": "com" }
If
TEST_MODE
is active, It will accept testing URLs, parsing letter domains:>>> request.url 'A1' >>> request.meta['domain'] { "name": "A", "netloc": "A", "scheme": "-", "sld": "-", "subdomain": "-", "tld": "-" }
UrlFingerprintMiddleware¶
-
class
frontera.contrib.middlewares.fingerprint.
UrlFingerprintMiddleware
¶ This
Middleware
will add afingerprint
field for everyRequest.meta
andResponse.meta
if is activated.Fingerprint will be calculated from object
URL
, using the function defined inURL_FINGERPRINT_FUNCTION
setting. You can write your own fingerprint calculation function and use by changing this setting. The fingerprint must be bytes.An example for a
Request
object:>>> request.url 'http//www.scrapinghub.com:8080' >>> request.meta['fingerprint'] '60d846bc2969e9706829d5f1690f11dafb70ed18'
-
frontera.utils.fingerprint.
hostname_local_fingerprint
(key)¶ This function is used for URL fingerprinting, which serves to uniquely identify the document in storage.
hostname_local_fingerprint
is constructing fingerprint getting first 4 bytes as Crc32 from host, and rest is MD5 from rest of the URL. Default option is set to make use of HBase block cache. It is expected to fit all the documents of average website within one cache block, which can be efficiently read from disk once.参数: key – str URL 返回: str 20 bytes hex string
DomainFingerprintMiddleware¶
-
class
frontera.contrib.middlewares.fingerprint.
DomainFingerprintMiddleware
¶ This
Middleware
will add afingerprint
field for everyRequest.meta
andResponse.meta
domain
fields if is activated.Fingerprint will be calculated from object
URL
, using the function defined inDOMAIN_FINGERPRINT_FUNCTION
setting. You can write your own fingerprint calculation function and use by changing this setting. The fingerprint must be bytesAn example for a
Request
object:>>> request.url 'http//www.scrapinghub.com:8080' >>> request.meta['domain'] { "fingerprint": "5bab61eb53176449e25c2c82f172b82cb13ffb9d", "name": "scrapinghub.com", "netloc": "www.scrapinghub.com", "scheme": "http", "sld": "scrapinghub", "subdomain": "www", "tld": "com" }