Middlewares(中间件)

Frontier Middleware 位于 FrontierManagerBackend objects 之间, 根据 frontier data flow 的流程,处理 RequestResponse

Middlewares 是一个轻量级、低层次的系统,可以用来过滤和更改 Frontier 的 requests 和 responses。

激活一个 middleware

要激活 Middleware component, 需要添加它到 MIDDLEWARES setting(这是一个列表,包含类的路径或者一个 Middleware 对象)。

这是一个例子:

MIDDLEWARES = [
    'frontera.contrib.middlewares.domain.DomainMiddleware',
]

Middlewares按照它们在列表中定义的相同顺序进行调用,根据你自己的需要安排顺序。 该顺序很重要,因为每个中间件执行不同的操作,并且您的中间件可能依赖于一些先前(或后续的)执行的中间件。

最后,记住一些 middlewares 需要通过特殊的 setting。详细请参考 each middleware documentation

写你自己的 middleware

写自己的 Frontera middleware 是很简单的。每个 Middleware 是一个继承 Component 的 Python 类。

FrontierManager 会通过下面的方法和所有激活的 middlewares 通信。

class frontera.core.components.Middleware

Methods

Middleware.frontier_start()

Called when the frontier starts, see starting/stopping the frontier.

Middleware.frontier_stop()

Called when the frontier stops, see starting/stopping the frontier.

Middleware.add_seeds(seeds)

This method is called when new seeds are added to the frontier.

参数:seeds (list) – A list of Request objects.
返回:Request object list or None

应该返回 None 或者 Request 的列表。

如果返回 NoneFrontierManager 将不会处理任何中间件,并且种子也不会到达 Backend

如果返回 Request 列表,该列表将会传给下个中间件。这个过程会在每个激活的中间件重复,直到它到达 Backend

如果要过滤任何种子,请不要将其包含在返回的对象列表中。

Middleware.page_crawled(response)

This method is called every time a page has been crawled.

参数:response (object) – The Response object for the crawled page.
返回:Response or None

应该返回 None 或者一个 Response 对象。

如果返回 NoneFrontierManager 将不会处理任何中间件,并且 Backend 不会被通知。

如果返回 Response,它将会被传给下个中间件。这个过程会在每个激活的中间件重复,直到它到达 Backend

如果要过滤页面,只需返回 None。

Middleware.request_error(page, error)

This method is called each time an error occurs when crawling a page.

参数:
  • request (object) – The crawled with error Request object.
  • error (string) – A string identifier for the error.
返回:

Request or None

应该返回 None 或者一个 Request 对象。

如果返回 NoneFrontierManager 将不会和其他任何中间件通信,并且 Backend 不会被通知。

如果返回一个 Response 对象,它将会被传给下个中间件。这个过程会在每个激活的中间件重复,直到它到达 Backend

如果要过滤页面错误,只需返回 None。

Class Methods

Middleware.from_manager(manager)

Class method called from FrontierManager passing the manager itself.

Example of usage:

def from_manager(cls, manager):
    return cls(settings=manager.settings)

内置 middleware 参考

这篇文章描述了 Frontera 所有的 Middleware 组件。如何使用和写自己的 middleware,请参考 middleware usage guide.

有关默认启用的组件列表(及其顺序),请参阅 MIDDLEWARES 设置。

DomainMiddleware

class frontera.contrib.middlewares.domain.DomainMiddleware

This Middleware will add a domain info field for every Request.meta and Response.meta if is activated.

domain object will contain the following fields, with both keys and values as bytes:

  • netloc: URL netloc according to RFC 1808 syntax specifications
  • name: Domain name
  • scheme: URL scheme
  • tld: Top level domain
  • sld: Second level domain
  • subdomain: URL subdomain(s)

An example for a Request object:

>>> request.url
'http://www.scrapinghub.com:8080/this/is/an/url'

>>> request.meta['domain']
{
    "name": "scrapinghub.com",
    "netloc": "www.scrapinghub.com",
    "scheme": "http",
    "sld": "scrapinghub",
    "subdomain": "www",
    "tld": "com"
}

If TEST_MODE is active, It will accept testing URLs, parsing letter domains:

>>> request.url
'A1'

>>> request.meta['domain']
{
    "name": "A",
    "netloc": "A",
    "scheme": "-",
    "sld": "-",
    "subdomain": "-",
    "tld": "-"
}

UrlFingerprintMiddleware

class frontera.contrib.middlewares.fingerprint.UrlFingerprintMiddleware

This Middleware will add a fingerprint field for every Request.meta and Response.meta if is activated.

Fingerprint will be calculated from object URL, using the function defined in URL_FINGERPRINT_FUNCTION setting. You can write your own fingerprint calculation function and use by changing this setting. The fingerprint must be bytes.

An example for a Request object:

>>> request.url
'http//www.scrapinghub.com:8080'

>>> request.meta['fingerprint']
'60d846bc2969e9706829d5f1690f11dafb70ed18'
frontera.utils.fingerprint.hostname_local_fingerprint(key)

This function is used for URL fingerprinting, which serves to uniquely identify the document in storage. hostname_local_fingerprint is constructing fingerprint getting first 4 bytes as Crc32 from host, and rest is MD5 from rest of the URL. Default option is set to make use of HBase block cache. It is expected to fit all the documents of average website within one cache block, which can be efficiently read from disk once.

参数:key – str URL
返回:str 20 bytes hex string

DomainFingerprintMiddleware

class frontera.contrib.middlewares.fingerprint.DomainFingerprintMiddleware

This Middleware will add a fingerprint field for every Request.meta and Response.meta domain fields if is activated.

Fingerprint will be calculated from object URL, using the function defined in DOMAIN_FINGERPRINT_FUNCTION setting. You can write your own fingerprint calculation function and use by changing this setting. The fingerprint must be bytes

An example for a Request object:

>>> request.url
'http//www.scrapinghub.com:8080'

>>> request.meta['domain']
{
    "fingerprint": "5bab61eb53176449e25c2c82f172b82cb13ffb9d",
    "name": "scrapinghub.com",
    "netloc": "www.scrapinghub.com",
    "scheme": "http",
    "sld": "scrapinghub",
    "subdomain": "www",
    "tld": "com"
}