Frontier 对象

Frontier 使用两种对象类型: Request and Response. 他们各自代表 HTTP 请求和 HTTP 返回.

这两个类会被大多数的 Frontera API 方法调用,根据方法不同可能作为参数也可能作为返回值。

Frontera 同样也会使用这两种对象在内部组件之间传递数据(比如 middlewares 和 backend)。

Request 对象

class frontera.core.models.Request(url, method='GET', headers=None, cookies=None, meta=None, body='')

A Request object represents an HTTP request, which is generated for seeds, extracted page links and next pages to crawl. Each one should be associated to a Response object when crawled.

参数:
  • url (string) – URL to send.
  • method (string) – HTTP method to use.
  • headers (dict) – dictionary of headers to send.
  • cookies (dict) – dictionary of cookies to attach to this request.
  • meta (dict) – dictionary that contains arbitrary metadata for this request, the keys must be bytes and the values must be either bytes or serializable objects such as lists, tuples, dictionaries with byte type items.
body

A string representing the request body.

cookies

Dictionary of cookies to attach to this request.

headers

A dictionary which contains the request headers.

meta

A dict that contains arbitrary metadata for this request. This dict is empty for new Requests, and is usually populated by different Frontera components (middlewares, etc). So the data contained in this dict depends on the components you have enabled. The keys are bytes and the values are either bytes or serializable objects such as lists, tuples, dictionaries with byte type items.

method

A string representing the HTTP method in the request. This is guaranteed to be uppercase. Example: GET, POST, PUT, etc

url

A string containing the URL of this request.

Response 对象

class frontera.core.models.Response(url, status_code=200, headers=None, body='', request=None)

A Response object represents an HTTP response, which is usually downloaded (by the crawler) and sent back to the frontier for processing.

参数:
  • url (string) – URL of this response.
  • status_code (int) – the HTTP status of the response. Defaults to 200.
  • headers (dict) – dictionary of headers to send.
  • body (str) – the response body.
  • request (Request) – The Request object that generated this response.
body

A str containing the body of this Response.

headers

A dictionary object which contains the response headers.

meta

A shortcut to the Request.meta attribute of the Response.request object (ie. self.request.meta).

request

The Request object that generated this response.

status_code

An integer representing the HTTP status of the response. Example: 200, 404, 500.

url

A string containing the URL of the response.

domainfingerprint 字段被 内置 middlewares 添加。

对象唯一识别标志

因为 Frontera 对象会在爬虫和服务器之间传递,所以需要一些机制来唯一标示一个对象。这个识别机制会基于 Frontera 逻辑不同而有所变化(大多数情况是根据后端的逻辑)。

默认 Frontera 会激活 fingerprint middleware ,根据 Request.urlResponse.url 分别生成一个唯一标示,并分别赋值给 Request.meta and Response.meta。你可以使用这个中间件或者自己定义。

一个为 Request 生成指纹的例子:

>>> request.url
'http://thehackernews.com'

>>> request.meta['fingerprint']
'198d99a8b2284701d6c147174cd69a37a7dea90f'

为对象添加其他值

大多数情况下 Frontera 存储了系统运行所需要的参数。

同样的,其他信息也可以存入 Request.metaResponse.meta

例如,激活 domain middleware 会为每个 Request.metaResponse.meta 添加 domain 字段:

>>> request.url
'http://www.scrapinghub.com'

>>> request.meta['domain']
{
    "name": "scrapinghub.com",
    "netloc": "www.scrapinghub.com",
    "scheme": "http",
    "sld": "scrapinghub",
    "subdomain": "www",
    "tld": "com"
}