Frontera API

本节介绍了 Frontera 核心API,适用于中间件和后端的开发人员。

Frontera API / Manager

Frontera API的主要入口点是 FrontierManager 对象,通过from_manager类方法传递给中间件和后端。该对象提供对所有Frontera核心组件的访问,并且是中间件和后端访问它们并将其功能挂接到Frontera中的唯一方法。

FrontierManager 负责加载安装的中间件和后端,以及用于管理整个 frontier 的数据流。

从settings加载

尽管 FrontierManager 可以通过参数初始化,但最常用的初始化方法还是使用 Frontera Settings

这个可以通过 from_settings 类方法实现,使用字符串路径:

>>> from frontera import FrontierManager
>>> frontier = FrontierManager.from_settings('my_project.frontier.settings')

或者一个 BaseSettings 对象:

>>> from frontera import FrontierManager, Settings
>>> settings = Settings()
>>> settings.MAX_PAGES = 0
>>> frontier = FrontierManager.from_settings(settings)

也可以无参数初始化,这种情况下 frontier 会使用 默认配置

>>> from frontera import FrontierManager, Settings
>>> frontier = FrontierManager.from_settings()

Frontier Manager

class frontera.core.manager.FrontierManager(request_model, response_model, backend, middlewares=None, test_mode=False, max_requests=0, max_next_requests=0, auto_start=True, settings=None, canonicalsolver=None, db_worker=False, strategy_worker=False)

The FrontierManager object encapsulates the whole frontier, providing an API to interact with. It’s also responsible of loading and communicating all different frontier components.

参数:
  • request_model (object/string) – The Request object to be used by the frontier.
  • response_model (object/string) – The Response object to be used by the frontier.
  • backend (object/string) – The Backend object to be used by the frontier.
  • middlewares (list) – A list of Middleware objects to be used by the frontier.
  • test_mode (bool) – Activate/deactivate frontier test mode.
  • max_requests (int) – Number of pages after which the frontier would stop (See Finish conditions).
  • max_next_requests (int) – Maximum number of requests returned by get_next_requests method.
  • auto_start (bool) – Activate/deactivate automatic frontier start (See starting/stopping the frontier).
  • settings (object/string) – The Settings object used by the frontier.
  • canonicalsolver (object/string) – The CanonicalSolver object to be used by frontier.
  • db_worker (bool) – True if class is instantiated in DB worker environment
  • strategy_worker (bool) – True if class is instantiated in strategy worker environment

Attributes

request_model

The Request object to be used by the frontier. Can be defined with REQUEST_MODEL setting.

response_model

The Response object to be used by the frontier. Can be defined with RESPONSE_MODEL setting.

backend

The Backend object to be used by the frontier. Can be defined with BACKEND setting.

middlewares

A list of Middleware objects to be used by the frontier. Can be defined with MIDDLEWARES setting.

test_mode

Boolean value indicating if the frontier is using frontier test mode. Can be defined with TEST_MODE setting.

max_requests

Number of pages after which the frontier would stop (See Finish conditions). Can be defined with MAX_REQUESTS setting.

max_next_requests

Maximum number of requests returned by get_next_requests method. Can be defined with MAX_NEXT_REQUESTS setting.

auto_start

Boolean value indicating if automatic frontier start is activated. See starting/stopping the frontier. Can be defined with AUTO_START setting.

settings

The Settings object used by the frontier.

iteration

Current frontier iteration.

n_requests

Number of accumulated requests returned by the frontier.

finished

Boolean value indicating if the frontier has finished. See Finish conditions.

API Methods

start()

Notifies all the components of the frontier start. Typically used for initializations (See starting/stopping the frontier).

返回:None.
stop()

Notifies all the components of the frontier stop. Typically used for finalizations (See starting/stopping the frontier).

返回:None.
add_seeds(seeds)

Adds a list of seed requests (seed URLs) as entry point for the crawl.

参数:seeds (list) – A list of Request objects.
返回:None.
get_next_requests(max_next_requests=0, **kwargs)

Returns a list of next requests to be crawled. Optionally a maximum number of pages can be passed. If no value is passed, FrontierManager.max_next_requests will be used instead. (MAX_NEXT_REQUESTS setting).

参数:
  • max_next_requests (int) – Maximum number of requests to be returned by this method.
  • kwargs (dict) – Arbitrary arguments that will be passed to backend.
返回:

list of Request objects.

page_crawled(response)

Informs the frontier about the crawl result.

参数:response (object) – The Response object for the crawled page.
返回:None.
request_error(request, error)

Informs the frontier about a page crawl error. An error identifier must be provided.

参数:
  • request (object) – The crawled with error Request object.
  • error (string) – A string identifier for the error.
返回:

None.

Class Methods

classmethod from_settings(settings=None, db_worker=False, strategy_worker=False)

Returns a FrontierManager instance initialized with the passed settings argument. If no settings is given, frontier default settings are used.

启动/停止 frontier

有时,frontier 组件需要执行初始化和最终化操作。frontier 通过 start()stop() 方法去通知不同组件启动或者停止。

默认 auto_start 值是激活的,这意味着在创建 FrontierManager 对象后,组件将被通知。如果您需要对初始化不同组件时进行更精细的控制,请停用 auto_start 并手动调用frontier API start()stop() 方法。

auto_start 处于激活状态时,Frontier stop() 方法不会自动调用(因为frontier 不知道抓取状态)。如果您需要通知 frontier 组件,您应该手动调用该方法。

Frontier 迭代

一旦 frontier 运行,通常的过程就是 data flow 部分所描述的过程。

爬虫调用 get_next_requests() 方法请求接下来要抓取的页面。每次 frontier 返回一个非空列表(可用数据),就是我们所说的前沿迭代。当前 frontier 迭代可以通过 iteration 属性访问。

结束 frontier

抓取过程可以被爬虫程序或者 Frontera 停止。当返回最大页数时,Frontera 将结束。此限制由 max_requests 属性控制( MAX_REQUESTS 设置)。

如果 max_requests 设置为0,那么 frontier 会无限抓取下去。

一旦 frontier 完成,get_next_requests 方法将不再返回任何页面,并且 finished 属性将为True。

组件对象

class frontera.core.components.Component

Interface definition for a frontier component The Component object is the base class for frontier Middleware and Backend objects.

FrontierManager communicates with the active components using the hook methods listed below.

Implementations are different for Middleware and Backend objects, therefore methods are not fully described here but in their corresponding section.

Attributes

name

The component name

Abstract methods

frontier_start()

Called when the frontier starts, see starting/stopping the frontier.

frontier_stop()

Called when the frontier stops, see starting/stopping the frontier.

add_seeds(seeds)

This method is called when new seeds are added to the frontier.

参数:seeds (list) – A list of Request objects.
page_crawled(response)

This method is called every time a page has been crawled.

参数:response (object) – The Response object for the crawled page.
request_error(page, error)

This method is called each time an error occurs when crawling a page.

参数:
  • request (object) – The crawled with error Request object.
  • error (string) – A string identifier for the error.

Class Methods

classmethod from_manager(manager)

Class method called from FrontierManager passing the manager itself.

Example of usage:

def from_manager(cls, manager):
    return cls(settings=manager.settings)

测试模式

在某些情况下,在测试中,frontier 组件需要采用与通常不同的方式(例如,在测试模式下解析域URL时,domain middleware 会接受诸如 'A1' 或者 'B1' 之类的非有效URL)。

组件可以通过 test_mode 属性知道 frontier 是否处于测试模式。

使用 frontier 的其他方法

与 frontier 通信也可以通过HTTP API或队列系统等其他机制完成。这些功能暂时不可用,但希望包含在将来的版本中。