Frontera API¶
本节介绍了 Frontera 核心API,适用于中间件和后端的开发人员。
Frontera API / Manager¶
Frontera API的主要入口点是 FrontierManager 对象,通过from_manager类方法传递给中间件和后端。该对象提供对所有Frontera核心组件的访问,并且是中间件和后端访问它们并将其功能挂接到Frontera中的唯一方法。
FrontierManager 负责加载安装的中间件和后端,以及用于管理整个 frontier 的数据流。
从settings加载¶
尽管 FrontierManager 可以通过参数初始化,但最常用的初始化方法还是使用 Frontera Settings 。
这个可以通过 from_settings 类方法实现,使用字符串路径:
>>> from frontera import FrontierManager
>>> frontier = FrontierManager.from_settings('my_project.frontier.settings')
或者一个 BaseSettings 对象:
>>> from frontera import FrontierManager, Settings
>>> settings = Settings()
>>> settings.MAX_PAGES = 0
>>> frontier = FrontierManager.from_settings(settings)
也可以无参数初始化,这种情况下 frontier 会使用 默认配置
>>> from frontera import FrontierManager, Settings
>>> frontier = FrontierManager.from_settings()
Frontier Manager¶
-
class
frontera.core.manager.FrontierManager(request_model, response_model, backend, middlewares=None, test_mode=False, max_requests=0, max_next_requests=0, auto_start=True, settings=None, canonicalsolver=None, db_worker=False, strategy_worker=False)¶ The
FrontierManagerobject encapsulates the whole frontier, providing an API to interact with. It’s also responsible of loading and communicating all different frontier components.参数: - request_model (object/string) – The
Requestobject to be used by the frontier. - response_model (object/string) – The
Responseobject to be used by the frontier. - backend (object/string) – The
Backendobject to be used by the frontier. - middlewares (list) – A list of
Middlewareobjects to be used by the frontier. - test_mode (bool) – Activate/deactivate frontier test mode.
- max_requests (int) – Number of pages after which the frontier would stop (See Finish conditions).
- max_next_requests (int) – Maximum number of requests returned by
get_next_requestsmethod. - auto_start (bool) – Activate/deactivate automatic frontier start (See starting/stopping the frontier).
- settings (object/string) – The
Settingsobject used by the frontier. - canonicalsolver (object/string) – The
CanonicalSolverobject to be used by frontier. - db_worker (bool) – True if class is instantiated in DB worker environment
- strategy_worker (bool) – True if class is instantiated in strategy worker environment
Attributes
-
request_model¶ The
Requestobject to be used by the frontier. Can be defined withREQUEST_MODELsetting.
-
response_model¶ The
Responseobject to be used by the frontier. Can be defined withRESPONSE_MODELsetting.
-
middlewares¶ A list of
Middlewareobjects to be used by the frontier. Can be defined withMIDDLEWARESsetting.
-
test_mode¶ Boolean value indicating if the frontier is using frontier test mode. Can be defined with
TEST_MODEsetting.
-
max_requests¶ Number of pages after which the frontier would stop (See Finish conditions). Can be defined with
MAX_REQUESTSsetting.
-
max_next_requests¶ Maximum number of requests returned by
get_next_requestsmethod. Can be defined withMAX_NEXT_REQUESTSsetting.
-
auto_start¶ Boolean value indicating if automatic frontier start is activated. See starting/stopping the frontier. Can be defined with
AUTO_STARTsetting.
-
iteration¶ Current frontier iteration.
-
n_requests¶ Number of accumulated requests returned by the frontier.
-
finished¶ Boolean value indicating if the frontier has finished. See Finish conditions.
API Methods
-
start()¶ Notifies all the components of the frontier start. Typically used for initializations (See starting/stopping the frontier).
返回: None.
-
stop()¶ Notifies all the components of the frontier stop. Typically used for finalizations (See starting/stopping the frontier).
返回: None.
-
add_seeds(seeds)¶ Adds a list of seed requests (seed URLs) as entry point for the crawl.
参数: seeds (list) – A list of Requestobjects.返回: None.
-
get_next_requests(max_next_requests=0, **kwargs)¶ Returns a list of next requests to be crawled. Optionally a maximum number of pages can be passed. If no value is passed,
FrontierManager.max_next_requestswill be used instead. (MAX_NEXT_REQUESTSsetting).参数: - max_next_requests (int) – Maximum number of requests to be returned by this method.
- kwargs (dict) – Arbitrary arguments that will be passed to backend.
返回: list of
Requestobjects.
-
page_crawled(response)¶ Informs the frontier about the crawl result.
参数: response (object) – The Responseobject for the crawled page.返回: None.
-
request_error(request, error)¶ Informs the frontier about a page crawl error. An error identifier must be provided.
参数: - request (object) – The crawled with error
Requestobject. - error (string) – A string identifier for the error.
返回: None.
- request (object) – The crawled with error
Class Methods
-
classmethod
from_settings(settings=None, db_worker=False, strategy_worker=False)¶ Returns a
FrontierManagerinstance initialized with the passed settings argument. If no settings is given, frontier default settings are used.
- request_model (object/string) – The
启动/停止 frontier¶
有时,frontier 组件需要执行初始化和最终化操作。frontier 通过 start() 和 stop() 方法去通知不同组件启动或者停止。
默认 auto_start 值是激活的,这意味着在创建 FrontierManager 对象后,组件将被通知。如果您需要对初始化不同组件时进行更精细的控制,请停用 auto_start 并手动调用frontier API start() 和 stop() 方法。
当 auto_start 处于激活状态时,Frontier stop() 方法不会自动调用(因为frontier 不知道抓取状态)。如果您需要通知 frontier 组件,您应该手动调用该方法。
Frontier 迭代¶
一旦 frontier 运行,通常的过程就是 data flow 部分所描述的过程。
爬虫调用 get_next_requests() 方法请求接下来要抓取的页面。每次 frontier 返回一个非空列表(可用数据),就是我们所说的前沿迭代。当前 frontier 迭代可以通过 iteration 属性访问。
结束 frontier¶
抓取过程可以被爬虫程序或者 Frontera 停止。当返回最大页数时,Frontera 将结束。此限制由 max_requests 属性控制( MAX_REQUESTS 设置)。
如果 max_requests 设置为0,那么 frontier 会无限抓取下去。
一旦 frontier 完成,get_next_requests 方法将不再返回任何页面,并且 finished 属性将为True。
组件对象¶
-
class
frontera.core.components.Component¶ Interface definition for a frontier component The
Componentobject is the base class for frontierMiddlewareandBackendobjects.FrontierManagercommunicates with the active components using the hook methods listed below.Implementations are different for
MiddlewareandBackendobjects, therefore methods are not fully described here but in their corresponding section.Attributes
-
name¶ The component name
Abstract methods
-
frontier_start()¶ Called when the frontier starts, see starting/stopping the frontier.
-
frontier_stop()¶ Called when the frontier stops, see starting/stopping the frontier.
-
add_seeds(seeds)¶ This method is called when new seeds are added to the frontier.
参数: seeds (list) – A list of Requestobjects.
-
page_crawled(response)¶ This method is called every time a page has been crawled.
参数: response (object) – The Responseobject for the crawled page.
-
request_error(page, error)¶ This method is called each time an error occurs when crawling a page.
参数: - request (object) – The crawled with error
Requestobject. - error (string) – A string identifier for the error.
- request (object) – The crawled with error
Class Methods
-
classmethod
from_manager(manager)¶ Class method called from
FrontierManagerpassing the manager itself.Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
-
测试模式¶
在某些情况下,在测试中,frontier 组件需要采用与通常不同的方式(例如,在测试模式下解析域URL时,domain middleware 会接受诸如 'A1' 或者 'B1' 之类的非有效URL)。
组件可以通过 test_mode 属性知道 frontier 是否处于测试模式。
使用 frontier 的其他方法¶
与 frontier 通信也可以通过HTTP API或队列系统等其他机制完成。这些功能暂时不可用,但希望包含在将来的版本中。