Frontera API¶
本节介绍了 Frontera 核心API,适用于中间件和后端的开发人员。
Frontera API / Manager¶
Frontera API的主要入口点是 FrontierManager
对象,通过from_manager类方法传递给中间件和后端。该对象提供对所有Frontera核心组件的访问,并且是中间件和后端访问它们并将其功能挂接到Frontera中的唯一方法。
FrontierManager
负责加载安装的中间件和后端,以及用于管理整个 frontier 的数据流。
从settings加载¶
尽管 FrontierManager
可以通过参数初始化,但最常用的初始化方法还是使用 Frontera Settings 。
这个可以通过 from_settings
类方法实现,使用字符串路径:
>>> from frontera import FrontierManager
>>> frontier = FrontierManager.from_settings('my_project.frontier.settings')
或者一个 BaseSettings
对象:
>>> from frontera import FrontierManager, Settings
>>> settings = Settings()
>>> settings.MAX_PAGES = 0
>>> frontier = FrontierManager.from_settings(settings)
也可以无参数初始化,这种情况下 frontier 会使用 默认配置
>>> from frontera import FrontierManager, Settings
>>> frontier = FrontierManager.from_settings()
Frontier Manager¶
-
class
frontera.core.manager.
FrontierManager
(request_model, response_model, backend, middlewares=None, test_mode=False, max_requests=0, max_next_requests=0, auto_start=True, settings=None, canonicalsolver=None, db_worker=False, strategy_worker=False)¶ The
FrontierManager
object encapsulates the whole frontier, providing an API to interact with. It’s also responsible of loading and communicating all different frontier components.参数: - request_model (object/string) – The
Request
object to be used by the frontier. - response_model (object/string) – The
Response
object to be used by the frontier. - backend (object/string) – The
Backend
object to be used by the frontier. - middlewares (list) – A list of
Middleware
objects to be used by the frontier. - test_mode (bool) – Activate/deactivate frontier test mode.
- max_requests (int) – Number of pages after which the frontier would stop (See Finish conditions).
- max_next_requests (int) – Maximum number of requests returned by
get_next_requests
method. - auto_start (bool) – Activate/deactivate automatic frontier start (See starting/stopping the frontier).
- settings (object/string) – The
Settings
object used by the frontier. - canonicalsolver (object/string) – The
CanonicalSolver
object to be used by frontier. - db_worker (bool) – True if class is instantiated in DB worker environment
- strategy_worker (bool) – True if class is instantiated in strategy worker environment
Attributes
-
request_model
¶ The
Request
object to be used by the frontier. Can be defined withREQUEST_MODEL
setting.
-
response_model
¶ The
Response
object to be used by the frontier. Can be defined withRESPONSE_MODEL
setting.
-
middlewares
¶ A list of
Middleware
objects to be used by the frontier. Can be defined withMIDDLEWARES
setting.
-
test_mode
¶ Boolean value indicating if the frontier is using frontier test mode. Can be defined with
TEST_MODE
setting.
-
max_requests
¶ Number of pages after which the frontier would stop (See Finish conditions). Can be defined with
MAX_REQUESTS
setting.
-
max_next_requests
¶ Maximum number of requests returned by
get_next_requests
method. Can be defined withMAX_NEXT_REQUESTS
setting.
-
auto_start
¶ Boolean value indicating if automatic frontier start is activated. See starting/stopping the frontier. Can be defined with
AUTO_START
setting.
-
iteration
¶ Current frontier iteration.
-
n_requests
¶ Number of accumulated requests returned by the frontier.
-
finished
¶ Boolean value indicating if the frontier has finished. See Finish conditions.
API Methods
-
start
()¶ Notifies all the components of the frontier start. Typically used for initializations (See starting/stopping the frontier).
返回: None.
-
stop
()¶ Notifies all the components of the frontier stop. Typically used for finalizations (See starting/stopping the frontier).
返回: None.
-
add_seeds
(seeds)¶ Adds a list of seed requests (seed URLs) as entry point for the crawl.
参数: seeds (list) – A list of Request
objects.返回: None.
-
get_next_requests
(max_next_requests=0, **kwargs)¶ Returns a list of next requests to be crawled. Optionally a maximum number of pages can be passed. If no value is passed,
FrontierManager.max_next_requests
will be used instead. (MAX_NEXT_REQUESTS
setting).参数: - max_next_requests (int) – Maximum number of requests to be returned by this method.
- kwargs (dict) – Arbitrary arguments that will be passed to backend.
返回: list of
Request
objects.
-
page_crawled
(response)¶ Informs the frontier about the crawl result.
参数: response (object) – The Response
object for the crawled page.返回: None.
-
request_error
(request, error)¶ Informs the frontier about a page crawl error. An error identifier must be provided.
参数: - request (object) – The crawled with error
Request
object. - error (string) – A string identifier for the error.
返回: None.
- request (object) – The crawled with error
Class Methods
-
classmethod
from_settings
(settings=None, db_worker=False, strategy_worker=False)¶ Returns a
FrontierManager
instance initialized with the passed settings argument. If no settings is given, frontier default settings are used.
- request_model (object/string) – The
启动/停止 frontier¶
有时,frontier 组件需要执行初始化和最终化操作。frontier 通过 start()
和 stop()
方法去通知不同组件启动或者停止。
默认 auto_start
值是激活的,这意味着在创建 FrontierManager
对象后,组件将被通知。如果您需要对初始化不同组件时进行更精细的控制,请停用 auto_start
并手动调用frontier API start()
和 stop()
方法。
当 auto_start
处于激活状态时,Frontier stop()
方法不会自动调用(因为frontier 不知道抓取状态)。如果您需要通知 frontier 组件,您应该手动调用该方法。
Frontier 迭代¶
一旦 frontier 运行,通常的过程就是 data flow 部分所描述的过程。
爬虫调用 get_next_requests()
方法请求接下来要抓取的页面。每次 frontier 返回一个非空列表(可用数据),就是我们所说的前沿迭代。当前 frontier 迭代可以通过 iteration
属性访问。
结束 frontier¶
抓取过程可以被爬虫程序或者 Frontera 停止。当返回最大页数时,Frontera 将结束。此限制由 max_requests
属性控制( MAX_REQUESTS
设置)。
如果 max_requests
设置为0,那么 frontier 会无限抓取下去。
一旦 frontier 完成,get_next_requests
方法将不再返回任何页面,并且 finished
属性将为True。
组件对象¶
-
class
frontera.core.components.
Component
¶ Interface definition for a frontier component The
Component
object is the base class for frontierMiddleware
andBackend
objects.FrontierManager
communicates with the active components using the hook methods listed below.Implementations are different for
Middleware
andBackend
objects, therefore methods are not fully described here but in their corresponding section.Attributes
-
name
¶ The component name
Abstract methods
-
frontier_start
()¶ Called when the frontier starts, see starting/stopping the frontier.
-
frontier_stop
()¶ Called when the frontier stops, see starting/stopping the frontier.
-
add_seeds
(seeds)¶ This method is called when new seeds are added to the frontier.
参数: seeds (list) – A list of Request
objects.
-
page_crawled
(response)¶ This method is called every time a page has been crawled.
参数: response (object) – The Response
object for the crawled page.
-
request_error
(page, error)¶ This method is called each time an error occurs when crawling a page.
参数: - request (object) – The crawled with error
Request
object. - error (string) – A string identifier for the error.
- request (object) – The crawled with error
Class Methods
-
classmethod
from_manager
(manager)¶ Class method called from
FrontierManager
passing the manager itself.Example of usage:
def from_manager(cls, manager): return cls(settings=manager.settings)
-
测试模式¶
在某些情况下,在测试中,frontier 组件需要采用与通常不同的方式(例如,在测试模式下解析域URL时,domain middleware 会接受诸如 'A1'
或者 'B1'
之类的非有效URL)。
组件可以通过 test_mode
属性知道 frontier 是否处于测试模式。
使用 frontier 的其他方法¶
与 frontier 通信也可以通过HTTP API或队列系统等其他机制完成。这些功能暂时不可用,但希望包含在将来的版本中。