抓取策略

使用 cluster 例子和 frontera.worker.strategies.bfs 模型进行参考。 Use cluster example and frontera.worker.strategies.bfs module for reference. 一般来说,你需要写一个 抓取策略子类,参照:

class frontera.worker.strategies.BaseCrawlingStrategy(manager, mb_stream, states_context)

Interface definition for a crawling strategy.

Before calling these methods strategy worker is adding ‘state’ key to meta field in every Request with state of the URL. Pleases refer for the states to HBaseBackend implementation.

After exiting from all of these methods states from meta field are passed back and stored in the backend.

Methods

classmethod from_worker(manager, mb_stream, states_context)

Called on instantiation in strategy worker.

参数:
  • manager
    class:Backend <frontera.core.manager.FrontierManager> instance
  • mb_stream
    class:UpdateScoreStream <frontera.worker.strategy.UpdateScoreStream> instance
返回:

new instance

add_seeds(seeds)

Called when add_seeds event is received from spider log.

参数:seeds (list) – A list of Request objects.
page_crawled(response)

Called every time document was successfully crawled, and receiving page_crawled event from spider log.

参数:response (object) – The Response object for the crawled page.
page_error(request, error)

Called every time there was error during page downloading.

参数:
  • request (object) – The fetched with error Request object.
  • error (str) – A string identifier for the error.
finished()

Called by Strategy worker, after finishing processing each cycle of spider log. If this method returns true, then Strategy worker reports that crawling goal is achieved, stops and exits.

返回:bool
close()

Called when strategy worker is about to close crawling strategy.

该类可以放在任何模块中,并在启动时使用命令行选项或 CRAWLING_STRATEGY 设置传递给 strategy worker

这个策略类在 strategy worker 中实例化,可以使用自己的存储或任何其他类型的资源。所有来着 spider log 的都会传给这些方法。返回的分数不一定与方法参数相同。finished() 方法会被周期性的调用来检测抓取目标是否达到了。