后端¶

Frontier Backend 是抓取逻辑/策略所在的地方，本质上是你的爬虫的大脑。 Queue, Metadata 和 States 是为了放置低级代码的类，相反，后端类运行更高级的代码。Frontera 内置了内存或数据库方式实现的 Queue, Metadata 和 States，它们可以在你自定义的后端类中使用或者实例化 FrontierManager 和后端独立使用。

后端方法在 Middleware 之后被 FrontierManager 调用，根据 frontier data flow 使用 hooks 处理 Request 和 Response 。

与中间件可以激活许多不同的实例不同，每个 Frontera 只能使用一种后端。

激活一个后端¶

要激活 Frontera 后端组件，请通过 BACKEND 设置进行设置。

这是一个例子

BACKEND = 'frontera.contrib.backends.memory.FIFO'

请记住，某些后端可能需要额外配置其他 settings。更多信息，请参阅 backends documentation 。

写你自己的后端¶

每个后端组件是一个 Python 类，它继承自 Backend 或 DistributedBackend ，且使用 Queue, Metadata 和 States 中的一个或多个。

FrontierManager 将通过下列方法与激活的后端通信。

class frontera.core.components.Backend¶

Methods

Backend.frontier_start()¶

Called when the frontier starts, see starting/stopping the frontier.

返回:	None.

Backend.frontier_stop()¶

Called when the frontier stops, see starting/stopping the frontier.

返回:	None.

Backend.finished()¶

Quick check if crawling is finished. Called pretty often, please make sure calls are lightweight.

返回:	boolean

Backend.add_seeds(seeds)¶

This method is called when new seeds are added to the frontier.

参数:	seeds (list) – A list of `Request` objects.
返回:	None.

Backend.page_crawled(response)¶

This method is called every time a page has been crawled.

参数:	response (object) – The `Response` object for the crawled page.
返回:	None.

Backend.request_error(page, error)¶

This method is called each time an error occurs when crawling a page.

参数:	request (object) – The crawled with error `Request` object. error (string) – A string identifier for the error.
返回:	None.

Backend.get_next_requests(max_n_requests, **kwargs)¶

Returns a list of next requests to be crawled.

参数:	max_next_requests (int) – Maximum number of requests to be returned by this method. kwargs (dict) – A parameters from downloader component.
返回:	list of `Request` objects.

Class Methods

Backend.from_manager(manager)¶

Class method called from FrontierManager passing the manager itself.

Example of usage:

def from_manager(cls, manager):
    return cls(settings=manager.settings)

Properties

queue¶

states¶

metadata¶

class frontera.core.components.DistributedBackend¶

继承 Backend 的所有方法，并且还有两个类方法，它们在 strategy worker 和 db worker 实例化期间被调用。

classmethod DistributedBackend.strategy_worker(manager)¶

classmethod DistributedBackend.db_worker(manager)¶

Backend 应通过这些类与低级存储进行通信：

Metadata¶

class frontera.core.components.Metadata¶

Methods

Metadata.add_seeds(seeds)¶

This method is called when new seeds are added to the frontier.

参数:	seeds (list) – A list of `Request` objects.

Metadata.request_error(page, error)¶

This method is called each time an error occurs when crawling a page.

参数:	request (object) – The crawled with error `Request` object. error (string) – A string identifier for the error.

Metadata.page_crawled(response)¶

This method is called every time a page has been crawled.

参数:	response (object) – The `Response` object for the crawled page.

已知的实现是: MemoryMetadata 和 sqlalchemy.components.Metadata。

Queue¶

class frontera.core.components.Queue¶

Methods

Queue.get_next_requests(max_n_requests, partition_id, **kwargs)¶

Returns a list of next requests to be crawled, and excludes them from internal storage.

参数:	max_next_requests (int) – Maximum number of requests to be returned by this method. kwargs (dict) – A parameters from downloader component.
返回:	list of `Request` objects.

Queue.schedule(batch)¶

Schedules a new documents for download from batch, and updates score in metadata.

参数:	batch – list of tuples(fingerprint, score, request, schedule), if `schedule` is True, then document needs to be scheduled for download, False - only update score in metadata.

Queue.count()¶

Returns count of documents in the queue.

返回:	int

已知的实现是: MemoryQueue 和 sqlalchemy.components.Queue。

States¶

class frontera.core.components.States¶

Methods

States.update_cache(objs)¶

Reads states from meta[‘state’] field of request in objs and stores states in internal cache.

参数:	objs – list or tuple of `Request` objects.

States.set_states(objs)¶

Sets meta[‘state’] field from cache for every request in objs.

参数:	objs – list or tuple of `Request` objects.

States.flush(force_clear)¶

Flushes internal cache to storage.

参数:	force_clear – boolean, True - signals to clear cache after flush

States.fetch(fingerprints)¶

Get states from the persistent storage to internal cache.

参数:	fingerprints – list document fingerprints, which state to read

已知的实现是: MemoryStates 和 sqlalchemy.components.States。

内置后端引用¶

本文介绍了与 Frontera 捆绑在一起的所有后端组件。

要知道默认激活的 Backend 请看 BACKEND 设置。

基本算法¶

一些内置的 Backend 对象实现基本算法，如 FIFO/LIFO or DFS/BFS，用于页面访问排序。

它们之间的差异将在使用的存储引擎上。例如，memory.FIFO 和 sqlalchemy.FIFO 将使用相同的逻辑，但使用不同的存储引擎。

所有这些后端变体都使用相同的 CommonBackend 类实现具有优先级队列的一次访问爬网策略。

class frontera.contrib.backends.CommonBackend¶

内存后端¶

这组 Backend 对象将使用 heapq 模块作为队列和本机字典作为 basic algorithms 的存储。

class frontera.contrib.backends.memory.BASE¶: Base class for in-memory Backend objects.

class frontera.contrib.backends.memory.FIFO¶: In-memory Backend implementation of FIFO algorithm.

class frontera.contrib.backends.memory.LIFO¶: In-memory Backend implementation of LIFO algorithm.

class frontera.contrib.backends.memory.BFS¶: In-memory Backend implementation of BFS algorithm.

class frontera.contrib.backends.memory.DFS¶: In-memory Backend implementation of DFS algorithm.

class frontera.contrib.backends.memory.RANDOM¶: In-memory Backend implementation of a random selection algorithm.

SQLAlchemy 后端¶

这组 Backend 对象将使用 SQLAlchemy 作为 basic algorithms 的存储。

默认情况下，它使用内存模式的 SQLite 数据库作为存储引擎，但可以使用 any databases supported by SQLAlchemy 。

如果你想使用你自己的 declarative sqlalchemy models ，你可以使用 SQLALCHEMYBACKEND_MODELS 设置。

这个 setting 使用一个字典，其中 key 代表要定义的模型的名称，value 代表了这个模型。

有关用于 SQLAlchemy 后端的所有 settings，请查看 settings 。

class frontera.contrib.backends.sqlalchemy.BASE¶: Base class for SQLAlchemy Backend objects.

class frontera.contrib.backends.sqlalchemy.FIFO¶: SQLAlchemy Backend implementation of FIFO algorithm.

class frontera.contrib.backends.sqlalchemy.LIFO¶: SQLAlchemy Backend implementation of LIFO algorithm.

class frontera.contrib.backends.sqlalchemy.BFS¶: SQLAlchemy Backend implementation of BFS algorithm.

class frontera.contrib.backends.sqlalchemy.DFS¶: SQLAlchemy Backend implementation of DFS algorithm.

class frontera.contrib.backends.sqlalchemy.RANDOM¶: SQLAlchemy Backend implementation of a random selection algorithm.

定时重爬后端¶

基于自定义 SQLAlchemy 后端和队列。从种子开始抓取。种子被抓取后，每一个新的文件将被安排立即抓取。每个文档被抓取之后，将会在由 SQLALCHEMYBACKEND_REVISIT_INTERVAL 设置的时间间隔后再次抓取。

定时重爬后端当前没有实现优先级。在长时间运行时，爬虫可能会闲置，因为没有可用的抓取任务，但有任务等待他们的预定的访问时间。

class frontera.contrib.backends.sqlalchemy.revisiting.Backend¶: 实现定时重爬后端的 SQLAlchemy Backend 基类。 Base class for SQLAlchemy Backend implementation of revisiting back-end.

HBase 后端¶

class frontera.contrib.backends.hbase.HBaseBackend¶

更适合大规模抓取。设置请参考 HBase 后端。请考虑调整块缓存以适应平均网页块的大小。要实现这一点，建议使用 hostname_local_fingerprint ，可以让相同域名的网页放在一起。这个函数可以通过 URL_FINGERPRINT_FUNCTION 设置。