Crawler4U - crawlerclub/crawler 优秀的 Golang 爬虫库入门

Crawler4U 一句话简介：十年磨一剑 - Crawler4U 专注通用爬虫。一下被吸引了，文档很少，想略过，但一看使用该爬虫的用户。

Crawler4U - crawlerclub/crawler

震惊了！于是看一下代码。发现太专业了👍相见恨晚😄

如果下次需要做爬虫，肯定会选择 Crawler4U

使用入门

下载或编译

下载二进制文件或下载源码来编译。

Bash: 编译crawler

go get -d crawler.club/crawler
cd $GOPATH/src/crawler.club/crawler
make

配置

默认配置文件 conf/seeds.json ，这个文件是设置目标网页。如下示例

JSON: seeds.json

[
  {
    "parser_name": "section",
    "url": "http://www.newsmth.net/nForum/section/1"
  },
  {
    "parser_name": "section",
    "url": "http://www.newsmth.net/nForum/section/2"
  }
]

parser_name 用来指定解析文件，文件名与设定对应。如：

JSON: section.json

{
  "name": "section",
  "example_url": "http://www.newsmth.net/nForum/section/1",
  "default_fields": true,
  "rules": {
    "root": [
      {
        "type": "url",
        "key": "section",
        "xpath": "//tr[contains(td[2]/text(),'[二级目录]')]/td[1]/a"
      },
      {
        "type": "url",
        "key": "board",
        "xpath": "//tr[not(contains(td[2]/text(),'[二级目录]'))]/td[1]/a"
      }
    ]
  },
  "js": ""
}

rules["root"]["key"] 指定的也是解析文件，section 指向自身，board 定义如下：

JSON: board.json

{
  "name": "board",
  "example_url": "http://www.newsmth.net/nForum/board/Universal",
  "default_fields": true,
  "rules": {
    "root": [
      {
        "type": "url",
        "key": "article",
        "xpath": "//tr[not(contains(@class, 'top ad'))]/td[2]/a"
      },
      {
        "type": "url",
        "key": "board",
        "xpath": "//div[@class='t-pre']//li[@class='page-select']/following-sibling::li[1]/a"
      },
      {
        "type": "text",
        "key": "time_",
        "xpath": "//tr[not(contains(@class, 'top'))][1]/td[8]"
      }
    ]
  },
  "js": ""
}

上面又引入了 article.json

JSON: article.json

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46

">1 class="line">{ "name": "article", "example_url": "http://www.newsmth.net/nForum/article/AI/65703", "default_fields": true, "rules": { "root": [ { "type": "url", "key": "article", "xpath": "//div[@class='t-pre']//li/a/@href" }, { "type": "dom", "key": "posts", "xpath": "//table[contains(concat(' ', @class, ' '), ' article ')]" } ], "posts": [ { "type": "text", "key": "text", "xpath": ".//td[contains(concat(' ', @class, ' '), ' a-content ')]" }, { "type": "html", "key": "meta", "xpath": ".//td[contains(concat(' ', @class, ' '), ' a-content ')]", "re": [ "发信人:(?P<author>.+?)\\((?P<nick>.*?)\\).*?信区:(?P<board>.+?)<br/>", "标题:(?P<title>.+?)<br/>", "发信站:(?P<site>.+?)\\((?P<time>.+?)\\)", "\\[FROM: (?P<ip>[\\d\\.\\*]+?)\\]" ] }, { "type": "text", "key": "floor", "xpath": ".//span[contains(@class, 'a-pos')]", "re": ["(\\d+|楼主)"], "js": "function process(s){if(s=='楼主') return '0'; return s;}" } ] }, "js": "" class="p">}

等等等……

运行

Bash: 运行 crawler

% ./crawler -logtostderr -api -period 30
Git SHA: Not provided (use make instead of go build)
Go Version: go1.17.1
Go OS/Arch: darwin/amd64
I1102 23:48:54.650103   79334 main.go:133] start worker 0
I1102 23:48:54.650327   79334 web.go:89] rest server listen on:2001

抓到的内容默认保存在 data/fs ，以 json 格式保存，一行一条。

通用的 Doc 结构，保存的信息很多

保存doc信息

运行状态

查看运行状态 http://localhost:2001/api/status

JSON: status

{
	"status": "OK",
	"message": {
		"crawl": {
			"queue_length": 80527,
			"retry_queue_length": 28
		},
		"store": {
			"queue_length": 36880,
			"retry_queue_length": 0
		}
	}
}

保存的数据

保存的数据格式，从 data/fs/*.dat 里取出的两条：

JSON: 爬取结果

{
    "article": [
        "http://www.newsmth.net/nForum/article/Mj/158582",
        "http://www.newsmth.net/nForum/article/Mj/158579",
        "http://www.newsmth.net/nForum/article/Mj/158578",
        "http://www.newsmth.net/nForum/article/Mj/158448"
    ],
    "board": "http://www.newsmth.net/nForum/board/Mj?p=3",
    "crawl_time_": "2021-11-02T15:52:07Z",
    "from_parser_": "board",
    "from_url_": "http://www.newsmth.net/nForum/board/Mj?p=2"
}

{
	"crawl_time_": "2021-11-02T15:52:06Z",
	"from_parser_": "article",
	"from_url_": "http://www.newsmth.net/nForum/article/Mj/158590",
	"posts": [{
		"floor": "0",
		"meta": {
			"author": "SolomonIre",
			"board": "Mj",
			"ip": "58.244.39.*",
			"nick": "水龙吟",
			"site": "水木社区",
			"time": "Sun Apr 13 19:58:19 2014",
			"title": "今天回了趟老家累死了"
		},
		"text": "发信人: SolomonIre (水龙吟), 信区: Mj\n标  题: 今天回了趟老家累死了\n发信站: 水木社区 (Sun Apr 13 19:58:19 2014), 站内\n\n搬了些东西，气喘吁吁，再不动弹动弹身体就废了\n\n--\n“似花还似非花，也无人惜从教坠。抛家傍路，思量却是，无情有思。\n萦损柔肠，困酣娇眼，欲开还闭。梦随风万里，寻郎去处，又还被，莺呼起。\n不恨此花飞尽，恨西园、落红难缀。晓来雨过，遗踪何在？一池萍碎。\n春色三分，二分尘土，一分流水。细看来、不是杨花，点点是离人泪。”\n\n—— 苏轼\n\n\n※ 来源:·水木社区 newsmth.net·[FROM: 58.244.39.*]"
	}, {
		"floor": "1",
		"meta": {
			"author": "xiaoyuer",
			"board": "Mj",
			"ip": "123.118.218.*",
			"nick": "十年又十年又十年",
			"site": "水木社区",
			"time": "Sun Apr 13 21:03:56 2014",
			"title": "Re: 今天回了趟老家累死了"
		},
		"text": "发信人: xiaoyuer (十年又十年又十年), 信区: Mj\n标  题: Re: 今天回了趟老家累死了\n发信站: 水木社区 (Sun Apr 13 21:03:56 2014), 站内\n\n\n速度动弹\n\n【 在 SolomonIre (水龙吟) 的大作中提到: 】\n: 搬了些东西，气喘吁吁，再不动弹动弹身体就废了\n\n\n--\n\n※ 来源:·水木社区 newsmth.net·[FROM: 123.118.218.*]"
	}, {
		"floor": "2",
		"meta": {
			"author": "ripeapple",
			"board": "Mj",
			"ip": "123.122.61.*",
			"nick": "小象※自作自受",
			"site": "水木社区",
			"time": "Sun Apr 13 23:16:28 2014",
			"title": "Re: 今天回了趟老家累死了"
		},
		"text": "发信人: ripeapple (小象※自作自受), 信区: Mj\n标  题: Re: 今天回了趟老家累死了\n发信站: 水木社区 (Sun Apr 13 23:16:28 2014), 站内\n\n老家是哪里呀？\n我现在随便动动就很累。。。\n【 在 SolomonIre 的大作中提到: 】\n: 搬了些东西，气喘吁吁，再不动弹动弹身体就废了\n\n--\n\n※ 来源:·水木社区 http://www.newsmth.net·[FROM: 123.122.61.*]"
	}]
}

这里 "from_parser_": "article" 指明是根据 article.json 配置文件解析得到的结果。回头再看看上面的 board.json 和 article.json 就理清解析方法。conf/parser/*.json 这些 json 文件设置了解析方法，同时也指定了抓取结果的数据格式，让管理解析很方便。

进阶使用

使用 cookie

自己携带 cookie 请求，得修改 main.go 文件，work 函数。

带cookie请求

如果同时爬取多个网站就做一个 cookie 字典。

自定义处理结果

可以直接修改这一段

保存结果

但不建议这么做，因为可能影响到爬取效率，最好是对 data/fs/*.dat 文件监测处理。

总结

Crawler4U使用 goleveldb 嵌入式数据库保存爬虫状态数据，爬取结果使用 json 文件保存。部署很简单，不需要另配数据库，程序很绿色很环保。

没有 web 配置界面，对初学者来说比较麻烦，但入门后感觉很棒。

参考

https://github.com/crawlerclub/crawler

本文网址: https://golangnote.com/topic/295.html 转摘请注明来源

There are 2 Comments to "Crawler4U - crawlerclub/crawler 优秀的 Golang 爬虫库入门"

1 pykill8 says:
2021-11-06 14:21:43 回复

这可以动态熏染？

2 GolangNote says:
2021-11-07 23:13:54 回复

@pykill8 #1 不能

GolangNote says: 2021-11-07 23:13:54
@pykill8 #1 不能

pykill8 says: 2021-11-06 14:21:43
这可以动态熏染？

golang小白 says: 2021-07-30 17:50:19
`copy(items[0:n], ...

L says: 2021-04-05 22:23:30
``` go PortScanpac...

GolangNote says: 2021-02-27 10:37:18
@HDJ #1 官方SDK注释说明了...

GolangNote says: 2021-02-27 10:27:06
@smallwhite #1 预分配...

GolangNote says: 2021-02-26 22:53:11
@frank #1 上面的例子很完整...

darrykinger says: 2021-02-04 04:36:38
go-fastping.go 那个e...

HDJ says: 2020-08-14 08:54:08
可是strings.Compare也...

GolangNote says: 2020-05-22 15:29:43
@no 国内本地开发，如果是 `ht...

GolangNote

Golang笔记

Crawler4U - crawlerclub/crawler 优秀的 Golang 爬虫库入门

使用入门

下载或编译

配置

运行

运行状态

保存的数据

进阶使用

使用 cookie

自定义处理结果

总结

参考

Related articles

Golang 爬虫工具 goquery 推荐

Golang 时区时差处理方式

Golang 实现md5sum 计算文件md5 值

Golang Range 的性能提升Tip

Golang Web 程序生产环境独立部署示例

Golang 两个同类型 struct 的比较的方法及性能

Golang 字符串 Split、index和 LastIndex 的性能

使用Golang 简单删除图片exif 信息

golang共享数据用Mutex 或 Channel

golang 截取中文不出现乱码的方法

Comments

There are 2 Comments to "Crawler4U - crawlerclub/crawler 优秀的 Golang 爬虫库入门"

Write a Comment to "Crawler4U - crawlerclub/crawler 优秀的 Golang 爬虫库入门"