Go语言爬虫

之前写爬虫都是用的python语言，最近发现go语言写起来也挺方便的，下面简单介绍一下。
这里说的爬虫并不是对网络中的很多资源进行不断的循环抓取，而只是抓通过程序的手段都某些网页实现特定的信息抓取。可以简单分成两个部分：抓取网页，对网页进行解析。

抓取网页。一般是向服务器发送一个http get/post请求，得到response。go提供的http包可以很好的实现。

get方法：

1	resp, err := http.Get(“http://www.legendtkl.com")

post方法：

1 2	resp, err := http.Post(“http://example.com/upload”, “image/jpg”, &buf) resp, err := http.PostForm(“http://example.com/form”, url.Values{“key”:{“Value”}, “id”:{“123"}})

当然如果要更具体的设置HTTP client，可以建一个client。

1
2
3

client := &http.Client{
    CheckRedirect : redirectPolicyFunc,
}

client结构如下：

type Client struct {
    Transport RoundTripper //Transport specifies the mechanism by which HTTP request are made
    CheckRedirect func(req *Request, via []*Request) error //if nil, use default, namely 10 consecutive request
    Jar CookieJar //specify the cookie jar
    Timeout time.Duration //request timeout
}

HTTP header，就是我们post时候需要提交的key-value，定义如下

1	type Header map[string][]string

提供了Add, Del, Set等操作。

HTTP request，我们前面直接用Get向服务器地址请求，为了更好的处理，可以使用Do()发送http request。

func (c *client) Do(req *Request) (resp *Response, error)
    client := &http.Client{
...
}
req, err := http.NewRequest(“GET”, “http://example.com”, nil)
req.Header.Add(“If-None-Match”, `W/“wyzzy"`)
resp, err := client.Do(req)

Request里面的数据结构属于http的内容，这里就不细说了。

上面这些只是得到网页内容，更重要的是网页内容解析。我们这里要使用的是goquery，类似jQuery，可以很方便的解析html内容。下面先看一个抓取糗事百科的例子。

package main

import (
    "fmt"
    "github.com/PuerkitoBio/goquery"
    "log"
)

func ExampleScrape() {
    doc, err := goquery.NewDocument("http://www.qiushibaike.com")
    if err != nil {
        log.Fatal(err)
    }
    doc.Find(".article").Each(func(i int, s *goquery.Selection) {
        if s.Find(".thumb").Nodes == nil &amp;&amp; s.Find(".video_holder").Nodes == nil {
            content := s.Find(".content").Text()
            fmt.Printf("%s", content)
        }
    })
}

func main() {
    ExampleScrape()
}

程序运行效果如下。

goquery中最核心的就是find函数，原型如下

1	func (s Selection) Find(selector string) Selection

返回的Selection数据结构如下

1
2
3

type Selection struct {
    Nodes []*html.Node
}

html包是golang.org/x/net/html，Node结构如下

type Node struct {
    Parent, FirstChild, LastChild, PrevSibling, NextSibling *Node
    Type NodeType
    DataAtom atom.Atom
    Data string
    Namespace string
    Attr []Attribule
}

用来解析html网页，上面代码中的第二个if是用来过来糗事百科上面的视频和图片。
goquery内容很丰富，解析起来很方便，大家可以通过godoc查询。