python3 爬蟲學習——1

2019-11-06 07:15:30

字體：大中小

來源：轉載

供稿：網友

最近在學習運用python寫爬蟲

買的書以及網上資料大多還停留在python2

由于部分庫有些改動，在博客里mark一下

爬蟲第一版

import urllib.requestdef download(url):    return urllib.request.urlopen(url).read()txt = download('https://www.baidu.com')PRint(txt.decode()) #default parameter is 'utf-8'
因為是第一版所以功能很簡單
通過urllib.request中的urlopen()函數直接獲取網址，此處我們訪問的是baidu
再通過解碼得到txt文本
需要注意的有兩點：
1.python2中的urllib2在python3中改成了urllib2.request 
2.上述代碼采用的是默認的解碼方式 只對通用的utf-8編碼的網頁有用 采用像gb2312這種國標的網頁可能會gg
第二版
import urllib.requestdef download(url,num_retries=2):    print('downloading: ',url)    try:        html = urllib.request.urlopen(url).read()    except urllib.request.URLError as e:        print('download error: ',e.reason)        html = None        if num_retries>0:            if hasattr(e,'code') and 500<=e.code<600:                return download(url,num_retries-1)    return htmlpage = download('http://httpstat.us/500')if page != None:    print(page.decode())else:    print('Receive None')第二版在第一版的基礎上加了異常處理
在訪問網頁的時候最常見的是404異常（表示網頁目前不存在）
4xx的錯誤發生在請求存在問題的時候
5xx的錯誤發生在服務器端存在問題
因此我們面對5xx的錯誤可以采用重試下載來應對
上述代碼對于5xx的錯誤 會重試三次 三次都不成功會放棄
實例中的url是個美國的地址所以基本上都會error
第三版
#自動化識別網頁編碼方式并以改格式解碼import chardetimport urllib.requestdef download(url,num_retries=2):    print('downloading: ',url)    try:        html = urllib.request.urlopen(url).read()    except urllib.request.URLError as e:        print('download error: ',e.reason)        html = None        if num_retries>0:            if hasattr(e,'code') and 500<=e.code<600:                return download(url,num_retries-1)    return htmlpage = download('http://www.sdu.edu.cn')if page != None:    charset_info = chardet.detect(page)     #獲取文本編碼方式    print(charset_info)    print(page.decode(charset_info['encoding'],'ignore'))else:    print('Receive None')這一版的更新加入了自動識別網頁編碼方式
利用chardet這個模塊給的detect方法檢測網頁編碼方式
并采取該方式解碼
需要注意的是有些網站很大時，檢測時間會比較長 
這種情況下只檢測網站部分內容即可
chardet.detect(page[:500])
第四版
#使用代理進行訪問#自動化識別網頁編碼方式并以改格式解碼import chardetimport urllib.requestdef download(url,user_agent='wswp',num_retries=2):    print('downloading: ',url)    headers = {'User-agent':user_agent}    request = urllib.request.Request(url,headers=headers)    try:        response = urllib.request.urlopen(request)        html = response.read()    except urllib.request.URLError as e:        print('download error: ',e.reason)        html = None        if num_retries>0:            if hasattr(e,'code') and 500<=e.code<600:                return download(url,num_retries-1)    return htmldef decode_page(page):    if page != None:        charset_info = chardet.detect(page[:500])  # 獲取文本編碼方式        charset = charset_info['encoding']        return page.decode(charset, 'ignore')    else:        return 'None Page'page = download('http://www.nju.edu.cn')txt = decode_page(page)print(txt)這次的改進是加入了代理，因為很多網站會限制爬蟲，因此很多時候爬蟲都要偽裝成瀏覽器經常偽裝成Mozilla。。當然我們這個版本的只是演示一下這次訪問的是南大網站接下來的版本放到了下一篇博客中