一步一步構建一個爬蟲實例,抓取糗事百科的段子
先不用beautifulsoup包來進行解析
第一步,訪問網址并抓取源碼
# -*- coding: utf-8 -*-# @Author: HaonanWu# @Date: 2016-12-22 16:16:08# @Last Modified by: HaonanWu# @Last Modified time: 2016-12-22 20:17:13import urllibimport urllib2import reimport osif __name__ == '__main__': # 訪問網址并抓取源碼 url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357' user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36' headers = {'User-Agent':user_agent} try: request = urllib2.Request(url = url, headers = headers) response = urllib2.urlopen(request) content = response.read() except urllib2.HTTPError as e: print e exit() except urllib2.URLError as e: print e exit() print content.decode('utf-8')
第二步,利用正則表達式提取信息
首先先觀察源碼中,你需要的內容的位置以及如何識別
然后用正則表達式去識別讀取
注意正則表達式中的 . 是不能匹配/n的,所以需要設置一下匹配模式。
# -*- coding: utf-8 -*-# @Author: HaonanWu# @Date: 2016-12-22 16:16:08# @Last Modified by: HaonanWu# @Last Modified time: 2016-12-22 20:17:13import urllibimport urllib2import reimport osif __name__ == '__main__': # 訪問網址并抓取源碼 url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357' user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36' headers = {'User-Agent':user_agent} try: request = urllib2.Request(url = url, headers = headers) response = urllib2.urlopen(request) content = response.read() except urllib2.HTTPError as e: print e exit() except urllib2.URLError as e: print e exit() regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S) items = re.findall(regex, content) # 提取數據 # 注意換行符,設置 . 能夠匹配換行符 for item in items: print item
第三步,修正數據并保存到文件中
# -*- coding: utf-8 -*-# @Author: HaonanWu# @Date: 2016-12-22 16:16:08# @Last Modified by: HaonanWu# @Last Modified time: 2016-12-22 21:41:32import urllibimport urllib2import reimport osif __name__ == '__main__': # 訪問網址并抓取源碼 url = 'http://www.qiushibaike.com/textnew/page/1/?s=4941357' user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36' headers = {'User-Agent':user_agent} try: request = urllib2.Request(url = url, headers = headers) response = urllib2.urlopen(request) content = response.read() except urllib2.HTTPError as e: print e exit() except urllib2.URLError as e: print e exit() regex = re.compile('<div class="content">.*?<span>(.*?)</span>.*?</div>', re.S) items = re.findall(regex, content) # 提取數據 # 注意換行符,設置 . 能夠匹配換行符 path = './qiubai' if not os.path.exists(path): os.makedirs(path) count = 1 for item in items: #整理數據,去掉/n,將<br/>換成/n item = item.replace('/n', '').replace('<br/>', '/n') filepath = path + '/' + str(count) + '.txt' f = open(filepath, 'w') f.write(item) f.close() count += 1
新聞熱點
疑難解答