Python抓取Discuz!用戶名腳本代碼

2020-02-23 05:02:59

字體：大中小

來源：轉載

供稿：網友

最近學習Python，于是就用Python寫了一個抓取Discuz!用戶名的腳本，代碼很少但是很搓。思路很簡單，就是正則匹配title然后提取用戶名寫入文本文檔。程序以百度站長社區為例(一共有40多萬用戶)，掛在VPS上就沒管了，雖然用了延時但是后來發現一共只抓取了50000多個用戶名就被封了。。。
代碼如下：
代碼如下:
# -*- coding: utf-8 -*-
# Author: 天一
# Blog: http://www.90blog.org
# Version: 1.0
# 功能: Python抓取百度站長平臺用戶名腳本

import urllib
import urllib2
import re
import time

def BiduSpider():
     pattern = re.compile(r'<title>(.*)的個人資料百度站長社區 </title>')
     uid=1
     thedatas = []
     while uid <400000:
         theUrl = "http://bbs.zhanzhang.baidu.com/home.php?mod=space&uid="+str(uid)
         uid +=1
         theResponse = urllib2.urlopen(theUrl)
         thePage = theResponse.read()
         #正則匹配用戶名
         theFindall = re.findall(pattern,thePage)
         #等待0.5秒，以防頻繁訪問被禁止
         time.sleep(0.5)
         if theFindall :
              #中文編碼防止亂碼輸出
              thedatas = theFindall[0].decode('utf-8').encode('gbk')
              #寫入txt文本文檔
              f = open('theUid.txt','a')
              f.writelines(thedatas+'/n')
              f.close()

if __name__ == '__main__':
     BiduSpider()