使用BeautifulSoup提取網頁信息并自動存儲

2019-11-14 12:08:16

字體：大中小

來源：轉載

供稿：網友

關于BeautifulSoup類的實例方法和屬性的說明，不再贅述，還是拿示例分析，看一看使用BeautifulSoup是如何提取網站信息并自動存儲的。

下面的例子是用提供的網站域名作為文件夾名稱，把提取到的圖像文件存儲到文件夾中。

from bs4 import BeautifulSoupimport requestsimport osfrom urllib.request import urlopenfrom urllib.parse import urlparse'''if len(sys.argv) < 2:    PRint("用法：python bs4FileTest.py 網址")    exit(1)'''url = 'http://www.abvedu.com/appcpzs'domain = "{}://{}".format(urlparse(url).scheme, urlparse(url).hostname)#http://www.abvedu.comsrc = requests.get(url)print(type(src))src.encoding = 'bgk'#獲得以標記為元素的文本列表html  = src.text#對超文本標記語言進行解析,生成一個BeautifulSoup實例bsbs = BeautifulSoup(html,'html.parser')#搜索的目標是<img>標簽,把搜索到的符合條件的標簽存放到列表all_imgs中all_imgs = bs.find_all(['a','img'])#all_imgs = bs.find_all(['img'])#迭代列表for link in all_imgs:    #提取屬性值，即從<img..../>標簽中提取屬性    src = link.get('src')    print("-----",src,"------------")    href = link.get('href')    print("**********",href,"**********")    #創建一個列表    targets = [src, href]    for t in targets:        if t != None and ('.jpg' in t or '.png' in t or 'gif' in t):            if t.startswith('http'): full_path = t            else:                     full_path = domain+t            print(full_path)            image_dir = url.split('/')[-1]            #檢查要存取的文件夾是否存在，如果不存在就創建一個新的            if not os.path.exists(image_dir): os.mkdir(image_dir)            filename = full_path.split('/')[-1]            ext = filename.split('.')[-1]            filename = filename.split('.')[-2]            if  'jpg' in ext: filename = filename + '.jpg'            else:              filename = filename + '.png'            image = urlopen(full_path)            fp = open(os.path.join(image_dir,filename),'wb')            fp.write(image.read())            fp.close()

上一篇：數據結構之簡單算法學習

下一篇：UVa-10304 All in All

學習交流

索泰發布一款GTX 1070 Mini迷你版本:小機

索泰發布一款GTX 1070 Mini迷你版本:小機箱大愛...

熱門圖片

猜你喜歡的新聞

猜你喜歡的關注