python 爬虫基础
从站点获得数据
需要使用到 requests 这个库
requests 库
下载
使用方法
1 2 3 4 5 6 7 8 9 10 11 12
| import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"}
response = requests.get(f"https://movie.douban.com/top250?start={start}&filter=", headers = headers)
html = response.text
|
对从站点获得的数据进行加工处理
需要使用 BeautifulSoup 函数,在库 bs4 中
bs4 库
下载
使用方法
导入
1
| from bs4 import BeautifulSoup
|
BeautifulSoup 函数能够把从上一步得到的 html 加工成一个树结构,方便后面的操作。
1 2
| soup = BeautifulSoup(html, "html.parser")
|
之后可以使用 findAll 函数进行具体内容的查找
1 2 3 4 5 6 7
| all_names = soup.findAll("span", attrs = {"class": "title"})
for name in all_names: if "/" not in str(name.string): id = id + 1 f.write(str(id) + ": " + str(name.string) + '\n')
|
实例
获取豆瓣TOP250的中文名字
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| import requests from bs4 import BeautifulSoup
with open("douban.txt", "w", encoding = "utf-8") as f: headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36"} id = 0 for start in range(0, 250, 25): response = requests.get(f"https://movie.douban.com/top250?start={start}&filter=", headers = headers) html = response.text
soup = BeautifulSoup(html, "html.parser") all_names = soup.findAll("span", attrs = {"class": "title"}) for name in all_names: if "/" not in str(name.string): id = id + 1 f.write(str(id) + ": " + str(name.string) + '\n')
|