博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Python数据分析
阅读量:4074 次
发布时间:2019-05-25

本文共 5362 字,大约阅读时间需要 17 分钟。

作者:挖数
链接:https://www.zhihu.com/question/20899988/answer/96904827
来源:知乎
著作权归作者所有,转载请联系作者获得授权。
以下是我学python爬虫的打怪升级之路,过程充满艰辛,也充满欢乐,虽然还未打倒大boss,但一路的风景就是最大的乐趣,不是么?希望大家能get到想要的东西!
多图预警!
<img src="https://pic4.zhimg.com/55e8bc9324234bc88b354821ce005bc3_b.png" data-rawwidth="288" data-rawheight="179" class="content_image" width="288">
<img src="https://pic3.zhimg.com/af1baba1052c2cd49cea5ea6986eb30a_b.png" data-rawwidth="242" data-rawheight="268" class="content_image" width="242">
<img src="https://pic2.zhimg.com/5ec82828ba71e96a7d86b7e88254ccd9_b.png" data-rawwidth="254" data-rawheight="230" class="content_image" width="254">
<img src="https://pic3.zhimg.com/c60bde3fec9e5f791b1a217613879b46_b.png" data-rawwidth="278" data-rawheight="320" class="content_image" width="278">
<img src="https://pic3.zhimg.com/974b3d7c1c50bac62c14afe58ff0ed26_b.png" data-rawwidth="309" data-rawheight="318" class="content_image" width="309">
<img src="https://pic2.zhimg.com/2c3e1e5f18d6e6cc8758337663c548f5_b.png" data-rawwidth="313" data-rawheight="264" class="content_image" width="313">
<img src="https://pic4.zhimg.com/b65ad1e407e0335107eca80e4a0bdac3_b.png" data-rawwidth="266" data-rawheight="240" class="content_image" width="266">
<img src="https://pic2.zhimg.com/70067cc590378e31676ed48192633d7d_b.png" data-rawwidth="269" data-rawheight="246" class="content_image" width="269">
<img src="https://pic4.zhimg.com/2cecf7ef8b19f24a2fb287403a51142b_b.png" data-rawwidth="299" data-rawheight="254" class="content_image" width="299">
<img src="https://pic3.zhimg.com/b2867a2ddb861a04a91fde5d34ed5982_b.png" data-rawwidth="212" data-rawheight="266" class="content_image" width="212">
<img src="https://pic3.zhimg.com/ae5a6594ab77bfdeaaa9e45b9420c93e_b.png" data-rawwidth="313" data-rawheight="266" class="content_image" width="313">
<img src="https://pic4.zhimg.com/5f65be4b49e5f84ab99efc92ab6ea61b_b.png" data-rawwidth="304" data-rawheight="232" class="content_image" width="304">
<img src="https://pic2.zhimg.com/506899fbbe618e05cbe1e2768665b17d_b.png" data-rawwidth="287" data-rawheight="234" class="content_image" width="287">
<img src="https://pic1.zhimg.com/009fcaa5d4a08f4eda54fb38b88e575c_b.png" data-rawwidth="325" data-rawheight="354" class="content_image" width="325">
<img src="https://pic3.zhimg.com/b93fbe0719c946b1a68a3f0b33937942_b.png" data-rawwidth="289" data-rawheight="243" class="content_image" width="289">
<img src="https://pic2.zhimg.com/ded59bb8038a10b3bfb4e65fd14db631_b.png" data-rawwidth="309" data-rawheight="189" class="content_image" width="309">
<img src="https://pic2.zhimg.com/8d8337c43a58a5386227e037891f9d61_b.png" data-rawwidth="266" data-rawheight="346" class="content_image" width="266">
<img src="https://pic2.zhimg.com/e5dbb6f838f6532b0d0a481c69a79ddd_b.png" data-rawwidth="338" data-rawheight="269" class="content_image" width="338">
<img src="https://pic4.zhimg.com/5e1b525feb212ff0b860481ecb67288b_b.png" data-rawwidth="255" data-rawheight="175" class="content_image" width="255">
以下奉献一段爬取知乎头像的代码
import requests
import urllib
import re
import random
from time import sleep
def main():
url=' '
#感觉这个话题下面美女多
headers={省略}
i=1
for x in xrange(20,3600,20):
data={'start':'0',
'offset':str(x),
'_xsrf':'a128464ef225a69348cef94c38f4e428'}
#知乎用offset控制加载的个数,每次响应加载20
content=requests.post(url,headers=headers,data=data,timeout=10).text
#用post提交form data
imgs=re.findall('<img src=\\\\\"(.*?)_m.jpg',content)
#在爬下来的json上用正则提取图片地址,去掉_m为大图
for img in imgs:
try:
img=img.replace('\\','')
#去掉\字符这个干扰成分
pic=img+'.jpg'
path='d:\\bs4\\zhihu\\jpg\\'+str(i)+'.jpg'
#声明存储地址及图片名称
urllib.urlretrieve(pic,path)
#下载图片
print u'下载了第'+str(i)+u'张图片'
i+=1
sleep(random.uniform(0.5,1))
#睡眠函数用于防止爬取过快被封IP
except:
print u'抓漏1张'
pass
sleep(random.uniform(0.5,1))
if __name__=='__main__':
main()
结果:
&amp;lt;img src=&quot;https://pic2.zhimg.com/b1fc67ee3e290376fe882113ff7d44fd_b.png&quot; data-rawwidth=&quot;710&quot; data-rawheight=&quot;744&quot; class=&quot;origin_image zh-lightbox-thumb&quot; width=&quot;710&quot; data-original=&quot;https://pic2.zhimg.com/b1fc67ee3e290376fe882113ff7d44fd_r.png&quot;&amp;gt;
最后,请关注我吧,我会好好维护你的时间线的
\( ^▽^ )/
你可能感兴趣的文章
新版本的linux如何生成xorg.conf
查看>>
xorg.conf的编写
查看>>
启用SELinux时遇到的问题
查看>>
virbr0 虚拟网卡卸载方法
查看>>
No devices detected. Fatal server error: no screens found
查看>>
新版本的linux如何生成xorg.conf
查看>>
virbr0 虚拟网卡卸载方法
查看>>
Centos 6.0_x86-64 终于成功安装官方显卡驱动
查看>>
Linux基础教程:CentOS卸载KDE桌面
查看>>
db sql montior
查看>>
read humor_campus
查看>>
IBM WebSphere Commerce Analyzer
查看>>
Unix + OS IBM Aix FTP / wu-ftp / proftp
查看>>
my read work
查看>>
db db2 base / instance database tablespace container
查看>>
hd disk / disk raid / disk io / iops / iostat / iowait / iotop / iometer
查看>>
project ASP.NET
查看>>
db db2_monitorTool IBM Rational Performace Tester
查看>>
OS + Unix Aix telnet
查看>>
IBM Lotus
查看>>