python爬虫开发之urllib模块详细使用方法与实例全解

脚本专栏 2026/3/30 佚名

3 2 1

爬虫所需要的功能，基本上在urllib中都能找到，学习这个标准库，可以更加深入的理解后面更加便利的requests库。

首先

在Pytho2.x中使用import urllib2——-对应的，在Python3.x中会使用import urllib.request，urllib.error

在Pytho2.x中使用import urllib——-对应的，在Python3.x中会使用import urllib.request，urllib.error，urllib.parse

在Pytho2.x中使用import urlparse——-对应的，在Python3.x中会使用import urllib.parse

在Pytho2.x中使用import urlopen——-对应的，在Python3.x中会使用import urllib.request.urlopen

在Pytho2.x中使用import urlencode——-对应的，在Python3.x中会使用import urllib.parse.urlencode

在Pytho2.x中使用import urllib.quote——-对应的，在Python3.x中会使用import urllib.request.quote

在Pytho2.x中使用cookielib.CookieJar——-对应的，在Python3.x中会使用http.CookieJar

在Pytho2.x中使用urllib2.Request——-对应的，在Python3.x中会使用urllib.request.Request

urllib是Python自带的标准库，无需安装，直接可以用。"htmlcode">

import urllib
r = urllib.urlopen('//www.jb51.net/')
datatLine = r.readline() #读取html页面的第一行
data=file.read() #读取全部
f=open("./1.html","wb") # 网页保存在本地
f.write(data)
f.close()

urlopen返回对象提供方法：

read() , readline() ,readlines() , fileno() , close() ：这些方法的使用方式与文件对象完全一样 info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息 getcode()：返回Http状态码。如果是http请求，200请求成功完成;404网址未找到 geturl()：返回请求的url

urllib.quote(url)和urllib.quote_plus(url)，对关键字进行编码可使得urlopen能够识别

POST请求

import urllib.request
import urllib.parse
url = 'https://passport.jb51.net/user/signin"htmlcode">

import urllib.request
import urllib.parse
url = 'https://passport.jb51.net/user/signin"htmlcode">

def use_proxy(proxy_addr,url):
import urllib.request
#构建代理
proxy=urllib.request.ProxyHandler({'http':proxy_addr})
# 构建opener对象
opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
# 安装到全局
# urllib.request.install_opener(opener)
# data=urllib.request.urlopen(url).read().decode('utf8') 以全局方式打开
data=opener.open(url) # 直接用句柄方式打开
return data
proxy_addr='61.163.39.70:9999'
data=use_proxy(proxy_addr,'//www.jb51.net')
print(len(data))
## 异常处理以及日输出


opener通常是build_opener()创建的opener对象。
install_opener(opener) 安装opener作为urlopen()使用的全局URL opener
cookie的使用
获取Cookie保存到变量


import http.cookiejar, urllib.request
#使用http.cookiejar.CookieJar()创建CookieJar对象
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
#使用HTTPCookieProcessor创建cookie处理器，并以其为参数构建opener对象
opener = urllib.request.build_opener(handler)
#将opener安装为全局
urllib.request.install_opener(opener)
response = urllib.request.urlopen('//www.jb51.net')
#response = opener.open('//www.jb51.net')
for item in cookie:
print 'Name = '+item.name
print 'Value = '+item.value


首先我们必须声明一个 CookieJar 对象，接下来我们就需要利用 HTTPCookieProcessor 来构建一个 handler ，最后利用 build_opener 方法构建出 opener ，执行 open() 即可。 最后循环输出cookiejar
获取Cookie保存到本地


import cookielib
import urllib
#设置保存cookie的文件，同级目录下的cookie.txt
filename = 'cookie.txt'
#声明一个MozillaCookieJar对象实例来保存cookie，之后写入文件
cookie = cookielib.MozillaCookieJar(filename)
#利用urllib库的HTTPCookieProcessor对象来创建cookie处理器
handler = urllib.request.HTTPCookieProcessor(cookie)
#通过handler来构建opener
opener = urllib.request.build_opener(handler)
#创建一个请求，原理同urllib2的urlopen
response = opener.open("//www.jb51.net")
#保存cookie到文件
cookie.save(ignore_discard=True, ignore_expires=True)


异常处理
异常处理结构如下


try:
# 要执行的代码
print(...)
except:
#try代码块里的代码如果抛出异常了，该执行什么内容
print(...)
else:
#try代码块里的代码如果没有跑出异常，就执行这里
print(...)
finally:
#不管如何，finally里的代码，是总会执行的
print(...)


URLerror产生原因：
1、网络未连接（即不能上网）


from urllib import request, error
try:
r=request.urlopen('//www.jb51.net')
except error.URLError as e:
print(e.reason)


2、访问页面不存(HTTPError)
客户端向服务器发送请求，如果成功地获得请求的资源，则返回的状态码为200，表示响应成功。如果请求的资源不存在， 则通常返回404错误。


from urllib imort request, error
try:
response = request.urlopen('//www.jb51.net')
except error.HTTPError as e:
print(e.reason, e.code, e.headers, sep='\n')
else:
print("Request Successfully')
# 加入 hasattr属性提前对属性,进行判断原因
from urllib import request,error
try:
response=request.urlopen('http://blog.jb51.net')
except error.HTTPError as e:
if hasattr(e,'code'):
print('the server couldn\'t fulfill the request')
print('Error code:',e.code)
elif hasattr(e,'reason'):
print('we failed to reach a server')
print('Reason:',e.reason)
else:
print('no exception was raised')
# everything is ok


下面为大家列出几个urllib模块很有代表性的实例
1、引入urllib模块


import urllib.request
response = urllib.request.urlopen('http://jb51.net/')
html = response.read()


2、使用 Request


import urllib.request
req = urllib.request.Request('http://jb51.net/')
response = urllib.request.urlopen(req)
the_page = response.read()


3、发送数据


#! /usr/bin/env python3
import urllib.parse
import urllib.request
url = 'http://localhost/login.php'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {
'act' : 'login',
'login[email]' : 'yzhang@i9i8.com',
'login[password]' : '123456'
}
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data)
req.add_header('Referer', '//www.jb51.net/')
response = urllib.request.urlopen(req)
the_page = response.read()
print(the_page.decode("utf8"))


4、发送数据和header


#! /usr/bin/env python3
import urllib.parse
import urllib.request
url = 'http://localhost/login.php'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {
'act' : 'login',
'login[email]' : 'yzhang@i9i8.com',
'login[password]' : '123456'
}
headers = { 'User-Agent' : user_agent }
data = urllib.parse.urlencode(values)
req = urllib.request.Request(url, data, headers)
response = urllib.request.urlopen(req)
the_page = response.read()
print(the_page.decode("utf8"))


5、http 错误


#! /usr/bin/env python3
import urllib.request
req = urllib.request.Request('//www.jb51.net ')
try:
urllib.request.urlopen(req)
except urllib.error.HTTPError as e:
print(e.code)
print(e.read().decode("utf8"))


6、异常处理


#! /usr/bin/env python3
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request("//www.jb51.net /")
try:
response = urlopen(req)
except HTTPError as e:
print('The server couldn't fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
print("good!")
print(response.read().decode("utf8"))


7、异常处理


from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request("//www.jb51.net /")
try:
response = urlopen(req)
except URLError as e:
if hasattr(e, 'reason'):
print('We failed to reach a server.')
print('Reason: ', e.reason)
elif hasattr(e, 'code'):
print('The server couldn't fulfill the request.')
print('Error code: ', e.code)
else:
print("good!")
print(response.read().decode("utf8"))


8、HTTP 认证


#! /usr/bin/env python3
import urllib.request
# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "https://www.jb51.net /"
password_mgr.add_password(None, top_level_url, 'rekfan', 'xxxxxx')
handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)
# use the opener to fetch a URL
a_url = "https://www.jb51.net /"
x = opener.open(a_url)
print(x.read())
# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)
a = urllib.request.urlopen(a_url).read().decode('utf8')
print(a)


9、使用代理


#! /usr/bin/env python3
import urllib.request
proxy_support = urllib.request.ProxyHandler({'sock5': 'localhost:1080'})
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
a = urllib.request.urlopen("//www.jb51.net ").read().decode("utf8")
print(a)


10、超时


#! /usr/bin/env python3
import socket
import urllib.request
# timeout in seconds
timeout = 2
socket.setdefaulttimeout(timeout)
# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request('//www.jb51.net /')
a = urllib.request.urlopen(req).read()
print(a)


11.自己创建build_opener


header=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36')]
#创建opener对象
opener=urllib.request.build_opener()
opener.addheaders=header
#设置opener对象作为urlopen()使用的全局opener
urllib.request.install_opener(opener)
response =urllib.request.urlopen('//www.jb51.net/')
buff = response.read()
html = buff .decode("utf8")
response.close()
print(the_page)


12.urlib.resquest.urlretrieve远程下载


header=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36')]
#创建opener对象
opener=urllib.request.build_opener()
opener.addheaders=header
#设置opener对象作为urlopen()使用的全局opener
urllib.request.install_opener(opener)
#下载文件到当前文件夹
urllib.request.urlretrieve('//www.jb51.net/','baidu.html')
#清除urlretrieve产生的缓存
urlib.resquest.urlcleanup()


13.post请求


import urllib.request
import urllib.parse
url='//www.jb51.net/mypost/'
#将数据使用urlencode编码处理后，使用encode()设置为utf-8编码
postdata=urllib.parse.urlencode({name:'测试名',pass:"123456"}).encode('utf-8')
#urllib.request.quote()接受字符串，
#urllib.parse.urlencode()接受字典或者列表中的二元组[(a,b),(c,d)],将URL中的键值对以连接符&划分
req=urllib.request.Request(url,postdata)
#urllib.request.Request(url, data=None, header={}, origin_req_host=None, unverifiable=False, #method=None)
#url：包含URL的字符串。
#data：http request中使用，如果指定，则发送POST而不是GET请求。
#header：是一个字典。
#后两个参数与第三方cookie有关。
req.add_header('user-agent','User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/
537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0')
data=urllib.request.urlopen(req).read()
//urlopen（）的data参数默认为None，当data参数不为空的时候，urlopen（）提交方式为Post。


14.cookie的使用
1.获取Cookie保存到变量


 import urllib.request
 import http.cookie
 # 声明一个CookieJar对象实例来保存cookie
 cookie = cookielib.CookieJar()
 # 利用urllib库的HTTPCookieProcessor对象来创建cookie处理器
 handler = urllib.request.HTTPCookieProcessor(cookie)
 # 通过handler来构建opener
 opener = urllib.request.build_opener(handler)
 # 此处的open方法同urllib.request的urlopen方法，也可以传入request
 urllib.request.install_opener(opener)
 #使用opener或者urlretrieve方法来获取需要的网站cookie
 urllib.request.urlretrieve('//www.jb51.net/','baidu.html')
 # data=urllib.request.urlopen('//www.jb51.net/')



2.保存cookies到文件


import http.cookie
import urllib.request
# 设置保存cookie的文件，同级目录下的cookie.txt
filename = 'cookie.txt'
# 声明一个MozillaCookieJar对象实例来保存cookie，之后写入文件
cookie = http.cookie.MozillaCookieJar(filename)
# 利用urllib库的HTTPCookieProcessor对象来创建cookie处理器
handler = urllib.request.HTTPCookieProcessor(cookie)
# 通过handler来构建opener
opener = urllib.request.build_opener(handler)
# 创建一个请求，原理同urllib的urlopen
response = opener.open("//www.jb51.net")
# 保存cookie到文件
cookie.save(ignore_discard=True, ignore_expires=True)



3.从文件中获取cookies并访问


import http.cookielib
import urllib.request
# 创建MozillaCookieJar实例对象
cookie = http.cookie.MozillaCookieJar()
# 从文件中读取cookie内容到变量
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
# 创建请求的request
req = urllib.Request("//www.jb51.net")
# 利用urllib的build_opener方法创建一个opener
opener = urllib.build_opener(urllib.request.HTTPCookieProcessor(cookie))
response = opener.open(req)
print (response.read())


15.代理服务器设置


import socket
#设置Socket连接超时时间,同时决定了urlopen的超时时间
socket.setdefaulttimeout(1)
import urllib.request
#代理服务器信息，http代理使用地址
startime = time.time()
#设置http和https代理
proxy=request.ProxyHandler({'https':'175.155.25.91:808','http':'175.155.25.91:808'})
opener=request.build_opener(proxy)
opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0'),
# ("Accept","text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"),
# ("Accept-Language", "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3"),
# ("Accept-Encoding","gzip, deflate, br"),
# ("Connection","keep-alive"),
# ("Pragma","no-cache"),
# ("Cache-Control","no-cache")
  ]
request.install_opener(opener)
# data = request.urlopen('https://www.jb51.net/find-ip-address').read()
data = request.urlopen( 'http://www.ipip.net/' ).read().decode('utf-8')
# data=gzip.decompress(data).decode('utf-8','ignore')
endtime = time.time()
delay = endtime-startime
print(data)


有时在urlopen的data数据直接decode(‘utf-8')会失败，必须要使用gzip.decompress(‘utf-8','ignore')才能打开，猜测应该是header的问题，换一个有时会好
本文主要讲解了python爬虫模块urllib详细使用方法与实例全解，更多关于python爬虫模块urllib详细使用方法与实例请查看下面的相关链接

python爬虫开发,python爬虫urllib模块详细使用方法,python爬虫urllib模块实例全解

标签：

python爬虫开发,python爬虫urllib模块详细使用方法,python爬虫urllib模块实例全解

免责声明：本站文章均来自网站采集或用户投稿，网站不提供任何软件下载或自行开发的软件！如有用户或公司发现本站内容信息存在侵权行为，请邮件告知！ 858582#qq.com

白云城资源网 Copyright www.dyhadc.com

评论“python爬虫开发之urllib模块详细使用方法与实例全解”

python爬虫开发之urllib模块详细使用方法与实例全解

暂无“python爬虫开发之urllib模块详细使用方法与实例全解”评论...

www.dyhadc.com 白云城资源网

129,905影音资源

244,626技术资源

111,817软件资源

578,645站长资源

更新日志

2026年03月30日

python爬虫开发之urllib模块详细使用方法与实例全解

POST请求

cookie的使用

异常处理

1、引入urllib模块

2、使用 Request

3、发送数据

4、发送数据和header

5、http 错误

6、异常处理

7、异常处理

8、HTTP 认证

9、使用代理

10、超时

11.自己创建build_opener

12.urlib.resquest.urlretrieve远程下载

13.post请求

14.cookie的使用

15.代理服务器设置

python爬虫开发,python爬虫urllib模块详细使用方法,python爬虫urllib模块实例全解

Python unittest 自动识别并执行测试用例方式

在Python IDLE 下调用anaconda中的库教程

评论“python爬虫开发之urllib模块详细使用方法与实例全解”

更新日志

友情链接