引言
写这个小爬虫主要是为了爬校园论坛上的实习信息,主要采用了Requests库
源码
URLs.py
主要功能是根据一个初始url(包含page页面参数)来获得page页面从当前页面数到pageNum的url列表
import re
def getURLs(url, attr, pageNum=1):
all_links = []
try:
now_page_number = int(re.search(attr+'=(\d+)', url, re.S).group(1))
for i in range(now_page_number, pageNum + 1):
new_url = re.sub(attr+'=\d+', attr+'=%s' % i, url, re.S)
all_links.append(new_url)
return all_links
except TypeError:
print "arguments TypeError:attr should be string."
uni_2_native.py
由于论坛上爬取得到的网页上的中文都是unicode编码的形式,文本格式都为 &#XXXX;的形式,所以在爬得网站内容后还需要对其进行转换
import sys
import re
reload(sys)
sys.setdefaultencoding('utf-8')
def get_native(raw):
tostring = raw
while True:
obj = re.search('&#(.*"htmlcode">
# -*- coding: utf-8 -*-
import MySQLdb
class saveSqlite():
def __init__(self):
self.infoList = []
def saveSingle(self, author=None, title=None, date=None, url=None,reply=0, view=0):
if author is None or title is None or date is None or url is None:
print "No info saved!"
else:
singleDict = {}
singleDict['author'] = author
singleDict['title'] = title
singleDict['date'] = date
singleDict['url'] = url
singleDict['reply'] = reply
singleDict['view'] = view
self.infoList.append(singleDict)
def toMySQL(self):
conn = MySQLdb.connect(host='localhost', user='root', passwd='', port=3306, db='db_name', charset='utf8')
cursor = conn.cursor()
# sql = "select * from info"
# n = cursor.execute(sql)
# for row in cursor.fetchall():
# for r in row:
# print r
# print '\n'
sql = "delete from info"
cursor.execute(sql)
conn.commit()
sql = "insert into info(title,author,url,date,reply,view) values (%s,%s,%s,%s,%s,%s)"
params = []
for each in self.infoList:
params.append((each['title'], each['author'], each['url'], each['date'], each['reply'], each['view']))
cursor.executemany(sql, params)
conn.commit()
cursor.close()
conn.close()
def show(self):
for each in self.infoList:
print "author: "+each['author']
print "title: "+each['title']
print "date: "+each['date']
print "url: "+each['url']
print "reply: "+str(each['reply'])
print "view: "+str(each['view'])
print '\n'
if __name__ == '__main__':
save = saveSqlite()
save.saveSingle('网','aaa','2008-10-10 10:10:10','www.baidu.com',1,1)
# save.show()
save.toMySQL()
主要爬虫代码
import requests
from lxml import etree
from cc98 import uni_2_native, URLs, saveInfo
# 根据自己所需要爬的网站,伪造一个header
headers ={
'Accept': '',
'Accept-Encoding': '',
'Accept-Language': '',
'Connection': '',
'Cookie': '',
'Host': '',
'Referer': '',
'Upgrade-Insecure-Requests': '',
'User-Agent': ''
}
url = 'http://www.cc98.org/list.asp"get infomation from cc98..."
urls = URLs.getURLs(url, "page", 50)
savetools = saveInfo.saveSqlite()
for url in urls:
r = requests.get(url, headers=headers)
html = uni_2_native.get_native(r.text)
selector = etree.HTML(html)
content_tr_list = selector.xpath('//form/table[@class="tableborder1 list-topic-table"]/tbody/tr')
for each in content_tr_list:
href = each.xpath('./td[2]/a/@href')
if len(href) == 0:
continue
else:
# print len(href)
# not very well using for, though just one element in list
# but I don't know why I cannot get the data by index
for each_href in href:
link = cc98 + each_href
title_author_time = each.xpath('./td[2]/a/@title')
# print len(title_author_time)
for info in title_author_time:
info_split = info.split('\n')
title = info_split[0][1:len(info_split[0])-1]
author = info_split[1][3:]
date = info_split[2][3:]
hot = each.xpath('./td[4]/text()')
# print len(hot)
for hot_num in hot:
reply_view = hot_num.strip().split('/')
reply, view = reply_view[0], reply_view[1]
savetools.saveSingle(author=author, title=title, date=date, url=link, reply=reply, view=view)
print "All got! Now saving to Database..."
# savetools.show()
savetools.toMySQL()
print "ALL CLEAR! Have Fun!"
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持。
免责声明:本站文章均来自网站采集或用户投稿,网站不提供任何软件下载或自行开发的软件!
如有用户或公司发现本站内容信息存在侵权行为,请邮件告知! 858582#qq.com
白云城资源网 Copyright www.dyhadc.com
暂无“python定向爬虫校园论坛帖子信息”评论...
稳了!魔兽国服回归的3条重磅消息!官宣时间再确认!
昨天有一位朋友在大神群里分享,自己亚服账号被封号之后居然弹出了国服的封号信息对话框。
这里面让他访问的是一个国服的战网网址,com.cn和后面的zh都非常明白地表明这就是国服战网。
而他在复制这个网址并且进行登录之后,确实是网易的网址,也就是我们熟悉的停服之后国服发布的暴雪游戏产品运营到期开放退款的说明。这是一件比较奇怪的事情,因为以前都没有出现这样的情况,现在突然提示跳转到国服战网的网址,是不是说明了简体中文客户端已经开始进行更新了呢?
更新日志
2025年11月11日
2025年11月11日
- 小骆驼-《草原狼2(蓝光CD)》[原抓WAV+CUE]
- 群星《欢迎来到我身边 电影原声专辑》[320K/MP3][105.02MB]
- 群星《欢迎来到我身边 电影原声专辑》[FLAC/分轨][480.9MB]
- 雷婷《梦里蓝天HQⅡ》 2023头版限量编号低速原抓[WAV+CUE][463M]
- 群星《2024好听新歌42》AI调整音效【WAV分轨】
- 王思雨-《思念陪着鸿雁飞》WAV
- 王思雨《喜马拉雅HQ》头版限量编号[WAV+CUE]
- 李健《无时无刻》[WAV+CUE][590M]
- 陈奕迅《酝酿》[WAV分轨][502M]
- 卓依婷《化蝶》2CD[WAV+CUE][1.1G]
- 群星《吉他王(黑胶CD)》[WAV+CUE]
- 齐秦《穿乐(穿越)》[WAV+CUE]
- 发烧珍品《数位CD音响测试-动向效果(九)》【WAV+CUE】
- 邝美云《邝美云精装歌集》[DSF][1.6G]
- 吕方《爱一回伤一回》[WAV+CUE][454M]