Python爬取读者并制作成PDF

更新时间：2015年03月10日 11:45:54 投稿：hebedich

本文是在学习了beautifulsoup之后，制作的一个爬取读者杂志并使用reportlab制作成pdf的python小工具，咱也文艺一下：），分享给大家，有需要的小伙伴参考下吧。

学了下beautifulsoup后,做个个网络爬虫,爬取读者杂志并用reportlab制作成pdf..

crawler.py

#!/usr/bin/env python
#coding=utf-8
"""
    Author:         Anemone
    Filename:       getmain.py
    Last modified: 2015-02-19 16:47
    E-mail:         anemone@82flex.com
"""
import urllib2
from bs4 import BeautifulSoup
import re
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
def getEachArticle(url):
#    response = urllib2.urlopen('http://www.52duzhe.com/2015_01/duzh20150104.html')
    response = urllib2.urlopen(url)
    html = response.read()
    soup = BeautifulSoup(html)#.decode("utf-8").encode("gbk"))
    #for i in soup.find_all('div'):
    #    print i,1
    title=soup.find("h1").string
    writer=soup.find(id="pub_date").string.strip()
    _from=soup.find(id="media_name").string.strip()
    text=soup.get_text()#.encode("utf-8")
    main=re.split("BAIDU_CLB.*;",text)
    result={"title":title,"writer":writer,"from":_from,"context":main[1]}
    return result
    #new=open("new.txt","w")
    #new.write(result["title"]+"\n\n")
    #new.write(result["writer"]+" "+result["from"])
    #new.write(result["context"])
    #new.close()
def getCatalog(issue):
    url="http://www.52duzhe.com/"+issue[:4]+"_"+issue[-2:]+"/"
    firstUrl=url+"duzh"+issue+"01.html"
    firstUrl=url+"index.html"
    duzhe=dict()
    response = urllib2.urlopen(firstUrl)
    html = response.read()
    soup=BeautifulSoup(html)
    firstUrl=url+soup.table.a.get("href")
    response = urllib2.urlopen(firstUrl)
    html = response.read()
    soup = BeautifulSoup(html)
    all=soup.find_all("h2")
    for i in all:
        print i.string
        duzhe[i.string]=list()
        for link in i.parent.find_all("a"):
            href=url+link.get("href")
            print href
            while 1:
                try:
                    article=getEachArticle(href)
                    break
                except:
                    continue
            duzhe[i.string].append(article)
    return duzhe
def readDuZhe(duzhe):
    for eachColumn in duzhe:
        for eachArticle in duzhe[eachColumn]:
            print eachArticle["title"]
if __name__ == '__main__':
#    issue=raw_input("issue(201501):")
    readDuZhe(getCatalog("201424"))

getpdf.py

复制代码代码如下:

#!/usr/bin/env python
#coding=utf-8
"""
 Author: Anemone
 Filename: writetopdf.py
 Last modified: 2015-02-20 19:19
 E-mail: anemone@82flex.com
"""
#coding=utf-8
import reportlab.rl_config
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from reportlab.lib import fonts
import copy
from reportlab.platypus import Paragraph, SimpleDocTemplate,flowables
from reportlab.lib.styles import getSampleStyleSheet
import crawler
def writePDF(issue,duzhe):
 reportlab.rl_config.warnOnMissingFontGlyphs = 0
 pdfmetrics.registerFont(TTFont('song',"simsun.ttc"))
 pdfmetrics.registerFont(TTFont('hei',"msyh.ttc"))
 fonts.addMapping('song', 0, 0, 'song')
 fonts.addMapping('song', 0, 1, 'song')
 fonts.addMapping('song', 1, 0, 'hei')
 fonts.addMapping('song', 1, 1, 'hei')
 stylesheet=getSampleStyleSheet()
 normalStyle = copy.deepcopy(stylesheet['Normal'])
 normalStyle.fontName ='song'
 normalStyle.fontSize = 11
 normalStyle.leading = 11
 normalStyle.firstLineIndent = 20
 titleStyle = copy.deepcopy(stylesheet['Normal'])
 titleStyle.fontName ='song'
 titleStyle.fontSize = 15
 titleStyle.leading = 20
 firstTitleStyle = copy.deepcopy(stylesheet['Normal'])
 firstTitleStyle.fontName ='song'
 firstTitleStyle.fontSize = 20
 firstTitleStyle.leading = 20
 firstTitleStyle.firstLineIndent = 50
 smallStyle = copy.deepcopy(stylesheet['Normal'])
 smallStyle.fontName ='song'
 smallStyle.fontSize = 8
 smallStyle.leading = 8
 story = []
 story.append(Paragraph("读者{0}期".format(issue), firstTitleStyle))
 for eachColumn in duzhe:
 story.append(Paragraph('__'*28, titleStyle))
 story.append(Paragraph('{0}'.format(eachColumn), titleStyle))
 for eachArticle in duzhe[eachColumn]:
 story.append(Paragraph(eachArticle["title"],normalStyle))
 story.append(flowables.PageBreak())
 for eachColumn in duzhe:
 for eachArticle in duzhe[eachColumn]:
 story.append(Paragraph("{0}".format(eachArticle["title"]),titleStyle))
 story.append(Paragraph(" {0} {1}".format(eachArticle["writer"],eachArticle["from"]),smallStyle))
 para=eachArticle["context"].split("　　")
 for eachPara in para:
 story.append(Paragraph(eachPara,normalStyle))
 story.append(flowables.PageBreak())
 #story.append(Paragraph("context",normalStyle))
 doc = SimpleDocTemplate("duzhe"+issue+".pdf")
 print "Writing PDF..."
 doc.build(story)
def main(issue):
 duzhe=crawler.getCatalog(issue)
 writePDF(issue,duzhe)
if __name__ == '__main__':
 issue=raw_input("Enter issue(201501):")
 main(issue)

以上就是本文的全部内容了，希望大家能够喜欢。

您可能感兴趣的文章:

一款Python工具制作的动态条形图(强烈推荐!)
有时为了方便看数据的变化情况,需要画一个动态图来看整体的变化情况,下面这篇文章主要给大家介绍了一款Python工具制作的动态条形图的相关资料,文中通过实例代码介绍的非常详细,需要的朋友可以参考下
2023-02-02
使用Django实现商城验证码模块的方法
本文主要涉及图形验证码的相关功能，主要包括，图形验证码获取、验证码文字存储、验证码生成等。需要的朋友们下面随着小编来一起学习学习吧
2021-06-06
pandas如何将dataframe中的NaN替换成None
这篇文章主要介绍了pandas如何将dataframe中的NaN替换成None问题,具有很好的参考价值,希望对大家有所帮助,如有错误或未考虑完全的地方,望不吝赐教
2023-08-08
python使用threading.Condition交替打印两个字符
这篇文章主要为大家详细介绍了python使用threading.Condition交替打印两个字符，具有一定的参考价值，感兴趣的小伙伴们可以参考一下
2019-05-05
Python中logging日志记录到文件及自动分割的操作代码
这篇文章主要介绍了Python中logging日志记录到文件及自动分割,本文给大家介绍的非常详细，对大家的学习或工作具有一定的参考借鉴价值，需要的朋友可以参考下
2020-08-08
Python3爬虫里关于识别微博宫格验证码的知识点详解
在本篇文章里小编给大家分享了关于Python3爬虫里关于识别微博宫格验证码的知识点，有兴趣的朋友们可以参考下。
2020-07-07
对numpy数据写入文件的方法讲解
今天小编就为大家分享一篇对numpy数据写入文件的方法讲解，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧
2018-07-07
python破解bilibili滑动验证码登录功能
这篇文章主要介绍了python破解bilibili滑动验证码登录功能，本文给大家介绍的非常详细，具有一定的参考借鉴价值,需要的朋友可以参考下
2019-09-09
python中list列表的高级函数
这篇文章主要为大家详细介绍了python中list列表的高级函数，感兴趣的小伙伴们可以参考一下
2016-05-05
Python lambda表达式filter、map、reduce函数用法解析
这篇文章主要介绍了Python lambda表达式filter、map、reduce函数用法解析,文中通过示例代码介绍的非常详细，对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下
2019-09-09

Python爬取读者并制作成PDF

相关文章

最新评论

大家感兴趣的内容

最近更新的内容

常用在线小工具