使用python采集脚本之家电子书资源并自动下载到本地的实例脚本

更新时间：2018年10月23日 15:58:26 作者：忘忧草论坛

这篇文章主要介绍了python采集jb51电子书资源并自动下载到本地实例教程,非常不错，具有一定的参考借鉴价值，需要的朋友可以参考下

jb51上面的资源还比较全，就准备用python来实现自动采集信息，与下载啦。

Python具有丰富和强大的库，使用urllib，re等就可以轻松开发出一个网络信息采集器！

下面，是我写的一个实例脚本，用来采集某技术网站的特定栏目的所有电子书资源，并下载到本地保存！

软件运行截图如下：

在脚本运行时期，不但会打印出信息到shell窗口，还会保存日志到txt文件，记录采集到的页面地址，书籍的名称，大小，服务器本地下载地址以及百度网盘的下载地址！

实例采集并下载脚本之家的python栏目电子书资源：

# -*- coding:utf-8 -*-
import re
import urllib2
import urllib
import sys
import os
reload(sys)
sys.setdefaultencoding('utf-8')
def getHtml(url):
 request = urllib2.Request(url)
 page = urllib2.urlopen(request)
 htmlcontent = page.read()
 #解决中文乱码问题
 htmlcontent = htmlcontent.decode('gbk', 'ignore').encode("utf8",'ignore')
 return htmlcontent
def report(count, blockSize, totalSize):
 percent = int(count*blockSize*100/totalSize)
 sys.stdout.write("r%d%%" % percent + ' complete')
 sys.stdout.flush()
def getBookInfo(url):
 htmlcontent = getHtml(url);
 #print "htmlcontent=",htmlcontent; # you should see the ouput html
 #<h1 class="h1user">crifan</h1>
 regex_title = '<h1s+?itemprop="name">(?P<title>.+?)</h1>';
 title = re.search(regex_title, htmlcontent);
 if(title):
 title = title.group("title");
 print "书籍名字:",title;
 file_object.write('书籍名字:'+title+'r');
 #<li>书籍大小：<span itemprop="fileSize">27.2MB</span></li>
 filesize = re.search('<spans+?itemprop="fileSize">(?P<filesize>.+?)</span>', htmlcontent);
 if(filesize):
 filesize = filesize.group("filesize");
 print "文件大小:",filesize;
 file_object.write('文件大小:'+filesize+'r');
 #<div class="picthumb"><a href="//img.jbzj.com/do/uploads/litimg/151210/1A9262GO2.jpg" target="_blank"
 bookimg = re.search('<divs+?class="picthumb"><a href="(?P<bookimg>.+?)" rel="external nofollow" target="_blank"', htmlcontent);
 if(bookimg):
 bookimg = bookimg.group("bookimg");
 print "封面图片：",bookimg;
 file_object.write('封面图片:'+bookimg+'r');
 #<li><a href="http://xz6.jb51.net:81/201512/books/JavaWeb100(jb51.net).rar" target="_blank">酷云中国电信下载</a></li>
 downurl1 = re.search('<li><a href="(?P<downurl1>.+?)" rel="external nofollow" target="_blank">酷云中国电信下载</a></li>', htmlcontent);
 if(downurl1):
 downurl1 = downurl1.group("downurl1");
 print "下载地址1:",downurl1; 
 file_object.write('下载地址1:'+downurl1+'r');
 sys.stdout.write('rFetching ' + title + '...n')
 title = title.replace(' ', '');
 title = title.replace('/', '');
 saveFile = '/Users/superl/Desktop/pythonbook/'+title+'.rar';
 if os.path.exists(saveFile):
 print "该文件已经下载了！";
 else:
 urllib.urlretrieve(downurl1, saveFile, reporthook=report);
 sys.stdout.write("rDownload complete, saved as %s" % (saveFile) + 'nn')
 sys.stdout.flush()
 file_object.write('文件下载成功!r');
 else:
 print "下载地址1不存在";
 file_error.write(url+'r');
 file_error.write(title+"下载地址1不存在!文件没有自动下载！r");
 file_error.write('r');
 #<li><a href="http://pan.baidu.com/s/1pKfVNwJ" rel="external nofollow" target="_blank">百度网盘下载2</a></li>
 downurl2 = re.search('</a></li><li><a href="(?P<downurl2>.+?)" rel="external nofollow" target="_blank">百度网盘下载2</a></li>', htmlcontent);
 if(downurl2):
 downurl2 = downurl2.group("downurl2");
 print "下载地址2:",downurl2; 
 file_object.write('下载地址2:'+downurl2+'r');
 else:
 #file_error.write(url+'r');
 print "下载地址2不存在"; 
 file_error.write(title+"下载地址2不存在r");
 file_error.write('r');
 file_object.write('r');
 print "n";
def getBooksUrl(url):
 htmlcontent = getHtml(url);
 #<ul class="cur-cat-list"><a href="/books/438381.html" rel="external nofollow" class="tit"</ul></div><!--end #content -->
 urls = re.findall('<a href="(?P<urls>.+?)" rel="external nofollow" class="tit"', htmlcontent);
 for url in urls:
 url = "//www.jb51.net"+url;
 print url+"n";
 file_object.write(url+'r');
 getBookInfo(url)
 #print "url->", url
if __name__=="__main__":
 file_object = open('/Users/superl/Desktop/python.txt','w+');
 file_error = open('/Users/superl/Desktop/pythonerror.txt','w+');
 pagenum = 3;
 for pagevalue in range(1,pagenum+1):
 listurl = "//www.jb51.net/ books/list476_%d.html"%pagevalue;
 print listurl;
 file_object.write(listurl+'r');
 getBooksUrl(listurl);
 file_object.close();
 file_error.close();

注意，上面代码部分地方的url被我换了。

总结

以上所述是小编给大家介绍的python采集jb51电子书资源并自动下载到本地实例脚本，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对脚本之家网站的支持！

您可能感兴趣的文章:

Python 创建TCP服务器的方法
这篇文章主要介绍了Python 创建TCP服务器的方法，文中讲解非常细致，代码帮助大家更好的理解和学习，感兴趣的朋友可以了解下
2020-07-07
python解压zip包中文乱码解决方法
这篇文章主要介绍了python解压zip包中文乱码解决方法，帮助大家更好的理解和学习python，感兴趣的朋友可以了解下
2020-11-11
Python模拟登录网易云音乐并自动签到
时隔三周没有和大家见过面了,最近在研究python模拟登陆专题,话不多说,让我们愉快地开始实现模拟登陆实现网易云自动签到,需要的朋友可以参考下
2021-06-06
Jupyter Lab设置切换虚拟环境的实现步骤
本文主要介绍了Jupyter Lab设置切换虚拟环境的实现步骤，文中通过示例代码介绍的非常详细，对大家的学习或者工作具有一定的参考学习价值，需要的朋友们下面随着小编来一起学习学习吧
2023-02-02
python爬虫获取新浪新闻教学
在本篇内容中小编给大家分享的是关于python爬虫获取新浪新闻的相关步骤和知识点，需要的可以跟着学习下。
2018-12-12
Jupyter中markdown的操作方法
Jupyter Notebook是基于网页的用于交互计算的应用程序,Jupyter notebook,作为Python广受欢迎的一款IDLE,其直观性、简易性、易于阅读等优点广受许多Python用户所推荐,这篇文章介绍Jupyter中markdown的操作,感兴趣的朋友一起看看吧
2024-01-01
Python中实现定时任务常见的几种方式
在Python中,实现定时任务是一个常见的需求,无论是在自动化脚本、数据处理、系统监控还是其他许多应用场景中,Python提供了多种方法来实现定时任务,包括使用标准库、第三方库以及系统级别的工具,本文将详细介绍几种常见的Python定时任务实现方式
2024-08-08
Python实现的序列化和反序列化二叉树算法示例
这篇文章主要介绍了Python实现的序列化和反序列化二叉树算法,结合实例形式分析了Python二叉树的构造、遍历、序列化、反序列化等相关操作技巧,需要的朋友可以参考下
2019-03-03
Python函数基础
这篇文章主要从函数开始介绍展开Python函数，以最基本的函数定义方法描述,需要的朋友可以参考下文简单的介绍
2021-08-08
EM算法的python实现的方法步骤
本篇文章主要介绍了EM算法的python实现的方法步骤，小编觉得挺不错的，现在分享给大家，也给大家做个参考。一起跟随小编过来看看吧
2018-01-01

使用python采集脚本之家电子书资源并自动下载到本地的实例脚本

相关文章

最新评论

大家感兴趣的内容

最近更新的内容

常用在线小工具