python中解析html获取指定元素信息方法（二）使用BeautifulSoup库

2780 阅读 0 评论 0 点赞

Beautiful Soup是python中一个第三方的HTML即XML的解析库，可以用它来方便地从网页中提取数据。

目前最新版本为BeautifulSoup4，已经被移植到bs4当中，在导入时需要from bs4，然后再导入BeautifulSoup。

安装命令：

pip install beautifulsoup4

查看是否安装

（1）、命令行中执行 pip list，看输出的结果中是否有beautifulsoup

（2）、python命令中输入from bs4 import BeautifulSoup，如果没有报错，说明安装成功

BeautifulSoup支持Python标准库中的HTML解析器，也支持第三方的解析器，如lxml解析器。需要单独安装lxml库。一般建议使用lxml解析器，速度快及文档容错能力强。

BeautifulSoup库解析html使用：

（1）、导入BeautifulSoup库，然后实例化BeautifulSoup对象

soup=BeautifulSoup(html_doc,"lxml")

（2）、接下来就可以使用find()、find_all()来获取指定的元素

soup.find('title').string #获取title标签里面的文本

soup.find('title').attrs #获取title标签的属性值，字典类型

soup.find_all('a') #获取所有a标签的对象

ps：如果标签下面有子标签的话，用string属性获取标签的文本可能会有错误不准的情况，建议使用get_text()方法，可以获取子标签的文本

下面实例代码供参考：

from bs4 import BeautifulSoup
#要分析的页面源代码，也可以通过网络抓取或读取文件
html_pagesource='''
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
        <title>php技术分享博客|个人php技术博客</title>
    </head>
        
    <body class="multi index">
        <div id="divAll">
            <div id="divNavBar">
                <ul>
                    <li id="nvabar-item-index"><a href="/">首页</a></li>
                    <li id="navbar-category-2"><a href="/list/2.html">php基础知识</a></li>
                    <li id="navbar-category-3"><a href="/list/3.html">php安全知识</a></li>
                    <li id="navbar-category-4"><a href="/list/4.html">php疑难问题</a></li>
                    <li id="navbar-category-7"><a href="/list/7.html">php进阶开发</a></li>
                    <li id="navbar-category-6"><a href="/list/6.html">python学习</a></li>
                    <li id="navbar-page-2"><a href="/aboutus">关于我</a></li>
                </ul>
            </div>
                                
            <div id="divMain">
                <div class="post multi">
                    <h2 class="post-title"><a href="/article/29.html">使用selenium获取网址所加载所有资源url列表信息</a><span class="post-date">2021-01-07 01:01</span></h2>
                    
                    <h6 class="post-footer"> 分类:python <small>|</small> 浏览:1086 <small>|</small> 评论:2   </h6>
                </div>
                
                <div class="post multi">
                        <h2 class="post-title"><a href="/article/28.html">mysql中索引类型Btree和Hash的区别以及使用场景</a><span class="post-date">2020-11-16 23:50</span></h2>
                        <h6 class="post-footer"> 分类:mysql学习 <small>|</small> 浏览:544 <small>|</small> 评论:3   </h6>
                </div>
                
                <div class="post multi">
                    <h2 class="post-title"><a href="/article/27.html">selenium在Centos服务器下环境搭建</a><span class="post-date">2020-11-04 19:21</span></h2>
                    <h6 class="post-footer"> 分类:python <small>|</small> 浏览:729 <small>|</small> 评论:0    </h6>
                </div>
            </div>
        </div>
    </body>
</html>
'''

soup=BeautifulSoup(html_pagesource,"lxml")
title=soup.find("title").string
print(title) #输出标题内容

#获取分类列表，栏目名称及栏目链接
cates=soup.find('div',attrs={'id':'divNavBar'}).find_all('a')
for cate in cates:
    print(cate.string,cate.attrs['href'])
#获取文章列表
infos=[]
articles=soup.find_all('div',attrs={"class":"post"})
for article in articles:
    info={}
    title_elm=article.find('a')
    info['title']=title_elm.string
    info['href']=title_elm.get('href')
    info['dateline']=article.find(attrs={"class":"post-date"}).string
    
    other=article.find(attrs={"class":"post-footer"}).string
    #输出不准确
    other=article.find(attrs={"class":"post-footer"}).get_text()
    others=other.split('|')
    for key,i in enumerate(others):
        if key==0:
            info['cate']=i.replace('分类:','').strip()
        elif key==1:
            info['views']=i.replace('浏览:','').strip()
        elif key==2:
            info['replies']=i.replace('评论:','').strip()
    print(info)

点赞(0) 打赏

本文分类：python开发
本文标签：无
浏览次数：2780 次浏览
发布日期：2021-08-28 14:19:00
本文链接：https://www.pyii.cn/article/34.html

上一篇 > python中解析html获取指定元素信息方法（一）使用lxml库
下一篇 > python中lxml下etree库使用css选择器获取指定html元素

python中解析html获取指定元素信息方法（二）使用BeautifulSoup库

评论列表共有 0 条评论

发表评论取消回复

python中解析html获取指定元素信息方法（二）使用BeautifulSoup库

网页中图片地址为http协议，但是实际请求变成https问题解决

Python执行报错UnboundLocalError： local variable 'xx' referenced before assignment

mysql中sql如何进行自定义的排序方式

javascript中进行时区处理方法

评论列表 共有 0 条评论

发表评论 取消回复

评论列表共有 0 条评论

发表评论取消回复