跳转到帖子
  • 游客您好,欢迎来到黑客世界论坛!您可以在这里进行注册。

    赤队小组-代号1949(原CHT攻防小组)在这个瞬息万变的网络时代,我们保持初心,创造最好的社区来共同交流网络技术。您可以在论坛获取黑客攻防技巧与知识,您也可以加入我们的Telegram交流群 共同实时探讨交流。论坛禁止各种广告,请注册用户查看我们的使用与隐私策略,谢谢您的配合。小组成员可以获取论坛隐藏内容!

    TheHackerWorld官方

Python BeautifulSoup简介


KaiWn

推荐的帖子

  • 1.BeautifulSoup简介

    BeautifulSoup是一个可以从HTML或XML文件中提取数据的python库;它能够通过转换器实现惯用的文档导航、查找、修改文档的方式。

    BeautifulSoup是一个基于re开发的解析库,可以提供一些强大的解析功能;使用BeautifulSoup能够提高提取数据的效率与爬虫开发效率。

    2.BeautifulSoup总览

    构建文档树

    BeautifulSoup进行文档解析是基于文档树结构来实现的,而文档树则是由BeautifulSoup中的四个数据对象构建而成的。

    文档树对象 描述
    Tag 标签; 访问方式:soup.tag;属性:tag.name(标签名),tag.attrs(标签属性)
    Navigable String 可遍历字符串; 访问方式:soup.tag.string
    BeautifulSoup 文档全部内容,可作为Tag对象看待; 属性:soup.name(标签名),soup.attrs(标签属性)
    Comment 标签内字符串的注释; 访问方式:soup.tag.string
    import lxml
    import requests
    from bs4 import BeautifulSoup
    
    html =  """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    #1、BeautifulSoup对象
    soup = BeautifulSoup(html,'lxml')
    print(type(soup))
    
    #2、Tag对象
    print(soup.head,'\n')
    print(soup.head.name,'\n')
    print(soup.head.attrs,'\n')
    print(type(soup.head))
    
    #3、Navigable String对象
    print(soup.title.string,'\n')
    print(type(soup.title.string))
    
    #4、Comment对象
    print(soup.a.string,'\n')
    print(type(soup.a.string))
    
    #5、结构化输出soup对象
    print(soup.prettify())

    遍历文档树

    BeautifulSoup之所以将文档转为树型结构,是因为树型结构更便于对内容的遍历提取。

    向下遍历方法 描述
    tag.contents tag标签子节点
    tag.children tag标签子节点,用于循环遍历子节点
    tag.descendants tag标签子孙节点,用于循环遍历子孙节点
    向上遍历方法 描述
    tag.parent tag标签父节点
    tag.parents tag标签先辈节点,用于循环遍历先别节点
    平行遍历方法 描述
    tag.next_sibling tag标签下一兄弟节点
    tag.previous_sibling tag标签上一兄弟节点
    tag.next_siblings tag标签后续全部兄弟节点
    tag.previous_siblings tag标签前序全部兄弟节点
    import requests
    import lxml
    import json
    from bs4 import BeautifulSoup
    
    html =  """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    soup = BeautifulSoup(html,'html.parser')
    
    #1、向下遍历
    print(soup.p.contents)
    print(list(soup.p.children))
    print(list(soup.p.descendants))
    
    #2、向上遍历
    print(soup.p.parent.name,'\n')
    for i in soup.p.parents:
        print(i.name)
    
    #3、平行遍历
    print('a_next:',soup.a.next_sibling)
    for i in soup.a.next_siblings:
        print('a_nexts:',i)
    print('a_previous:',soup.a.previous_sibling)
    for i in soup.a.previous_siblings:
        print('a_previouss:',i)

    搜索文档树

    BeautifulSoup提供了许多搜索方法,能够便捷地获取我们需要的内容。

    遍历方法 描述
    soup.find_all( ) 查找所有符合条件的标签,返回列表数据
    soup.find 查找符合条件的第一个个标签,返回字符串数据
    soup.tag.find_parents() 检索tag标签所有先辈节点,返回列表数据
    soup.tag.find_parent() 检索tag标签父节点,返回字符串数据
    soup.tag.find_next_siblings() 检索tag标签所有后续节点,返回列表数据
    soup.tag.find_next_sibling() 检索tag标签下一节点,返回字符串数据
    soup.tag.find_previous_siblings() 检索tag标签所有前序节点,返回列表数据
    soup.tag.find_previous_sibling() 检索tag标签上一节点,返回字符串数据

    需要注意的是,因为class是python的保留关键字,若要匹配标签内class的属性,需要特殊的方法,有以下两种:

    • 在attrs属性用字典的方式进行参数传递
    • BeautifulSoup自带的特别关键字class_
    import requests
    import lxml
    import json
    from bs4 import BeautifulSoup
    
    html =  """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    soup = BeautifulSoup(html,'html.parser')
    
    #1、find_all( )
    print(soup.find_all('a'))  #检索标签名
    print(soup.find_all('a',id='link1')) #检索属性值
    print(soup.find_all('a',class_='sister')) 
    print(soup.find_all(text=['Elsie','Lacie']))
    
    #2、find( )
    print(soup.find('a'))
    print(soup.find(id='link2'))
    
    #3 、向上检索
    print(soup.p.find_parent().name)
    for i in soup.title.find_parents():
        print(i.name)
        
    #4、平行检索
    print(soup.head.find_next_sibling().name)
    for i in soup.head.find_next_siblings():
        print(i.name)
    print(soup.title.find_previous_sibling())
    for i in soup.title.find_previous_siblings():
        print(i.name)

    CSS选择器

    BeautifulSoup选择器支持绝大部分的CSS选择器,在Tag或BeautifulSoup对象的.select( )方法中传入字符串参数,即可使用CSS选择器找到Tag。

    常用HTML标签:

    HTML标题:<h> </h>
    HTML段落:<p> </p>
    HTML链接:<a href='httts://www.baidu.com/'> this is a link </a>
    HTML图像:<img src='Ai-code.jpg',width='104',height='144' />
    HTML表格:<table> </table>
    HTML列表:<ul> </ul>
    HTML块:<div> </div>
    import requests
    import lxml
    import json
    from bs4 import BeautifulSoup
    
    html =  """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    soup = BeautifulSoup(html,'html.parser')
    
    print('标签查找:',soup.select('a'))
    print('属性查找:',soup.select('a[id="link1"]'))
    print('类名查找:',soup.select('.sister'))
    print('id查找:',soup.select('#link1'))
    print('组合查找:',soup.select('p #link1'))

    爬取图片实例

    import requests
    from bs4 import BeautifulSoup
    import os
    
    def getUrl(url):
        try:
            read = requests.get(url)  
            read.raise_for_status()   
            read.encoding = read.apparent_encoding  
            return read.text    
        except:
            return "连接失败!"
     
    def getPic(html):
        soup = BeautifulSoup(html, "html.parser")
        
        all_img = soup.find('ul').find_all('img') 
        for img in all_img:
            src = img['src']  
            img_url = src
            print(img_url)
            root = "F:/Pic/"   
            path = root + img_url.split('/')[-1]  
            print(path)
            try:
                if not os.path.exists(root):  
                    os.mkdir(root)
                if not os.path.exists(path):
                    read = requests.get(img_url)
                    with open(path, "wb")as f:
                        f.write(read.content)
                        f.close()
                        print("文件保存成功!")
                else:
                    print("文件已存在!")
            except:
                print("文件爬取失败!")
     
    if __name__ == '__main__':
       html_url=getUrl("https://findicons.com/search/nature")
       getPic(html_url)
链接帖子
意见的链接
分享到其他网站

黑客攻防讨论组

黑客攻防讨论组

    You don't have permission to chat.
    • 最近浏览   0位会员

      • 没有会员查看此页面。
    ×
    ×
    • 创建新的...