Jump to content
  • Hello visitors, welcome to the Hacker World Forum!

    Red Team 1949  (formerly CHT Attack and Defense Team) In this rapidly changing Internet era, we maintain our original intention and create the best community to jointly exchange network technologies. You can obtain hacker attack and defense skills and knowledge in the forum, or you can join our Telegram communication group to discuss and communicate in real time. All kinds of advertisements are prohibited in the forum. Please register as a registered user to check our usage and privacy policy. Thank you for your cooperation.

    TheHackerWorld Official

Python BeautifulSoup简介

 Share


KaiWn

Recommended Posts

  • 1.BeautifulSoup简介

    BeautifulSoup是一个可以从HTML或XML文件中提取数据的python库;它能够通过转换器实现惯用的文档导航、查找、修改文档的方式。

    BeautifulSoup是一个基于re开发的解析库,可以提供一些强大的解析功能;使用BeautifulSoup能够提高提取数据的效率与爬虫开发效率。

    2.BeautifulSoup总览

    构建文档树

    BeautifulSoup进行文档解析是基于文档树结构来实现的,而文档树则是由BeautifulSoup中的四个数据对象构建而成的。

    文档树对象 描述
    Tag 标签; 访问方式:soup.tag;属性:tag.name(标签名),tag.attrs(标签属性)
    Navigable String 可遍历字符串; 访问方式:soup.tag.string
    BeautifulSoup 文档全部内容,可作为Tag对象看待; 属性:soup.name(标签名),soup.attrs(标签属性)
    Comment 标签内字符串的注释; 访问方式:soup.tag.string
    import lxml
    import requests
    from bs4 import BeautifulSoup
    
    html =  """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    #1、BeautifulSoup对象
    soup = BeautifulSoup(html,'lxml')
    print(type(soup))
    
    #2、Tag对象
    print(soup.head,'\n')
    print(soup.head.name,'\n')
    print(soup.head.attrs,'\n')
    print(type(soup.head))
    
    #3、Navigable String对象
    print(soup.title.string,'\n')
    print(type(soup.title.string))
    
    #4、Comment对象
    print(soup.a.string,'\n')
    print(type(soup.a.string))
    
    #5、结构化输出soup对象
    print(soup.prettify())

    遍历文档树

    BeautifulSoup之所以将文档转为树型结构,是因为树型结构更便于对内容的遍历提取。

    向下遍历方法 描述
    tag.contents tag标签子节点
    tag.children tag标签子节点,用于循环遍历子节点
    tag.descendants tag标签子孙节点,用于循环遍历子孙节点
    向上遍历方法 描述
    tag.parent tag标签父节点
    tag.parents tag标签先辈节点,用于循环遍历先别节点
    平行遍历方法 描述
    tag.next_sibling tag标签下一兄弟节点
    tag.previous_sibling tag标签上一兄弟节点
    tag.next_siblings tag标签后续全部兄弟节点
    tag.previous_siblings tag标签前序全部兄弟节点
    import requests
    import lxml
    import json
    from bs4 import BeautifulSoup
    
    html =  """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    soup = BeautifulSoup(html,'html.parser')
    
    #1、向下遍历
    print(soup.p.contents)
    print(list(soup.p.children))
    print(list(soup.p.descendants))
    
    #2、向上遍历
    print(soup.p.parent.name,'\n')
    for i in soup.p.parents:
        print(i.name)
    
    #3、平行遍历
    print('a_next:',soup.a.next_sibling)
    for i in soup.a.next_siblings:
        print('a_nexts:',i)
    print('a_previous:',soup.a.previous_sibling)
    for i in soup.a.previous_siblings:
        print('a_previouss:',i)

    搜索文档树

    BeautifulSoup提供了许多搜索方法,能够便捷地获取我们需要的内容。

    遍历方法 描述
    soup.find_all( ) 查找所有符合条件的标签,返回列表数据
    soup.find 查找符合条件的第一个个标签,返回字符串数据
    soup.tag.find_parents() 检索tag标签所有先辈节点,返回列表数据
    soup.tag.find_parent() 检索tag标签父节点,返回字符串数据
    soup.tag.find_next_siblings() 检索tag标签所有后续节点,返回列表数据
    soup.tag.find_next_sibling() 检索tag标签下一节点,返回字符串数据
    soup.tag.find_previous_siblings() 检索tag标签所有前序节点,返回列表数据
    soup.tag.find_previous_sibling() 检索tag标签上一节点,返回字符串数据

    需要注意的是,因为class是python的保留关键字,若要匹配标签内class的属性,需要特殊的方法,有以下两种:

    • 在attrs属性用字典的方式进行参数传递
    • BeautifulSoup自带的特别关键字class_
    import requests
    import lxml
    import json
    from bs4 import BeautifulSoup
    
    html =  """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    soup = BeautifulSoup(html,'html.parser')
    
    #1、find_all( )
    print(soup.find_all('a'))  #检索标签名
    print(soup.find_all('a',id='link1')) #检索属性值
    print(soup.find_all('a',class_='sister')) 
    print(soup.find_all(text=['Elsie','Lacie']))
    
    #2、find( )
    print(soup.find('a'))
    print(soup.find(id='link2'))
    
    #3 、向上检索
    print(soup.p.find_parent().name)
    for i in soup.title.find_parents():
        print(i.name)
        
    #4、平行检索
    print(soup.head.find_next_sibling().name)
    for i in soup.head.find_next_siblings():
        print(i.name)
    print(soup.title.find_previous_sibling())
    for i in soup.title.find_previous_siblings():
        print(i.name)

    CSS选择器

    BeautifulSoup选择器支持绝大部分的CSS选择器,在Tag或BeautifulSoup对象的.select( )方法中传入字符串参数,即可使用CSS选择器找到Tag。

    常用HTML标签:

    HTML标题:<h> </h>
    HTML段落:<p> </p>
    HTML链接:<a href='httts://www.baidu.com/'> this is a link </a>
    HTML图像:<img src='Ai-code.jpg',width='104',height='144' />
    HTML表格:<table> </table>
    HTML列表:<ul> </ul>
    HTML块:<div> </div>
    import requests
    import lxml
    import json
    from bs4 import BeautifulSoup
    
    html =  """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1"><!--Elsie--></a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """
    
    soup = BeautifulSoup(html,'html.parser')
    
    print('标签查找:',soup.select('a'))
    print('属性查找:',soup.select('a[id="link1"]'))
    print('类名查找:',soup.select('.sister'))
    print('id查找:',soup.select('#link1'))
    print('组合查找:',soup.select('p #link1'))

    爬取图片实例

    import requests
    from bs4 import BeautifulSoup
    import os
    
    def getUrl(url):
        try:
            read = requests.get(url)  
            read.raise_for_status()   
            read.encoding = read.apparent_encoding  
            return read.text    
        except:
            return "连接失败!"
     
    def getPic(html):
        soup = BeautifulSoup(html, "html.parser")
        
        all_img = soup.find('ul').find_all('img') 
        for img in all_img:
            src = img['src']  
            img_url = src
            print(img_url)
            root = "F:/Pic/"   
            path = root + img_url.split('/')[-1]  
            print(path)
            try:
                if not os.path.exists(root):  
                    os.mkdir(root)
                if not os.path.exists(path):
                    read = requests.get(img_url)
                    with open(path, "wb")as f:
                        f.write(read.content)
                        f.close()
                        print("文件保存成功!")
                else:
                    print("文件已存在!")
            except:
                print("文件爬取失败!")
     
    if __name__ == '__main__':
       html_url=getUrl("https://findicons.com/search/nature")
       getPic(html_url)
Link to post
Link to comment
Share on other sites

 Share

discussion group

discussion group

    You don't have permission to chat.
    • Recently Browsing   0 members

      • No registered users viewing this page.
    ×
    ×
    • Create New...