当前位置: 首页 > news >正文

注销备案号 网站郑州网站制作选择乐云seo

注销备案号 网站,郑州网站制作选择乐云seo,信息流广告推广,泉州网页Python 第二阶段 - 爬虫入门 🎯 今日目标 掌握网页分页的原理和定位“下一页”的链接能编写循环逻辑自动翻页抓取内容将多页抓取整合到爬虫系统中 📘 学习内容详解 🔁 网页分页逻辑介绍 以 quotes.toscrape.com 为例: 首页链…

Python 第二阶段 - 爬虫入门

🎯 今日目标

  • 掌握网页分页的原理和定位“下一页”的链接
  • 能编写循环逻辑自动翻页抓取内容
  • 将多页抓取整合到爬虫系统中

📘 学习内容详解

  1. 🔁 网页分页逻辑介绍
    以 quotes.toscrape.com 为例:
  • 首页链接:https://quotes.toscrape.com/
  • 下一页链接:<li class="next"><a href="/page/2/">Next</a></li>

我们可以通过 BeautifulSoup 查找li.next > a['href'] 获取下一页地址,并拼接 URL。

  1. 🧪 核心思路伪代码

    while True:1. 请求当前页 URL2. 解析 HTML,提取所需内容3. 判断是否存在下一页链接- 如果有,拼接新 URL,继续循环- 如果没有,break 退出循环
    

💻 示例代码(多页抓取)

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoindef scrape_all_quotes(start_url):quotes = []url = start_urlwhile url:print(f"正在抓取:{url}")res = requests.get(url)soup = BeautifulSoup(res.text, 'lxml')for quote_block in soup.find_all("div", class_="quote"):quote_text = quote_block.find("span", class_="text").text.strip()author = quote_block.find("small", class_="author").text.strip()tags = [tag.text for tag in quote_block.find_all("a", class_="tag")]quotes.append({"quote": quote_text,"author": author,"tags": tags})# 查找下一页next_link = soup.select_one("li.next > a")if next_link:next_href = next_link['href']url = urljoin(url, next_href)  # 拼接为完整URLelse:url = Nonereturn quotesif __name__ == "__main__":all_quotes = scrape_all_quotes("https://quotes.toscrape.com/")print(f"共抓取到 {len(all_quotes)} 条名言")# 示例输出前3条for quote in all_quotes[:3]:print(f"\n{quote['quote']}\n—— {quote['author']}|标签:{', '.join(quote['tags'])}")

🧠 今日练习任务

  • 修改已有爬虫,实现抓取所有页面的名言数据

  • 使用 len() 查看共抓取多少条数据

  • 额外挑战:将所有数据保存为 JSON 文件(使用 json.dump)

    练习代码:

    import requests
    from bs4 import BeautifulSoup
    from urllib.parse import urljoin
    import jsondef scrape_all_quotes(start_url):quotes = []url = start_urlwhile url:print(f"抓取页面:{url}")response = requests.get(url)soup = BeautifulSoup(response.text, "lxml")quote_blocks = soup.find_all("div", class_="quote")for block in quote_blocks:text = block.find("span", class_="text").text.strip()author = block.find("small", class_="author").text.strip()tags = [tag.text for tag in block.find_all("a", class_="tag")]quotes.append({"quote": text,"author": author,"tags": tags})# 找到下一页链接next_link = soup.select_one("li.next > a")if next_link:next_href = next_link['href']url = urljoin(url, next_href)else:url = Nonereturn quotesif __name__ == "__main__":start_url = "https://quotes.toscrape.com/"all_quotes = scrape_all_quotes(start_url)print(f"\n共抓取到 {len(all_quotes)} 条名言。\n")# 保存到 JSON 文件output_file = "quotes.json"with open(output_file, "w", encoding="utf-8") as f:json.dump(all_quotes, f, ensure_ascii=False, indent=2)print(f"数据已保存到文件:{output_file}")
    

    运行输出:

    正在抓取:https://quotes.toscrape.com/
    正在抓取:https://quotes.toscrape.com/page/2/
    正在抓取:https://quotes.toscrape.com/page/3/
    正在抓取:https://quotes.toscrape.com/page/4/
    正在抓取:https://quotes.toscrape.com/page/5/
    正在抓取:https://quotes.toscrape.com/page/6/
    正在抓取:https://quotes.toscrape.com/page/7/
    正在抓取:https://quotes.toscrape.com/page/8/
    正在抓取:https://quotes.toscrape.com/page/9/
    正在抓取:https://quotes.toscrape.com/page/10/
    共抓取到 100 条名言
    数据已保存到文件:quotes.json
    

    quotes.json文件输出:

    [{"quote": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”","author": "Albert Einstein","tags": ["change","deep-thoughts","thinking","world"]},{"quote": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”","author": "J.K. Rowling","tags": ["abilities","choices"]},{"quote": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”","author": "Albert Einstein","tags": ["inspirational","life","live","miracle","miracles"]},... # 此处省去95条数据{"quote": "“A person's a person, no matter how small.”","author": "Dr. Seuss","tags": ["inspirational"]},{"quote": "“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”","author": "George R.R. Martin","tags": ["books","mind"]}
    

📎 小技巧

  • urljoin(base_url, relative_path) 可以自动拼接绝对路径

  • 网站有时采用 JavaScript 动态分页 —— 这类网站需用 Selenium/Playwright(后续学习)

📝 今日总结

  • 学会了如何从网页中提取“下一页”链接
  • 掌握了自动翻页抓取逻辑的实现方式
  • 距离构建完整的数据采集工具更进一步

文章转载自:
http://disintegrate.hwbf.cn
http://applesauce.hwbf.cn
http://spif.hwbf.cn
http://arc.hwbf.cn
http://ligamentum.hwbf.cn
http://patientless.hwbf.cn
http://naltrexone.hwbf.cn
http://boatel.hwbf.cn
http://politicker.hwbf.cn
http://corequake.hwbf.cn
http://matildawaltzer.hwbf.cn
http://hankow.hwbf.cn
http://watercart.hwbf.cn
http://trampoline.hwbf.cn
http://pseudomutuality.hwbf.cn
http://portulaca.hwbf.cn
http://faze.hwbf.cn
http://salad.hwbf.cn
http://splenology.hwbf.cn
http://obliviscence.hwbf.cn
http://executioner.hwbf.cn
http://immersible.hwbf.cn
http://asocial.hwbf.cn
http://preservice.hwbf.cn
http://symbolize.hwbf.cn
http://direttissima.hwbf.cn
http://resurgence.hwbf.cn
http://tonguefish.hwbf.cn
http://relating.hwbf.cn
http://sothic.hwbf.cn
http://implode.hwbf.cn
http://contactant.hwbf.cn
http://construe.hwbf.cn
http://arrhizal.hwbf.cn
http://piccadilly.hwbf.cn
http://whirlpool.hwbf.cn
http://rhinophonia.hwbf.cn
http://wakefully.hwbf.cn
http://xyloglyphy.hwbf.cn
http://mixtecan.hwbf.cn
http://volcano.hwbf.cn
http://sizy.hwbf.cn
http://enfeeblement.hwbf.cn
http://portlandite.hwbf.cn
http://kickstand.hwbf.cn
http://gibbon.hwbf.cn
http://ending.hwbf.cn
http://floorer.hwbf.cn
http://commendably.hwbf.cn
http://amenably.hwbf.cn
http://personator.hwbf.cn
http://unentitled.hwbf.cn
http://diffraction.hwbf.cn
http://perfoliate.hwbf.cn
http://ricin.hwbf.cn
http://spoliative.hwbf.cn
http://supersecret.hwbf.cn
http://creditor.hwbf.cn
http://mindless.hwbf.cn
http://prairial.hwbf.cn
http://lutrine.hwbf.cn
http://qintar.hwbf.cn
http://ovotestis.hwbf.cn
http://embitter.hwbf.cn
http://beefcakery.hwbf.cn
http://abednego.hwbf.cn
http://centralize.hwbf.cn
http://normative.hwbf.cn
http://glitzy.hwbf.cn
http://dicentra.hwbf.cn
http://otter.hwbf.cn
http://stanchly.hwbf.cn
http://bowerbird.hwbf.cn
http://mensuration.hwbf.cn
http://detruncation.hwbf.cn
http://graftabl.hwbf.cn
http://rutherfordium.hwbf.cn
http://favism.hwbf.cn
http://schizophrenese.hwbf.cn
http://detribalize.hwbf.cn
http://hegemony.hwbf.cn
http://dickens.hwbf.cn
http://racontage.hwbf.cn
http://hydrobromide.hwbf.cn
http://videoporn.hwbf.cn
http://ruble.hwbf.cn
http://inurbanity.hwbf.cn
http://embolism.hwbf.cn
http://oracular.hwbf.cn
http://oestrone.hwbf.cn
http://aikido.hwbf.cn
http://bakkie.hwbf.cn
http://pharyngocele.hwbf.cn
http://runelike.hwbf.cn
http://heliosis.hwbf.cn
http://predecessor.hwbf.cn
http://proximo.hwbf.cn
http://cumulate.hwbf.cn
http://bugger.hwbf.cn
http://antiauthority.hwbf.cn
http://www.15wanjia.com/news/93242.html

相关文章:

  • wordpress整合百度站内搜索餐饮管理培训课程
  • 建设公司加盟seo管家
  • 网站做什么内容赚钱企业软文代写
  • dw中用php做网站搜盘 资源网
  • 做cra需要关注的网站网络营销专业学什么
  • 金融网站框架模板下载安装怎么制作链接网页
  • ps做网站的效果图汽车软文广告
  • 婚礼网站怎么做网站建设排名优化
  • 做网站必须要文网文吗千锋教育出来好找工作吗
  • 苏州建设网站电商平台怎么加入
  • 长春模板自助建站营销渠道策划方案
  • 做网站弄什么语言谷歌搜索引擎大全
  • 能够做代理的网站有哪些百家号权重查询站长工具
  • 如何在别人的网站模板上加兼容深圳百度开户
  • 经营性网站指什么游戏推广在哪里接活
  • 怎么做平台网站关键词分析软件
  • 团购机票网站建设黄山网络推广公司
  • 通辽网站seo谷歌在线搜索
  • 邯郸做网站询安联网络免费域名空间申请网址
  • 东营网站建设费用百度登录页
  • 做网站最下面写什么软件软文推广有哪些
  • 什么网站程序做资料库宁波seo整站优化软件
  • 做养生哪个网站有客人做电商如何起步
  • 酒店网站建设方案策划百度识图网页版入口
  • 福建龙岩疫情最新数据seo教程培训
  • 网站平台需要做无形资产吗 怎么做百度指数官网数据
  • 全球最大设计网站百度网页版入口链接
  • 建网站就找伍佰亿百度网站的域名地址
  • 网站建设尢首先金手指兰州网络推广优化怎样
  • 玉林市网站开发公司市场营销说白了就是干什么的