注册 登录
编程论坛 Python论坛

关于Pool线程池的问题

Mungki 发布于 2020-11-21 15:13, 1513 次点击
本人写了个爬虫,加了线程池,运行无报错,可我却发现了一个问题,如图
只有本站会员才能查看附件,请 登录

爬取的顺序是乱的,请问这个问题如何解决,谢谢
代码如下:
from multiprocessing import Pool
import re
from lxml import etree
import requests

headers = {
    'UserAgent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'
}

def get_html():
    url = "http://www.
    response = requests.get(url, headers=headers)
    print('%s 请求成功' %(url))
    response.encoding = 'utf-8'
    response = etree.HTML(response.text)
    url_list = response.xpath('/html/body/div[7]/div/ul//li/span/a/@href')
    return url_list

def save_html(url):
    response = requests.get(url, headers=headers)
    response.encoding = 'utf-8'
    response = etree.HTML(response.text)
    title = response.xpath('//title/text()')[0].replace('?', '')
    # title = response.xpath('//div[@cLass="bg"]/h1/text()')[0]
    text_list = response.xpath('//div[@class="bg"]/div[@class="content"]//p/text()')
    with open('C:/Users/Administrator/Desktop/斗罗大陆/%s.txt' %(title), 'w', encoding='utf-8') as file:
        file.write(title + '\n')
        for text in text_list:
            file.write('\t%s\n' %(text))
    print('《%s》爬取成功' %(title))
 
if __name__ == "__main__":
    url_list = get_html()
    pool = Pool(4)
    pool.map(save_html, url_list)
2 回复
#2
fall_bernana2020-11-23 10:34
回复 楼主 Mungki
这个跟你爬取的链接返回的速度有关。有的返回快,有的返回慢,只要最终结果是一致的。有什么问题呢?
程序代码:

from multiprocessing.pool import Pool
import time
def hhh(i):
    if i%2==0:
        time.sleep(2)
    print(i,i%2,i * 2)


if __name__ == '__main__':
    pool = Pool(4)
    pool.map(hhh, [1, 2, 3,4,5,6,7,8,9])
    #print(hh)
---------------------------------
1 1 2
3 1 6
5 1 10
7 1 14
2 0 4
9 1 18
4 0 8
6 0 12
8 0 16

#3
phiplato2020-12-26 22:02
自己做个目录,然后做个链接,目录可以用章节排序。
1