爬虫踩坑-ConnectionResetError(100504)

网络爬虫
Author

Tom

Published

January 27, 2023

在爬巨潮资讯公告数据时遇到报错:

('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))

1 原始代码

import logging
import time

import pandas as pd
import requests
from tqdm import tqdm
def get_json(pagenum):
    url = 'http://www.cninfo.com.cn/new/fulltextSearch/full?'
    
    headers = {
        "Accept-Encoding": "gzip",
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
    }
    
    payload = {
        'searchkey': '区块链',
        'sdate': '2009-01-01',
        'edate': '2022-12-31',
        'isfulltext': 'true',
        'sortName': 'pubdate',
        'sortType': 'asc',
        'type': 'shj',
    }
    
    payload['pageNum'] = str(pagenum)
    
    try:
        res = requests.get(url, headers=headers, params=payload)
#         res.close()
    except Exception as e:
        logging.warning(e)
        
    return res.json()
# 查看第一页数据
pd.DataFrame(get_json(1)['announcements']).head()
id secCode secName orgId announcementId announcementTitle announcementTime adjunctUrl adjunctSize adjunctType storageTime columnId pageColumn announcementType associateAnnouncement important batchNum announcementContent orgName announcementTypeName
0 None 000631 顺发恒业 gssz0000631 49624621 *ST 兰宝:重大资产出售、发行股份购买资产暨关联交易报告书(修订稿) 1235773830000 finalpage/2009-02-28/49624621.PDF 1749 PDF None 01030101||010612 SZZB 01010501||01010701||010112||01170110 None None None (二)公司业绩持续下滑,且无改善迹象 由于个别银行发现公司资金<em>链</em>紧张,逐步... None None
1 None 600596 新安股份 gssh0600596 49988949 新安股份:2008年年度报告 1236637800000 finalpage/2009-03-10/49988949.PDF 1711 PDF None 01030103||01030404||010612||010613 SHZB 01010501||010113||01030101 None None None 3.新安包装公司搬迁事项 因新安江桥东<em>区块</em>旧城改造,新安包装公司于本期从老... None None
2 None 600267 海正药业 gssh0600267 50367093 海正药业:2008年年度股东大会会议资料 1237501800000 finalpage/2009-03-20/50367093.PDF 346 PDF None 01030103||010612 SHZB 01010501||010113||011906 None None None 100.00% 141,278 100.00% 115,132 100.00% 公司凭借原料... None None
3 None 002244 滨江集团 9900004730 50327309 滨江集团:2008年年度报告 1237501800000 finalpage/2009-03-20/50327309.PDF 867 PDF None 01010302||01010306||01010410||01010411||010301... SZZB 01010503||010112||010114||01030101 None None None 江干科技经济园区地块开发协议书》(江科园协字【2006】035 号),约定就“S08、09、... None None
4 None 000301 东方盛虹 gssz0000301 50432137 东方市场:2008年年度报告 1237847400000 finalpage/2009-03-24/50432137.PDF 369 PDF None 01030101||01030402||010612||010613 SZZB 01010501||010112||01030101 None None None 公司将采取措施进一步完善产业<em>链</em>,继续减少关联交易。 None None
# 获取列名
column = pd.DataFrame(get_json(1)['announcements']).columns
# 获取每页数据(DataFrame)格式
def get_df_data(pagenum):
    announcements_list = get_json(pagenum)['announcements']
    return pd.DataFrame(announcements_list)
def all_data(pagenum, res):
    df_data = get_df_data(pagenum)
    if len(df_data) > 0:
        res.extend(df_data.values)
res = []
for page in tqdm(range(1, 500)):
    all_data(page, res)
 24%|████████████████████▎                                                                | 119/499 [00:26<01:34,  4.03it/s]WARNING:root:('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))
 24%|████████████████████▎                                                                | 119/499 [00:36<01:56,  3.28it/s]
---------------------------------------------------------------------------
UnboundLocalError                         Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_20364/2801816917.py in <module>
      1 res = []
      2 for page in tqdm(range(1, 500)):
----> 3     all_data(page, res)

~\AppData\Local\Temp/ipykernel_20364/1866910109.py in all_data(pagenum, res)
      1 def all_data(pagenum, res):
----> 2     df_data = get_df_data(pagenum)
      3     if len(df_data) > 0:
      4         res.extend(df_data.values)

~\AppData\Local\Temp/ipykernel_20364/1464953047.py in get_df_data(pagenum)
      1 # 获取每页数据(DataFrame)格式
      2 def get_df_data(pagenum):
----> 3     announcements_list = get_json(pagenum)['announcements']
      4     return pd.DataFrame(announcements_list)

~\AppData\Local\Temp/ipykernel_20364/565530397.py in get_json(pagenum)
     26         logging.warning(e)
     27 
---> 28     return res.json()

UnboundLocalError: local variable 'res' referenced before assignment

根据错误提示,连接被远程主机强制关闭了, get_json 函数中,requests.get 未成功获取响应报文,故 return res.json() 报错 UnboundLocalError

2 故障分析

谷歌了一下,有以下两种解释:

This can be caused by the two sides of the connection disagreeing over whether the connection timed out or not during a keepalive. (Your code tries to reused the connection just as the server is closing it because it has been idle for too long.) You should basically just retry the operation over a new connection. (I’m surprised your library doesn’t do this automatically.)

代码尝试复用旧的 TCP 连接,但服务器认为该连接闲置太久,已经先行关闭了该连接。对于这个解释,我查看了一下 requests 的文档,发现每一次 requests.get 请求,都是调用了 requests.request('GET', url, **kwargs) 方法,而该方法定义如下:

def request(method, url, **kwargs):
    # By using the 'with' statement we are sure the session is closed, thus we
    # avoid leaving sockets open which can trigger a ResourceWarning in some
    # cases, and look like a memory leak in others.
    with sessions.Session() as session:
        return session.request(method=method, url=url, **kwargs)

可以发现,事实上每一次 get 请求都是短连接,因为 requests.request 方法使用 with 语句会自动关闭 session。事实上,如果在 requests 中要实现长连接应该使用 Session 对象。那么,既然每次调用 get 方法都会关闭连接,也就不存在服务器-客户端超时分歧的问题了。

有回答建议使用 res.close(),既然 requests.request 方法使用 with 语句自动关闭 session,原则上我们应该不用显示调用 res.close()。事实上,官方文档也是这么建议的。

close(): Releases the connection back to the pool. Once this method has been called the underlying raw object must not be accessed again. Note: Should not normally need to be called explicitly.

The web server actively rejected your connection. That’s usually because it is congested, has rate limiting or thinks that you are launching a denial of service attack. If you get this from a server, you should sleep a bit before trying again. In fact, if you don’t sleep before retry, you are a denial of service attack. The polite thing to do is implement a progressive sleep of, say, (1,2,4,8,16,32) seconds.

由于请求太频繁,被服务器识别为恶意访问,因此强制断开连接。尝试在请求后 time.sleep(random.random()*2),有改善,但请求一段时间后,连接仍会断开。

3 解决方法

根据故障分析,判断是被服务器识别为恶意访问。解决方法是使用 Session 对象,通过复用 TCP 连接,从而减少创建/关闭多个 TCP 连接的开销(包括响应时间、CPU 资源、减少拥堵等),同时提升爬虫代码的性能。

def get_json(pagenum, s):
    url = 'http://www.cninfo.com.cn/new/fulltextSearch/full?'
    
    headers = {
        "Accept-Encoding": "gzip",
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
        "Connection": "keep-alive",
    }
    
    payload = {
        'searchkey': '区块链',
        'sdate': '2009-01-01',
        'edate': '2022-12-31',
        'isfulltext': 'true',
        'sortName': 'pubdate',
        'sortType': 'asc',
        'type': 'shj',
    }
    
    payload['pageNum'] = str(pagenum)
    
    try:
        res = s.get(url, headers=headers, params=payload)
    except Exception as e:
        logging.warning(e)
        
    return res.json()
def get_df_data(pagenum, s):
    announcements_list = get_json(pagenum, s)['announcements']
    return pd.DataFrame(announcements_list)
def all_data(pagenum, s, res):
    df_data = get_df_data(pagenum, s)
    if len(df_data) > 0:
        res.extend(df_data.values)
res = []
with requests.Session() as s:
    for page in tqdm(range(1, 500)):
        all_data(page, s, res)
100%|█████████████████████████████████████████████████████████████████████████████████████| 499/499 [01:27<00:00,  5.71it/s]

参考资料

  • Http——Keep-Alive 机制
  • 图解 HTTP 2.7:持久连接节省通信量
  • https://stackoverflow.com/questions/27333671/how-to-solve-the-10054-error
  • https://stackoverflow.com/questions/8814802/python-errno-10054-an-existing-connection-was-forcibly-closed-by-the-remote-h