import logging
import time
import pandas as pd
import requests
from tqdm import tqdm
在爬巨潮资讯公告数据时遇到报错:
('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))
1 原始代码
def get_json(pagenum):
= 'http://www.cninfo.com.cn/new/fulltextSearch/full?'
url
= {
headers "Accept-Encoding": "gzip",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
}
= {
payload 'searchkey': '区块链',
'sdate': '2009-01-01',
'edate': '2022-12-31',
'isfulltext': 'true',
'sortName': 'pubdate',
'sortType': 'asc',
'type': 'shj',
}
'pageNum'] = str(pagenum)
payload[
try:
= requests.get(url, headers=headers, params=payload)
res # res.close()
except Exception as e:
logging.warning(e)
return res.json()
# 查看第一页数据
1)['announcements']).head() pd.DataFrame(get_json(
id | secCode | secName | orgId | announcementId | announcementTitle | announcementTime | adjunctUrl | adjunctSize | adjunctType | storageTime | columnId | pageColumn | announcementType | associateAnnouncement | important | batchNum | announcementContent | orgName | announcementTypeName | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | None | 000631 | 顺发恒业 | gssz0000631 | 49624621 | *ST 兰宝:重大资产出售、发行股份购买资产暨关联交易报告书(修订稿) | 1235773830000 | finalpage/2009-02-28/49624621.PDF | 1749 | None | 01030101||010612 | SZZB | 01010501||01010701||010112||01170110 | None | None | None | (二)公司业绩持续下滑,且无改善迹象 由于个别银行发现公司资金<em>链</em>紧张,逐步... | None | None | |
1 | None | 600596 | 新安股份 | gssh0600596 | 49988949 | 新安股份:2008年年度报告 | 1236637800000 | finalpage/2009-03-10/49988949.PDF | 1711 | None | 01030103||01030404||010612||010613 | SHZB | 01010501||010113||01030101 | None | None | None | 3.新安包装公司搬迁事项 因新安江桥东<em>区块</em>旧城改造,新安包装公司于本期从老... | None | None | |
2 | None | 600267 | 海正药业 | gssh0600267 | 50367093 | 海正药业:2008年年度股东大会会议资料 | 1237501800000 | finalpage/2009-03-20/50367093.PDF | 346 | None | 01030103||010612 | SHZB | 01010501||010113||011906 | None | None | None | 100.00% 141,278 100.00% 115,132 100.00% 公司凭借原料... | None | None | |
3 | None | 002244 | 滨江集团 | 9900004730 | 50327309 | 滨江集团:2008年年度报告 | 1237501800000 | finalpage/2009-03-20/50327309.PDF | 867 | None | 01010302||01010306||01010410||01010411||010301... | SZZB | 01010503||010112||010114||01030101 | None | None | None | 江干科技经济园区地块开发协议书》(江科园协字【2006】035 号),约定就“S08、09、... | None | None | |
4 | None | 000301 | 东方盛虹 | gssz0000301 | 50432137 | 东方市场:2008年年度报告 | 1237847400000 | finalpage/2009-03-24/50432137.PDF | 369 | None | 01030101||01030402||010612||010613 | SZZB | 01010501||010112||01030101 | None | None | None | 公司将采取措施进一步完善产业<em>链</em>,继续减少关联交易。 | None | None |
# 获取列名
= pd.DataFrame(get_json(1)['announcements']).columns column
# 获取每页数据(DataFrame)格式
def get_df_data(pagenum):
= get_json(pagenum)['announcements']
announcements_list return pd.DataFrame(announcements_list)
def all_data(pagenum, res):
= get_df_data(pagenum)
df_data if len(df_data) > 0:
res.extend(df_data.values)
= []
res for page in tqdm(range(1, 500)):
all_data(page, res)
24%|████████████████████▎ | 119/499 [00:26<01:34, 4.03it/s]WARNING:root:('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))
24%|████████████████████▎ | 119/499 [00:36<01:56, 3.28it/s]
--------------------------------------------------------------------------- UnboundLocalError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_20364/2801816917.py in <module> 1 res = [] 2 for page in tqdm(range(1, 500)): ----> 3 all_data(page, res) ~\AppData\Local\Temp/ipykernel_20364/1866910109.py in all_data(pagenum, res) 1 def all_data(pagenum, res): ----> 2 df_data = get_df_data(pagenum) 3 if len(df_data) > 0: 4 res.extend(df_data.values) ~\AppData\Local\Temp/ipykernel_20364/1464953047.py in get_df_data(pagenum) 1 # 获取每页数据(DataFrame)格式 2 def get_df_data(pagenum): ----> 3 announcements_list = get_json(pagenum)['announcements'] 4 return pd.DataFrame(announcements_list) ~\AppData\Local\Temp/ipykernel_20364/565530397.py in get_json(pagenum) 26 logging.warning(e) 27 ---> 28 return res.json() UnboundLocalError: local variable 'res' referenced before assignment
根据错误提示,连接被远程主机强制关闭了, get_json
函数中,requests.get
未成功获取响应报文,故 return res.json()
报错 UnboundLocalError
。
2 故障分析
谷歌了一下,有以下两种解释:
This can be caused by the two sides of the connection disagreeing over whether the connection timed out or not during a keepalive. (Your code tries to reused the connection just as the server is closing it because it has been idle for too long.) You should basically just retry the operation over a new connection. (I’m surprised your library doesn’t do this automatically.)
代码尝试复用旧的 TCP 连接,但服务器认为该连接闲置太久,已经先行关闭了该连接。对于这个解释,我查看了一下 requests 的文档,发现每一次 requests.get
请求,都是调用了 requests.request('GET', url, **kwargs)
方法,而该方法定义如下:
def request(method, url, **kwargs):
# By using the 'with' statement we are sure the session is closed, thus we
# avoid leaving sockets open which can trigger a ResourceWarning in some
# cases, and look like a memory leak in others.
with sessions.Session() as session:
return session.request(method=method, url=url, **kwargs)
可以发现,事实上每一次 get
请求都是短连接,因为 requests.request
方法使用 with
语句会自动关闭 session
。事实上,如果在 requests 中要实现长连接应该使用 Session
对象。那么,既然每次调用 get
方法都会关闭连接,也就不存在服务器-客户端超时分歧的问题了。
有回答建议使用 res.close()
,既然 requests.request
方法使用 with
语句自动关闭 session
,原则上我们应该不用显示调用 res.close()
。事实上,官方文档也是这么建议的。
close()
: Releases the connection back to the pool. Once this method has been called the underlying raw object must not be accessed again. Note: Should not normally need to be called explicitly.
The web server actively rejected your connection. That’s usually because it is congested, has rate limiting or thinks that you are launching a denial of service attack. If you get this from a server, you should sleep a bit before trying again. In fact, if you don’t sleep before retry, you are a denial of service attack. The polite thing to do is implement a progressive sleep of, say, (1,2,4,8,16,32) seconds.
由于请求太频繁,被服务器识别为恶意访问,因此强制断开连接。尝试在请求后 time.sleep(random.random()*2)
,有改善,但请求一段时间后,连接仍会断开。
3 解决方法
根据故障分析,判断是被服务器识别为恶意访问。解决方法是使用 Session
对象,通过复用 TCP 连接,从而减少创建/关闭多个 TCP 连接的开销(包括响应时间、CPU 资源、减少拥堵等),同时提升爬虫代码的性能。
def get_json(pagenum, s):
= 'http://www.cninfo.com.cn/new/fulltextSearch/full?'
url
= {
headers "Accept-Encoding": "gzip",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
"Connection": "keep-alive",
}
= {
payload 'searchkey': '区块链',
'sdate': '2009-01-01',
'edate': '2022-12-31',
'isfulltext': 'true',
'sortName': 'pubdate',
'sortType': 'asc',
'type': 'shj',
}
'pageNum'] = str(pagenum)
payload[
try:
= s.get(url, headers=headers, params=payload)
res except Exception as e:
logging.warning(e)
return res.json()
def get_df_data(pagenum, s):
= get_json(pagenum, s)['announcements']
announcements_list return pd.DataFrame(announcements_list)
def all_data(pagenum, s, res):
= get_df_data(pagenum, s)
df_data if len(df_data) > 0:
res.extend(df_data.values)
= []
res with requests.Session() as s:
for page in tqdm(range(1, 500)):
all_data(page, s, res)
100%|█████████████████████████████████████████████████████████████████████████████████████| 499/499 [01:27<00:00, 5.71it/s]
参考资料
- Http——Keep-Alive 机制
- 图解 HTTP 2.7:持久连接节省通信量
- https://stackoverflow.com/questions/27333671/how-to-solve-the-10054-error
- https://stackoverflow.com/questions/8814802/python-errno-10054-an-existing-connection-was-forcibly-closed-by-the-remote-h