注册 登录
编程论坛 VFP论坛

网页数据抓取状态代码问题

igaoyuan 发布于 2023-01-26 18:43, 804 次点击
学习网页数据抓取,天眼查https://www.这个网站有些难度(还有类似https://aiqicha.baidu.com/)

1、输入关键词“华为”,获取请求 URL: https://www.,状态代码显示200
只有本站会员才能查看附件,请 登录


2、代码
程序代码:
CLEAR
lcWb = '华为'    &&keywords   
lcWb1 = STRCONV(STRCONV(lcWb, 9), 15)
* 转换为UTF8编码
lcUTF8 = ""
FOR ln = 1 TO LEN(lcWb1) STEP 2
    lcUTF8 = lcUTF8 + "%" + SUBSTR(lcWb1, ln, 2)
ENDFOR
myurl = 'https://www.'  &&"https://aiqicha.baidu.com/s?q=&lcUTF8"

oHTTP = CREATEOBJECT("MSXML2.ServerXMLHTTP")
oHTTP.Open("GET", myurl, .F.)
OHTTP.SETREQUESTHEADER("Content-Type", "application/x-www-form-urlencoded")
lcSend = "erectDate=&nothing=&pjname=" + lcUTF8 + "&head=head_620.js&bottom=bottom_591.js"

oHTTP.Send(lcSend)
? oHTTP.Status
IF oHTTP.Status = 200
    lcStr = oHTTP.ResponseText                           &&网页内容存入lcstr
    STRTOFILE(lcStr,'D:\ex.txt')                               &&调试语句:将下载的网页存为D:\ex.txt
ENDIF


3、实际状态代码返回418

4、网址搜索后自动挂上一段变化码&sessionNo=1674728807.71143526,与此有关吗?
https://www.
5 回复
#2
sdta2023-01-26 19:14
试试
程序代码:
CLEAR
lcWb = '华为'    &&keywords   
lcWb1 = STRCONV(STRCONV(lcWb, 9), 15)
* 转换为UTF8编码
lcUTF8 = ""
FOR ln = 1 TO LEN(lcWb1) STEP 2
    lcUTF8 = lcUTF8 + "%" + SUBSTR(lcWb1, ln, 2)
ENDFOR
myurl = 'https://www.' + lcUTF8  &&"https://aiqicha.baidu.com/s?q=&lcUTF8"

oHTTP = CREATEOBJECT("MSXML2.XMLHTTP")
oHTTP.Open("GET", myurl, .F.)
*OHTTP.SETREQUESTHEADER("Content-Type", "application/x-www-form-urlencoded")
*lcSend = "erectDate=&nothing=&pjname=" + lcUTF8 + "&head=head_620.js&bottom=bottom_591.js"

oHTTP.Send()
? oHTTP.Status
IF oHTTP.Status = 200
    lcStr = oHTTP.ResponseText                           &&网页内容存入lcstr
    STRTOFILE(lcStr,'D:\ex.txt')     &&调试语句:将下载的网页存为D:\ex.txt
    MODIFY FILE D:\ex.txt
ENDIF
#3
igaoyuan2023-01-26 19:57

厉害厉害!果然行!!果然行!!!

有个疑问:
myurla = 'https://www.' + lcUTF8
myurlb = 'https://www.'
?myurla
?myurlb
字符串结果是一样的,但是结论却不一样,啥原因?
#4
吹水佬2023-01-26 20:08
有可能是Cookie问题
#5
igaoyuan2023-01-26 20:34
如果查询关键字使用数字或英文(如搜索huawei),两种方式都可行,中文可能比较特殊,虽然字符串一致...
#6
igaoyuan2023-01-26 21:36
回复 2楼 sdta
感谢!感谢!!
1