使用Scrapy爬取LinkedIn遇999状态码无法登录的解决方法
解决Scrapy爬取LinkedIn时999状态码登录失败问题
LinkedIn的999状态码是典型的反爬机制触发提示,说明你的请求被识别为非人类操作了。结合你的代码,我整理了几个针对性的修复方案:
1. 修复登录流程(核心问题)
你当前的登录逻辑完全不对——没有提交登录表单,只是拿了个错误的token就直接访问目标页面,LinkedIn根本没收到你的登录凭证。正确的流程应该是先获取登录页面的真实CSRF token,再提交用户名密码完成登录:
修改你的start_requests和login相关方法:
def start_requests(self): self.username = self.settings['myemail'] self.password = self.settings['password'] # 先请求登录页面,获取真实的csrf token yield scrapy.Request( url=self.login_page, callback=self.parse_login_page, dont_filter=True, headers=self.headers ) def parse_login_page(self, response): # 提取登录页面里的真实csrf token(不是错误追踪用的lnkd-track-error) csrf_token = response.xpath("//input[@name='csrfToken']/@value").get() if not csrf_token: self.log("Failed to get valid CSRF token") return # 构造登录表单数据 form_data = { 'session_key': self.username, 'session_password': self.password, 'csrfToken': csrf_token, 'loginCsrfParam': csrf_token } # 提交登录请求到正确的接口 yield scrapy.FormRequest( url='https://www.linkedin.com/uas/login-submit', formdata=form_data, callback=self.check_login_response, headers=self.headers, dont_filter=True )
2. 优化请求头与User-Agent策略
你的UA是固定的老版本Chrome,很容易被识别。建议:
- 使用随机User-Agent,可以借助
scrapy-useragents库实现(安装后在settings里配置) - 补充现代浏览器的完整请求头字段,让请求更逼真
更新你的请求头,并配置随机UA中间件:
# 更新请求头 headers = { "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 'upgrade-insecure-requests': '1', "accept-encoding": "gzip, deflate, br", "accept-language": "en-US,en;q=0.9", "sec-ch-ua": '"Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"', "sec-ch-ua-mobile": "?0", "sec-ch-ua-platform": '"Linux"', "sec-fetch-dest": "document", "sec-fetch-mode": "navigate", "sec-fetch-site": "none", "sec-fetch-user": "?1", 'referer': 'https://www.linkedin.com/', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' } # 在settings.py里配置随机UA中间件(需先安装scrapy-useragents) # DOWNLOADER_MIDDLEWARES = { # 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None, # 'scrapy_useragents.downloadermiddlewares.useragents.UserAgentsMiddleware': 500, # } # USER_AGENTS = [ # 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36', # 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36', # 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' # ]
3. 正确处理Cookie会话
不要手动设置JSESSIONID,让Scrapy默认的CookieMiddleware自动管理会话,确保settings里启用了该中间件:
# settings.py中确认 DOWNLOADER_MIDDLEWARES = { 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700, # 其他中间件配置... }
4. 添加请求延迟与限速
LinkedIn对请求频率极其敏感,通过限速降低被识别的概率:
# 在settings.py里添加 DOWNLOAD_DELAY = 3 # 每个请求间隔3秒 CONCURRENT_REQUESTS_PER_DOMAIN = 2 # 单个域名同时请求数 AUTOTHROTTLE_ENABLED = True # 启用自动动态限速 AUTOTHROTTLE_START_DELAY = 1 AUTOTHROTTLE_MAX_DELAY = 5
5. 验证登录的可靠方式
修改check_login_response方法,用页面跳转逻辑判断登录是否成功,而不是检查用户名是否在响应体中:
def check_login_response(self, response): # 登录成功后会跳转到首页或feed页面,不会停留在登录页 if response.url != self.login_page and ('feed' in response.url or 'in/' in response.url): self.log("Successfully logged in") yield scrapy.Request(url=self.start_urls[0], dont_filter=True, headers=self.headers) else: self.log("Login failed - check credentials or anti-bot measures")
内容的提问来源于stack exchange,提问作者Softdev




