admin管理员组

文章数量:1594587

序言

大学的终点是浦东,浦东的终点是陆家嘴。我越来越疑惑自己到底为何而来,又究竟将从何而去,想要有所不同的追求,但到头来还是流于俗套,回来第一时间去支付宝大厦找了份差事,我只知道很多认识的人都在这里,或许也该来这看一看。走进S空间时,一场从零开始的新阶段又开始了。

我渐渐开始每晚梦到 故事里的长安
长安城有人歌诗三百 歌尽了悲欢
抵达的时候阳光正好 听风吹得暖软
可我为什么忽然失措 在长安
这重重楼阁浩浩殿堂 都不是我想象
我心中曾有画卷一幅 画着它模样
长安城忽然开始下雨 湿了繁华沧桑
慌张人潮里我遗忘了 来时的方向
那年转身离去
水声远了河岸
村落是否依然
千万里外我怅然回看

这次之前没有准备,甚至没有计划要找个班上。但回来碰巧看到一份蚂蚁的便投了下,结果从投递到录用只用了五天,可能是组里确实缺人给我捡了漏。虽然日常并不善于言辞,但是每次面试时却特能扯,啥问题都能扯出个一二三来。如果不是人格分裂,那就只能是生存所迫。一面的谢专老师是个有些不修边幅的人,感觉在大厂里这样的人还挺少见的,二面的芳姐恰好是校友,看起来是一个比较容易相处的人。

其实我别无选择,只能妥协。我大概能猜到为什么AK毕业不到两年就辞了虹桥那边国企的工作回杨浦创业,然而,七月他又因为创业不顺被迫先回浦东找了份工作以作权宜。我问他是否会搬回浦东,现在他每天从仁德路往浦东赶要一个多钟头时间,晚上八点才能到家,他说不会,可能有许多理由,我也无从得知,但或许浦东就是一个围城,形势所就,每个人都在往里面挤,却也有很多人想走却走不掉。

所以一切又回到了上一篇的结尾,生活应赋予纯粹的兴致以周期性的空间和时间,人不是机械轮轴,生命也不是单目标的命题。我知道大厂有许多团建,但随着年龄的增长,越来越难培养长久的兴趣和结识令人信赖的朋友,有的人最近喜欢上了骑车,而对我来说没有什么能比在场地一同挥洒汗水,跑到上头更令人兴奋,我享受在场地上拉爆别人和被别人拉爆,可是或许我终将无法坚持下去。

你只有足够的努力,才能跑得很轻松。

可是更多的人,努力一辈子都无法跑得轻松。

尽管如此,他们依然装作很轻松。


文章目录

  • 序言
    • 20230822
    • 20230823~20230824
    • 20230825~20230829
    • 20230830~20230831
    • 20230901
    • 20230902~20230904
    • 20230905
    • 20230906
    • 20230907
    • 20230908~20230909
    • 20230910~20230911
  • 20230912~20230913
    • 20230914
    • 20230915~20230916
    • 20230917
    • 20230918~20230919
    • 20230920~20230921
    • 20230922~20230925
    • 20230926~20230928
    • 20230929~20231002
    • 20231003~20231005
    • 20231006~20231008
    • 20231009~20231010
    • 20231011~20231012


20230822

关于dgl里的如何根据边得到端点数据的问题。最近刚好有人提问,做一下回答记录:

有人可能看到消息传递的apply_func里会有edges.dst或者edges.src的写法,所以想当然地觉得可以用graph.edges[i].dst或者graph.edges[i].src来依次访问每条边的端点。前者的数据类型是dgl.udf.EdgeBatch,即一个batch的边,因为消息传递本身是模型前向传播的逻辑,在训练时输入通常是一个batch的边。后者graph.edges[i](同构图中也可以用graph.edges[u, v],异构图则用graph.edges[<name>])是一个edgespace。事实上它在dgl里只有三个属性:data, count, index,有用也只有data(其他两个一般用不到)。

问题是如何给定一个edgespace类型的变量,找到它的端点呢?这个在dgl里似乎并不是很方便的可以实现,在networkx中是可以用g.edges.data来访问所有边的端点及其数据,但是dgl中并没有g.edges.data这种用法,只能用g.edges[<edge_id>].data或者g.edges[u, v].data来访问一个边的数据。如果必须要用dgl实现的话,目前看下来只能使用graph.has_edge_between或者graph.has_edges_between函数判断两个点之间是否有边,然后依次遍历所有的边来实现。

另外graph.edge_ids(u, v)可以查询两个节点之间边的编号。


20230823~20230824

  • 昨天跟zsp接头,临行前他给我留了两个字,说AGR有些务虚,有种出乎意料又情理之中的感觉,有亿点点无奈。
  • 晚上129,课表(200m快+200m慢)×20,高级组快@44秒,慢@60秒;精英组快@40秒,慢@56秒(算起来精英组平均配速4’00",高级组4’20",这个配速跑20圈不算快,但变速比匀速要难许多)。结果他们跑着跑着就变成了节奏跑,高级组两个pacer最后按照3’55"跑了16km,我直接裂开。八点等到了AK,他下地铁直接来同济,他最近老板天天带他们去吃酒水烧烤,显然状态也虚得很,跟着补了4~5k,后程带到3’45"力不能逮,不过AK很快也跑崩,晚饭吃多要吐。
  • 回头路上跟AK了解了一下他现在的工作,他们商学院国际贸易专业出来的不少会去做跨境电商,从采购到销售,中间会有许多工作(如广告投放,图片设计,物流管理),他之前创业时就是做这个,然后现在去的也是一家创业公司,合伙人以前也是国企员工,积累一定资本,就出来自己做老板,有点像杉数,氛围总归比深不见底的大江湖好些。
  • 从家庭的角度来说,编制是最稳定的选择,但从个人价值实现的角度来看,应当尊重任何选择。即便是最头部的投行,近两三年也有极大规模的裁员,中年失业无疑是致命的。我的父母是高中同学,母亲高考落榜却稳定地从事三十多年的医生职业,慢慢也积累了一定资历,父亲虽华理出身,但前后跳槽三四次,如今有些不尽人意,这么多年也大吵过两次架,但是现在家庭非常和睦,这对我影响很大,我不是一个风险偏好的人,因此大概率只能妥协。
import gradio as gr


def sentence_builder(quantity, animal, countries, place, activity_list, morning):
    return f"""The {quantity} {animal}s from {" and ".join(countries)} went to the {place} where they {" and ".join(activity_list)} until the {"morning" if morning else "night"}"""


demo = gr.Interface(
    sentence_builder,
    [
        gr.Slider(2, 20, value=4, label="Count", info="Choose between 2 and 20"),
        gr.Dropdown(
            ["cat", "dog", "bird"], label="Animal", info="Will add more animals later!"
        ),
        gr.CheckboxGroup(["USA", "Japan", "Pakistan"], label="Countries", info="Where are they from?"),
        gr.Radio(["park", "zoo", "road"], label="Location", info="Where did they go?"),
        gr.Dropdown(
            ["ran", "swam", "ate", "slept"], value=["swam", "slept"], multiselect=True, label="Activity", info="Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed auctor, nisl eget ultricies aliquam, nunc nisl aliquet nunc, eget aliquam nisl nunc vel nisl."
        ),
        gr.Checkbox(label="Morning", info="Did they do it in the morning?"),
    ],
    "text",
    examples=[
        [2, "cat", ["Japan", "Pakistan"], "park", ["ate", "swam"], True],
        [4, "dog", ["Japan"], "zoo", ["ate", "swam"], False],
        [10, "bird", ["USA", "Pakistan"], "road", ["ran"], False],
        [8, "cat", ["Pakistan"], "zoo", ["ate"], True],
    ]
)

if __name__ == "__main__":
    demo.launch()

20230825~20230829

  • 诸事不利,搞到今天才把一些写了老久的东西整完,加上近期台风环伺,周六停跑,周日上午跟AK去游了个泳(又被打回了旱鸭子原型,不过还是勉强能游的,但明显不如七月初熟练),晚上便也懒得下来跑步,昨天觉得再不能偷懒,实验大楼干了5个上下,但还是不能扭转身体低迷的状态,来个完满的晴天或许就好了。

关于ONet数据库,写了一个爬虫把它所有的文件弄下来,后期可能要用到。大致看了一下,针对每一个细分的职业技能(大概有几百种),会对所有1016个职位对其进行评分和重要性衡量。1016种职位分成23种大类型,每种职位都有详细全面的介绍页(包括job duty, Work Context, work context, etc),然后由此会衍生出一些相关性查询的引擎,数据内容很多,但是感觉上原始数据就是这些职位介绍,后面的一些查询都是DM的结果。

先挂一个比较稳定的O*Net Data板块的爬取脚本:

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu

import os
import re
import time
import json
import random
import logging
import requests

from crawler.base import BaseCrawler
from bs4 import BeautifulSoup


class ONetCrawler(BaseCrawler):
	home_url = "https://www.onetonline"
	headers = """Host: www.onetonline
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Cookie: session3=Ci78F2TofJK2QcRsFFqoAg==
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1"""
	headers_dict = BaseCrawler.headers_to_dict(headers)
	
	def __init__(self):
		super(ONetCrawler, self).__init__()

	def run_skill(self):
		save_root = os.path.join("data", "onet")
		os.makedirs(save_root, exist_ok=True)
		title_results = self.extract_title_results(save_path="content.json")
		for title_result in title_results[1:]:
			title_id = title_result["id"]
			title_name = title_result["name"]
			title_url = title_result["url"]
			save_dir = os.path.join(save_root, title_name)
			os.makedirs(save_dir, exist_ok=True)
			skill_results = self.extract_skill_results(title_url, verbose=True)
			with open(os.path.join(save_dir, "meta.json"), 'w', encoding="utf8") as f:
				json.dump(skill_results, f, indent=4)
			for skill_result in skill_results:
				skill_id = skill_result["id"]
				skill_name = skill_result["name"]
				skill_url = skill_result["url"]
				if not skill_url.startswith('#'):
					download_url = self.home_url + skill_url
					self.download_excel_and_csv(download_url, save_dir)
	
	def download_excel_and_csv(self, download_url, save_dir=str()):
		response = self.easy_requests(method="GET", url=download_url, headers=self.headers_dict)
		html = response.text
		soup = BeautifulSoup(html, "lxml")
		h2_tag = soup.find("h2", class_="reportdesc")
		if h2_tag is None:
			h2_tag = soup.find("h2", class_="report")
			if h2_tag is None:
				a_tags = soup.find_all("a", class_="ms-2")
			else:
				a_tags = h2_tag.find_all("a", class_="ms-2")
		else:
			a_tags = h2_tag.find_all("a", class_="ms-2")
		assert len(a_tags) == 2, a_tags
		for a_tag in a_tags:
			href = a_tag.attrs["href"]
			filename = href.split('/')[-1].split('?')[0]
			if href.endswith("xlsx"):
				excel_url = self.home_url + href
				response = self.easy_requests(method="GET", url=excel_url, headers=self.headers_dict)
				with open(os.path.join(save_dir, filename), "wb") as f:
					f.write(response.content)
			elif href.endswith("csv"):
				csv_url = self.home_url + href
				response = self.easy_requests(method="GET", url=csv_url, headers=self.headers_dict)
				with open(os.path.join(save_dir, filename), "wb") as f:
					f.write(response.content)
			else:
				logging.warning(f"Unknown href format: {href}")

	# Extract skill URLs by each title URL
	def extract_skill_results(self, title_url, verbose=False):
		response = self.easy_requests(method="GET", url=title_url, headers=self.headers_dict)
		html = response.text
		soup = BeautifulSoup(html, "lxml")
		# Find top <div> tag by id
		div_tags = soup.find_all("div", id="cmtop")
		assert len(div_tags) == 1, f"Extract more than 1 top <div> tags on {title_url}"
		top_div_tag = div_tags[0]
		# Find top <ul> tag, which is usually the unique sibling of top <div> tag
		ul_tags = top_div_tag.find_all("ul", recursive=False)
		assert len(ul_tags) == 1, f"Extract more than 1 top <ul> tags on {title_url}"
		top_ul_tag = ul_tags[0]
		skill_results = list()
		if verbose:
			# Detailedly build the skill hierarchical tree
			# The hierarchical architecture is like below:
			# <div id="cmtop">
			#   <ul>
			#     <li><a>...</a><div>...</div>...</li>
			#     <li><a>...</a><div>...</div>...</li>
			#     <li><a>...</a><div>...</div>...</li>
			#   </ul>
			# </div>
			# 1. The <a> tag in <li> tag contains the name of the skill
			# 2. The <div> tag in <li> tag contains the description of the skill
			# 3. There may be recursive <div><ul><li>...</li></ul></div> block in the last ellipsis:
			#    - If there is no <div><ul><li>...</li></ul></div> block in ellipsis, then this <li> tag is the leaf node
			#    - Another method to distinguish leaf node is judge whether there is "cm-toggle" in the class of <a> tag
			#    - Another method to distinguish leaf node is judge whether the href of <a> tag starts with '#' (usually "#cm-")
			def _recursive_extract_li_tag(_li_tag):
				_li_tag_id = _li_tag.attrs["id"]
				_li_tag_id_split_list = _li_tag_id.split('-')
				assert _li_tag_id_split_list[0] == "cm", _li_tag_id
				assert _li_tag_id_split_list[1] == "wrap", _li_tag_id
				_a_tag = _li_tag.find('a', recursive=False)									# The <a> tag contains the name of the skill
				_div_tags = _li_tag.find_all("div", recursive=False)						# The first <div> tag contains the description of the skill
				_skill_id = '.'.join(_li_tag_id_split_list[2:])								# e.g. 4.C.1
				_skill_name = self.tag_regex.sub(str(), str(_a_tag)).strip()				# e.g. Interpersonal Relationships
				_skill_url = _a_tag.attrs["href"]											# e.g. #cm-4-C-1-c
				_skill_description = self.tag_regex.sub(str(), str(_div_tags[0])).strip()	# e.g. This category describes the context of the job in terms of human interaction processes.
				skill_results.append({"id": _skill_id, "name": _skill_name, "url": _skill_url, "description": _skill_description})
				# Judge if recursive extraction is required 
				_child_ul_tag = _li_tag.find("ul")
				_leaf_flag_1 = len(_div_tags) == 2
				_leaf_flag_2 = _skill_url.startswith('#')
				assert _leaf_flag_1 == _leaf_flag_2, f"{len(_div_tags)}, {_skill_url}"
				if _leaf_flag_1 and _leaf_flag_2:
					# Recursive extraction
					_top_div_tag = _div_tags[1]
					_ul_tags = _top_div_tag.find("ul", recursive=False),
					assert len(_ul_tags) == 1, f"Extract more than 1 top <ul> tags in skill {_skill_id}"
					_top_url_tag = _ul_tags[0]
					_li_tags = _top_url_tag.find_all("li", recursive=False)
					for _li_tag in _li_tags:
						_recursive_extract_li_tag(_li_tag)
			li_tags = top_ul_tag.find_all("li", recursive=False)
			for li_tag in li_tags:
				_recursive_extract_li_tag(_li_tag=li_tag)
		else:
			# Just find <a> tags of each skill entry
			a_tags = top_ul_tag.find_all('a')
			for a_tag in a_tags:
				href = a_tag.attrs["href"]
				if not href.startswith('#'):
					skill_id = href.split('/')[-1]								# e.g. 1.A.1.d.1
					skill_name = self.tag_regex.sub(str(), str(a_tag)).strip()	# e.g. Memorization 
					skill_url = self.home_url + href							# e.g. https://www.onetonline/find/descriptor/result/1.A.1.d.1
					skill_results.append({"id": skill_id, "name": skill_name, "url": skill_url})
		return skill_results

	# Extract title URLs on home page
	def extract_title_results(self, save_path=None):
		response = self.easy_requests(method="GET", url=self.home_url, headers=self.headers_dict)
		html = response.text
		soup = BeautifulSoup(html, "lxml")
		div_tag = soup.find("div", id="hsec-odata")
		a_tags = div_tag.find_all('a')
		title_results = list()
		for a_tag in a_tags:
			href = a_tag.attrs["href"]
			title_id = href.split('/')[-1]								# e.g. 1.A, 1.B.1
			title_name = self.tag_regex.sub(str(), str(a_tag)).strip()	# e.g. Abilities, Interests
			title_url = self.home_url + href							# e.g. https://www.onetonline/find/descriptor/browse/1.A
			title_results.append({"id": title_id, "name": title_name, "url": title_url})
		if save_path is not None:
			with open(save_path, 'w', encoding="utf8") as f:
				json.dump(title_results, f, indent=4)			
		return title_results


20230830~20230831

  • 这届博士的住宿条件属实让人艳羡,七号楼(原国际生住)套间独卫,里面两个单间,一个更适合称为玄关的小客厅。王京和杨亦童前天搬好宿舍,今天带他俩熟悉了一下环境,请他俩到合生汇搓了一顿(主要是杨亦童,毕竟大老远从新疆过来,我得尽地主之宜),他俩虽然还比我大两岁,但论资排辈也得叫我师兄。
  • ictai又跳票,到晚上十点都没来notification,群里大伙都急坏了,一路延期,总得给个准头吧。
  • 晚上操场重启,破军训总算结束了,加上这些天连日阴雨,属实糟心。晚上配素食大叔干了5k,胡鑫宇也在,他报了泰州半马,我下半年没有参赛的计划,最多试一下上马,但是上马没有半马项目,只有全马,主要看陈嘉伟,他要想跑上马,我就抽空陪他练,一起抽个签,抽中就去跑。

如何给证件照切换背景颜色(如蓝底->白底)

简单找到的在线工具都需要付费,可以考虑使用backgroundremover,这是一个命令行工具,可以使用pip安装:

import os
from PIL import Image

# 输入
in_path = "replace.jpg"
# 输出
out_path = "out.png"
# 要替换的背景颜色
color = "red"
# 红:red、蓝:blue、黑:black、白:white
# 去掉背景颜色
os.system('backgroundremover -i "'+str(in_path)+'"-o "cg_output.jpg"')

# 加上背景颜色
no_bg_image = Image.open("cg_output.jpg")
x, y = no_bg_image.size
new_image = Image.new('RGBA', no_bg_image.size, color=color)
new_image.paste(no_bg_image, (0, 0, x, y), no_bg_image)
new_image.save(out_path)

但是我就很费解这玩意儿居然要装一大堆依赖,而且需要torch和networkx,然后windows系统安装总是会出各种问题,最后找到了一个超简单好用且免费的线上工具https://www.remove.bg/zh


20230901

  • 嘉伟归来,训练立刻就有强度,晚上陪他干了票大的,他一共是12km,我跟了8km,中途轮流破风,手表配速4’08",实际要慢一些,但也是相当极限。
  • 杨亦童确是个猛人,今天跟他交流了一下,给我推荐了很多毕业神刊,拨云见日,茅塞顿开,感觉是组里的新大腿,而且感觉他的行事作风跟贺阳很像,很随意,但很大佬,他因为有回新疆老家的保底选择,所以读书特佛系。

关于soup.select

可以使用CSS选择器,其实CSS选择器和XPATH语法是接近的

html > h1 > a[@class="h1"]这样得到一个标签

XPATH里用的是/来取分层级,CSS则是用>

简单的一个XPATH实例:'//div[@class="page-box house-lst-page-box"]/a[last()-1]/text()'


20230902~20230904

  • 周六晚陪AK拼20k@400,AK带的很稳,一改之前的颓风,果然瘦死的骆驼比马大,四分配干满整整50圈,我前后一共只跟了6k,虚得不行,又困又乏,心肺完全支撑不了,嘉伟硬是顶了我双倍的量,才知道嘉伟两周前去西宁跑了趟半马,用时132,比之前上半马124慢些,但毕竟是高原,一般来说如果相差1k的海拔,配速差大概在15秒,但嘉伟作为甘肃人算有半个高原血统,说到底还是西南西北适合长跑,只是人太少了。
  • ictai约莫要凉,挤牙膏似的延期了两周,结果连个review都不给,接下来准备边写边投,头大如牛说实话最近,虽然实习事情不算太紧,但是书到用时方恨少,现在看起来还是cv比较吃香,nlp被冲烂了都。而且现在很多会议都要求不参会即撤稿,虽然周期短但成本也挺高的,现在这环境短时间显然签证办不成,所以大概率只能找当地的proxy,那也是不菲的价格。
  • 下午组里团购一波酱香拿铁,虽然是在搞噱头,但味道确实可以,商科出身的人对成本是真的敏感,到手就开始算一杯的成本,分析营销手段,战略研究岗的思维果然和算法数据岗差挺多,但也是很好的体验。晚上回来简单跑了10圈,碰到很多队员,然而自己却力不从心,想必这学期是没办法快乐训练了。

gradio@state

import gradio as gr

demo = gr.Blocks(css="""#btn {color: red} .abc {font-family: "Comic Sans MS", "Comic Sans", cursive !important}""")

with demo:
    default_json = {"a": "a"}

    num = gr.State(value=0)
    squared = gr.Number(value=0)
    btn = gr.Button("Next Square", elem_id="btn", elem_classes=["abc", "def"])

    stats = gr.State(value=default_json)
    table = gr.JSON()

    def increase(var, stats_history):
        var += 1
        stats_history[str(var)] = var**2
        return var, var**2, stats_history, stats_history

    btn.click(increase, [num, stats], [num, squared, stats, table])

if __name__ == "__main__":
    demo.launch()

20230905

  • 啃思棚留下的代码,他们金融出身的写代码实在不讲究,变量名瞎取,方法高聚合低耦合,难用难读,一天只重写了一半核心指标的算法。其实相对还是比较闲的(比杉数做远程闲,而且做宏观研究的更偏于讲故事,不是很硬,但是很有idea,比如昨天关于酱香咖啡的讨论,今天午饭关于教育与人口红利的讨论,他们总是会放眼十年二十年以后的市场,而我们总是在实现他们的idea),至少感觉连主管都不是很忙,而且一般都能准时下班。唯一的问题(也是关键的问题)这边确实没啥好吃的,东西又贵又一般,能吃饱已是万幸,之前大三在源深路那边实习吃的店都找不到了,而且我除了外出从不点外卖,我在学校哪怕距离四五公里我都选择骑车到店去吃,不是很信任外卖的质量,最后还是宝玺带我逛了一圈地下,挑了份30块的浇饭(七八块胗片+两块猪耳边+午餐肉+半个咸鸭蛋),勉强吃饱。
  • 今晚状态其实还是ok的,下午AK在群里发了NIKE间歇课表,本来也是田径队新学期第一次训练,所以上地铁就先给AK发了消息约到学校跑课表,结果就忘了打卡,到学校吃个饭又坐到商城路,下车打卡,打完搁对面坐回程的地铁(原来都不用下车就能打到卡,早上又能拖几分钟再出门了),结果转18号线还坐反了(其实第一遍回来的时候坐9号线也坐反了,真是老了),回来已经九点多,只能慢摇了5k聊以自慰,这学期只能跟队周六的训练,真的要走下坡路了。

gradio@textbox:

import gradio as gr

def greet(name):
    return "Hello " + name + "!"

demo = gr.Interface(fn=greet, inputs="text", outputs="text")
    
if __name__ == "__main__":
    demo.launch()   

20230906

  • 昨晚9点半到操场,看到AK带嘉伟和宋镇均刚跑完10组变速(300米快+100米慢)在拉伸,有点可惜,这学期不能陪他们耍了,我只能5分配慢摇了10圈,今天回来也是10圈,不过是更慢的配速,不过今天主要是鞋子完全不适合跑,而且回来之后去蜀地源恰了四碗饭,本来就是本着消化慢摇的10圈,跑得又慢又累,虽然心率很低,但撑得难受。其实如果先回去洗个澡换好装备来也是能跑的,但是时间太紧,现在新生刚入学,每晚操场都很多人,大伙儿积极性确实很高,我还想着10月底的校运会能不能再参加一次,上次伤病没能跑到前三,今年要是能冲一冲就好了,机会渺茫唉,明天129大概也去不了。
  • 试了一下动宾识别的效果,GLM系列确实差GPT很多,甚至差3.5都很多,提示用简短的动宾短语进行识别,GLM只能很笨重的进行片段抽取,而且很不简洁,GPT是能对一些动宾倒置的结构进行有效捕获并调整为标准的动宾结构。此外也试图用stanford的parser进行识别,但效果很差,之前没有注意过,只是当成一个自动化工具用,现在发现这个句法分析的老parser的内置算法太平凡了,稍微改一改分词就识别的特别差,完全不能实用。正好今天听到旁边技术部做NLP的大佬开会讲的话,他们现在觉得堆数据做tune在这个开源涌现的新LLM时代是没啥意义的,我们还是应该多挖掘LLM的问题(比如幻觉),做一些有突破的工作。

gradio@timeseries

import random
import os
import gradio as gr


def fraud_detector(card_activity, categories, sensitivity):
    activity_range = random.randint(0, 100)
    drop_columns = [
        column for column in ["retail", "food", "other"] if column not in categories
    ]
    if len(drop_columns):
        card_activity.drop(columns=drop_columns, inplace=True)
    return (
        card_activity,
        card_activity,
        {"fraud": activity_range / 100.0, "not fraud": 1 - activity_range / 100.0},
    )
demo = gr.Interface(
    fraud_detector,
    [
        gr.Timeseries(x="time", y=["retail", "food", "other"]),
        gr.CheckboxGroup(
            ["retail", "food", "other"], value=["retail", "food", "other"]
        ),
        gr.Slider(1, 3),
    ],
    [
        "dataframe",
        gr.Timeseries(x="time", y=["retail", "food", "other"]),
        gr.Label(label="Fraud Level"),
    ],
    examples=[
        [os.path.join(os.path.dirname(__file__), "fraud.csv"), ["retail", "food", "other"], 1.0],
    ],
)
if __name__ == "__main__":
    demo.launch()

20230907

  • 日渐疲累,今天总算改完思棚的所有代码,三天写了完整带注释的代码将近2000行,每天中午都不休息,从早肝到晚,虽然通勤时间不长也不加班,早上八点四十才出门,晚上七点十分就能到学校,但是强度太大,身体很难受,感觉像是空调吹太久(主要是地铁空调吹得难受),而且久坐很僵硬。
  • 然后晚上回来就特别纠结,去哪儿吃饭,吃完去哪儿,主要时今晚队里训练,同济129也有训练,我想出去吃,但是吃太饱又跑不动,到学校吃的话想练又得回去换衣服洗澡再折返,但是又很累,都想直接回去吃个泡面睡觉了,最后想还是去学校吃,然后直接去操场混一下,毕竟新学期还没去队里露脸。
  • 先自己慢摇8圈权当热身(其实8圈就已经很累了),衣服鞋子也没换,本来想就这样走了,但还是跟着做了两组核心(四肢撑×1->三肢撑×4->两肢撑×4,每个动作30秒),一下子觉得感觉好多了,最后又跟宋镇均跑了一组3000米,12’44",均配4’14",出了一身汗终于精神好多了,说到底还是得练才行。而且卢星雨还一直坚持练,说实话我觉得女生其实长时间不见总是会变化特别大,但是卢星雨这么多年真的没咋变过,咋样还是咋样。
  • 今晚嘉伟在同济大杀四方,高级组1.2k@3’45"×6(pacer日常不诚信,一般带到3’40"),间歇4分钟,上次这个课表我跟了三组就报销了
  • 与宋镇均聊了很久,他暑假和前女友分手,今年大四面临毕业,GPA不是很高所以想直接本科毕业工作,不过他是会院又是上海本地人,要找份体面的工作并不困难,我建议能尽快工作成家立业是最好,拖着是消耗的是将来的成本。

ChromeDriver下载官网地址(需翻墙)

http://npm.taobao/mirrors/chromedriver/和https://chromedriver.storage.googleapis/index.html的版本目前只更新到114,但是最新的Chrome版本已经到116.0.5845.180

gradio@uploadbutton

import gradio as gr
def upload_file(files):
    file_paths = [file.name for file in files]
    return file_paths

with gr.Blocks() as demo:
    file_output = gr.File()
    upload_button = gr.UploadButton("Click to Upload a File", file_types=["image", "video"], file_count="multiple")
    upload_button.upload(upload_file, upload_button, file_output)
demo.launch()

20230908~20230909

  • 昨天回来时其实感觉还行,甚至想去陪嘉伟冲两个1k,但是时间太晚,而且还没吃晚饭。每天都很纠结晚饭吃什么,想跑步就得多吃,但多吃消化不掉又跑不动,先跑后吃又太饿,先吃后跑又太撑,不跑就只能随便吃点儿,但随便吃点儿又不够本,这就是晚饭怪圈。最后做了最咸鱼的选择,跑休,然后八点半去蜀地源干4碗饭,吃爽就完事了,练个鸟。
  • 今晚跟嘉伟上了强度,(1k@3’40"+200m走路)×6,前三组按3’35",3’40",3’40"跑,然后休息了5分钟,第四组我跳过(太久没上强度,前三组我就已经上头了,所以第四组嘉伟一个人跑,结果跑了3’20",我人都傻了,果然嘉伟是照顾我前面没有带太快),后面两组3’43",最后2圈冷身+拉伸。嘉伟现在感觉已经更强了,下周四129要测5k,他期望能进18分钟,我是亲眼看着嘉伟从20分钟一路跑到18分钟的,2021年的高百嘉伟10k只跑了44’06",那时候我们还是一个水平线,现在我已经只能当他的背景板了,真的老了。

gradio@video

import gradio as gr
import os


def video_identity(video):
    return video


demo = gr.Interface(video_identity, 
                    gr.Video(), 
                    "playable_video", 
                    examples=[
                        os.path.join(os.path.dirname(__file__), 
                                     "video/video_sample.mp4")], 
                    cache_examples=True)

if __name__ == "__main__":
    demo.launch()

20230910~20230911

  • 昨天搬家,先把一些衣服和重的用品搬到2号楼,下周末再全搬过去。之前以为7人间会很挤,但其实有两个卫生间,3个双间1个单间(AB上床下桌,C单间,D两张单人床+桌+柜),而且客厅贼大,感觉能放下两个乒乓球台,唯一的缺陷就是A间柜子实在是太少了,感觉可能不够放。这次运气也贼好,室友是个上海本地工作的在职哲学博士,年龄很大,很少来住,所以又是一人住双间,而且跟胡鑫宇一个套间,一起约跑也很方便。
  • 昨晚嘉伟自测了5000米,18’12"(震惊,为什么嘉伟进步这么快,我就始终停滞不前唉),均配3’38",后程配速掉的太多,之前队里派两个兔子带他测的时候跑出的PB是18’16",这次在拥挤的操场自测能跑出新的PB(4月校运会是18’33"),果然更强了,大概率比赛能跑进18分钟。然后我也被刺激了一下,搬完家换衣服也想去测一下5000米,结果一激动起猛,7分半跑了2000米就崩了,菜得真实(srds,这个2000米也算是个人历史前三的水平,我觉得自己还是有丶进步的,只是跟嘉伟比相形见绌)。
  • 这周三组里要outing三天,预计会很爽,大约只要干两天活,后三天可以摸鱼,准备约颜烨和王凯出来,有半年没有聚了。

挂个IDEALab的selenium爬虫(这次我发现一个requests很难处理的抓包,就是如果POST请求的响应是EventStream,即流式返回,那么实际上response会得到什么呢?我不知道,因为没有测试成功过,请求头是阅后即焚,所以测试的时候返回都是error,我不知道如果能正确返回应该长啥样)。我还以为蚂蚁财大气粗内网能无限用大模型接口的,结果每天就两百条,各个模型还是共用的数量限制,本来上周五走之前把爬虫挂这,结果周一回来看爬了9000多条,只有600多条是可用的,真是醉了。不过内网能翻墙已经很好了(但会被检测,又不敢访问敏感网站):

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu

import os
import time
import requests
import pandas as pd

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup

from src.crawler.base import BaseCrawler

class IdeaLabCrawler(BaseCrawler):
	home_url = "https://idealab.alibaba-inc/chat"
	model_cards_xpath = "//div[@class=\"next-slick-list\"]"										# XPath of all model cards on home page
	the_first_model_card_xpath = "//div[@class=\"next-slick-list\"]//div[@data-index=\"0\"]"	# XPath of the first model card (ChatGPT 3.5) on home page
	the_fourth_model_card_xpath = "//div[@class=\"next-slick-list\"]//div[@data-index=\"3\"]"	# XPath of the first model card (ChatGLM2 6B) on home page
	
	input_box_xpath = "//textarea[@id=\"inputValue\"]"											# XPath of the input box for chatting
	send_button_xpath = "//div[@class=\"send-button\"]"											# XPath of the button to send text of input box
	chat_area_xpath = "//div[@class=\"idea-talk-prompted\"]"									# XPath of the chat area which contains all the talks
	human_box_xpath = "//div[@class=\"prompt-box\"]"											# XPath of human chat box
	ai_box_xpath = "//div[@class=\"ChatBox\"]//div[@class=\"ChatBox-contents-inner-box-text\"]"	# XPath of AI chat box
	create_new_xpath = "//div[@class=\"idea-talk-record-list-add\"]"							# XPath of create new talk
	def __init__(self):
		super(IdeaLabCrawler, self).__init__()

	def request(self, text, is_initial=False):
		if is_initial:
			# Home page
			self.driver.get(self.home_url)
			WebDriverWait(self.driver, self.timeout).until(lambda _driver: _driver.find_element_by_xpath(self.model_cards_xpath).is_displayed())
			# Click on the first model (ChatGPT 3.5)
			# self.driver.find_element_by_xpath(self.the_first_model_card_xpath).click()
			self.driver.find_element_by_xpath(self.the_fourth_model_card_xpath).click()
		# Input text
		self.driver.find_element_by_xpath(self.input_box_xpath).send_keys(text)
		# Send text
		self.driver.find_element_by_xpath(self.send_button_xpath).click()
		# Get AI response
		WebDriverWait(self.driver, self.timeout).until(lambda _driver: _driver.find_element_by_xpath(self.chat_area_xpath).is_displayed())
		WebDriverWait(self.driver, self.timeout).until(lambda _driver: _driver.find_element_by_xpath(self.human_box_xpath).is_displayed())
		WebDriverWait(self.driver, self.timeout).until(lambda _driver: _driver.find_element_by_xpath(self.ai_box_xpath).is_displayed())
		aigc_element = self.driver.find_element_by_xpath(self.ai_box_xpath)
		last_aigc_html = aigc_element.get_attribute("innerHTML")
		time.sleep(self.interval)
		while True:
			aigc_element = self.driver.find_element_by_xpath(self.ai_box_xpath)
			aigc_html = aigc_element.get_attribute("innerHTML")
			if aigc_html == last_aigc_html:
				# There is no change on the page
				break
			last_aigc_html = aigc_html
			time.sleep(self.interval)
		response = self.tag_regex.sub(str(), last_aigc_html).strip("\n\t ")
		self.driver.quit()
		return response
		
	# @param data_path: 
	def demo(self, data_path=r"D:\交接文档\岗位信息示例.xlsx", save_path="chatgpt-extract-full.txt"):
		# with open(save_path, 'w', encoding="utf8") as f:
			# pass
		# data = pd.read_excel(r"D:\交接文档\岗位信息实例-硬技能-v1.xlsx")
		data = pd.read_excel(data_path)
		texts = data["intro"]
		prompt = """请使用简洁的动宾短语总结下面招聘广告中提到的职业技能,我可以给你一个参考案例:
-----------------------------------------
岗位职责: 
1、熟练使用photoshop或lightroom,对图片进行调色、合成、人像美化修调; 2、根据顾客要求进行照片后期处理; 3、完成每日分配的工作量。 任职资格: 1、一年以上相关工作经验; 2、色彩控制能力强、富有创意及执行力; 3、吃苦耐劳,具有创新精神及团队协作精神; 4、责任心强,有良好的沟通能力。  基本薪资+提成8-11k,具体内容面试详谈。

该招聘广告中提到的技能有:使用photoshop或lightroom,调色图片,合成图片,美化修调人像,后期处理照片。

2、机械工程师1名岗位职责:①机械设备维修专业或者具有维修机械设备的经验。②按计划完成现场设备维护计划及维护内容;③负责设备完好率、设备停机率、设备故障分析等统计分析工作。④组织协调维修机械人员进行设备故障分析,有计划地制定维修方案并进行跟踪与落实。

该招聘广告中提到的技能有:维修机械设备,完成维护计划,统计设备数据,协调维修工作,制定维修方案

3、岗位职责:1)负责与客户的沟通,全面清楚的了解客户的需求; 2)领导、组织设计团队完成设计任务,统筹规划设计方案的制定,并能够向客户做提案演示; 3)参与公司管理工作,对设计部门的工作做统筹安排,负责检查设计部门的工作进度和工作质量; 4)协助项目经理,解决现场施工过程中发生的设计方面的问题: 5)协助市场开拓经理,提供市场信息和资源。任职要求:1)室内设计相关教育背景,大专及以上学历,5年以上设计经验; 2)熟悉建筑,设计,房地产等行业,有类似公司背景者优先;3)熟练运用各种绘图软件,如AUTO CAD,3D MAX,CORELDRAW,PHOTOSHOP等软件;有过独立设计样板房,大型别墅、办公餐饮空间,装饰工程设计与施工配合经验; 4)对设计潮流有领悟力和不断追求完美设计的毅力; 5)能带领独立的设计小组进行高效率的工作,具备优秀的职业素质; 6)擅长创意方案和施工图创作,手稿功底扎实,尤其方案能力及施工图能力强。

该招聘广告中提到的技能有:沟通客户需求、领导设计团队、制定设计方案、做提案演示、检查工作进度和质量、解决施工中的设计问题、提供市场信息和资源、有室内设计背景、熟练使用绘图软件、有设计经验、对设计潮流有领悟力、能带领设计小组、擅长创意方案和施工图创作。

-----------------------------------------

那么请按照我给你提供的案例,来提取下面几个的招聘广告中的职业技能(以简洁的动宾短语形式,越简洁越好!):"""
		for i, text in enumerate(texts):
			print(i, text)
			text = prompt.replace('\n', "    ") + text.replace('\n', "    ")
			while True:
				try:
					self.driver = self.initialize_chrome_driver()
					response = self.request(text, is_initial=True)
					break
				except Exception as e:
					print(e)
					self.driver.quit()
					time.sleep(self.interval)
					continue
			response = response.replace('\n', '|')
			with open(save_path, 'a', encoding="utf8") as f:
				f.write(f"{i}\t{response}\n")
			is_initial = False
			print('#' * 64)
			time.sleep(2)
		self.driver.quit()

20230912~20230913

  • 逐渐适应节奏,精神状态慢慢良性发展,周一晚回去简餐后,慢跑外道10圈(25分钟),不是很累,昨天又跟素食大叔干了5000米(间歇2k+1k+1k+1k,间歇时长为2分钟走200米,均配3’47",看起来跟素食大叔打了个55开,但其实他是先跑了5000米才跟我一起练,而这个强度对我来说很极限了),今晚准备跑休,好好吃一顿(每天只跑4-5k消耗并不大,但就是想吃撑,反正也吃不胖)。
  • 上周六AK先在世纪公园跑了20km(4’05"),周日又去杭州跑越野20km,每天早八晚八能这么猛是真的可怕,相较之下我还是太弱鸡。
  • 本周四晚七点129训练营将于同济大学一二九运动场进行5000米场地 测试赛,pacer配速:精英组3’30"/km,高级组3’45"/km,中级组4’00"/km,初级组4’30"/km,入门组5’20"/km。嘉伟一定会去(能否破18分钟?),我也很想去测(毕竟还想最后参加一次校运会),但是大约赶不上,接下来秋雨季,日子不好过。
  • 凯爹还是住在老地方,所以昨晚跟颜烨王凯在陆家嘴中心小聚了一下,大家都很忙,在coco吃了点儿咖喱,凯爹还要回去录纪要,颜烨还得回8楼干活到10点(不过他们现在不考勤,早上十点多才到,晚上也干到十点多走),了解了一下现在互联网和金融的行情,看来金融日子也不是很好过,而且较于互联网,感觉金融那边风气还要差些。说到底如果不买房不成家,确实可以在上海过得很圆润。

正则记录:

^ 匹配字符串的开头
$ 匹配字符串的末尾。
. 匹配除了换行符(\n)的任意字符,当re.DOTALL标记被指定时,则可以匹配包括换行符的任意字符。
[…] 用来表示一组字符,单独列出:[amk] 匹配 ‘a’,‘m’或’k’
[^…] 不在[]中的字符:[^abc] 匹配除了a,b,c之外的字符。
re* 匹配0个或多个的表达式。
re+ 匹配1个或多个的表达式。
re? 匹配0个或1个由前面的正则表达式定义的片段,非贪婪方式
re{n}
re{n,} 精确匹配n个前面表达式。
re{n, m} 匹配 n 到 m 次由前面的正则表达式定义的片段,贪婪方式
a| b 匹配a或b
(re) G匹配括号内的表达式,也表示一个组
(?imx) 正则表达式包含三种可选标志:i, m, 或 x 。只影响括号中的区域。
(?-imx) 正则表达式关闭 i, m, 或 x 可选标志。只影响括号中的区域。
(?: re) 类似 (…), 但是不表示一个组
(?imx: re) 在括号中使用i, m, 或 x 可选标志
(?-imx: re) 在括号中不使用i, m, 或 x 可选标志
(?#…) 注释.
(?= re) 前向肯定界定符。如果所含正则表达式,以 … 表示,在当前位置成功匹配时成功,否则失败。但一旦所含表达式已经尝试,匹配引擎根本没有提高;模式的剩余部分还要尝试界定符的右边。
(?! re) 前向否定界定符。与肯定界定符相反;当所含表达式不能在字符串当前位置匹配时成功
(?> re) 匹配的独立模式,省去回溯。
\w 匹配字母数字及下划线,等价于’[A-Za-z0-9_]'。
\W 匹配非字母数字及下划线,等价于 ‘[^A-Za-z0-9_]’。
\s 匹配任意空白字符,等价于 [\t\n\r\f].
\S 匹配任意非空字符,等价于 [^ \f\n\r\t\v]。
\d 匹配任意数字,等价于 [0-9].
\D 匹配任意非数字,等价于 [^0-9]。
\A 匹配字符串开始
\Z 匹配字符串结束,如果是存在换行,只匹配到换行前的结束字符串。c
\z 匹配字符串结束
\G 匹配最后匹配完成的位置。
\b 匹配一个单词边界,也就是指单词和空格间的位置。例如, ‘er\b’ 可以匹配"never" 中的 ‘er’,但不能匹配 “verb” 中的 ‘er’。
\B 匹配非单词边界。‘er\B’ 能匹配 “verb” 中的 ‘er’,但不能匹配 “never” 中的 ‘er’。
\n, \t, 等. 匹配一个换行符。匹配一个制表符。等
\1…\9 匹配第n个分组的内容。
\10 匹配第n个分组的内容,如果它经匹配。否则指的是八进制字符码的表达式。


20230914

  • 无聊的雨季,这两天组里outing去,跟宝玺姐相依为命,昨天跟步执的汇报有点失策,主要一直在赶代码,talk is cheap, show me your code,但这里并不是这样,idea is more expensive。把要汇报的东西弄完,可以摸鱼了。
  • 晚上同济冒雨进行5000米测试,回来吃完饭已经七点半,实在是赶不上,嘉伟本场再次PB,17’46",平均配速3’32",几乎是出人意料又情理之中的打开了18分钟大关,我已经不想再惊讶了,从他第一次校运会19’19",到队内兔子破风跑出的18’16",再到同济测试17’46",只能说人和人之间的天赋真的差距很大。以前我认为AK的水平是不可及的(全国大学生田径锦标赛5k17’10",10k35’08"),现在看来嘉伟极有可能毕业前超越AK的战绩。而咸鱼的上财连操场都不开,我只能去实验大楼跑了5个上下。

挂一下规则模板抽取动宾的代码,结果论上来说pkuseg确实比jieba好,但是太慢:

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu

import re
import jieba
import pkuseg

from jieba import posseg

from src.nlp.base import BaseNLP
from src.util import timer

class JobAdNLP(BaseNLP):
	uncritical_pos_tags = {"pkuseg": ['r', 'z', 'y', 'e', 'o', 'i', 'x', 'd']}
	noun_pos_tags = {"pkuseg": ['n', "nr", "ns", "nt", "nx", "nz", "vn", "an"]}
	verb_pos_tags = {"pkuseg": ['v', 'vx']}
	other_pos_tags = {"pkuseg": ['c', 'u', 'w']}	# remain 'w' means to remain "、"
	def __init__(self, userdict_path, stopword_path):
		super(JobAdNLP, self).__init__(userdict_path, stopword_path)

	def extract_phrase(self, document, save_path, library="pkuseg"):
		# Global variables
		with open(save_path, 'w', encoding="utf8") as f:
			pass
		with open(f"detailed-{save_path}", 'w', encoding="utf8") as f:
			pass
		with open(f"simplified-{save_path}", 'w', encoding="utf8") as f:
			pass
		with open(f"brief-{save_path}", 'w', encoding="utf8") as f:
			pass
		regexs_for_substitude = [("[【(\(][^))】]+[\))】]", str()),					# e.g. 【岗位职责】
								 ("(初中|高中|中?专|大专)(以?及)?(以[上内下])?(学历)?", str()),	# e.g. 
								 ("\d{1,2}周?岁以[内下上]", str()),						# e.g. 35岁
								 ("\d{1,2}周?岁", str()),								# e.g. 35岁
								 ("\d{1,2}年及?以[上下内]", str()),						# e.g. 10年以上
								 ("[一二两三四五六七八九十]年及?以[上下内]", str()),			# e.g. 两年及以上
								  ("[【】✅☞??]", str()),									# e.g. 特殊字符删除
								 ("\d+[、.\)))]", '。'),								# e.g. 1.
								 ("[\d一二三四五六七八九十][、.]", '。'),					# e.g. 一、
								 ("\d{1,2}[::]\d{2}", str()),							# e.g. 8:30
								 ("(\d{1,5}或)\d{1,5}元?/[周天月年]", str()),				# e.g. 40或50每天
								 ("([^\d])\d([^\d])", r"\1。\2"),						# e.g. 
							     ]
		regexs_for_substitude = list(map(lambda _tuple: (re.compile(_tuple[0]), _tuple[1]), regexs_for_substitude))
		regex_for_split = re.compile("[。;!?]+|\-+")
		forbidden_strings = ["00", "周岁", "岗位职责", "任职要求", "职位要求",
							 "工作职责", "岗位要求", "任职资格",
							 ]	# Note that there is no 
		forbidden_strings = list(filter(None, forbidden_strings))
		uncritical_tokens = ["福利", "晋升", "薪资", "工资", "工作地址",
							 "联系人", "联系电话", "待遇", "五险", "底薪",
							 "一金", "五险一金", "奖金", "上班", "工作时间",
							 "职位发展", "培训与发展", "绩效", "技能", "免费",
							 "职前培训", "级别", "年龄", "截止"
							 ]
		uncritical_tokens = list(filter(None, uncritical_tokens))
		tokenizer = self.load_tokenizer(library=library, pos_tag=True)
		# Preprocess ad text
		def _preprocess(_text):
			for _regex, _sub_string in regexs_for_substitude:
				_text = _regex.sub(_sub_string, _text)
			_sentences = regex_for_split.split(_text)
			_sentences = list(map(lambda _sentence: _sentence.strip(), _sentences))
			_sentences = list(filter(None, _sentences))
			return _sentences

		# Judge if a phrase is invalid
		def _is_invalid(phrase):
			_flags = [len(phrase) <= 1,
					  phrase.isdigit()]
			for _string in forbidden_strings:
				_flags.append(_string in phrase)
			return sum(_flags) > 0

		# Extract phrase for each sentence in ad
		# @timer
		def _extract_phrases(_sentence):
			_sentence_phrases = list()
			_detailed_sentence_phrases = list()
			_sentence_phrase = str()
			_detailed_sentence_phrase = list()
			_last_phrase_flag = True	# Flag for last phrase
			_token_pos_tuples = tokenizer(_sentence)
			for (_token, _pos) in _token_pos_tuples:
				if (not _token == "、") and (_pos.startswith('x') or _pos.startswith('w')):
					# Encounter a punctuation
					for _uncritical_token in uncritical_tokens:
						if _uncritical_token in _sentence_phrase:
							_last_phrase_flag = False
							break
					if not _last_phrase_flag:
						break
					if (not _is_invalid(_sentence_phrase)):
						_sentence_phrases.append(_sentence_phrase)
						_detailed_sentence_phrases.append(_detailed_sentence_phrase)
					_sentence_phrase = str()
					_detailed_sentence_phrase = list()
				else:
					# Encounter a punctuation
					_sentence_phrase += _token
					for _uncritical_token in uncritical_tokens:
						if _uncritical_token in _sentence_phrase:
							_last_phrase_flag = False
							break
					if not _last_phrase_flag:
						break
					_detailed_sentence_phrase.append((_token, _pos))
			# Process last phrase
			if (not _is_invalid(_sentence_phrase)) and _last_phrase_flag:
				_sentence_phrases.append(_sentence_phrase)
				_detailed_sentence_phrases.append(_detailed_sentence_phrase)
			# @return _sentence_phrases: ['按照设计的功能', '实现模块的代码编写、自测和维护']
			# @return _detailed_sentence_phrases: [[('按照', 'p'), ('设计', 'v'), ('的', 'u'), ('功能', 'n')], [('实现', 'v'), ('模块', 'n'), ('的', 'u'), ('代码', 'n'), ('编写', 'v'), ('、', 'w'), ('自测', 'v'), ('和', 'c'), ('维护', 'v')]]
			return _sentence_phrases, _detailed_sentence_phrases, _last_phrase_flag

		# Simplify extracted phrase
		# @_detailed_sentence_phrase: [('积极', 'ad'), ('接受', 'v'), ('开发', 'vn'), ('负责人', 'n'), ('分配', 'v'), ('的', 'u'), ('设计', 'vn'), ('和', 'c'), ('开发', 'vn'), ('任务', 'n')]
		def _simplify_phrases(_detailed_sentence_phrases):
			_simplified_sentence_phrases = list()
			for _detailed_sentence_phrase in _detailed_sentence_phrases:
				# Step 1: Find verb blocks and noun blocks in `_detailed_sentence_phrase`
				# Rule 1: Block algorithms
				# Rule 1.1: we classify a phrase into several blocks: [noun, (others), verb, (others), noun, ...] (There is no preposition block in it, the detail can be seen below)
				# Rule 1.2: deal with (others) blocks
				# Rule 1.2.1: only adverb(ad) is classified into verb block, other (others) classified into noun block
				# Rule 1.2.2: if a (others) block connects two same blocks, then just merge them
				# Rule 1.3: If a phrase contains none of noun or verb blocks, then delete it
				# Rule 1.4: If a phrase contains only noun blocks or verb blocks, then connect it to the former or latter phrase
				# Rule 1.4.1: If a phrase contains only noun blocks, then connect it to the former phrase with a punctuation (No former one then delete it)
				# Rule 1.4.2: If a phrase contains only verb blocks, then connect it to the latter phrase with a punctuation (No latter one then connect it to the former, neither then delete it)
				# Rule 1.5: If encounter a preposition, then skip the first noun block after it
				# e.g. 'n', 'v', 'o', 'p', 'pn', namely refer to noun block, verb block, (others) block, preposition(not encounter noun), preposition(has encountered noun)
				_current_block_flag = None 		
				_current_block_string = str()
				_simplified_sentence_phrase = list()
				for (_token, _pos) in _detailed_sentence_phrase:
					# Uncritical: ignore
					if _pos in self.uncritical_pos_tags[library]:
						continue
					# Preposition: Rule 1.5
					elif _pos == 'p':
						if len(_current_block_string) > 1 and _current_block_flag in ['n', 'v', 'o']:
							_simplified_sentence_phrase.append((_current_block_string, _current_block_flag))							
						_current_block_flag = 'p'
						_current_block_string = str()

					# Noun: build noun blocks
					elif _pos in self.noun_pos_tags[library]:
						if _current_block_flag in ['p', 'pn']:
							_current_block_flag = 'pn'
						elif _current_block_flag is None or _current_block_flag == 'n':
							_current_block_string += _token
							_current_block_flag = 'n'
						else:
							# i.e. _current_block_flag in ['v', 'o']
							if (len(_current_block_string) > 1 and _current_block_flag == 'v') or (len(_current_block_string) > 0 and _current_block_flag == 'o'):
								_simplified_sentence_phrase.append((_current_block_string, _current_block_flag))
							_current_block_flag = 'n'
							_current_block_string = _token
							
					# Verb: build verb blocks
					elif _pos in self.verb_pos_tags[library]:
						if _current_block_flag == 'p':
							continue
						elif _current_block_flag == 'pn':
							_current_block_flag = 'v'
							_current_block_string = _token
						elif _current_block_flag is None or _current_block_flag == 'v':
							_current_block_string += _token
							_current_block_flag = 'v'
						else:
							# i.e. _current_block_flag in ['n', 'o']
							if (len(_current_block_string) > 1 and _current_block_flag == 'n') or (len(_current_block_string) > 0 and _current_block_flag == 'o'):
								_simplified_sentence_phrase.append((_current_block_string, _current_block_flag))
							_current_block_flag = 'v'
							_current_block_string = _token
					# Others: remain
					else:
						# Rule 2: Only remains adverb(ad) contains 熟, e.g. 熟悉, 熟练
						if _pos == 'ad' and (not "熟" in _token):
							continue
						# Rule 3: Skip 'c' contains 并, e.g. 并完成负责人分配的任务
						if _pos == 'c' and ("并" in _token):
							continue
						if _current_block_flag in ['p', 'pn']:
							continue
						elif _current_block_flag is None or _current_block_flag == 'o':
							_current_block_string += _token
							_current_block_flag = 'o'
						else:
							# i.e. _current_block_flag in ['v', 'n']
							if len(_current_block_string) > 1 and _current_block_flag in ['v', 'n']:
								_simplified_sentence_phrase.append((_current_block_string, _current_block_flag))
							_current_block_flag = 'o'
							_current_block_string = _token
				if len(_current_block_string) > 1 or (len(_current_block_string) == 1 and _current_block_flag == 'o'):
					# Ibid. Tolerate those `_current_block_string`s consist of only one char, e.g. "高(质量)"
					_simplified_sentence_phrase.append((_current_block_string, _current_block_flag))
				_simplified_sentence_phrases.append(_simplified_sentence_phrase)
			
			_brief_sentence_phrases = list()
			# Rule 4: Generate `_brief_sentence_phrases`
			# Rule 4.1: If there is both v and n blocks in _simplified_sentence_phrase, then just concatenate it
			# Rule 4.2: If there is only v blocks in _simplified_sentence_phrase, then combine it with the latter _simplified_sentence_phrase (no latter then former, no latter and no former then delete it)
			# Rule 4.3: If there is only n blocks in _simplified_sentence_phrase, then combine it with the former _simplified_sentence_phrase (no former then latter, no latter and no former then delete it)
			for _simplified_sentence_phrase in _simplified_sentence_phrases:
				_brief_sentence_phrase = str()
				_v_flag = False
				_n_flag = False
				_previous_string = str()
				for (_block_string, _block_flag) in _simplified_sentence_phrase:
					if _block_flag == 'v':
						_v_flag = True
					elif _block_flag == 'n':
						_n_flag = True
					_brief_sentence_phrase += _block_string
				if _v_flag and _n_flag:
					# Rule 4.1
					_brief_sentence_phrases.append(f"{_previous_string}{_brief_sentence_phrase}")
					_previous_string = str()
				elif _v_flag and (not _n_flag):
					# Rule 4.2, common case
					_previous_string += f",{_brief_sentence_phrase}"
				elif _n_flag and (not _v_flag):
					# Rule 4.3, common case
					if _brief_sentence_phrases:
						_brief_sentence_phrases[-1] += f"{_previous_string}{_brief_sentence_phrase}"
					else:
						_previous_string += f",{_brief_sentence_phrase}"
				else:
					# Rule 4.2 & 4.3, default case
					continue
				if _previous_string and _brief_sentence_phrases:
					_brief_sentence_phrases[-1] += _previous_string
			# Postprocess `_brief_sentence_phrases`
			_forbidden_strings = ["身高", "体重", "优先", "招聘", "入职", "科技园"]
			def __final_process(__phrase):
				for ___forbidden_string in _forbidden_strings:
					if ___forbidden_string in __phrase:
						return None
				__puncts = ",,.。;;[]【】-——、"
				__phrase = __phrase.lstrip(__puncts).rstrip(__puncts)
				return __phrase
			_brief_sentence_phrases = list(filter(None, map(__final_process, _brief_sentence_phrases)))
			return _simplified_sentence_phrases, _brief_sentence_phrases
		
		for i, ad_text in enumerate(document):
			print(i)
			ad_sentences = _preprocess(_text=ad_text)
			ad_phrases = list()
			detailed_ad_phrases = list()
			simplified_ad_phrases = list()
			brief_ad_phrases = list()
			for ad_sentence in ad_sentences:
				ad_sentence_phrases, detailed_sentence_phrases, last_phrase_flag = _extract_phrases(_sentence=ad_sentence)			
				simplified_sentence_phrases, brief_sentence_phrases = _simplify_phrases(_detailed_sentence_phrases=detailed_sentence_phrases)
				# Append
				ad_phrases.append(ad_sentence_phrases)
				detailed_ad_phrases.append(detailed_sentence_phrases)
				simplified_ad_phrases.append(simplified_sentence_phrases)
				brief_ad_phrases.extend(brief_sentence_phrases)
				if not last_phrase_flag:
					break
			with open(save_path, 'a', encoding="utf8") as f:
				f.write(f"{list(filter(None, ad_phrases))}\n")
			with open(f"detailed-{save_path}", 'a', encoding="utf8") as f:
				f.write(f"{list(filter(None, detailed_ad_phrases))}\n")
			with open(f"simplified-{save_path}", 'a', encoding="utf8") as f:
				f.write(f"{list(filter(None, simplified_ad_phrases))}\n")
			with open(f"brief-{save_path}", 'a', encoding="utf8") as f:
				f.write(f"{list(filter(None, brief_ad_phrases))}\n")

20230915~20230916

  • 周末日常补觉,但是还要搬家,搬到下午四点半回来睡到六点,起来AK问我跑不跑,那当然跑!结果又是惨遭拉爆的一晚。第一段AK6k,我4k,配速3’52",过3k的时候AK问我要不要休息一下,我觉得能干到5k,所以他第4个1k就提速了,我原地爆炸,本来今晚感觉真的很有机会5000米PB,顶完5k应该能到19分半左右,可惜4k结束实在力不能逮,难受得要死。
  • 休息不到十分钟后,第二段跟了3k就爆了,配速3’58",心率飙到183,心肺爆炸,AK最后一共跑了15k,我只跑了9k,他一周加班,五天没跑,正好周三到周五也一直下雨,今天想跑个大的,其实如果是4’10"左右的配速,我应该能顶到10km以上。不过现在确实感觉水平有提升,4分配以内的速度并不很吃力,体感跟一年前4’10"的配速差不多,今天主要还是状态还是没有那么好,缺少一个爆发的契机。中长跑就是这样,不断被拉爆,才能有所提升。

gradio@error

import gradio as gr

def calculator(num1, operation, num2):
    if operation == "add":
        return num1 + num2
    elif operation == "subtract":
        return num1 - num2
    elif operation == "multiply":
        return num1 * num2
    elif operation == "divide":
        if num2 == 0:
            raise gr.Error("Cannot divide by zero!")
        return num1 / num2

demo = gr.Interface(
    calculator,
    [
        "number", 
        gr.Radio(["add", "subtract", "multiply", "divide"]),
        "number"
    ],
    "number",
    examples=[
        [5, "add", 3],
        [4, "divide", 2],
        [-4, "multiply", 2.5],
        [0, "subtract", 1.2],
    ],
    title="Toy Calculator",
    description="Here's a sample toy calculator. Allows you to calculate things like $2+2=4$",
)
if __name__ == "__main__":
    demo.launch()

20230917

  • 彻底搬到2号楼,中午做了个极其错误的决定,就是吃完饭回来到底是先睡觉再搬最后一波,还是搬完最后一波到2号楼午睡,我看着自己吃了四碗饭的肚子,想想还是先搞体力活,最后零零碎碎东西太多,啥垃圾桶拖把扫帚的都不好拿,而且没推车,只能纯靠蛮力,去搞了半天终于把床铺好,结果楼上叮叮咚咚也在搬家,吵得不行,完全没补到觉,直接导致5点去参加MBA亚沙16k测试赛跑崩(嘉伟4’12"配速67分钟跑完第一,这对他太轻松了,但我跟到10km胸闷,难受得要命,实在跑不下去只能弃赛),不过确实是很久不跑长距离,体力不支也是情理之中,感觉自己确实是很难追上他们的步伐了。
  • 认识129的花姐,听嘉伟说他第一次去129就被花姐拉爆了,完全看不出来她有如此强的实力(主要是肌肉并不明显,从某种意义上来说,嘉伟看起来比她还要瘦),据说全马能跑进250,令人咋舌,全上海市都是排得上号的,今天测试赛她是第二个跑完的。

gradio@load

import gradio as gr
demo = gr.load("gradio/question-answering", src="spaces")
demo.launch()

20230918~20230919

  • 每日一问,今晚吃什么?
  • 嘉伟晚上七点半拉了个群,《小跑一下》,问AK、宋镇均和我八点半来练不,AK顺便甩出耐克黑马的课表,我仔细一看,800米@2’40"×10,简歇3分钟,新食堂的我看了看碗里的麻辣烫,流下了无能的泪水(你看我吃这么多还有机会吗)?但是我本来还是计划要去慢摇一下的,结果就变成了跟他们一圈带变速跑了,AK和嘉伟硬是把这精英组的强度给撑下来了,我只能每组跟一圈,剩下的都是慢摇(但我不休息,等于是变速跑,但是慢圈2+快圈1,主要下班还有事来不及回去换衣服鞋子,而且今天特别累,热身的时候心肺就已经不太行了),一周没练的宋弟弟跟了5组,吐了3次,但是越吐越猛,确认是宋弟弟本人。(800米2’40",相当于5000米二级运动员达标线的配速,可想而知达标二级是多么困难)
  • 听炳杰说今年有个新生,一来就说要拿校运会5000米冠军,然后王炳杰就向他推荐了嘉伟的名片。不过讲道理今年确实看到不少跑步水平不错的新生,现在又有嘉伟、AK这样的精英选手坐镇,我们上财在长三角高校跑圈也算是有一席之地的。
  • 下半年还是有不少比赛的,但是我都没有参加的计划,大约只会跑一个高校百英里接力赛,希望今年能办得比2021年更好,上马系列赛都不想参加了,累了。

雪球评论爬虫

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu


import json
import time
import random
import requests

from datetime import datetime
from urllib.parse import urlencode

from src.base import BaseCrawler

class XueqiuCrawler(BaseCrawler):

	def __init__(self):
		super(XueqiuCrawler, self).__init__()

	# 获取股票代码
	def get_stock_code_list(self, type_="sh_sz"):
		# 这个接口能拿到的数据还挺多的,除了常规的股票交易数据外,主要是current_year_percent字段统计本年度累计涨跌幅
		url = "https://stock.xueqiu/v5/stock/screener/quote/list.json?"
		# * 这个数值竟然是刚好顶到5000就不能再多了(检查发现得到的5000个股票代码是不重复的,且都满足正则/S[HZ]\d{6}/),但是沪深A股真的是刚好5000个吗?
		query_dict = {"page"	: 1,
					  "size"	: 5000,
					  "order"	: "desc",
					  "orderby"	: "percent",
					  "order_by": "percent",
					  "market"	: "CN",
					  "type"	: type_, 
					  }
		query_string = urlencode(query_dict)
		headers = """Host: stock.xueqiu
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0
Accept: application/json, text/plain, */*
Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2
Accept-Encoding: gzip, deflate, br
Referer: https://xueqiu/hq
Origin: https://xueqiu
Connection: keep-alive
Cookie: xq_a_token=29bdb37dee2432c294425cc9e8f45710a62643a5; xqat=29bdb37dee2432c294425cc9e8f45710a62643a5; xq_r_token=3a35db27fcf5471898becda7aa5dab6afeafe471; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTY5NjgxMTc5NCwiY3RtIjoxNjk1MDAzNDQyOTMzLCJjaWQiOiJkOWQwbjRBWnVwIn0.pmY02xHxJbZCfAOK9Y1WwfJNJBidKHFmOgE9oxewcazFoEIpCUzN3zp-O3AdXw0uaHvXaLMvw2R7-cae8AGhHOFx-Ioib43ZT7JWQKtNbvqFMEwzrFePNmGth9ODQe6s5BribtIMgv90nqlzSSCsLgDuwlaF8gyNB6iuq-8C4TBj3DBjmblHdjyc9JMGCHKD7t3COuTvvtANV5jw0eh00qB0yeqnQgYH9dT_WE_bDppjK9qjqyhNU05zKeKUzf1QxzSgQ331rafUjpuoCoDgT7eZzlDZoynz8bdGp5eNGv5EuYnET0ITslI3zn1oQQNK8xxyCMDIvO6UJUGFH4fKbA; cookiesu=601695003498916; u=601695003498916; Hm_lvt_1db88642e346389874251b5a1eded6e3=1695003501; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1695004140; device_id=69e58e61ba289c950d70a7957c310a51; s=bs13uu2v2b
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-site
TE: trailers"""
		response = self.easy_requests("GET", url + query_string, headers=BaseCrawler.headers_to_dict(headers))
		json_response = json.loads(response.text)
		stock_code_list = list()
		with open("stock_list_detail.json", 'w', encoding="utf8") as f:
			json.dump(json_response, f, indent=4)
		with open("stock_code.txt", 'w', encoding="utf8") as f:
			for item in json_response["data"]["list"]:
				stock_code = item["symbol"]
				stock_code_list.append(stock_code)
				f.write(stock_code + '\n')
		return stock_code_list
			
	def get_comment(self, stock_code_list=None, save_path="comment.csv"):
		headers = """Host: xueqiu
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0
Accept: */*
Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2
Accept-Encoding: gzip, deflate, br
Referer: https://xueqiu/S/SH000001
X-Requested-With: XMLHttpRequest
elastic-apm-traceparent: 00-5226eff461396a4694eee43975deb394-09faaefb5ddf8f69-00
Connection: keep-alive
Cookie: acw_tc=2760825d16947703499771600eaceb80e2a64b465eff760e11b8a1dcba60d3; xq_a_token=29bdb37dee2432c294425cc9e8f45710a62643a5; xqat=29bdb37dee2432c294425cc9e8f45710a62643a5; xq_r_token=3a35db27fcf5471898becda7aa5dab6afeafe471; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTY5NjgxMTc5NCwiY3RtIjoxNjk0NzcwMzMzOTg2LCJjaWQiOiJkOWQwbjRBWnVwIn0.A1UOWQnon5rZxK5LGWXSyTwg0jKnNePvvlaiOd1J6YhQj6wJdqNXAWh5YGj3EJ6fC835ep08GXnrKaUoPFqaRV1A_28hJsa3Y0cudjq4ONTXOTlF0juEZbpPICpdILv1byn-MyZbrEA7uO6NK7Ny_TWlhlUOxXaUhJ-BBvFxLceakgP4vw9ttJsyGPSLZ4UzpV0MMLphBgjGw9P0B3HyHXRhQ0z90tSC5j5UclNBg4cveOnlCVTdvGHlycBM_V6YHCJAspwTDJAwOalZ_4BwXCu8YWARwRZ3TxXJNbACOf3mg8k-TxDlcrEwx3u3aCjDRIUIcx5ktcUPzNsYbycFMg; cookiesu=971694770350836; u=971694770350836; device_id=69e58e61ba289c950d70a7957c310a51; Hm_lvt_1db88642e346389874251b5a1eded6e3=1694770353; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1694770812; is_overseas=0
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin"""
		# 这是响应的JSON数据list字段下每个字典的有用字段
		columns = ["user_id", "text", "created_at", "view_count",
				   "like_count", "reply_count", "retweet_count",
				   "reward_count", "reward_user_count", "source",
				   "controversial", "blocked", "blocking", "is_answer",
				   "is_bonus", "is_refused", "is_reward",
				   "is_ss_multi_pic", 
				   ]
		# 这是响应的JSON数据list字段下每个字典的user字段下的有用字段
		user_columns = ["gender", "province", "friends_count",
						"followers_count",
						]
		# 其他字段:记录评论所属的股票,记录
		other_columns = ["symbol", "timestamp"]
		if stock_code_list is None:
			stock_code_list =  ["SH000001", "SZ399001"] + self.get_stock_code_list()
		else:
			stock_code_list =  ["SH000001", "SZ399001"] + stock_code_list
		stock_code_list = list(set(stock_code_list))
		url_stock = "https://xueqiu/S/"
		url_api = "https://xueqiu/query/v1/symbol/search/status.json?"
		if not os.path.exists(save_path):
			write_string = str()
			for other_column in other_columns:
				write_string += f"{other_column}\t"
			for column in columns:
				write_string += f"{column}\t"
			for user_column in user_columns:
				write_string += f"user.{user_column}\t"
			write_string = write_string.rstrip('\t') + '\n'
			with open(save_path, 'w', encoding="utf8") as f:
				f.write(write_string)
		for i, stock_code in enumerate(stock_code_list):
			print('#' * 64)
			print(i, stock_code)
			query_dict = {"count"		: 50,
						  "comment"		: 0,
						  "symbol"		: stock_code,
						  "hl"			: 0,
						  "source"		: "all",
						  "sort"		: str(),
						  "page"		: 1,
						  "q"			: str(),
						  "type"		: 12,
						  }
			query_string = urlencode(query_dict)
			self.easy_requests("GET", url_stock + stock_code, headers=BaseCrawler.headers_to_dict(headers))
			response = self.easy_requests("GET", url_api + query_string, headers=BaseCrawler.headers_to_dict(headers))
			timestamp = int(time.time())
			print(response.text[:100])
			json_response = json.loads(response.text)
			write_string = str()
			for item in json_response["list"]:
				write_string += f"{stock_code}\t{timestamp}\t"	# 录入other_columns
				for column in columns:
					data = item.get(column)
					data = str(data).replace('\t', ' ')
					write_string += f"{data}\t"
				for user_column in user_columns:
					data = item["user"].get(user_column)
					data = str(data).replace('\t', ' ')
					write_string += f"{data}\t"
				write_string = write_string.rstrip('\t') + '\n'
			with open(save_path, 'a', encoding="utf8") as f:
				f.write(write_string)
			time.sleep(random.randint(30, 60))

20230920~20230921

  • 连日秋雨,这周不知为何明显比上周累,昨天回去都快睁不开眼,早早睡了一波,今天又满血复活,晚上回来直接便服冲了10圈,还是边跟wyl打电话边跑,16分钟出头,完全不累。
  • 嘉伟带宋镇均在同济129厮杀,嘉伟已经完全制霸129,精英组的人也不是他的对手,原本340的变速硬生生被带到320以内,129群里称之为后浪来袭,宋镇均也奇迹般地跟完精英组(我宋哥最擅长跑间歇了.jpg),为了10月的校运会都在发力。王炳杰预计也要回来最后跑一次校运会,他和镇均都大四了,我们四个最后再一起参加一次5000米比赛,给这一期田径队长跑组画上句号。
  • 炳杰大约是要去光华的,以后只能在屏幕里瞻仰我敬爱的王总了。

sys.setrecursionlimit(1500): 设置最大递归深度
sys.getsizeof(var):查看变量占用内存大小(只能用于基础变量)
列表元素连乘:

from functools import reduce

lt = [1,2,3,4,5]
ln = reduce(lambda x, y: x*y, lt)

快速情感分析:


class BaseNLP:
	stopwords = ["可以", "没有", "一些", '我们', "对于", "这样", "怎么", 
				 "网上", "希望", "不是", "还是", "就是", "相对", "这个", 
				 "你们", "他们", "不了", "一下", "只能", "自己", "但是", 
				 "什么", '不能', "一个", "不要", "应该", "这些", "很大", 
				 "多多", "越来越", "多点", "不好", "得到", "这些", "或者", 
				 "条件", "方面", "已经", "如果", "最好", "一点", "主要", 
				 "太高", "如果", "比较", "真正", "开通", "不会", "导致", 
				 "需要", "使用", "更好", "时候", "开通", "不会", "一定", 
				 "暂无", "能够", "还要", "反而", "感觉", "觉得", "比如", 
				 "才能", "相关", "无法", "每个", "根本", "一直", "基本", 
				 "之后", "总是", "但是", "为了", "加强", "扩大", "通过", 
				 "减少", "很多", "非常", "加强", "真的", 
				 ]	
	def __init__(self, user_dict_dir=None, stopword_dir="data/stopwords"):
		# 加载自定义的用户字典
		if user_dict_dir is not None:
			for filename in os.listdir(user_dict_dir):
				filepath = os.path.join(os.path.join(user_dict_dir, filename))
		# 加载停用词表
		for filename in os.listdir(stopword_dir):
			with open(os.path.join(stopword_dir, filename), 'r', encoding="utf8") as f:
				self.stopwords.extend(f.read().splitlines())
		self.stopwords = set(self.stopwords)

	def easy_emotion(self, text):
		emotion = Emotion()
		return {"emotion": emotion.emotion_count(text)}

	def easy_sentiment(self, text):
		sentiment = Sentiment()
		return {"sentiment": sentiment.sentiment_count(text)}
		
	def plot_wordcloud(self,
					   words,
					   save_path,
					   mask_path=None,
					   font_path="simhei.ttf",
					   background_color="white",
					   max_words=200,
					   ):
		words = list(filter(lambda _word: _word not in self.stopwords and len(_word) > 1, words))	# 去除停用词
		if mask_path is None:
			mask = None
		else:
			mask = np.array(Image.open(mask_path))
			color_func = ImageColorGenerator(mask)
		wordcloud = WordCloud(font_path=font_path,
							  background_color=background_color,
							  height=800,
							  width=800,
							  scale=20,
							  prefer_horizontal=1,
							  mask=mask,
							  max_words=max_words,
							  relative_scaling=.3,
							  max_font_size=80,
							  )
		wordcloud_image = wordcloud.generate(' '.join(words))
		if mask is None:
			plt.imshow(wordcloud, alpha=1)
		else:
			plt.imshow(wordcloud.recolor(color_func=color_func), alpha=1)
		plt.axis("off")
		plt.savefig(save_path, dpi=1200, bbox_inches="tight")

20230922~20230925

  • 连日阴雨,周六补觉睡到11点自然醒,昨天等雨停跟AK小跑了5k,状态相对一般,衣服都没换,这学期只去了一次训练,准备等国庆好好练一下。
  • 结果这周一个个全都请假,Jimmy和宝玺都买的周六的票,所以22号结束要到10月7号才见面,更有甚者节后还要休假,感觉隔壁科技组就巨卷,从我来就一直push,连实习生也很push,我们组就很佛系,我觉得这样也挺好。
  • 2号楼设计居然厕所门能反锁(保险锁的是把手,而不是一个单独出一个锁块锁门,所以在里面保险挂上,门一关就反锁了,最搞笑的是宿管也没有厕所门钥匙),我都震惊了,关键我都不知道是谁反锁了门,但我好多东西都在里面,我从楼道窗户试着用棍子去捣,差点没把手机掉里面,无奈早上出门前去问宿管报修,宿管说或许可以踹开,我直接一脸问号???

gradio@mount_gradio_app

from fastapi import FastAPI
import gradio as gr
app = FastAPI()
@app.get("/")
def read_main():
    return {"message": "This is your main app"}
io = gr.Interface(lambda x: "Hello, " + x + "!", "textbox", "textbox")
app = gr.mount_gradio_app(app, io, path="/gradio")
# Then run `uvicorn run:app` from the terminal and navigate to http://localhost:8000/gradio.

20230926~20230928

  • 本周每天回来4k加速跑(便装,@515~@345,均配430)+1k放松,不热身,一般25分钟完事,突出一个高效训练,有10天不穿跑鞋跑步,今天周四又是最后一个工作日,请假提前溜去同济准备陪嘉伟练一下,课表是600米×15,间歇60秒,精英组@340,高级组@355,中级组@415,不知道这么多天养生跑还能不能跟上精英组。
  • 前几天王炳杰买了美津浓新款的无后跟跑鞋(WAVE REBELLION PRO,J1GC231701是标准色,J1GC231702是水墨色),看起来弹性很好,削去后跟既减重,又强制前掌跑,增幅特别大(正式比赛中长跑一般禁用无后跟跑鞋,但是这款属于擦规则的边,正式比赛可能是能用的),而且看鞋底的样子肯定比耐克的耐磨,官网1200,但一般900多就能到手,性价比比耐克高,关键是很漂亮,特别是水墨款贼好看,等炳杰试两周我再看要不要入手。
# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu
# 信心指数计算

import os
import time
import logging
import numpy as np
import pandas as pd
from copy import deepcopy

from src.base import BaseXiaowei
from src.util import timer, initialize_logger, terminate_logger

class Confidence(BaseXiaowei):
	confidence_index_names = ["总体 - 市场需求",
							  "总体 - 营业收入",
							  "总体 - 运营成本",
							  "总体 - 雇员人数",
							  "未工商注册的个体户 - 市场需求",
							  "未工商注册的个体户 - 营业收入",
							  "未工商注册的个体户 - 运营成本",
							  "未工商注册的个体户 - 雇员人数",
							  "工商注册的个体户 - 市场需求",
							  "工商注册的个体户 - 营业收入",
							  "工商注册的个体户 - 运营成本",
							  "工商注册的个体户 - 雇员人数",
							  "公司制企业 - 市场需求",
							  "公司制企业 - 营业收入",
							  "公司制企业 - 运营成本",
							  "公司制企业 - 雇员人数",
							  "北方 - 市场需求",
							  "北方 - 营业收入",
							  "北方 - 运营成本",
							  "北方 - 雇员人数",
							  "南方 - 市场需求",
							  "南方 - 营业收入",
							  "南方 - 运营成本",
							  "南方 - 雇员人数",
							  "农林牧渔业 - 市场需求",
							  "农林牧渔业 - 营业收入",
							  "农林牧渔业 - 运营成本",
							  "农林牧渔业 - 雇员人数",
							  "制造业_建筑业 - 市场需求",
							  "制造业_建筑业 - 营业收入",
							  "制造业_建筑业 - 运营成本",
							  "制造业_建筑业 - 雇员人数",
							  "服务业 - 市场需求",
							  "服务业 - 营业收入",
							  "服务业 - 运营成本",
							  "服务业 - 雇员人数",
							  "消费服务业 - 市场需求",
							  "消费服务业 - 营业收入",
							  "消费服务业 - 运营成本",
							  "消费服务业 - 雇员人数",
							  "商务服务业 - 市场需求",
							  "商务服务业 - 营业收入",
							  "商务服务业 - 运营成本",
							  "商务服务业 - 雇员人数",
							  ]
	def __init__(self, data_path=None, save_path=None, quarters_all=None, data=None, filter_dict=None, debug=False):
		super(Confidence, self).__init__(data_path, save_path, quarters_all, data, filter_dict, debug)

	# 信心指数算法类似, 可以写一个基函数减少代码量
	# @param data					: 计算的数据
	# @param column					: 计算指标的字段名称
	# @param positive_correlation	: 指标数值与信心的相关性(如需求、收入、雇员都是正相关,成本是负相关)
	def _calc_confidence_base(self, data, column, positive_correlation):
		data_increase = data[data[column].str.contains("增长", na=False)]
		increase_index = data_increase[column].groupby(data_increase["quarter"]).count() / data[column].groupby(data["quarter"]).count()
		data_decrease = data[data[column].str.contains("降低", na=False)]
		decrease_index = data_decrease[column].groupby(data_decrease["quarter"]).count() / data[column].groupby(data["quarter"]).count()
		confidence_base = (positive_correlation * (increase_index - decrease_index) / 2 + .5) * 100.
		confidence_base = confidence_base.to_frame().T
		confidence_base_json = {quarter: confidence_base.loc[column, quarter] for quarter in confidence_base.columns}
		return confidence_base_json

	# 市场需求
	def _calc_confidence_demand(self, data):
		return self._calc_confidence_base(data, column="confidence_demand", positive_correlation=1)

	# 营业收入
	def _calc_confidence_revenue(self, data):
		return self._calc_confidence_base(data, column="confidence_revenue", positive_correlation=1)

	# 运营成本
	def _calc_confidence_cost(self, data):
		return self._calc_confidence_base(data, column="confidence_cost", positive_correlation=-1)

	# 雇员人数
	def _calc_confidence_stuff(self, data):
		return self._calc_confidence_base(data, column="confidence_stuff", positive_correlation=1)

	# 合并所有指标并导出到外部文件
	# @param confidence_index: 所有JSON格式的信心指数计算结果按键放好, 形如[{quarter_1: index_1, ...}, ...], 这里每个指标的值一定不会是列表
	# @param quarters: 默认为confidence_index中所有JSON文件键(即季度)的并集, 也可以自行传入需要统计的季度(最好自行传入)
	def save_confidence_index(self, confidence_index, quarters=None):
		if quarters is None:
			quarters = list()
			for name, data in confidence_index.items():
				quarters.extend(list(data.keys()))
			quarters = list(set(quarters))
		export_dict = {quarter: list() for quarter in quarters}
		# 依次填入每个信心指数
		for quarter in quarters:
			for data in confidence_index:
				export_dict[quarter].append(data.get(quarter, None))					
		# 导出数据
		
		export_data = pd.DataFrame(export_dict, columns=list(export_dict.keys()))
		logging.info(export_data.shape)
		export_data.index = self.confidence_index_names[:]
		export_data.to_csv(self.save_path, header=True, index=True, sep='\t')
		return export_data

	# 信心指数计算主程序
	def run_confidence_index(self):
		confidence_index = list()
		# 目前所有的信心指数都是分数据计算下面四个指标: 市场需求, 营业收入, 运营成本, 雇员人数
		functions = [self._calc_confidence_demand,
					 self._calc_confidence_revenue,
					 self._calc_confidence_cost,
					 self._calc_confidence_stuff,
					 ]
		# 信心指数 - 竖表: 总体(1-4)
		for function in functions:
			confidence_index.append(function(data=self.data))
		# 根据类别分类的统计简易调用函数
		def _easy_calc_by_category(_categories, _column):
			for _category in _categories:
				for function in functions:
					confidence_index.append(function(data=self.data[self.data[_column] == _category]))
		register_types = ["未工商注册的个体户", "工商注册的个体户", "公司制企业_工商注册的企业"]
		south_or_norths = ["北方", "南方"]
		industries = ["农林牧渔业", "制造业_建筑业", "服务业"]
		industry_3_categories = ["消费服务业", "商务服务业"]
		_easy_calc_by_category(_categories=register_types, _column="register_type")					# 信心指数 - 竖表: 未工商注册的个体户(5-8), 工商注册的个体户(9-12), 公司制企业(13-16)
		_easy_calc_by_category(_categories=south_or_norths, _column="south_or_north")				# 信心指数 - 竖表: 北方(17-20), 南方(21-24)
		_easy_calc_by_category(_categories=industries, _column="industry")							# 信心指数 - 行业 - 行业: 农林牧渔业(25-28), 制造业_建筑业(29-32), 服务业(33-36)
		_easy_calc_by_category(_categories=industry_3_categories, _column="industry_category_3")	# 信心指数 - 行业 - 第三产业分类: 消费服务业(37-40), 商务服务业(41-44)
		logging.info(f"Save to {self.save_path} ...")
		self.save_confidence_index(confidence_index, quarters=self.quarters_all)
		logging.info("Done!")

20230929~20231002

  • 国庆前四天,累得要死,29号补觉,晚上跟AK小聚,顺便陪他练了最后一次,他30号就要去内蒙参加第12届亚太商学院沙漠挑战赛,他其实状态并不好,本来上班就很累,而且近一个月认真的训练可能只有10次上下,沙漠越野不同于一般的山林越野,负重、天气、补给都更加艰难,看他发的以前刚跑完的照片,一脸沧桑,完全不像是26岁的小伙子。
  • 30号陪两个师弟吃饭,回来练力量,30个×8组箭步走(负重15kg铃片)+200个双摇+陪胡鑫宇慢跑10圈,1号早上没有什么感觉,下午午觉起来浑身酸痛,2号已经疼得连路都走不了,但是老爸老妈来上海,不得不陪他们今天逛了一大圈,走了30000步,到最后已经完全抬不动腿,到晚上回学校还是补了10圈加速跑,配速从5’50"加到3’40",均配4’33",最近手表显示的最大摄氧量( VO 2 max \text{VO}_2\text{max} VO2max)也重新回之前的最高值59,静息心率回到接近30bpm的水平,感觉状态已经接近巅峰期。
  • 本周129课表是1000米×9,间歇90秒,精英组配速330,高级组345,中级组400,初级组425,这就离大谱,要知道上周600米×15(间歇60秒)的精英组配速才340,高级组才400,肯定是因为上周精英组的离谱pacer带到了325的水平,教练决定这周给他们来点挑战。这周高我连高级组都绝无可能跟上(以之前巅峰期队内1000米间歇为例,差不多平均每组也是345,但只跑6组,而且间歇5分钟,就已经很难顶了),中级组都将很吃力。10月是备赛月,10~12月将会有非常多的比赛,尽管与我无关。

最近墙的很厉害,除了huggingface之外(这个是最恶心的,国内有替代品,比如modelscope和paddle,但明显跟huggingface不是一个量级的,虽然能翻墙,但是下模型流量消耗太大了,像我们这种买的低流量套餐的,1年1个T的流量哪够下),很多库的api文档也被墙了,比如https://www.gradio.app/docs/interface,不知道gxb又在搞什么飞机。

import gradio as gr
def image_classifier(inp):
    return {'cat': 0.3, 'dog': 0.7}
demo = gr.Interface(fn=image_classifier, inputs="image", outputs="label",
                    flagging_callback=CSVLogger())

import gradio as gr
hf_writer = gr.HuggingFaceDatasetSaver(HF_API_TOKEN, "image-classification-mistakes")
def image_classifier(inp):
    return {'cat': 0.3, 'dog': 0.7}
demo = gr.Interface(fn=image_classifier, inputs="image", outputs="label",
                    allow_flagging="manual", flagging_callback=hf_writer)

20231003~20231005

  • AK在沙漠越野赛中超A组拿到了冠军,最后因为有一个冠军,上财团体进了前十,我宽哥还是猛,996也能夺冠,吾辈楷模。
  • 大补了两天,晚上上了波强度,4k@354全力跑+2k@415有氧过渡+2k变速(快@350×3圈+慢@450×2圈),差强人意,5k比赛有望冲击19’30",炳杰这个月要出差,大约是要缺席了。
# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu

import gc
import torch
from src.test.base import BaseHuggingFaceTest
from src.tool.huggingface_tool import *


class RobertaLargeFinetunedRaceTest(BaseHuggingFaceTest):
	# https://huggingface.co/LIAMF-USP/roberta-large-finetuned-race
	Tokenizer = RobertaTokenizer
	Model = RobertaForMultipleChoice
	def __init__(self, model_path='LIAMF-USP/roberta-large-finetuned-race', device='cpu'):
		super(RobertaLargeFinetunedRaceTest, self).__init__(model_path, device)

	# @param data			: Dict[article(List[Str]), question(List[Str]), options(List[List[Str]])]
	# @return batch_logits	: FloatTensor(batch_size, 4)
	# @return batch_predicts: List[Str] (batchsize, )
	def run(self, data, max_length=512):
		return run_roberta_large_finetuned_race(data, self.tokenizer, self.model, max_length)


class LongformerLarge4096AnsweringRaceTest(BaseHuggingFaceTest):
	# https://huggingface.co/potsawee/longformer-large-4096-answering-race
	Tokenizer = LongformerTokenizer
	Model = LongformerForMultipleChoice	
	def __init__(self, model_path='potsawee/longformer-large-4096-answering-race', device='cpu'):
		super(LongformerLarge4096AnsweringRaceTest, self).__init__(model_path, device)

	# @param data			: Dict[article(List[Str]), question(List[Str]), options(List[List[Str]])]
	# @return batch_logits	: FloatTensor(batch_size, 4)
	# @return batch_predicts: List[Str] (batchsize, )
	def run(self, data, max_length=4096):
		return run_longformer_large_4096_answering_race(data, self.tokenizer, self.model, max_length)	


class RobertaBaseSquad2Test(BaseHuggingFaceTest):
	# https://huggingface.co/deepset/roberta-base-squad2
	Tokenizer = AutoTokenizer
	Model = AutoModelForQuestionAnswering	
	def __init__(self, model_path='potsawee/longformer-large-4096-answering-race', device='cpu'):
		super(RobertaBaseSquad2Test, self).__init__(model_path, device)
		
	# @param data			: Dict[article(List[Str]), question(List[Str])]
	# @return batch_results	: List[Str] (batchsize, )
	def run(self, data, max_length=4096):
		return run_roberta_base_squad2(data, self.tokenizer, self.model, max_length)


class ChatGLM6BTest(BaseHuggingFaceTest):
	# https://huggingface.co/THUDM/chatglm-6b
	# https://huggingface.co/THUDM/chatglm-6b-int4
	# https://huggingface.co/THUDM/chatglm-6b-int4-qe
	# https://huggingface.co/THUDM/chatglm-6b-int8
	# Note: The series of chatglm-6b-xxx models cannot run on CPU
	# You can quantize with `model = model.quantize(4)` or `model = model.quantize(8)` for low GPU memory
	def __init__(self, model_path='THUDM/chatglm-6b', device='cuda'):
		super(ChatGLM6BTest, self).__init__(model_path, device)

	def load_tokenizer(self):
		self.tokenizer = AutoTokenizer.from_pretrained(self.model_path, trust_remote_code=True)

	def load_model(self):
		self.model = AutoModel.from_pretrained(self.model_path, trust_remote_code=True).half().to(self.device)

	# @param data		: Dict[content(Str)]
	# @return response	: Robot response
	# @return history	: Chat history
	def run(self, data, history=list()):
		return run_chatglm_6b(data, self.tokenizer, self.model, history)

	# @param content: Str
	def request(self, content):
		response, history = self.chatglm6btest.run(data={'content': content}, history=list())
		return response


class VisualGLM6BTest(BaseHuggingFaceTest):
	# https://huggingface.co/THUDM/visualglm-6b
	# Note: The series of chatglm-6b-xxx models cannot run on CPU
	# You can quantize with `model = model.quantize(4)` or `model = model.quantize(8)` for low GPU memory
	def __init__(self, model_path='THUDM/chatglm-6b', device='cuda'):
		super(ChatGLM6BTest, self).__init__(model_path, device)

	def load_tokenizer(self):
		self.tokenizer = AutoTokenizer.from_pretrained(self.model_path, trust_remote_code=True)

	def load_model(self):
		self.model = AutoModel.from_pretrained(self.model_path, trust_remote_code=True).half().to(self.device)

	# @param data		: Dict[content(Str)]
	# @return response	: Robot response
	# @return history	: Chat history
	def run(self, data, history=list()):
		return run_visualglm_6b(data, self.tokenizer, self.model, history)

20231006~20231008

  • 八天一晃而过,七天苦苦挣扎。这周末二工大有个Unirun比赛,其中有一个半马接力赛,7个人接力20k(至少两个女生),也有10k的个人赛,冠军可以直通上马(落寞了,以前还有3000块奖金,AK拿过一次,成绩是35分钟出头),不过那边离交大很近,估计今年会有交大的高手去跑,即便是嘉伟要夺冠也挺难的,但是今年都挺想混个上马签玩玩。队里小朋友要是凑不齐人,我就陪他们去跑一跑半马接力。
  • 但队里现在风气不正常,可能是跟05后有代沟了,嘉伟都很少去训练,而且这学期lty看起来有些反常,和东哥走的太近了,虽说廖是深之前做的事情尽失风度,但那些话总归不是空穴来风,东哥也是有家室的人
# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu
# 信心指数计算

import os
import time
import logging
import numpy as np
import pandas as pd
from copy import deepcopy

from src.base import BaseXiaowei
from src.util import timer, initialize_logger, terminate_logger

class Confidence(BaseXiaowei):
	def __init__(self, data_path=None, save_path=None, quarters_all=None, data=None, filter_dict=None, debug=False):
		super(Confidence, self).__init__(data_path, save_path, quarters_all, data, filter_dict, debug)

	# 信心指数算法类似, 可以写一个基函数减少代码量
	# @param data					: 计算的数据
	# @param column					: 计算指标的字段名称
	# @param positive_correlation	: 指标数值与信心的相关性(如需求、收入、雇员都是正相关,成本是负相关)
	def _calc_confidence_base(self, data, column, positive_correlation):
		data_increase = data[data[column].str.contains("增长", na=False)]
		increase_index = data_increase[column].groupby(data_increase["quarter"]).count() / data[column].groupby(data["quarter"]).count()
		data_decrease = data[data[column].str.contains("降低", na=False)]
		decrease_index = data_decrease[column].groupby(data_decrease["quarter"]).count() / data[column].groupby(data["quarter"]).count()
		confidence_base = (positive_correlation * (increase_index - decrease_index) / 2 + .5) * 100.
		confidence_base = confidence_base.to_frame().T
		confidence_base_json = {quarter: confidence_base.loc[column, quarter] for quarter in confidence_base.columns}
		return confidence_base_json

	# 市场需求
	def _calc_confidence_demand(self, data):
		return self._calc_confidence_base(data, column="confidence_demand", positive_correlation=1)

	# 营业收入
	def _calc_confidence_revenue(self, data):
		return self._calc_confidence_base(data, column="confidence_revenue", positive_correlation=1)


20231009~20231010

  • 宝玺姐回来之后,因为要一起吃午饭,又不得不开始点外卖。但是现在觉得点黄山菜饭太亏,到店是19,能无限加饭加小菜(主要那儿小菜有腌黄瓜和泡椒,还有水煮的干丝都很好吃),外卖也是19,关键还花我一堆券,饭菜都不能加。所以把周围几家店都点了一遍,就每一个能吃的。前天是拌酱,昨天是蜀地源(跟邯郸路这家简直没法比),今天是鸡公煲,这周围就没一家靠谱的,等后天芳姐去北京,准备带宝玺姐去黄山店里吃,在这附近能有这么一家良心店,活该生意好到爆。
  • 周末两场比赛,周六是怡宝来学校办5k跑活动,第一名有马拉松直通名额,2-4是运动手环,5-10是跳绳,前十都有一箱怡宝功能饮料,所以去进个货,顺便给队里跑一箱水回来。周日要去二工大跑10k个人赛,主要接力赛全一年级小朋友去玩,要是组个强队去拿名次我也就报一下了,卢星雨也要去陪小朋友凑热闹,所以跟嘉伟去练一下10k路跑,顺便拉上最近巨颓废的宋镇均。其实从9.28开始,我只有前天没有跑步,其他每天至少都跑了4k,虽然困得要死,但还是能顶得下来,反正周末睡一觉差不多也回过来了状态,校运会前弄个10k路跑比赛预热一下也是极好的。
# -*- coding: utf-8 -*-
# @author: caoyang
# @email: caoyang@163.sufe.edu


import json
import time
import random
import requests


from datetime import datetime
from urllib.parse import urlencode

from src.base import BaseCrawler

class XueqiuCrawler(BaseCrawler):

	def __init__(self):
		super(XueqiuCrawler, self).__init__()

	# 获取股票代码
	def get_stock_code_list(self, type_="sh_sz"):
		# 这个接口能拿到的数据还挺多的,除了常规的股票交易数据外,主要是current_year_percent字段统计本年度累计涨跌幅
		url = "https://stock.xueqiu/v5/stock/screener/quote/list.json?"
		# * 这个数值竟然是刚好顶到5000就不能再多了(检查发现得到的5000个股票代码是不重复的,且都满足正则/S[HZ]\d{6}/),但是沪深A股真的是刚好5000个吗?
		query_dict = {"page"	: 1,
					  "size"	: 5000,
					  "order"	: "desc",
					  "orderby"	: "percent",
					  "order_by": "percent",
					  "market"	: "CN",
					  "type"	: type_, 
					  }
		query_string = urlencode(query_dict)
		headers = """Host: stock.xueqiu
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0
Accept: application/json, text/plain, */*
Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2
Accept-Encoding: gzip, deflate, br
Referer: https://xueqiu/hq
Origin: https://xueqiu
Connection: keep-alive
Cookie: xq_a_token=29bdb37dee2432c294425cc9e8f45710a62643a5; xqat=29bdb37dee2432c294425cc9e8f45710a62643a5; xq_r_token=3a35db27fcf5471898becda7aa5dab6afeafe471; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTY5NjgxMTc5NCwiY3RtIjoxNjk1MDAzNDQyOTMzLCJjaWQiOiJkOWQwbjRBWnVwIn0.pmY02xHxJbZCfAOK9Y1WwfJNJBidKHFmOgE9oxewcazFoEIpCUzN3zp-O3AdXw0uaHvXaLMvw2R7-cae8AGhHOFx-Ioib43ZT7JWQKtNbvqFMEwzrFePNmGth9ODQe6s5BribtIMgv90nqlzSSCsLgDuwlaF8gyNB6iuq-8C4TBj3DBjmblHdjyc9JMGCHKD7t3COuTvvtANV5jw0eh00qB0yeqnQgYH9dT_WE_bDppjK9qjqyhNU05zKeKUzf1QxzSgQ331rafUjpuoCoDgT7eZzlDZoynz8bdGp5eNGv5EuYnET0ITslI3zn1oQQNK8xxyCMDIvO6UJUGFH4fKbA; cookiesu=601695003498916; u=601695003498916; Hm_lvt_1db88642e346389874251b5a1eded6e3=1695003501; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1695004140; device_id=69e58e61ba289c950d70a7957c310a51; s=bs13uu2v2b
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-site
TE: trailers"""
		response = self.easy_requests("GET", url + query_string, headers=BaseCrawler.headers_to_dict(headers))
		json_response = json.loads(response.text)
		stock_code_list = list()
		with open("stock_list_detail.json", 'w', encoding="utf8") as f:
			json.dump(json_response, f, indent=4)
		with open("stock_code.txt", 'w', encoding="utf8") as f:
			for item in json_response["data"]["list"]:
				stock_code = item["symbol"]
				stock_code_list.append(stock_code)
				f.write(stock_code + '\n')
		return stock_code_list
			
	def get_comment(self, stock_code_list=None, save_path="comment.csv"):
		headers = """Host: xueqiu
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0
Accept: */*
Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2
Accept-Encoding: gzip, deflate, br
Referer: https://xueqiu/S/SH000001
X-Requested-With: XMLHttpRequest
elastic-apm-traceparent: 00-5226eff461396a4694eee43975deb394-09faaefb5ddf8f69-00
Connection: keep-alive
Cookie: acw_tc=2760825d16947703499771600eaceb80e2a64b465eff760e11b8a1dcba60d3; xq_a_token=29bdb37dee2432c294425cc9e8f45710a62643a5; xqat=29bdb37dee2432c294425cc9e8f45710a62643a5; xq_r_token=3a35db27fcf5471898becda7aa5dab6afeafe471; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTY5NjgxMTc5NCwiY3RtIjoxNjk0NzcwMzMzOTg2LCJjaWQiOiJkOWQwbjRBWnVwIn0.A1UOWQnon5rZxK5LGWXSyTwg0jKnNePvvlaiOd1J6YhQj6wJdqNXAWh5YGj3EJ6fC835ep08GXnrKaUoPFqaRV1A_28hJsa3Y0cudjq4ONTXOTlF0juEZbpPICpdILv1byn-MyZbrEA7uO6NK7Ny_TWlhlUOxXaUhJ-BBvFxLceakgP4vw9ttJsyGPSLZ4UzpV0MMLphBgjGw9P0B3HyHXRhQ0z90tSC5j5UclNBg4cveOnlCVTdvGHlycBM_V6YHCJAspwTDJAwOalZ_4BwXCu8YWARwRZ3TxXJNbACOf3mg8k-TxDlcrEwx3u3aCjDRIUIcx5ktcUPzNsYbycFMg; cookiesu=971694770350836; u=971694770350836; device_id=69e58e61ba289c950d70a7957c310a51; Hm_lvt_1db88642e346389874251b5a1eded6e3=1694770353; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1694770812; is_overseas=0
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin"""
		# 这是响应的JSON数据list字段下每个字典的有用字段
		columns = ["user_id", "text", "created_at", "view_count",
				   "like_count", "reply_count", "retweet_count",
				   "reward_count", "reward_user_count", "source",
				   "controversial", "blocked", "blocking", "is_answer",
				   "is_bonus", "is_refused", "is_reward",
				   "is_ss_multi_pic", 
				   ]
		# 这是响应的JSON数据list字段下每个字典的user字段下的有用字段
		user_columns = ["gender", "province", "friends_count",
						"followers_count",
						]
		# 其他字段:记录评论所属的股票,记录
		other_columns = ["symbol", "timestamp"]
		if stock_code_list is None:
			stock_code_list =  ["SH000001", "SZ399001"] + self.get_stock_code_list()
		url_stock = "https://xueqiu/S/"
		url_api = "https://xueqiu/query/v1/symbol/search/status.json?"
		write_string = str()
		for column in other_columns:
			write_string += f"{column}\t"
		for column in columns:
			write_string += f"{column}\t"
		for user_column in user_columns:
			write_string += f"user.{user_column}\t"
		write_string = write_string.rstrip('\t') + '\n'
		with open(save_path, 'w', encoding="utf8") as f:
			f.write(write_string)
			
		for i, stock_code in enumerate(stock_code_list):
			print('#' * 64)
			print(i, stock_code)
			query_dict = {"count"		: 50,
						  "comment"		: 0,
						  "symbol"		: stock_code,
						  "hl"			: 0,
						  "source"		: "all",
						  "sort"		: str(),
						  "page"		: 1,
						  "q"			: str(),
						  "type"		: 12,
						  }
			query_string = urlencode(query_dict)
			self.easy_requests("GET", url_stock + stock_code, headers=BaseCrawler.headers_to_dict(headers))
			response = self.easy_requests("GET", url_api + query_string, headers=BaseCrawler.headers_to_dict(headers))
			timestamp = int(time.time())
			print(response.text[:100])
			json_response = json.loads(response.text)
			write_string = str()
			for item in json_response["list"]:
				write_string += f"{stock_code}\t{timestamp}\t"	# 录入other_columns
				for column in columns:
					write_string += f"{item.get(column)}\t"
				for user_column in user_columns:
					write_string += f"user.{item['user'].get(column)}\t"
				write_string = write_string.rstrip('\t') + '\n'
			with open(save_path, 'a', encoding="utf8") as f:
				f.write(write_string)
			time.sleep(random.randint(30, 60))

20231011~20231012

  • 昨晚加了会儿,八点到操场看到耿益平(129精英组pacer)和AK两位在用5’30"的配速慢摇,跟着晃了四圈,然后自己拼劲全力冲了5000米,4’07"的配速,这周七天班上的头疼(主要是晚上总是要拖到一两点才睡,睡眠严重不足,昨晚回宿舍已经完全没有精力,被迫十二点前熄灯),但为了周末的比赛还是得加量。听AK说耿益平也要来考我们的MBA,财大马协又添精英一名。
  • 周六这个怡宝活动,除了前十有奖品外,男女前二还能拿到基普乔格中国行的门票,结果参加人数贼多,估摸着至少有两三百人,但是学校就那么大,哪能容得下这么多人跑比赛,东哥说已经关照让田径队的10个人在最前面出发,其实最后名次也就在队里出,那我大概率很有希望还是拿前二,王炳杰也要请假回来跑一下,突出一个进货。
  • 而且月底校运会大概率李乐康不来,宋镇均最近又很废,王炳杰要出差,这次很有希望前三占台。(嘉伟:我只是环顾四周,看看谁将成为第二)

技能分类预测的demo:

其实这东西就发现主观性很强,很多技能的分类太模棱两可,反正这事儿做起来真的很麻烦。


字数太多,写的太卡,搁笔。

(END)

本文标签: 备忘录cy囚生