admin管理员组

文章数量:1594246

序言

last, but not the last

终于也走到了这一步。


文章目录

  • 序言
  • 20240820
  • 20240821~20240823
  • 20240824
  • 20240825
  • 20240826
  • 20240827
  • 20240828
  • 20240829
  • 20240830
  • 20240831
  • 20240901
  • 20240902
  • 20240903
  • 20240904
  • 20240905
  • 20240906
  • 20240907
  • 20240908
  • 20240909
  • 20240910
  • 20240911
  • 20240912
  • 20240913
  • 20240914
  • 20240915(完篇)


20240820

  • 始终没下下来的雨,是夏末秋凉前最后的沉郁。

  • 各种手续尘埃落定,距离出行也不到一个月了,还是有所期待的。

  • 八点半,简单跑几组小强度,试了一下新买得半弹(是真没裤子穿,说到底还是半弹舒服,但是感觉长距离散热不是很好),收尾跟胡哥和AX一起摇了4K多,慢跑还是得有人一起才能很轻松,AX之前有6组×800米间歇@345,天气还是太热了,并不适合上太大强度。

  • PS:整体来看,伤后的三个月夏训,又慢慢回归到保守的后跟跑模式,前掌跑有时候反而感觉垂直振幅很大,心率容易上得太猛,经济性反而不行。目前八月总跑量130km,均配4’23",相对前两个月更加保守,不过近期确实也没有太认真训练。

accelerate DPO

accelerate launch --config-file examples/accelerate_configs/deepspeed_zero3.yaml examples/research_projects/stack_llama_2/scripts/dpo_llama2.py \
    --model_name_or_path="sft/final_checkpoint" \
    --output_dir="dpo"

DPO原理回顾(如何计算loss)

  • 输入定义:

    • π θ \pi_\theta πθ: 策略模型
    • π r e f \pi_{ref} πref: 参考模型
    • D = { ( x i , y i + , y i − ) } D = \{(x_i, y_i^+, y_i^-)\} D={(xi,yi+,yi)}: 训练数据,其中 x i x_i xi 是输入提示, y i + y_i^+ yi+ 是偏好的回答, y i − y_i^- yi 是非偏好的回答
  • 对数概率计算,对于每个样本 ( x i , y i + , y i − ) (x_i, y_i^+, y_i^-) (xi,yi+,yi):

    • π θ ( y i + ∣ x i ) = log ⁡ P θ ( y i + ∣ x i ) \pi_\theta(y_i^+ | x_i) = \log P_\theta(y_i^+ | x_i) πθ(yi+xi)=logPθ(yi+xi)
    • π θ ( y i − ∣ x i ) = log ⁡ P θ ( y i − ∣ x i ) \pi_\theta(y_i^- | x_i) = \log P_\theta(y_i^- | x_i) πθ(yixi)=logPθ(yixi)
    • π r e f ( y i + ∣ x i ) = log ⁡ P r e f ( y i + ∣ x i ) \pi_{ref}(y_i^+ | x_i) = \log P_{ref}(y_i^+ | x_i) πref(yi+xi)=logPref(yi+xi)
    • π r e f ( y i − ∣ x i ) = log ⁡ P r e f ( y i − ∣ x i ) \pi_{ref}(y_i^- | x_i) = \log P_{ref}(y_i^- | x_i) πref(yixi)=logPref(yixi)
  • 对数概率比计算

    • r i + = π θ ( y i + ∣ x i ) − π r e f ( y i + ∣ x i ) r_i^+ = \pi_\theta(y_i^+ | x_i) - \pi_{ref}(y_i^+ | x_i) ri+=πθ(yi+xi)πref(yi+xi)
    • r i − = π θ ( y i − ∣ x i ) − π r e f ( y i − ∣ x i ) r_i^- = \pi_\theta(y_i^- | x_i) - \pi_{ref}(y_i^- | x_i) ri=πθ(yixi)πref(yixi)
  • DPO 损失计算(以 sigmoid 损失为例)
    L i = − log ⁡ ( σ ( β ⋅ ( r i + − r i − ) ) ) ⋅ ( 1 − λ ) − log ⁡ ( 1 − σ ( β ⋅ ( r i + − r i − ) ) ) ⋅ λ L_i = -\log(\sigma(\beta \cdot (r_i^+ - r_i^-))) \cdot (1 - \lambda) - \log(1 - \sigma(\beta \cdot (r_i^+ - r_i^-))) \cdot \lambda Li=log(σ(β(ri+ri)))(1λ)log(1σ(β(ri+ri)))λ
    其中:

    • σ \sigma σ 是 sigmoid 函数
    • β \beta β 是温度参数
    • λ \lambda λ 是标签平滑参数
  • 总体损失
    L = 1 N ∑ i = 1 N L i L = \frac{1}{N} \sum_{i=1}^N L_i L=N1i=1NLi

其中 N N N 是批次大小。

  • 优化目标: θ ∗ = arg ⁡ min ⁡ θ L \theta^* = \arg\min_\theta L θ=argminθL

  • 奖励计算(用于评估)

    • chosen_reward i = β ⋅ ( π θ ( y i + ∣ x i ) − π r e f ( y i + ∣ x i ) ) \text{chosen\_reward}_i = \beta \cdot (\pi_\theta(y_i^+ | x_i) - \pi_{ref}(y_i^+ | x_i)) chosen_rewardi=β(πθ(yi+xi)πref(yi+xi))
    • rejected_reward i = β ⋅ ( π θ ( y i − ∣ x i ) − π r e f ( y i − ∣ x i ) ) \text{rejected\_reward}_i = \beta \cdot (\pi_\theta(y_i^- | x_i) - \pi_{ref}(y_i^- | x_i)) rejected_rewardi=β(πθ(yixi)πref(yixi))
  • 评估指标

    • 平均 chosen 奖励: 1 N ∑ i = 1 N chosen_reward i \frac{1}{N} \sum_{i=1}^N \text{chosen\_reward}_i N1i=1Nchosen_rewardi
    • 平均 rejected 奖励: 1 N ∑ i = 1 N rejected_reward i \frac{1}{N} \sum_{i=1}^N \text{rejected\_reward}_i N1i=1Nrejected_rewardi
    • 奖励准确率: 1 N ∑ i = 1 N 1 [ chosen_reward i > rejected_reward i ] \frac{1}{N} \sum_{i=1}^N \mathbb{1}[\text{chosen\_reward}_i > \text{rejected\_reward}_i] N1i=1N1[chosen_rewardi>rejected_rewardi]
    • 奖励边际: 1 N ∑ i = 1 N ( chosen_reward i − rejected_reward i ) \frac{1}{N} \sum_{i=1}^N (\text{chosen\_reward}_i - \text{rejected\_reward}_i) N1i=1N(chosen_rewardirejected_rewardi)

DPODataCollatorWithPadding & training

data_collator = DPODataCollatorWithPadding(
    # 2
    pad_token_id=self.tokenizer.pad_token_id,
    # -100
    label_pad_token_id=args.label_pad_token_id,
    # false
    is_encoder_decoder=self.is_encoder_decoder,
)
  • DPO DataCollator class that pads the tokenized inputs to the maximum length of the batch.
    • prompt_input_ids, chosen_input_ids, rejected_input_ids
    • chosen_labels, rejected_labels
    • prompt_attention_mask, chosen_attention_mask, rejected_attention_mask
  • concatenated_input_ids, concatenated_attention_mask
    • input_ids: (prompt + chosen), labels: chosen
    • input_ids: (prompt + rejected): labels: rejected
    • 一个三元组的数据(问题+好回答+坏回答)可以分为两个记录,一个时prompt+chosen,另一个prompt+rejected
    • Loss只定义在回答上,而不会在prompt上计算loss
outputs = model(
    concatenated_batch["concatenated_input_ids"],
    attention_mask=concatenated_batch["concatenated_attention_mask"],
    use_cache=False,
    **model_kwargs,
)

all_logits = outputs.logits

...

all_logps, size_completion = self.get_batch_logps(
    all_logits,
    concatenated_batch["concatenated_labels"],
    # average_log_prob=self.loss_type == "ipo",
    is_encoder_decoder=self.is_encoder_decoder,
    label_pad_token_id=self.label_pad_token_id,
)

...

labels = concatenated_batch["concatenated_labels"].clone()
nll_loss = cross_entropy_loss(all_logits[:len_chosen], labels[:len_chosen])

if self.loss_type == "ipo":
    all_logps = all_logps / size_completion

chosen_logps = all_logps[:len_chosen]
rejected_logps = all_logps[len_chosen:]

chosen_logits = all_logits[:len_chosen]
rejected_logits = all_logits[len_chosen:]

小寄巧

  • 小批量数据集,快速测试和调试
if sanity_check:
    dataset = dataset.select(range(min(len(dataset), 1000)))

20240821~20240823

  • 似乎有所影响,但确实跟我关系不大。

  • wyl最近不在学校,导致烂活的各种手续就特别麻烦,每次亦童问我怎么办,我都只能告诉他最后总有办法的,所以寄希望于能顺利落地吧。

  • 最近是有些偷懒,不过也是生活所迫,但整个八月也就才两天没跑而已,其中一天还是力量日。前天下雨,昨天开始操场就已经不让进了;昨晚陪胡哥环校扰了半个小时,有人一起是要轻松一些,他今天要回扬州;今晚则是放纵日,吃太撑,本来应该是去做力量,然而大爷死活不给我开门,5分开外跑了不到1k肚子就受不了,不过休两天也挺好,正反也都是能说得过去的。

判断当前函数名称

import sys
import inspect
 
def current_function_name():
    return inspect.currentframe().f_code.co_name

def my_function():
	current_frame = sys._getframe()
	caller_frame = current_frame.f_back
	return caller_frame.f_code.co_name

使用send方法,实现与生成器函数的交互:

def xie():          #方法xie()代表生产者模型
    print('等待接收处理任务。。。')
    while True:         #每个循环模拟发送一个任务给消费者模型
        data = (yield )
        print('收到任务:', data)

def producer():             #方法producer()代表消费者模型
    c = xie()               #调用函数xie()
    c.send(None)
    for i in range(3):
        print('发送一个任务。。', '任务%d' % i)
        c.send('任务%d' % i)

if __name__ == "__main__":
    producer()

这里有个问题,就是如何重置一个生成器对象,其实在torch里面,dataloader是可以调用.reset()方法来实现重置的,但是如果要重置一个自己的生成器,感觉没有什么好办法:

import torch
from torch.utils.data import DataLoader

# 创建原始的dataloader
dataset = MyDataset()
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# 在需要重置的时候,调用dataloader的reset方法
dataloader.reset()

也可以自定义重置操作:

import torch
from torch.utils.data import Dataset, DataLoader

class MyDataset(Dataset):
    def __init__(self, data):
        self.data = data
        self.reset()

    def reset(self):
        # 在重置dataloader时进行的操作,例如重新加载数据等
        self.data = load_data()

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return len(self.data)

# 创建原始的dataloader
dataset = MyDataset(data)
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

# 在需要重置的时候,调用自定义数据集类的reset方法
dataset.reset()

查了一下源码,torch.utils.data.DataLoader类的reset是这么写的:

class _BaseDataLoaderIter(object):
	# ...
    def _reset(self, loader, first_iter=False):
        self._sampler_iter = iter(self._index_sampler)
        self._num_yielded = 0
        self._IterableDataset_len_called = loader._IterableDataset_len_called

class _MultiProcessingDataLoaderIter(_BaseDataLoaderIter):
    r"""Iterates once over the DataLoader's dataset, as specified by the sampler"""
    # ...
    def _reset(self, loader, first_iter=False):
        super()._reset(loader, first_iter)
        self._send_idx = 0  # idx of the next task to be sent to workers
        self._rcvd_idx = 0  # idx of the next task to be returned in __next__
        # information about data not yet yielded, i.e., tasks w/ indices in range [rcvd_idx, send_idx).
        # map: task idx => - (worker_id,)        if data isn't fetched (outstanding)
        #                  \ (worker_id, data)   if data is already fetched (out-of-order)
        self._task_info = {}
        self._tasks_outstanding = 0  # always equal to count(v for v in task_info.values() if len(v) == 1)
        # A list of booleans representing whether each worker still has work to
        # do, i.e., not having exhausted its iterable dataset object. It always
        # contains all `True`s if not using an iterable-style dataset
        # (i.e., if kind != Iterable).
        # Not that this indicates that a worker still has work to do *for this epoch*.
        # It does not mean that a worker is dead. In case of `_persistent_workers`,
        # the worker will be reset to available in the next epoch.
        self._workers_status = [True for i in range(self._num_workers)]
        # We resume the prefetching in case it was enabled
        if not first_iter:
            for idx in range(self._num_workers):
                self._index_queues[idx].put(_utils.worker._ResumeIteration())
            resume_iteration_cnt = self._num_workers
            while resume_iteration_cnt > 0:
                return_idx, return_data = self._get_data()
                if isinstance(return_idx, _utils.worker._ResumeIteration):
                    assert return_data is None
                    resume_iteration_cnt -= 1
        # prime the prefetch loop
        for _ in range(self._prefetch_factor * self._num_workers):
            self._try_put_index()

其实就第一行有用,是带多进程处理(n_workers)的写法,其实就是重调了一遍函数(loader._IterableDataset_len_called,记录了生成器函数名),没有什么技巧,如果不知道生成器函数的情况下,感觉确实没有什么好办法。


20240824

  • 亦童想投cvpr,但被wyl劝阻,虽然wyl明面上说是因为cvpr不在目录里,但是我觉得可能意思是cvpr太高,不太够得到,但是亦童这次做出的效果确实惊艳的,如果没有作弊搞一些trick的话。

  • 陪AK在世纪公园干了4圈,20km@4’17",质量尚可,主要是这个夏训没有一个说得过去的长距离,我也很想跑一个,但一个人很难坚持下来,最近嘉伟伤病休整,其余的人都懒得不行,不必多言。

  • 前3圈一起跑,第4圈AK示意可提速先行。我体感良好,试着加到4分以内,但很快岔气,右腹疼痛难耐,17km补水点休了半分钟(世纪公园一圈有两处直饮水的水龙头,确实不错,对于夏训的长距离还是很关键的,只需要几秒钟就可以补完水,不影响节奏),AK很快追上并反超,我示意自己需要调整一下,让他先走一步,到19km又重新追上了他,并顶完最后一段。整体不算太吃力,除了15~16k这段提速,心率基本没有超过170bpm。

张量并行与 megtron-lm 及 accelerate 配置

https://arxiv/abs/1909.08053

  • megtron-lm: 顾名思义针对 transformer 来做的优化
    • 是 mp(论文题目),其实更多是tp(Tensor 张量内部做split)
    • Transformer(intra layer parallel)
      • mlp
      • mha
      • embedding (input: wte, output: lm_head)
  • 单卡做基线,没有通信的开销。存在划分,必然就存在通信。
  • 集成进 accelerate
    • accelerate 的几个 backends
      • deepspeed
      • fsdp
      • megtron-lm
    • https://huggingface.co/docs/accelerate/usage_guides/megatron_lm

mlp

Y = GeLU ( X ( b ℓ ) , k A k , k ′ ) ∈ R ( b ℓ ) , k ′ Y=\text{GeLU}(X_{(b\ell),k}A_{k,k'})\in \mathbb R^{(b\ell),k'} Y=GeLU(X(b),kAk,k)R(b),k

对于矩阵 A 的分块方式

  • 行分快
    • X = [ X 1 , X 2 ] , A = [ A 1 A 2 ] X=\begin{bmatrix}X_1,X_2\end{bmatrix},A=\begin{bmatrix}A_1\\A_2\end{bmatrix} X=[X1,X2],A=[A1A2]
    • Y = GeLU ( X A ) = GeLU ( X 1 A 1 + X 2 A 2 ) Y=\text{GeLU}(XA)=\text{GeLU}(X_1A_1+X_2A_2) Y=GeLU(XA)=GeLU(X1A1+X2A2)
    • 有两点
      • GeLU 的非线性导致 GeLU ( X 1 A 1 + X 2 A 2 ) ≠ GeLU ( X 1 A 1 ) + GeLU ( X 2 A 2 ) \text{GeLU}(X_1A_1+X_2A_2)\neq \text{GeLU}(X_1A_1)+\text{GeLU}(X_2A_2) GeLU(X1A1+X2A2)=GeLU(X1A1)+GeLU(X2A2)
      • X i A i ∈ R ( b ℓ ) , k ′ X_iA_i\in\mathbb R^{(b\ell),k'} XiAiR(b),k
  • 列分快
    • A = [ A 1 , A 2 ] A=\begin{bmatrix}A_1,A_2\end{bmatrix} A=[A1,A2]
    • Y = GeLU ( X A ) = GeLU ( X [ A 1 , A 2 ] ) = [ GeLU ( X A 1 ) , GeLU ( X A 2 ) ] Y=\text{GeLU}(XA)=\text{GeLU}(X\begin{bmatrix}A_1,A_2\end{bmatrix})=[\text{GeLU}(XA_1),\text{GeLU}(XA_2)] Y=GeLU(XA)=GeLU(X[A1,A2])=[GeLU(XA1),GeLU(XA2)]
      • X A i ∈ R b ℓ , k ′ / 2 XA_i\in \mathbb R^{b\ell,k'/2} XAiRb,k/2
    • 如果不同的 splits 放在不同的卡上,不同的卡需要维护全部的数据 X X X(数据未进行分块)

Z = GeLU ( X A ) B Z=\text{GeLU}(XA)B Z=GeLU(XA)B

对于矩阵 B 自然进行行分块:

  • B = [ B 1 B 2 ] B=\begin{bmatrix}B_1\\B_2\end{bmatrix} B=[B1B2]

Z = GeLU ( X A ) B = [ GeLU ( X A 1 ) , GeLU ( X A 2 ) ] [ B 1 B 2 ] = GeLU ( X A 1 ) B 1 + GeLU ( X A 2 ) B 2 \begin{split} Z=&\text{GeLU}(XA)B\\ =&\left[\text{GeLU}(XA_1),\text{GeLU}(XA_2)\right]\begin{bmatrix}B_1\\B_2\end{bmatrix}\\ =&\text{GeLU}(XA_1)B_1 + \text{GeLU}(XA_2)B_2 \end{split} Z===GeLU(XA)B[GeLU(XA1),GeLU(XA2)][B1B2]GeLU(XA1)B1+GeLU(XA2)B2

  • 最后对两张卡计算结果的加和是一种 all-reduce 的过程

mha

  • 多头自注意力按照 num heads ( h h h) 对 Q,K,V 三个 projection matrix 按列拆分 ( ( k , k ) → ( k , k / h ) (k,k)\rightarrow (k,k/h) (k,k)(k,k/h) )
    • 对于 O O O:按行拆分
  • 每个头的输出为 Y i = softmax ( ( X Q i ) ( X K i ) T d k ) V i ∈ R ℓ , k / h Y_i=\text{softmax}\left(\frac{(XQ_i)(XK_i)^T}{\sqrt{d_k}}\right)V_i\in \mathbb R^{\ell,k/h} Yi=softmax(dk (XQi)(XKi)T)ViR,k/h

[ Y 1 , Y 2 ] [ B 1 B 2 ] = Y 1 B 1 + Y 2 B 2 [Y_1,Y_2]\begin{bmatrix}B_1\\B_2\end{bmatrix}=Y_1B_1+Y_2B_2 [Y1,Y2][B1B2]=Y1B1+Y2B2

emb

  • 如果词表数量是64000,嵌入式表示维度为5120,类型采用32 位精度浮点数,那么整层参数需要的显存大约为64000 × 5120 × 4 /1024/1024 = 1250MB,反向梯度同样需要1250MB,仅仅存储就需要将近2.5GB。
    • [personal chatgpt] Llama2 7B vs. Llama3 8B (词表、attention 及 mlp)
  • wte: E H × v = [ E 1 , E 2 ] E_{H\times v}=[E_1,E_2] EH×v=[E1,E2]
    • column-wise(v,vocab-size dimension)
    • 1-50000: 1-25000, 25001-50000
    • all-reduce (weight/tensor sum)
  • lm head: [ Y 1 , Y 2 ] = [ X E 1 , X E 2 ] [Y_1,Y_2]=[XE_1,XE_2] [Y1,Y2]=[XE1,XE2]
    • all-gather: (weight/tensor concat)
      • 存在通信的问题: ( b × s ) × v (b\times s)\times v (b×s)×v v v v 万级别的)
    • softmax:logits => probs
    • X E i ∈ R ( b × s ) v 2 XE_i\in\mathbb R^{(b\times s)\frac v2} XEiR(b×s)2v
    • rowsum ( exp ⁡ ( X E 1 ) ) \text{rowsum}(\exp(XE_1)) rowsum(exp(XE1)), 长度为 b s bs bs 的列向量,同理长度为 b s bs bs 的列向量,两个列向量 all-reduce 继续得到长度为 bs 的列向量

  • [0, 1, 25000, 25001]: input,不进行拆分
    • 索引 E1 => 4*hidden_size,第3-4行为全0;
    • 索引 E2 => 4*hidden_size,第1-2行为全0;
    • 两个结果通过 all-reduce 加一起;
import torch
import torch.nn.functional as F

torch.manual_seed(42)

A = torch.randn(5, 8)  # 5行12列的随机矩阵
"""
tensor([[ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431, -1.6047],
        [-0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688,  0.7624],
        [ 1.6423, -0.1596, -0.4974,  0.4396, -0.7581,  1.0783,  0.8008,  1.6806],
        [ 0.0349,  0.3211,  1.5736, -0.8455,  1.3123,  0.6872, -1.0892, -0.3553],
        [-1.4181,  0.8963,  0.0499,  2.2667,  1.1790, -0.4345, -1.3864, -1.2862]])
"""

A_1, A_2 = A.split(4, dim=1)

A_1
"""
tensor([[ 1.9269,  1.4873,  0.9007, -2.1055],
        [-0.7521,  1.6487, -0.3925, -1.4036],
        [ 1.6423, -0.1596, -0.4974,  0.4396],
        [ 0.0349,  0.3211,  1.5736, -0.8455],
        [-1.4181,  0.8963,  0.0499,  2.2667]])
"""

A_2
"""
tensor([[ 0.6784, -1.2345, -0.0431, -1.6047],
        [-0.7279, -0.5594, -0.7688,  0.7624],
        [-0.7581,  1.0783,  0.8008,  1.6806],
        [ 1.3123,  0.6872, -1.0892, -0.3553],
        [ 1.1790, -0.4345, -1.3864, -1.2862]])
"""

exp_A_1 = torch.exp(A_1)
exp_A_2 = torch.exp(A_2)

rowsum_exp_A_1 = torch.sum(exp_A_1, dim=1)
rowsum_exp_A_2 = torch.sum(exp_A_2, dim=1)

# all-reduce
rowsum = rowsum_exp_A_1 + rowsum_exp_A_2

rowsum.view(-1, 1)
"""
tensor([[17.2970],
        [10.2543],
        [19.1843],
        [14.4078],
        [17.8164]])
"""

exp_A_1 / rowsum.view(-1, 1)
"""
tensor([[0.3971, 0.2558, 0.1423, 0.0070],
        [0.0460, 0.5071, 0.0659, 0.0240],
        [0.2693, 0.0444, 0.0317, 0.0809],
        [0.0719, 0.0957, 0.3348, 0.0298],
        [0.0136, 0.1375, 0.0590, 0.5415]])
"""

exp_A_2 / rowsum.view(-1, 1)
"""
tensor([[0.1139, 0.0168, 0.0554, 0.0116],
        [0.0471, 0.0557, 0.0452, 0.2090],
        [0.0244, 0.1532, 0.1161, 0.2799],
        [0.2578, 0.1380, 0.0234, 0.0487],
        [0.1825, 0.0363, 0.0140, 0.0155]])
"""

torch.concat([exp_A_1 / rowsum.view(-1, 1), exp_A_2 / rowsum.view(-1, 1)], dim=1)
torch.allclose(softmax, torch.concat([exp_A_1 / rowsum.view(-1, 1), exp_A_2 / rowsum.view(-1, 1)], dim=1)) # True

20240825

  • 回血日,炖鸡汤和红烧青鱼段。身体无碍,昨天的长距离没有给身体带来不适感,腿不酸脚上也没有磨皮,反而很好地改善了状态,得益于一段时间里的低强度过渡,不过,偶尔上点量也是极好的。

  • 目前手表给出的各项成绩预测已经完全回升到3月的水平,自我感觉可能要更好一些,至少我觉得自己肯定可以破三,但是手表就从不这么认为。

  • AK就比较疯狂了,今天马不停蹄开往虞山大环,20km@1200米爬升,速度倒是不快,因为天气比较热,中途肯定有所休息。他最近三天每天都是20km的量,9月比赛陆续开始,他是真有点急了。就昨天的情况来看,最近两个多月缺乏有效训练对他的影响还是很大的,实际水平现在可能还真不如我。

  • 晚上九点半下去简单遛了4K多环校,月跑量补到160km,平均配速4’24",跑一回少一回,且行且珍惜咯。

昨天最后一段代码里的torch.allclose就是验证拆分下来计算和直接合并结算结果是基本一致。

关于all reduce可参考https://zhuanlan.zhihu/p/469942194,本质上是一个优化节点数据通信的算法,实现是比较容易的,阿里巴巴的ACCL

accelerate megtron-lm config

https://huggingface.co/docs/accelerate/usage_guides/megatron_lm

  • Sequence Parallelism (SP): Reduces memory footprint without any additional communication.
    • https://arxiv/pdf/2205.05198
      • (Megatron 3)
    • Only applicable when using TP.
    • It reduces activation memory required as it prevents the same copies to be on the tensor parallel ranks post all-reduce by replacing then with reduce-scatter and no-op operation would be replaced by all-gather.
    • https://zhuanlan.zhihu/p/522198082
    • LayerNorm和Dropout的计算被平摊到了各个设备上,减少了计算资源的浪费;
    • LayerNorm和Dropout所产生的激活值也被平摊到了各个设备上,进一步降低了显存开销。

存在划分,必然就存在通信。在 Megatron1, 2 中,Transformer核的TP通信是由正向两个Allreduce以及后向两个Allreduce组成的。Megatron 3由于对sequence维度进行了划分,Allreduce在这里已经不合适了。为了收集在各个设备上的sequence parallel所产生的结果,需要插入Allgather算子;而为了使得TP所产生的结果可以传入sequence parallel层,需要插入reduce-scatter算子。在下图中,
所代表的就是前向Allgather,反向reduce scatter,
则是相反的操作。这么一来,我们可以清楚地看到,Megatron-3中,一共有4个Allgather和4个reduce-scatter算子。乍一看,通信的操作比Megatron-1 2都多得多,但其实不然。因为一般而言,一个Allreduce其实就相当于1个Reduce-scatter和1个Allgather,所以他们的总通信量是一样的。

如何配置?

./.cache/huggingface/accelerate/default_config.yaml里修改。使用命令workspace accelerate launch启动交互式配置。

关于LangGraph构建复杂的RAG工作流,先开个头:

4 LangGraph 构建复杂 RAG workflow(Self-corrective)

LangChain => LangGraph

  • LangChain 的链(Chain)不具备“循环”能力;
  • AgentExecutor 调度的Agent运行过于“黑盒”。
    • llm with tool executor

LangGraph vs. AutoGen

  • 都是 Multi-agent framework
  • LangGraph prefers an approach where you explicitly define different agents and transition probabilities, preferring to represent it as a graph
  • Autogen frames it more as a “conversation”.
  • Another key difference between Autogen and LangGraph is that LangGraph is fully integrated into the LangChain ecosystem, meaning you take fully advantage of all the LangChain integrations and LangSmith observability.

(Self-Corrective) RAG on LangGraph

https://github/vbarda/pandas-rag-langgraph/blob/main/demo.ipynb

  • RAG(Retrieval-Augmented Generation)
    • 未被 llm 训练过程覆盖的 domain knowledge 或者新知识;
    • 提供确定性的知识作为 context,进一步降低幻觉(hallucinations)
      • GROUNDED IN DOCUMENTS
  • vector database
    • Chroma

20240826

  • 绿叶咕佬肉加醋,极度差评。以及两个月都没修好的新食堂,特么只剩一周就开学了,真是过分。新食堂,没有你我可怎么活,新食堂!

  • 晚上薄底鞋慢跑7k,操场一堆人踢球打球,但就是不让别人进去。只得在学校外面绕了一圈4k多,状态似乎非常好,接近4分的配速顶完也不是很累,回来又补了3k,也没必要尽全力,状态好的时候还是得压一压。

  • PS:主要还是希望8月最后一周也能把跑量凑到200k(六七两月都达到200k,目前8月还差33k,剩余5天),虽然没啥意义,但无论最终在哪里破,能不能破,总归要给自己一个交代,如是。

RAG chain: developer-defined control flow

RAG Agent: LLM-defined control flow

Self-Corrective RAG

导入必要的包:

  • 使用的tool是TavilySearchResults,需要到https://python.langchain/v0.2/docs/integrations/tools/tavily_search/上创建apikey
# !pip install langgraph-checkpoint-sqlite
# !pip install beautifulsoup4
# !pip install chromadb

import re
from typing import Annotated, Iterator, Literal, TypedDict

from langchain import hub

# llm
# from langchain_anthropic import ChatAnthropic
from langchain_openai import ChatOpenAI

# tool, 
# https://python.langchain/v0.2/docs/integrations/tools/tavily_search/
# TAVILY_API_KEY
from langchain_community.tools.tavily_search import TavilySearchResults

# rag
from langchain_community.document_loaders import web_base
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_core.retrievers import BaseRetriever

# messages & prompts
from langchain_core.messages import BaseMessage, AIMessage, convert_to_messages
from langchain_core.prompts import ChatPromptTemplate

from langchain_core.output_parsers import StrOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_text_splitters import RecursiveCharacterTextSplitter

# langgraph
from langgraph.graph import END, StateGraph, add_messages

from langgraph.checkpoint.memory import MemorySaver

import os
os.environ['http_proxy'] = 'http://127.0.0.1:7890'
os.environ['https_proxy'] = 'http://127.0.0.1:7890'
from dotenv import load_dotenv
load_dotenv()

model, retriever & tools

SOURCE_URLS = [
    'https://pandas.pydata/docs/user_guide/indexing.html',
    'https://pandas.pydata/docs/user_guide/groupby.html',
    'https://pandas.pydata/docs/user_guide/merging.html'
]
NEWLINE_RE = re.compile("\n+")

class PandasDocsLoader(web_base.WebBaseLoader):
    def lazy_load(self) -> Iterator[Document]:
        """Lazy load text from the url(s) in web_path."""
        for path in self.web_paths:
            soup = self._scrape(path, bs_kwargs=self.bs_kwargs)
            text = soup.get_text(**self.bs_get_text_kwargs)
            text = NEWLINE_RE.sub("\n", text)     
            metadata = web_base._build_metadata(soup, path)
            yield Document(page_content=text, metadata=metadata)
PandasDocsLoader(SOURCE_URLS).web_paths
"""
['https://pandas.pydata/docs/user_guide/indexing.html',
 'https://pandas.pydata/docs/user_guide/groupby.html',
 'https://pandas.pydata/docs/user_guide/merging.html']
"""
def prepare_documents(urls: list[str]) -> list[Document]:
    text_splitter = RecursiveCharacterTextSplitter(
        separators=[
            r"In \[[0-9]+\]",
            r"\n+",
            r"\s+"
        ],
        is_separator_regex=True,
        chunk_size=1000
    )
    docs = [PandasDocsLoader(url).load() for url in urls]
    docs_list = [item for sublist in docs for item in sublist]
    return text_splitter.split_documents(docs_list)

def get_retriever() -> BaseRetriever:
    documents = prepare_documents(SOURCE_URLS)
    vectorstore = Chroma.from_documents(
        documents=documents,
        collection_name="pandas-rag-chroma",
        embedding=OpenAIEmbeddings(),
    )
    retriever = vectorstore.as_retriever()
    return retriever

llm = ChatOpenAI(model="gpt-4o", temperature=0)
retriever = get_retriever()
tavily_search_tool = TavilySearchResults(max_results=3)

这样我们就把文档和search tools都定义好了


20240827

  • 湿闷的上午,简直是难受到了极点,直到一场畅快的大雨,兴得几分清凉。

  • 晚饭跟亦童探讨了一下细节,之前以为他是改了模型输入,把重建的比例信息以数值形式输进去了,其实最终还是转化为了MASK输入,当然这个是需要手动操作的把box框出来,一个改进是使用分割模型,这样只需要点一下即可,而且效果会更好(box框更精确)。

  • 晚上赶着雨停慢跑会儿,不想中途又是大雨倾盆,最终10km@4’23",平均心率161bpm,感觉状态很好,体感比较轻松,但不带补给一个人很难再坚持更久,不勉强再顶更多,点到为止即可。

Graph

  • state: graph 中所有 node 的输入
    • question: user query
    • messages: add(即role和content,动态追加)
    • documents: 基于 retriever.invoke(question) or search_tool(search出来一些网站URL,添加到问题)
    • candidate_answer: generate(是否真正回答用户问题,且没有幻觉)
    • retries
    • web_fallback
  • nodes: 接收状态,执行动作,产生/改变状态
    • rewrite question: 单独的一个重写用户 query 的 llm 调用
    • document_search: retriever
      • append documents
    • generate: llm chain (lcel)
      • 提供或者替换 candidate_answer
    • web search: search tool
      • append documents
    • finalize response

state

class GraphState(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]
    question: str
    documents: list[Document]
    candidate_answer: str
    retries: int
    web_fallback: bool


class GraphConfig(TypedDict):
    max_retries: int

MAX_RETRIES = 3
VERBOSE = True

nodes

定义LangGraph里的各个节点(函数工具)

document search node

def document_search(state: GraphState):
    """
    Retrieve documents

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): New key added to state, documents, that contains retrieved documents
    """
    if VERBOSE:
        print("---RETRIEVE---")

    question = convert_to_messages(state["messages"])[-1].content

    # Retrieval
    documents = retriever.invoke(question)
    return {"documents": documents, "question": question, "web_fallback": True} # 这个webfallback=True应该就是说如果不知道答案,就返回网络结果

generate node

def generate(state: GraphState):
    """
    Generate answer

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): New key added to state, generation, that contains LLM generation
    """
    if VERBOSE:
        print("---GENERATE---")
    question = state["question"]
    documents = state["documents"]
    retries = state["retries"] if state.get("retries") is not None else -1

    # lcel
    rag_chain = RAG_PROMPT | llm | StrOutputParser() # lcel语法从左到右
    generation = rag_chain.invoke({"context": documents, "question": question})
    return {"retries": retries + 1, "candidate_answer": generation}

rewrite question

现在一般是要把用户问题先重写,也是让llm自己重写

QUERY_REWRITER_SYSTEM = (
"""
You are a question re-writer that converts an input question to a better version that is optimized for vectorstore retrieval.
Look at the input and try to reason about the underlying semantic intent / meaning.
"""
)

QUERY_REWRITER_PROMPT = ChatPromptTemplate.from_messages(
    [
        ("system", QUERY_REWRITER_SYSTEM),
        (
            "human",
            "Here is the initial question: \n\n {question} \n Formulate an improved question.",
        ),
    ]
)

def transform_query(state: GraphState):
    """
    Transform the query to produce a better question.

    Args:
        state (dict): The current graph state

    Returns:
        state (dict): Updates question key with a re-phrased question
    """
    if VERBOSE:
        print("---TRANSFORM QUERY---")

    question = state["question"]

    # Re-write question
    query_rewriter = QUERY_REWRITER_PROMPT | llm | StrOutputParser()
    better_question = query_rewriter.invoke({"question": question})
    return {"question": better_question}

20240828

  • 老天指定是坏掉了,八月底跟黄梅天似的,明摆着要下雨,但就是硬憋着不肯下。

  • 晚上九点环校3圈,9k@4’36"。保守且平稳,但没有午睡,有点缺觉,感觉跑得很疲累,不想凑到10k了,反正月跑量差不多也够数。LZR跟LXY也在跑,后半程看到LY也在跑,感觉LZR比较勤快,XR跟YY两人还是太懒,指望他俩还不如指望AK重回巅峰。

  • 今晚源深看起来人还不少,AK要跑12组×800米的间歇,但是我不太想过去(因为感觉会下雨),天天环校属实无聊,还是场地赤膊刚强度来得痛快(趁AK病,赶紧拿捏一下他)。发现一位13级金融的校友孙大伟,应该是顶了10组,很不错,他上半年两场半马都是1小时26分台,但目测水平并不在我之下,真的很强,感觉下半年有好几个人都要冲击破三。

web search

def web_search(state: GraphState):
    if VERBOSE:
        print("---RUNNING WEB SEARCH---")

    question = state["question"]
    documents = state["documents"]
    search_results = tavily_search_tool.invoke(question)
    search_content = "\n".join([d["content"] for d in search_results])
    documents.append(Document(page_content=search_content, metadata={"source": "websearch"}))
    return {"documents": documents, "web_fallback": False}

finalize response

def finalize_response(state: GraphState):
    if VERBOSE:
        print("---FINALIZING THE RESPONSE---")

    return {"messages": [AIMessage(content=state["candidate_answer"])]}

edges & graph

Grade answer

  • Check hallucinations
  • Check answer relevance

如何判定模型回答是否存在幻觉?

class GradeHallucinations(BaseModel):
    """Binary score for hallucination present in generation answer."""

    binary_score: str = Field(
        description="Answer is grounded in the facts, 'yes' or 'no'"
    )


HALLUCINATION_GRADER_SYSTEM = (
"""
You are a grader assessing whether an LLM generation is grounded in / supported by a set of retrieved facts.
Give a binary score 'yes' or 'no', where 'yes' means that the answer is grounded in / supported by the set of facts.

IF the generation includes code examples, make sure those examples are FULLY present in the set of facts, otherwise always return score 'no'.
"""
)

HALLUCINATION_GRADER_PROMPT = ChatPromptTemplate.from_messages(
    [
        ("system", HALLUCINATION_GRADER_SYSTEM),
        ("human", "Set of facts: \n\n {documents} \n\n LLM generation: {generation}"),
    ]
)


class GradeAnswer(BaseModel):
    """Binary score to assess answer addresses question."""

    binary_score: str = Field(
        description="Answer addresses the question, 'yes' or 'no'"
    )


ANSWER_GRADER_SYSTEM = (
"""
You are a grader assessing whether an answer addresses / resolves a question.
Give a binary score 'yes' or 'no', where 'yes' means that the answer resolves the question.
"""
)

ANSWER_GRADER_PROMPT = ChatPromptTemplate.from_messages(
    [
        ("system", ANSWER_GRADER_SYSTEM),
        ("human", "User question: \n\n {question} \n\n LLM generation: {generation}"),
    ]
)

def grade_generation_v_documents_and_question(state: GraphState, config) -> Literal["generate", "transform_query", "web_search", "finalize_response"]:
    """
    Determines whether the generation is grounded in the document and answers question.

    Args:
        state (dict): The current graph state

    Returns:
        str: Decision for next node to call
    """
    question = state["question"]
    documents = state["documents"]
    generation = state["candidate_answer"]
    web_fallback = state["web_fallback"]
    retries = state["retries"] if state.get("retries") is not None else -1
    max_retries = config.get("configurable", {}).get("max_retries", MAX_RETRIES)

    # this means we've already gone through web fallback and can return to the user
    if not web_fallback:
        return "finalize_response"

    if VERBOSE:
        print("---CHECK HALLUCINATIONS---")

    # llm lcel chain
    hallucination_grader = HALLUCINATION_GRADER_PROMPT | llm.with_structured_output(GradeHallucinations)
    hallucination_grade: GradeHallucinations = hallucination_grader.invoke(
        {"documents": documents, "generation": generation}
    )

    # Check hallucination
    if hallucination_grade.binary_score == "no":
        if VERBOSE: print("---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---")
        return "generate" if retries < max_retries else "web_search"

    if VERBOSE:
        print("---DECISION: GENERATION IS GROUNDED IN DOCUMENTS---")
        print("---GRADE GENERATION vs QUESTION---")

    # Check question-answering
    answer_grader = ANSWER_GRADER_PROMPT | llm.with_structured_output(GradeAnswer)
    answer_grade: GradeAnswer = answer_grader.invoke({"question": question, "generation": generation})
    if answer_grade.binary_score == "yes":
        if VERBOSE: print("---DECISION: GENERATION ADDRESSES QUESTION---")
        return "finalize_response"
    else:
        if VERBOSE: print("---DECISION: GENERATION DOES NOT ADDRESS QUESTION---")
        return "transform_query" if retries < max_retries else "web_search"

上面的关键就是retries < max_retries,达到最大次数就去调用搜索引擎,是一个基于条件性跳转(from generate),幻觉检测(llm lcel chain invoke),条件分支总结:

  • “generate”,
  • “transform_query”,
  • “web_search”,
  • “finalize_response”

元素已齐,接下来我们正式构建图:

workflow = StateGraph(GraphState, config_schema=GraphConfig)

# Define the nodes
workflow.add_node("document_search", document_search)
workflow.add_node("generate", generate)
workflow.add_node("transform_query", transform_query)
workflow.add_node("web_search", web_search)
workflow.add_node("finalize_response", finalize_response)

# Build graph
workflow.set_entry_point("document_search")
workflow.add_edge("document_search", "generate")
workflow.add_edge("transform_query", "document_search")
workflow.add_edge("web_search", "generate")
workflow.add_edge("finalize_response", END)

workflow.add_conditional_edges(
    "generate",
    grade_generation_v_documents_and_question
)

# Compile
graph = workflow.compile()

使用命令展示图

from IPython.display import Image, display
# grade_generation_v_documents_and_question
display(Image(graph.get_graph().draw_mermaid_png())) # mermaid流程图


然后跑图:

inputs就是初始state

VERBOSE = True
inputs = {"messages": [("human", "how do i calculate sum by groups")]}
for output in graph.stream(inputs):
    print(output)
    print("\n---\n")

输出结果:

---RETRIEVE---
{'document_search': {'question': 'how do i calculate sum by groups', 'documents': [Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/groupby.html', 'title': 'Group by: split-apply-combine — pandas 2.2.2 documentation'}, page_content='Windowing operations\nTime series / date functionality\nTime deltas\nOptions and settings\nEnhancing performance\nScaling to large datasets\nSparse data structures\nFrequently Asked Questions (FAQ)\nCookbook\nUser Guide\nGroup by:...\nGroup by: split-apply-combine#\nBy “group by” we are referring to a process involving one or more of the following\nsteps:\nSplitting the data into groups based on some criteria.\nApplying a function to each group independently.\nCombining the results into a data structure.\nOut of these, the split step is the most straightforward. In the apply step, we\nmight wish to do one of the following:\nAggregation: compute a summary statistic (or statistics) for each\ngroup. Some examples:\nCompute group sums or means.\nCompute group sizes / counts.\nTransformation: perform some group-specific computations and return a\nlike-indexed object. Some examples:\nStandardize data (zscore) within a group.\nFilling NAs within groups with a value derived from each group.'), Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/groupby.html', 'title': 'Group by: split-apply-combine — pandas 2.2.2 documentation'}, page_content='In [108]: grouped["C"].agg(["sum", "sum"])\nOut[108]: \n          sum       sum\nA                      \nbar  0.392940  0.392940\nfoo -1.796421 -1.796421\npandas also allows you to provide multiple lambdas. In this case, pandas\nwill mangle the name of the (nameless) lambda functions, appending _<i>\nto each subsequent lambda.'), Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/groupby.html', 'title': 'Group by: split-apply-combine — pandas 2.2.2 documentation'}, page_content='In [116]: grouped.agg({"C": "sum", "D": "std"})\nOut[116]: \n            C         D\nA                      \nbar  0.392940  1.366330\nfoo -1.796421  0.884785\nTransformation#\nA transformation is a GroupBy operation whose result is indexed the same\nas the one being grouped. Common examples include cumsum() and\ndiff().\nIn [117]: speeds\nOut[117]: \n          class           order  max_speed\nfalcon     bird   Falconiformes      389.0\nparrot     bird  Psittaciformes       24.0\nlion     mammal       Carnivora       80.2\nmonkey   mammal        Primates        NaN\nleopard  mammal       Carnivora       58.0\nIn [118]: grouped = speeds.groupby("class")["max_speed"]\nIn [119]: grouped.cumsum()\nOut[119]: \nfalcon     389.0\nparrot     413.0\nlion        80.2\nmonkey       NaN\nleopard    138.2\nName: max_speed, dtype: float64'), Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/groupby.html', 'title': 'Group by: split-apply-combine — pandas 2.2.2 documentation'}, page_content='In [73]: grouped[["A", "B"]].sum()\nOut[73]: \n                   A                  B\nA                                      \nbar        barbarbar        onethreetwo\nfoo  foofoofoofoofoo  onetwotwoonethree\nIterating through groups#\nWith the GroupBy object in hand, iterating through the grouped data is very\nnatural and functions similarly to itertools.groupby():\nIn [74]: grouped = df.groupby(\'A\')\nIn [75]: for name, group in grouped:\n   ....:     print(name)\n   ....:     print(group)\n   ....: \nbar\n     A      B         C         D\n1  bar    one  0.254161  1.511763\n3  bar  three  0.215897 -0.990582\n5  bar    two -0.077118  1.211526\nfoo\n     A      B         C         D\n0  foo    one -0.575247  1.346061\n2  foo    two -1.143704  1.627081\n4  foo    two  1.193555 -0.441652\n6  foo    one -0.408530  0.268520\n7  foo  three -0.862495  0.024580\nIn the case of grouping by multiple keys, the group name will be a tuple:')], 'web_fallback': True}}

---

---GENERATE---
---CHECK HALLUCINATIONS---
---DECISION: GENERATION IS GROUNDED IN DOCUMENTS---
---GRADE GENERATION vs QUESTION---
---DECISION: GENERATION ADDRESSES QUESTION---
{'generate': {'candidate_answer': "To calculate the sum by groups in pandas, you can use the `groupby` method followed by the `sum` method. For example, `grouped = df.groupby('column_name')` and then `grouped.sum()`. This will give you the sum of each group based on the specified column.", 'retries': 0}}

---

---FINALIZING THE RESPONSE---
{'finalize_response': {'messages': [AIMessage(content="To calculate the sum by groups in pandas, you can use the `groupby` method followed by the `sum` method. For example, `grouped = df.groupby('column_name')` and then `grouped.sum()`. This will give you the sum of each group based on the specified column.")]}}

---

20240829

  • 新园的椒麻鱼片和麻辣牛肉性价比还行,但不能天天吃,重口,不过别的真没好吃的了,中午点了一份油豆腐红烧肉,结果里面只有油豆腐以及两三块肥肉,油面筋的肉也是肉眼可见的不行,而且绿叶菜也没有。月底这几天冲量,又开始饿得不行,不多吃点是真不行。新食堂,没有你我可怎么活,新食堂!(二周目)

  • 晚上九点下去慢跑,发现师傅居然仁慈地开了大门。起手渐加速5000米@3’52",心率167bpm,确实不吃力,而且没换鞋,9圈时XR来跟了我最后3圈半,小家伙这个月跑量都不到60K,属实是拉胯得不行。补2000米慢跑@4’19"收尾,拉了两个引体,很无力,身体好沉,不太能拉的上去,最近力量做得太少,还是不行。

Query with a fallback

VERBOSE = True
inputs = {"messages": [("human", "how do i convert a column into dummies")]}
for output in graph.stream(inputs, {"configurable": {"max_retries": 1}}):
    print(output)
    print("\n---\n")

输出:

---RETRIEVE---
{'document_search': {'question': 'how do i convert a column into dummies', 'documents': [Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/indexing.html', 'title': 'Indexing and selecting data — pandas 2.2.2 documentation'}, page_content="In [7]: df[['B', 'A']] = df[['A', 'B']]\nIn [8]: df\nOut[8]: \n                   A         B         C         D\n2000-01-01 -0.282863  0.469112 -1.509059 -1.135632\n2000-01-02 -0.173215  1.212112  0.119209 -1.044236\n2000-01-03 -2.104569 -0.861849 -0.494929  1.071804\n2000-01-04 -0.706771  0.721555 -1.039575  0.271860\n2000-01-05  0.567020 -0.424972  0.276232 -1.087401\n2000-01-06  0.113648 -0.673690 -1.478427  0.524988\n2000-01-07  0.577046  0.404705 -1.715002 -1.039268\n2000-01-08 -1.157892 -0.370647 -1.344312  0.844885\nYou may find this useful for applying a transform (in-place) to a subset of the\ncolumns.\nWarning\npandas aligns all AXES when setting Series and DataFrame from .loc.\nThis will not modify df because the column alignment is before value assignment."), Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/indexing.html', 'title': 'Indexing and selecting data — pandas 2.2.2 documentation'}, page_content="In [21]: sa.a = 5\nIn [22]: sa\nOut[22]: \na    5\nb    2\nc    3\ndtype: int64\nIn [23]: dfa.A = list(range(len(dfa.index)))  # ok if A already exists\nIn [24]: dfa\nOut[24]: \n            A         B         C         D\n2000-01-01  0  0.469112 -1.509059 -1.135632\n2000-01-02  1  1.212112  0.119209 -1.044236\n2000-01-03  2 -0.861849 -0.494929  1.071804\n2000-01-04  3  0.721555 -1.039575  0.271860\n2000-01-05  4 -0.424972  0.276232 -1.087401\n2000-01-06  5 -0.673690 -1.478427  0.524988\n2000-01-07  6  0.404705 -1.715002 -1.039268\n2000-01-08  7 -0.370647 -1.344312  0.844885\nIn [25]: dfa['A'] = list(range(len(dfa.index)))  # use this form to create a new column"), Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/merging.html', 'title': 'Merge, join, concatenate and compare — pandas 2.2.2 documentation'}, page_content='In [144]: df = pd.DataFrame(\n   .....:     {\n   .....:         "col1": ["a", "a", "b", "b", "a"],\n   .....:         "col2": [1.0, 2.0, 3.0, np.nan, 5.0],\n   .....:         "col3": [1.0, 2.0, 3.0, 4.0, 5.0],\n   .....:     },\n   .....:     columns=["col1", "col2", "col3"],\n   .....: )\n   .....: \nIn [145]: df\nOut[145]: \n  col1  col2  col3\n0    a   1.0   1.0\n1    a   2.0   2.0\n2    b   3.0   3.0\n3    b   NaN   4.0\n4    a   5.0   5.0\nIn [146]: df2 = df.copy()\nIn [147]: df2.loc[0, "col1"] = "c"\nIn [148]: df2.loc[2, "col3"] = 4.0\nIn [149]: df2\nOut[149]: \n  col1  col2  col3\n0    c   1.0   1.0\n1    a   2.0   2.0\n2    b   3.0   4.0\n3    b   NaN   4.0\n4    a   5.0   5.0'), Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/groupby.html', 'title': 'Group by: split-apply-combine — pandas 2.2.2 documentation'}, page_content='In [225]: df\nOut[225]: \n  Branch Buyer  Quantity                Date\n0      A  Carl         1 2013-01-01 13:00:00\n1      A  Mark         3 2013-01-01 13:05:00\n2      A  Carl         5 2013-10-01 20:00:00\n3      A  Carl         1 2013-10-02 10:00:00\n4      A   Joe         8 2013-10-01 20:00:00\n5      A   Joe         1 2013-10-02 10:00:00\n6      A   Joe         9 2013-12-02 12:00:00\n7      B  Carl         3 2013-12-02 14:00:00\nGroupby a specific column with the desired frequency. This is like resampling.')], 'web_fallback': True}}

---

---GENERATE---
---CHECK HALLUCINATIONS---
---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---
{'generate': {'candidate_answer': "You can convert a column into dummy variables using the `pd.get_dummies()` function in pandas. For example, `pd.get_dummies(df['col1'])` will convert the 'col1' column into dummy variables. This function creates a new DataFrame with binary columns for each unique value in the original column.", 'retries': 0}}

---

---GENERATE---
---CHECK HALLUCINATIONS---
---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---
{'generate': {'candidate_answer': "You can convert a column into dummies using the `pd.get_dummies()` function in pandas. For example, if you have a DataFrame `df` and you want to convert the column `col1` into dummies, you can use `pd.get_dummies(df, columns=['col1'])`. This will create a new DataFrame with dummy variables for each unique value in `col1`.", 'retries': 1}}

---

---RUNNING WEB SEARCH---
{'web_search': {'documents': [Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/indexing.html', 'title': 'Indexing and selecting data — pandas 2.2.2 documentation'}, page_content="In [7]: df[['B', 'A']] = df[['A', 'B']]\nIn [8]: df\nOut[8]: \n                   A         B         C         D\n2000-01-01 -0.282863  0.469112 -1.509059 -1.135632\n2000-01-02 -0.173215  1.212112  0.119209 -1.044236\n2000-01-03 -2.104569 -0.861849 -0.494929  1.071804\n2000-01-04 -0.706771  0.721555 -1.039575  0.271860\n2000-01-05  0.567020 -0.424972  0.276232 -1.087401\n2000-01-06  0.113648 -0.673690 -1.478427  0.524988\n2000-01-07  0.577046  0.404705 -1.715002 -1.039268\n2000-01-08 -1.157892 -0.370647 -1.344312  0.844885\nYou may find this useful for applying a transform (in-place) to a subset of the\ncolumns.\nWarning\npandas aligns all AXES when setting Series and DataFrame from .loc.\nThis will not modify df because the column alignment is before value assignment."), Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/indexing.html', 'title': 'Indexing and selecting data — pandas 2.2.2 documentation'}, page_content="In [21]: sa.a = 5\nIn [22]: sa\nOut[22]: \na    5\nb    2\nc    3\ndtype: int64\nIn [23]: dfa.A = list(range(len(dfa.index)))  # ok if A already exists\nIn [24]: dfa\nOut[24]: \n            A         B         C         D\n2000-01-01  0  0.469112 -1.509059 -1.135632\n2000-01-02  1  1.212112  0.119209 -1.044236\n2000-01-03  2 -0.861849 -0.494929  1.071804\n2000-01-04  3  0.721555 -1.039575  0.271860\n2000-01-05  4 -0.424972  0.276232 -1.087401\n2000-01-06  5 -0.673690 -1.478427  0.524988\n2000-01-07  6  0.404705 -1.715002 -1.039268\n2000-01-08  7 -0.370647 -1.344312  0.844885\nIn [25]: dfa['A'] = list(range(len(dfa.index)))  # use this form to create a new column"), Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/merging.html', 'title': 'Merge, join, concatenate and compare — pandas 2.2.2 documentation'}, page_content='In [144]: df = pd.DataFrame(\n   .....:     {\n   .....:         "col1": ["a", "a", "b", "b", "a"],\n   .....:         "col2": [1.0, 2.0, 3.0, np.nan, 5.0],\n   .....:         "col3": [1.0, 2.0, 3.0, 4.0, 5.0],\n   .....:     },\n   .....:     columns=["col1", "col2", "col3"],\n   .....: )\n   .....: \nIn [145]: df\nOut[145]: \n  col1  col2  col3\n0    a   1.0   1.0\n1    a   2.0   2.0\n2    b   3.0   3.0\n3    b   NaN   4.0\n4    a   5.0   5.0\nIn [146]: df2 = df.copy()\nIn [147]: df2.loc[0, "col1"] = "c"\nIn [148]: df2.loc[2, "col3"] = 4.0\nIn [149]: df2\nOut[149]: \n  col1  col2  col3\n0    c   1.0   1.0\n1    a   2.0   2.0\n2    b   3.0   4.0\n3    b   NaN   4.0\n4    a   5.0   5.0'), Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/groupby.html', 'title': 'Group by: split-apply-combine — pandas 2.2.2 documentation'}, page_content='In [225]: df\nOut[225]: \n  Branch Buyer  Quantity                Date\n0      A  Carl         1 2013-01-01 13:00:00\n1      A  Mark         3 2013-01-01 13:05:00\n2      A  Carl         5 2013-10-01 20:00:00\n3      A  Carl         1 2013-10-02 10:00:00\n4      A   Joe         8 2013-10-01 20:00:00\n5      A   Joe         1 2013-10-02 10:00:00\n6      A   Joe         9 2013-12-02 12:00:00\n7      B  Carl         3 2013-12-02 14:00:00\nGroupby a specific column with the desired frequency. This is like resampling.'), Document(metadata={'source': 'websearch'}, page_content="And to add a prefix to the columns use: dummies.columns = ['D_'+col_name for col_name in dummies.columns] - Ufos. Commented Nov 12, 2017 at 23:06. 2 ... You can use str.join to join all elements in list present in series into string and then use str.get_dummies: ... Pandas convert dummies to a new column. Hot Network Questions\nColumns in the output are each named after a value; if the input is\na DataFrame, the name of the original variable is prepended to the value.\n Examples\nprevious\npandas.concat\nnext\npandas.from_dummies\n© 2024, pandas via NumFOCUS, Inc. Site Navigation\nSite Navigation\npandas.get_dummies#\nConvert categorical variable into dummy/indicator variables.\n If data contains other columns than the\ndummy-coded one(s), these will be prepended, unaltered, to the result.\n Whether the dummy-encoded columns should be backed by\na SparseArray (True) or a regular NumPy array (False).\n\nThis is a non-exhaustive solution to specifying many different columns to get_dummies while excluding some columns. Using the built-in filter() function on df.columns is also an option. pd.get_dummies only works on columns with an object dtype when columns=None . Another potential option is to set only columns to be transformed with the object ...")], 'web_fallback': False}}

---

---GENERATE---
{'generate': {'candidate_answer': "You can convert a column into dummies using the `pd.get_dummies` function in pandas. For example, `pd.get_dummies(df['column_name'])` will create dummy variables for the specified column. If you want to include the dummies in the original DataFrame, you can use `df = pd.get_dummies(df, columns=['column_name'])`.", 'retries': 2}}

---

---FINALIZING THE RESPONSE---
{'finalize_response': {'messages': [AIMessage(content="You can convert a column into dummies using the `pd.get_dummies` function in pandas. For example, `pd.get_dummies(df['column_name'])` will create dummy variables for the specified column. If you want to include the dummies in the original DataFrame, you can use `df = pd.get_dummies(df, columns=['column_name'])`.")]}}

---

查看输入

graph.invoke(inputs)
---RETRIEVE---
---GENERATE---
---CHECK HALLUCINATIONS---
---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---
---GENERATE---
---CHECK HALLUCINATIONS---
---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---
---GENERATE---
---CHECK HALLUCINATIONS---
---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---
---GENERATE---
---CHECK HALLUCINATIONS---
---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---
---RUNNING WEB SEARCH---
---GENERATE---
---FINALIZING THE RESPONSE---
{'messages': [HumanMessage(content='how do i convert a column into dummies', id='69125d7d-3b0e-42d7-8192-d43fd71ef25f'),
  AIMessage(content="You can convert a column into dummies using the `pd.get_dummies` function in pandas. For example, `pd.get_dummies(df['column_name'])` will create dummy variables for the specified column. If you want to include the original DataFrame, you can use `pd.get_dummies(df, columns=['column_name'])`.", id='c199ec88-cc3c-494d-adeb-3c7290bfe4b6')],
 'question': 'how do i convert a column into dummies',
 'documents': [Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/indexing.html', 'title': 'Indexing and selecting data — pandas 2.2.2 documentation'}, page_content="In [7]: df[['B', 'A']] = df[['A', 'B']]\nIn [8]: df\nOut[8]: \n                   A         B         C         D\n2000-01-01 -0.282863  0.469112 -1.509059 -1.135632\n2000-01-02 -0.173215  1.212112  0.119209 -1.044236\n2000-01-03 -2.104569 -0.861849 -0.494929  1.071804\n2000-01-04 -0.706771  0.721555 -1.039575  0.271860\n2000-01-05  0.567020 -0.424972  0.276232 -1.087401\n2000-01-06  0.113648 -0.673690 -1.478427  0.524988\n2000-01-07  0.577046  0.404705 -1.715002 -1.039268\n2000-01-08 -1.157892 -0.370647 -1.344312  0.844885\nYou may find this useful for applying a transform (in-place) to a subset of the\ncolumns.\nWarning\npandas aligns all AXES when setting Series and DataFrame from .loc.\nThis will not modify df because the column alignment is before value assignment."),
  Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/indexing.html', 'title': 'Indexing and selecting data — pandas 2.2.2 documentation'}, page_content="In [21]: sa.a = 5\nIn [22]: sa\nOut[22]: \na    5\nb    2\nc    3\ndtype: int64\nIn [23]: dfa.A = list(range(len(dfa.index)))  # ok if A already exists\nIn [24]: dfa\nOut[24]: \n            A         B         C         D\n2000-01-01  0  0.469112 -1.509059 -1.135632\n2000-01-02  1  1.212112  0.119209 -1.044236\n2000-01-03  2 -0.861849 -0.494929  1.071804\n2000-01-04  3  0.721555 -1.039575  0.271860\n2000-01-05  4 -0.424972  0.276232 -1.087401\n2000-01-06  5 -0.673690 -1.478427  0.524988\n2000-01-07  6  0.404705 -1.715002 -1.039268\n2000-01-08  7 -0.370647 -1.344312  0.844885\nIn [25]: dfa['A'] = list(range(len(dfa.index)))  # use this form to create a new column"),
  Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/merging.html', 'title': 'Merge, join, concatenate and compare — pandas 2.2.2 documentation'}, page_content='In [144]: df = pd.DataFrame(\n   .....:     {\n   .....:         "col1": ["a", "a", "b", "b", "a"],\n   .....:         "col2": [1.0, 2.0, 3.0, np.nan, 5.0],\n   .....:         "col3": [1.0, 2.0, 3.0, 4.0, 5.0],\n   .....:     },\n   .....:     columns=["col1", "col2", "col3"],\n   .....: )\n   .....: \nIn [145]: df\nOut[145]: \n  col1  col2  col3\n0    a   1.0   1.0\n1    a   2.0   2.0\n2    b   3.0   3.0\n3    b   NaN   4.0\n4    a   5.0   5.0\nIn [146]: df2 = df.copy()\nIn [147]: df2.loc[0, "col1"] = "c"\nIn [148]: df2.loc[2, "col3"] = 4.0\nIn [149]: df2\nOut[149]: \n  col1  col2  col3\n0    c   1.0   1.0\n1    a   2.0   2.0\n2    b   3.0   4.0\n3    b   NaN   4.0\n4    a   5.0   5.0'),
  Document(metadata={'language': 'en', 'source': 'https://pandas.pydata/docs/user_guide/groupby.html', 'title': 'Group by: split-apply-combine — pandas 2.2.2 documentation'}, page_content='In [225]: df\nOut[225]: \n  Branch Buyer  Quantity                Date\n0      A  Carl         1 2013-01-01 13:00:00\n1      A  Mark         3 2013-01-01 13:05:00\n2      A  Carl         5 2013-10-01 20:00:00\n3      A  Carl         1 2013-10-02 10:00:00\n4      A   Joe         8 2013-10-01 20:00:00\n5      A   Joe         1 2013-10-02 10:00:00\n6      A   Joe         9 2013-12-02 12:00:00\n7      B  Carl         3 2013-12-02 14:00:00\nGroupby a specific column with the desired frequency. This is like resampling.'),
  Document(metadata={'source': 'websearch'}, page_content="And to add a prefix to the columns use: dummies.columns = ['D_'+col_name for col_name in dummies.columns] - Ufos. Commented Nov 12, 2017 at 23:06. 2 ... You can use str.join to join all elements in list present in series into string and then use str.get_dummies: ... Pandas convert dummies to a new column. Hot Network Questions\nColumns in the output are each named after a value; if the input is\na DataFrame, the name of the original variable is prepended to the value.\n Examples\nprevious\npandas.concat\nnext\npandas.from_dummies\n© 2024, pandas via NumFOCUS, Inc. Site Navigation\nSite Navigation\npandas.get_dummies#\nConvert categorical variable into dummy/indicator variables.\n If data contains other columns than the\ndummy-coded one(s), these will be prepended, unaltered, to the result.\n Whether the dummy-encoded columns should be backed by\na SparseArray (True) or a regular NumPy array (False).\n\nThis is a non-exhaustive solution to specifying many different columns to get_dummies while excluding some columns. Using the built-in filter() function on df.columns is also an option. pd.get_dummies only works on columns with an object dtype when columns=None . Another potential option is to set only columns to be transformed with the object ...")],
 'candidate_answer': "You can convert a column into dummies using the `pd.get_dummies` function in pandas. For example, `pd.get_dummies(df['column_name'])` will create dummy variables for the specified column. If you want to include the original DataFrame, you can use `pd.get_dummies(df, columns=['column_name'])`.",
 'retries': 4,
 'web_fallback': False}

20240830

  • 新园金汤鱼片唯一真神!对比椒麻和麻辣的优势就是不重口,而且汤可以当鱼汤喝,确实不错,可以算是蜀地源的平替。蜀地源的问题就是每次吃完回来撑得要死,而且辣得胸口烧得难受,根本跑不了,现在对比唯一的优势就是可以吃到笋片和菌类,蔬菜的选择多一些(天气炎热,市面上绿叶菜都贵得不行,食堂也完全没有供应,新园快餐,黄瓜炒蛋都能卖5元1份,辣子鸡10元1份,价格翻了一倍,就离谱),性价比被新园这个完爆了。

  • 晚上九点半慢跑30分钟@6.85km,平均心率157bpm,月跑量凑满200K整,平均配速4’23",跑完非常舒服,完全没有疲劳感。时隔两个多月,手表终于重新给出高效训练的表现评价,这回我是真行了。

  • PS:静香姐今晚去杨体参加AR传承接力赛,看号码牌居然是70后(还是低估年龄了,之前叫她阿姨还不高兴),赛后群里发表名言,想娶我女儿必须跑得比我快(全马314,ITRA表现分619,可怕)。

关于rope-beta-encoding:

⌊ n β m − 1 ⌋ m o d    β \left\lfloor \frac{n}{\beta^{m-1}} \right\rfloor \mod \beta βm1nmodβ

  • n 表示原始的数值
  • beta 表示进制
  • m 表示位置,由右向左
def beta_encoding(n, beta):
    if n == 0:
        return "0"
    digits = []
    while n > 0:
        remainder = int(n % beta)
        if remainder >= 10:
            # 将10到15的数字转换为'A'到'F'
            digits.append(chr(ord('A') + remainder - 10))
        else:
            digits.append(str(remainder))
        n = n // beta
    # 将结果反转并转换为字符串表示
    beta_base_digits = ''.join(digits[::-1])
    return beta_base_digits

这个是原始的beta-encoding,rope在此基础上做了一些调整:

Θ = { θ i = 1000 0 − 2 ( i − 1 ) / d , i ∈ [ 1 , 2 , … , d / 2 ] } \Theta = \{\theta_i = 10000^{-2(i-1)/d}, i \in [1, 2, \ldots, d/2]\} Θ={θi=100002(i1)/d,i[1,2,,d/2]}

位置n的旋转位置编码(RoPE),本质上就是数字n的β进制编码!

[ cos ⁡ ( n β 0 ) , sin ⁡ ( n β 0 ) , cos ⁡ ( n β 1 ) , sin ⁡ ( n β 1 ) , … , cos ⁡ ( n β d / 2 − 1 ) , sin ⁡ ( n β d / 2 − 1 ) ] \left[ \cos\left(\frac{n}{\beta^0}\right), \sin\left(\frac{n}{\beta^0}\right), \cos\left(\frac{n}{\beta^1}\right), \sin\left(\frac{n}{\beta^1}\right), \ldots, \cos\left(\frac{n}{\beta^{d/2-1}}\right), \sin\left(\frac{n}{\beta^{d/2-1}}\right) \right] [cos(β0n),sin(β0n),cos(β1n),sin(β1n),,cos(βd/21n),sin(βd/21n)]

  • β = 1000 0 2 / d \beta=10000^{2/d} β=100002/d
  • 存在一个 n β m − 1 \frac{n}{\beta^{m-1}} βm1n
  • sin ⁡ , cos ⁡ \sin,\cos sin,cos 同 mod 运算一样,可以保证周期性

RoPE with PI (Position Interpolation)

  • 内插方案就是将 n 换成 n/k,其中 k 是要扩大的倍数
    • k = L ′ L k=\frac{L'}{L} k=LL
  • θ i = base − 2 i / d = 1000 0 − 2 i / d \theta_i=\text{base}^{-2i/d}=10000^{-2i/d} θi=base2i/d=100002i/d
import matplotlib.pyplot as plt
import numpy as np

# Define the values for L and theta
L = 12
theta = 2 * np.pi / L

# Calculate original token positions
original_positions = np.arange(1, L + 1)
cos_m_theta_original = np.cos(original_positions * theta)
sin_m_theta_original = np.sin(original_positions * theta)

# Calculate interpolated token positions
interpolated_positions = np.arange(1, 2 * L + 1)
cos_m_theta_interpolated = np.cos(interpolated_positions * theta / 2)
sin_m_theta_interpolated = np.sin(interpolated_positions * theta / 2)

# Plot the figure
plt.figure(figsize=(6, 6))
plt.plot(cos_m_theta_original, sin_m_theta_original, 'bo-', label='Original tokens at positions $m=1,...,L$')
plt.plot(cos_m_theta_interpolated, sin_m_theta_interpolated, 'ro-', label='Tokens after Position Interpolation at positions $m=1,...,2L$')
plt.xlabel(r'$\cos(m \theta)$')
plt.ylabel(r'$\sin(m \theta)$')
plt.legend()
plt.grid(True)
plt.title('Position Encoding Visualization')


20240831

  • 猛吃,排骨汤+三文鱼+牛肉白菜大馄饨,中午晚上连撑两顿,下午完全没动,晚上还能吃得下,我是真的饿了。

  • 晚上回来操场开放,趁机把力量做一下,已经两周没有做力量了,30箭步×8组(+20kg),组间50个双脚提踵(+20kg),稳定度比之前稍差一些,不过问题不大,结束补慢跑2000米@4’39",这样六七八三个月的跑量分别为200.2K,216.6K,202.1K,均配分别为4’16",4’14",4’23",可以算是保质又保量的三个月夏训恢复,接下来就是我的show time了。

  • PS:回实验室亦童跟我抱怨wyl又给他派烂活,居然给别人做论文代码补全的工作,就离谱,什么年代了,还在用keras写VAE来伪造数据,我跟亦童说,这就是能力越大,能力也就越大,反正我现在是根本不想理wyl的烂活。

Langchain基础(六)多轮对话 qlora SFT(Multi-turn Conversation)

1 关于SFT

Recall that creating a ChatGPT at home involves 3 steps:

  1. pre-training a large language model (LLM) to predict the next token on internet-scale data, on clusters of thousands of GPUs. One calls the result a “base model”
  2. supervised fine-tuning (SFT) to turn the base model into a useful assistant
    • base model => “chatbot”/“assistant”/“instruct”
    • fine-tuning the model on human instruction data, using the cross-entropy loss.
    • This means that the model is still trained to predict the next token, although we now want the model to generate useful completions given an instruction like “what are 10 things to do in London?”, “How can I make pancakes?” or “Write me a poem about elephants”.
    • https://gizmodo/chatgpt-openai-ai-contractors-15-dollars-per-hour-1850415474
      • 工人通常每小时赚15美元的标注合同工
  3. human preference fine-tuning which increases the assistant’s friendliness, helpfulness and safety.

SFT

  • RAG SFT
    • https://www.bilibili/video/BV1Yx4y147t4/
  • Multi-Turn conversation SFT
  • Tool use (function calling) SFT

2 数据集

  • Zephyr: distilled SFT (dSFT),distilled DPO(dDPO)
    • https://arxiv/pdf/2310.16944
    • https://github/huggingface/alignment-handbook
from datasets import load_dataset
# based on config
raw_datasets = load_dataset("HuggingFaceH4/ultrachat_200k")

数据集形如:

DatasetDict({
    train_sft: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 207865
    })
    test_sft: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 23110
    })
    train_gen: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 256032
    })
    test_gen: Dataset({
        features: ['prompt', 'prompt_id', 'messages'],
        num_rows: 28304
    })
})
from datasets import DatasetDict
raw_datasets = DatasetDict({
    "train": raw_datasets["train_sft"],
    "test": raw_datasets["test_sft"]
})

# from datasets import load_dataset
# raw_datasets = load_dataset("HuggingFaceH4/ultrachat_200k", split=["train_sft", "test_sft"])
# raw_datasets = DatasetDict({
#     "train": raw_datasets["train_sft"],
#     "test": raw_datasets["test_sft"]
# })

# from datasets import DatasetDict

# # remove this when done debugging
# indices = range(0,100)
# dataset_dict = {"train": raw_datasets["train_sft"].select(indices),
#                 "test": raw_datasets["test_sft"].select(indices)}
# raw_datasets = DatasetDict(dataset_dict)
# raw_datasets

raw_datasets['train'][0].keys() # dict_keys(['prompt', 'prompt_id', 'messages'])
print(raw_datasets['train'][0]['prompt_id'], raw_datasets['train'][0]['prompt'])
"""
f0e37e9f7800261167ce91143f98f511f768847236f133f2d0aed60b444ebe57 These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?
On your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!
Your Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.
Does this feature apply to all sections of the theme or just specific ones as listed in the text material?
"""

然后我们看一下数据集的对话内容:

for msg in raw_datasets['train'][0]['messages']:
    role = msg['role']
    content = msg['content']
    print(f'{role:20}:  {content}')

输出:

user                :  These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?
On your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!
Your Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.
Does this feature apply to all sections of the theme or just specific ones as listed in the text material?
assistant           :  This feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.
user                :  Can you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?
assistant           :  Sure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:

1. Log in to your Shopify account and go to your Online Store.
2. Click on Customize theme for the section-based theme you are using.
3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.
4. With the section open, click on the section's setting (gear) icon in the top-left corner.
5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.
6. If available, select 'Show secondary image on hover'.
7. Save the changes and preview the Collection/Featured Collection page to see the effect.

If you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.
user                :  Can you provide me with a link to the documentation for my theme?
assistant           :  I don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.
user                :  Can you confirm if this feature also works for the Quick Shop section of my theme?
assistant           :  The secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:

1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.

20240901

  • 一转九月,偶遇失踪三个多月的宋某,风格大变,现在该走艺术流了,我还指望他能高百前回归,这样又损失一位核心队员,真的很难。

  • AK早上去白玉兰楼进货,痛失前八,不过在路演摊拼了个安慰奖,HUAWEI Freebuds一副,含泪赚回报名费,xs,人老不中用咯,想当年AK在这可是拿过冠军,去年还位列前三。

  • 九点半下去慢跑了会儿,看到安迪和YZZ在环校。30分钟@6.56K,平均心率153bpm,很满意的表现,稳定,轻快,点到为止,并不想跑得更多。这是最好的状态,可遇不可求,如果现在是十一月,我有把握完成5000米到半马的PB,甚至可以冲一冲全马250,可惜留给我的时间真的已经很少了,很难把目前的状态保持到赛前。

  • PS:塞翁失马,焉知非福?总是想快一次,可能就要慢一辈子。买苹果,出门就摔了一地,真点背,emmm。

3 tokenizer

  • pad token
    • During pre-training, one doesn’t need to pad since one just creates blocks of text to predict the next token, but during fine-tuning, we will need to pad the (instruction, completion) pairs in order to create batches of equal length.
  • max seqlen
    • this is required in order to truncate sequences which are too long for the model. Here we decide to train on at most 2048 tokens.
    • 会显著地影响显存的占用
  • chat template:https://huggingface.co/blog/chat-templates
    • <|user|> to indicate a user message and <|assistant|> to indicate the chatbot’s response
    • 在 hf Transformers,chat_template 定义在 tokenizer
    • base model 是 None,instruct model 对应的 tokenizer 会有定义;
from transformers import AutoTokenizer
model_id = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token, tokenizer.pad_token_id # (None, None)

print(tokenizer.bos_token, tokenizer.eos_token) # <s> </s>
print(tokenizer.encode('<s>', add_special_tokens=False), tokenizer.encode('</s>', add_special_tokens=False)) # [1] [2]
print(tokenizer.decode(1), tokenizer.decode(2)) # <s> </s>
# set pad_token_id equal to the eos_token_id if not set
if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.model_max_length # 1000000000000000019884624838656

# Set reasonable default for models without max length
# 会显著地影响显存的占用
if tokenizer.model_max_length > 100_000:
    tokenizer.model_max_length = 2048
# base model 
tokenizer.chat_template

print('meta-llama/Meta-Llama-3-8B')
print(AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B').chat_template)
print('======================')
print('meta-llama/Meta-Llama-3-8B-Instruct')
print(AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B-Instruct').chat_template)
print('======================')
print('mistralai/Mistral-7B-Instruct-v0.1')
print(AutoTokenizer.from_pretrained('mistralai/Mistral-7B-Instruct-v0.1').chat_template)

输出的template结果:

meta-llama/Meta-Llama-3-8B
None
======================
meta-llama/Meta-Llama-3-8B-Instruct
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}
======================
mistralai/Mistral-7B-Instruct-v0.1
{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token + ' ' }}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}

4 apply chat template

DEFAULT_CHAT_TEMPLATE = "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}"
tokenizer.chat_template = DEFAULT_CHAT_TEMPLATE
  • tokenizer.apply_chat_template(messages, tokenize=False)
    • 接受的是 list
    • 基于 role
      • '<|system|>\n' + message['content'] + eos_token
      • '<|user|>\n' + message['content'] + eos_token
      • '<|assistant|>\n' + message['content'] + eos_token
tokenizer.eos_token # '</s>'
import re
import random
from multiprocessing import cpu_count

def apply_chat_template(example, tokenizer):
    messages = example["messages"]
    # We add an empty system message if there is none
    if messages[0]["role"] != "system":
        messages.insert(0, {"role": "system", "content": ""})
    example["text"] = tokenizer.apply_chat_template(messages, tokenize=False)
    return example
column_names = list(raw_datasets["train"].features)
column_names # ['prompt', 'prompt_id', 'messages']

raw_datasets = raw_datasets.map(apply_chat_template,
                                num_proc=cpu_count(),
                                fn_kwargs={"tokenizer": tokenizer},
                                remove_columns=column_names,
                                desc="Applying chat template",)
"""
Applying chat template (num_proc=64):   0%|          | 0/207865 [00:00<?, ? examples/s]
Applying chat template (num_proc=64):   0%|          | 0/23110 [00:00<?, ? examples/s]
"""
# create the splits
train_dataset = raw_datasets["train"]
eval_dataset = raw_datasets["test"]
train_dataset
"""
Dataset({
    features: ['text'],
    num_rows: 207865
})
"""

看一下raw_datasets['train'][0]['text']

<|system|>
</s>
<|user|>
These instructions apply to section-based themes (Responsive 6.0+, Retina 4.0+, Parallax 3.0+ Turbo 2.0+, Mobilia 5.0+). What theme version am I using?
On your Collections pages & Featured Collections sections, you can easily show the secondary image of a product on hover by enabling one of the theme's built-in settings!
Your Collection pages & Featured Collections sections will now display the secondary product image just by hovering over that product image thumbnail.
Does this feature apply to all sections of the theme or just specific ones as listed in the text material?</s>
<|assistant|>
This feature only applies to Collection pages and Featured Collections sections of the section-based themes listed in the text material.</s>
<|user|>
Can you guide me through the process of enabling the secondary image hover feature on my Collection pages and Featured Collections sections?</s>
<|assistant|>
Sure, here are the steps to enable the secondary image hover feature on your Collection pages and Featured Collections sections:

1. Log in to your Shopify account and go to your Online Store.
2. Click on Customize theme for the section-based theme you are using.
3. Navigate to the Collection pages or Featured Collections section where you want to enable the secondary image hover feature.
4. With the section open, click on the section's setting (gear) icon in the top-left corner.
5. In the settings panel that appears, look for an option labeled 'Image display' or 'Image hover'.
6. If available, select 'Show secondary image on hover'.
7. Save the changes and preview the Collection/Featured Collection page to see the effect.

If you're having trouble finding the setting, the best thing to do is refer to your theme's documentation, since the location and labeling of settings can vary between themes.</s>
<|user|>
Can you provide me with a link to the documentation for my theme?</s>
<|assistant|>
I don't have access to your store's theme information. However, you can usually find the documentation for your theme by going to the shopify theme store, finding your theme and clicking on the 'support' link located in the bottom right corner of the page. Alternatively, you can do a google search for the name of your theme followed by 'documentation' or 'user guide'.</s>
<|user|>
Can you confirm if this feature also works for the Quick Shop section of my theme?</s>
<|assistant|>
The secondary image hover feature may or may not work for your Quick Shop section, depending on the configuration of your theme. Some themes include this feature in the Quick Shop section by default, while others may require additional customization. To check if this feature is available for the Quick Shop section of your theme, follow these steps:

1. Go to the Quick Shop section where you would like to enable the feature. 2. Click on the Quick Shop settings icon (gear icon) and look for 'Image display' or 'Image hover'. 3. If available, select 'Show secondary image on hover'. 4. Save the changes. If this option is not available in your Quick Shop section settings, you may need to reach out to your theme developer for assistance with customizing your Quick Shop section to include this feature.</s>

20240902

  • 新生即将报到,食堂快餐终于拟人了些,牛肉和绿叶菜都有了(但是,新食堂还没修好,是真能拖

  • 晚上九点,感觉有点困,但还是起手渐加速跑了一个19’46"的5000米(4’12"+4’04"+3’58"+3’54"+3’36"),心率163bpm,补了两段慢跑冷身,凑了半小时,有个没见过的小哥跟了最后600米冲刺,可能是新生,看起来水平尚可,有待观察。

  • PS:最近发现几个有意思的小BUG,有空更一下。

最近GLM免费开了一个GLM-4-Flash的调用接口,虽然模型暂时还不开源,但是有接口用也不错了。

一些Technique report在https://www.bilibili/read/cv37761855,展示的结果表明,在他们选取的任务上,GLM-4-Flash仅次于GLM-4-AirX,与Doubao-lite-32K同级别,远远超过其他模型,包括最新的Qwen2-7b-instruct,也要比Baichuan几个开源的7b 8b系列好,因此估测模型级别在70b左右。

现在一个问题就是,框架不互通,大家各玩各的,OpenAI,Zhipu,Llama,Paddle,诸此种种,都用的不同框架,学习成本太高,最早的时候还只是keras,然后torch和tensorflow,总还算是比较统一的架构,现在真的是牛鬼蛇神什么都来。所以我还是先把最近两天更得Langchain多轮对话最后一部分模型调用的东西记一下:

5 model

from transformers import BitsAndBytesConfig, TrainingArguments
import torch

# specify how to quantize the model
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4", 
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_compute_dtype=torch.bfloat16,
)
device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else None

model_kwargs = dict(
    attn_implementation="flash_attention_2", # set this to True if your GPU supports it (Flash Attention drastically speeds up model computations)
    torch_dtype="auto",
    use_cache=False, # set to False as we're going to use gradient checkpointing
    device_map=device_map,
    quantization_config=quantization_config,
)
quantization_config
"""
BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "bfloat16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}
"""
device_map # {'': 0}
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(model_id, **model_kwargs)

模型架构:

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralFlashAttention2(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
    (norm): MistralRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
model.model.layers[0].self_attn.q_proj.weight.shape, model.model.layers[0].self_attn.q_proj.weight
"""
(torch.Size([8388608, 1]),
 Parameter containing:
 Parameter(Params4bit([[120],
             [119],
             [135],
             ...,
             [119],
             [119],
             [119]], device='cuda:0', dtype=torch.uint8)))
"""

注意torch没有int4类型的数据,它是用1个uint8来表示两个int4的数据,如下:

4096*4096/2 # 8388608.0
model.model.layers[0].mlp.gate_proj.weight.dtype # torch.uint8
model.lm_head.weight.dtype # torch.bfloat16
model.model.embed_tokens.weight.dtype # torch.bfloat16

另外默认都是fp32,需要手动搞成bf16节约显存

6 trl sft trainer

训练pipeline:

import os
os.environ['NCCL_P2P_DISABLE'] = "1"
os.environ['NCCL_IB_DISABLE'] = '1'

from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
from transformers import TrainingArguments
# path where the Trainer will save its checkpoints and logs
output_dir = 'data/mistral-7b-sft-lora'

# based on config
training_args = SFTConfig(
    fp16=True, # specify bf16=True instead when training on GPUs that support bf16
    do_eval=True,
    eval_strategy="epoch",
    per_device_eval_batch_size=4, # originally set to 8
    per_device_train_batch_size=4, # originally set to 8
    gradient_accumulation_steps=64,
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={"use_reentrant": False},
    learning_rate=2.0e-05,
    log_level="info",
    logging_steps=5,
    logging_strategy="steps",
    lr_scheduler_type="cosine",
    max_steps=-1,
    num_train_epochs=1,
    output_dir=output_dir,
    overwrite_output_dir=True,
    report_to="wandb",
    save_strategy="no",
    save_total_limit=None,
    seed=42,
    dataset_text_field="text",
    packing=True,
    dataset_num_proc=cpu_count(),
    max_seq_length=tokenizer.model_max_length,
)
# based on config
peft_config = LoraConfig(
        r=64,
        lora_alpha=16,
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
trainer = SFTTrainer(
        # model=model_id,
        # model_init_kwargs=model_kwargs,
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        tokenizer=tokenizer,
        peft_config=peft_config,
    )
train_result = trainer.train() # 开训!

metrics = train_result.metrics
max_train_samples = training_args.max_train_samples if training_args.max_train_samples is not None else len(train_dataset)
metrics["train_samples"] = min(max_train_samples, len(train_dataset))
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()

7 inference

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained(output_dir)
model = AutoModelForCausalLM.from_pretrained(output_dir, load_in_4bit=True, device_map="auto")

import torch

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]

# prepare the messages for the model
input_ids = tokenizer.apply_chat_template(messages, truncation=True, add_generation_prompt=True, return_tensors="pt").to("cuda")

# inference
outputs = model.generate(
        input_ids=input_ids,
        max_new_tokens=256,
        do_sample=True,
        temperature=0.7,
        top_k=50,
        top_p=0.95
)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])

20240903

  • 今天原本计划测万米,因为感觉这段时间状态很好,可以趁热打铁,又是雨后降温。但是没想到竟然这么难跑,低气压,闷,节奏极差。

  • 而且最近因为军训封操场的缘故,主要是路跑,自从8月24日20km之后,就没有穿过碳板,今天换160X3.0PRO,准备认真跑一下,发现居然不是很适应前掌跑法,前两公里一直在后跟跑,全掌跑和前掌跑之间纠结,怎么跑都感觉膈应,找不到舒适的节奏,然后第3个1000米又被踢球的人绊到,索性冲了一段,最后均配3’45"跑了3000米就不行了(347+351+338),极其差劲。

  • 结束补了三段节奏,4分上下的配,跑得特别难受,难受到难以置信。或许应该休一天了,反正九月中旬要出去一周,大概率也凑不齐200K了,缓一个月吧,似乎也没必要逼得太紧。

  • PS:目前体感有氧阈配应该在3’50"左右。每次想冲PB的时候都PB不了,反而都是只想随便跑跑的时候就能PB。还有就是,一个人确实很难冲PB。

杂记:

静态方法与类方法的比较

下面将比较两者之间的异同点:

  • 参数传递: 类方法的第一个参数通常是类本身(通常命名为cls),而静态方法没有这样的限制,它们不需要传递类或实例作为第一个参数。

  • 访问类属性: 类方法可以访问和修改类属性,因为它们可以通过第一个参数(类本身)来访问类的属性。而静态方法不能直接访问类属性,因为它们没有类对象作为参数。

其实我觉得@classmethod这个修饰符真没啥用,对象也可以调用,就是个标识符而已。


一个没用的小知识,关于eval

比如:

def f():
	question = "..."
	context = "..."
	answer = "..."
	columns = ["question", "context", "answer"]
	dic = {c: eval(c) for c in columns}
	print(dic)
f()

这样会报错找不到变量question,但是:

question = "..."
context = "..."
answer = "..."
columns = ["question", "context", "answer"]
dic = {c: eval(c) for c in columns}
print(dic)

就能跑通,原因是在函数f以外是找不到question这个变量的,question是个局部变量,需要声明为全局变量才行

# -*- coding: utf-8 -*- 
# @author : caoyang
# @email: caoyang@stu.sufe.edu

def f():
	global question, context, answer
	question = "..."
	context = "..."
	answer = "..."
	columns = ["question", "context", "answer"]
	dic = {c: eval(c) for c in columns}
	print(dic)
f()

但是这样也会出别的问题,局部值会覆盖全局的值,所以尽量别这么写

有意思的是如果是:

def g():
	# global question, context, answer
	question = "..."
	context = "..."
	answer = "..."
	columns = ["question", "context", "answer"]
	dic = {}
	for c in columns:
		dic[c] = eval(c)
	print(dic)
g()

这样就不会报错,所以循环赋值写法和直接在字典里写简易循环的作法是有区别的。


正则大写字母替换为小写字母:

import re

def replace_uppercase_with_lowercase(text):
    pattern = re.compile('[A-Z]')
    return pattern.sub(lambda x: x.group().lower(), text)

text = "HeLLo WORlD"
result = replace_uppercase_with_lowercase(text)
print(result)  # 输出: "hello world"

常规pattern.sub第一个参数repl传入目标字符串,这里之所以要先x.group(),因为repl为传入函数时,其形参x的类型是re.Match,而不是字符串,常见的re.match函数返回结果就是re.Match类型,如:

import re
# password = 'abca'
password = 'ab123ca'
if re.match(r'.*[0-9].*', password):
    print('密码格式正确!')
else:
    print('密码格式不正确!')

20240904

  • 新生入校,然而,新食堂还是没有修好,菜鸟驿站已经快炸了,快递全堆在外面,找都找不到。

  • 晚上简单摇了10圈,不是特别想跑,但是嘉伟和XR叫我下去。场上又有一个面生的高手,水平似乎不错,能跟340的配速两圈多,看起来在他的能力范围内,听嘉伟说这人今晚至少跑了有10K向上,我走的时候他都还没走,是个老手了。

  • PS:今晚看起来嘉伟最近状态还是可以的,应该是345左右间歇跑了有10K,我还以为他最近没怎么跑,掉得很厉害的。他这个月有三场小比赛,别人都不靠谱,但真是永远可以相信嘉伟的水平。

关于ACL2024接收的因果推断相关paper

  1. Causal-Guided Active Learning for Debiasing Large Language Models:2024.acl-long.778(本质就是用GPT-4去发掘其他模型的输出存在的偏差,提示工程。)
  2. Tree-of-Counterfactual Prompting for Zero-Shot Stance Detection:2024.acl-long.49(立场检测,对象是网络帖文,其目的是构造帖文形式的图文对的反事实,本质是做反事实数据增强)
  3. Identifying while Learning for Document Event Causality Identification:2024.acl-long.210(事件因果性识别(ECI)的目标是检测文档中两个事件之间是否存在因果关系)
  4. A Causal Approach for Counterfactual Reasoning in Narratives:2024.acl-long.354(也是做反事实生成,代码在https://github/mufeiteng/CausalCRN,这是叙事任务的反事实生成,不同于一般的指向性预测任务,重在优化事实与反事实之间的因果关系)
  5. Causal Estimation of Memorisation Profiles:2024.acl-long.834(实证研究:重在记忆化,记忆化(i)在较大模型中更强且更持久,(ii)由数据顺序和学习率决定,以及(iii)在模型大小之间具有稳定的趋势,从而使得在较小模型中可以预测较大模型的记忆化情况。)
  6. Multi-Aspect Controllable Text Generation with Disentangled Counterfactual Augmentation:2024.acl-long.500(多方面可控文本生成旨在从多个方面(例如,“情感”中的“积极”和“话题”中的“运动”)控制生成的文本。以前常做的风格迁移)
  7. DeCoT: Debiasing Chain-of-Thought for Knowledge-Intensive Tasks in Large Language Models via Causal Intervention:2024.acl-long.758(疑似是做了个CoT的可视化,从因果推断的角度讲了个CoT的故事,难绷)
  8. AGR: Reinforced Causal Agent-Guided Self-explaining Rationalization:2024.acl-short.47(根据模型当前的训练状态指导模型的下一步行动。具体来说,我们引入了因果干预计算来量化在理性化训练过程中固有的因果效应,并利用强化学习过程来调整这些效应的学习偏差。)

除了最后一篇强化学习的AGR是短文,其他都是长文,2和5是国外研究,其余来自国内(很惊讶国外也有做因果推断研究的)。

这样看下来,ACL确实不行了,水。

代码备份:

# 将triviaqa的每个问题转化为标准的JSON平铺格式(可以直接插入到dataframe中的)

import string

def trans_java_string_to_python_string(s):
    new_s = str()
    for i in range(len(s)):
        char = s[i]
        if char in string.ascii_uppercase:
            if i > 0:
                new_s += f"_{char.lower()}"
            else:
                new_s += char.lower()
        else:
            new_s += char
    return new_s

def transform_entry(entry):
    ep = entry["EntityPages"]
    q = entry["Question"]
    qid = entry["QuestionId"]
    qs = entry["QuestionSource"]
    ans = entry.get("Answer")
    sr = entry.get("SearchResults")
    qpove = entry.get("QuestionPartOfVerifiedEval")
    qvea = entry.get("QuestionVerifiedEvalAttempt")
    

    if ans is not None:
        ans_dict = {f"answer_{trans_java_string_to_python_string(_key)}": ans[_key] for _key in ans}
    else:
        ans_dict = {}
        
    ds = []
    epfn = []
    eptitle = []
    ep_url = []
    ep_ha = []
    epdpofe = []
    epdvea = []
    eplp = []
    eprho = []
    for page in ep:
        
        for key in page.keys():
            assert key in ["DocSource", "Filename", "Title",
                           "originalUrl", "HumanAnswer", "DocPartOfVerifiedEval", "DocVerifiedEvalAttempt",
                          "LinkProbability", "Rho"], f"{page.keys()}"

        ds.append(page.get("DocSource"))
        epfn.append(page.get("Filename"))
        eptitle.append(page.get("Title"))
        ep_url.append(page.get("originalUrl"))
        ep_ha.append(page.get("HumanAnswer"))
        epdpofe.append(page.get("DocPartOfVerifiedEval"))
        epdvea.append(page.get("DocVerifiedEvalAttempt"))
        eplp.append(page.get("LinkProbability"))
        eprho.append(page.get("Rho"))
        
        
    if sr is not None:
        desc = []
        srfn = []
        rank = []
        srtitle = []
        url = []  
        dpove = []
        dvea = []
        sr_ha = []
        sr_durl = []
        for result in sr:
            
            for key in result.keys():
                assert key in ["Description", "Filename", "Rank", "Title", "Url", 
                    "DocPartOfVerifiedEval", "DocVerifiedEvalAttempt", "HumanAnswer",
                              "DisplayUrl"
                              ], f"{result.keys()}"
            desc.append(result["Description"])
            srfn.append(result.get("Filename"))
            rank.append(result["Rank"])
            srtitle.append(result["Title"])
            url.append(result["Url"])
            dpove.append((result.get("DocPartOfVerifiedEval")))
            dvea.append((result.get("DocVerifiedEvalAttempt")))
            sr_ha.append((result.get("HumanAnswer")))
            sr_durl.append(result.get("DisplayUrl"))
    else:
        desc = None
        srfn = None
        rank = None
        srtitle = None
        url = None
        dpove = None
        dvea = None
        sr_ha = None
        sr_durl = None
        
    summary = {"question_id": qid,
               "question": q,

               "entity_pages_doc_source": ds,
               "entity_pages_filename": epfn,
               "entity_pages_title": eptitle,
               "entity_pages_original_url": ep_url,
               "entity_pages_human_answer": ep_ha,
               "entity_pages_doc_part_of_verified_eval": epdpofe,
               "entity_pages_doc_verified_eval_attempt": epdvea,
               "entity_pages_link_probability": eplp,
               "entity_pages_rho": eprho,
               
               "search_results_description": desc,
               "search_results_filename": srfn,
               "search_results_rank": rank,
               "search_results_title": srtitle,
               "search_results_url": url,
               "search_results_doc_part_of_verified_eval": dpove,
               "search_results_doc_verified_eval_attempt": dvea,
               "search_results_human_answer": sr_ha,
               "search_results_display_url": sr_durl,
               
               "question_source": qs,
               "Question_part_of_verified_eval": qpove,
               "Question_verified_eval_attempt": qvea,
               }
    
    summary = {**summary, **ans_dict} 
    return summary

20240905

  • 今日首蚌,新园汤面摊位上挂了个老盛昌的招牌,特显眼,上去一瞧,菜单一点儿没变,实际上浇头也确实没变,欺负新来的,搁这诈骗新生呢?(新园面条,狗都不吃

  • 晚上陪嘉伟干了几组,我九点下去,没想到他也来得这么晚,3K@347+2K@348+1.2K@410放松,他第一组比我多跑1K,后两组都比我多跑1圈。

  • 感觉不是很好跑,虽然不算太热,但状态不太行,本来也没想认真跑的,只想摇半个小时就撤,衣服鞋子都没换,跑前准备也没做,一上来就干这么猛。其实第一组没有太累,心率一直都没过170bpm,状态好一点至少5K不是太大问题,但是身上明显不太舒服。两个人一起跑,如果有一个人掉下去了,另一个人也很难坚持更久。还是我的问题,嘉伟是想扛到10K的,我掉的太快了,他也就很快不行了。养两天状态,一定要跟嘉伟认真跑一回。

  • PS:总之就是很累,感觉想实现一些事情还是很有难度的,未必可得。

TriviaQA,一共8个文件,[web, wikipedia] × [train, dev, test, dev_verified]

字段名数据类型说明
AnswerDict[Aliases: List[Str], MatchedWikiEntityName: Str, NormalizedAliases: Str, NormalizedMatchedWikiEntityName: Str, NormalizedValue: Str, Type: Str, Value: Str]详细的答案,三个Normalized开头的字段都是对应的三个其他字段清洗后的结果,因此直接用Normalized的结果也可以了。Aliases里面是一些候选答案,ValueMatchedWikiEntityName似乎都是一样的。Type标注的是实体来自哪里,大多是WikipediaEntity
EntityPagesList[Dict[DocSource: Str, Filename Str, Title: Str]]相关的上下文:可以在triviaqa-rc/evidence对应的webwikipedia文件夹下找到对应的FilenameDocSource不关键,感觉是平台名称(如TagMe),Title也不关键,基本上和Filename是对应的(标题)
web里可能为空列表,因为上下文来自SearchResults
QuestionStr问题描述
QuestionIdStr形如tc_69
QuestionSourceStr不关键,问题的来源,网址,比如http://www.triviacountry/
SearchResultsList[Dict[Description: Str, Filename: Str, Rank: Int, Title: Str, Url: Str]]web类专属:
1. 可以在triviaqa-rc/evidence对应的web文件夹下找到对应的Filename,形如10/10_99.txt
2. Description:就是Filename对应文件的内容,只是缩水了很多
3. Rank:不明确的数值,不知道是啥
4. TitleUrl就是标题和来源的网址
QuestionPartOfVerifiedEvalBooleanVerified文件专属
QuestionVerifiedEvalAttemptBooleanVerified文件专属

但是后来发现EntityPages和SearchResults里其实还有更多的属性,很不干净,所以换了一种写法,能覆盖所有的属性:

class TriviaqaDataset(BaseGenerativeDataset):
	dataset_name = "TriviaQA"
	checked_data_dirs = ["./qa/web-train.json",
						 "./qa/web-dev.json",
						 "./qa/web-test-without-answers.json",
						 "./qa/verified-web-dev.json",
						 "./qa/wikipedia-train.json",
						 "./qa/wikipedia-dev.json",
						 "./qa/wikipedia-test-without-answers.json",
						 "./qa/verified-wikipedia-dev.json",
						 "./evidence/web",
						 "./evidence/wikipedia",
						 "./triviaqa-unfiltered/unfiltered-web-train.json",
						 "./triviaqa-unfiltered/unfiltered-web-dev.json",
						 "./triviaqa-unfiltered/unfiltered-web-test-without-answers.json",
						 ]
	def __init__(self,
				 data_dir,
				 ):
		super(TriviaqaDataset, self).__init__(data_dir)

	# @param batch_size: Int
	# @param type_: Str, e.g. "train", "dev", "test", "verified"
	# @param category: Str, e.g. "web", "wikipedia"
	# @param unfiltered: Boolean, only take effect when @param category is "web", then data under `./triviaqa-unfiltered` directories will be read
	# @yield batch: List[Dict]
	def yield_batch(self,
					batch_size,
					type_,
					category,
					unfiltered = False,
					):
		# Load data
		if unfiltered:
			if type_ in ["train", "dev"]:
				data_path = os.path.join(self.data_dir, f"./triviaqa-unfiltered/unfiltered-{category}-{type_}.json")
			elif type_ == "test":
				data_path = os.path.join(self.data_dir, f"./triviaqa-unfiltered/unfiltered-{category}-test-without-answers.json")
			else:
				assert False, f"Unexpected keyword argument `type_`: {type_} for unfiltered TQA!"
		else:
			if type_ == "verified":
				data_path = os.path.join(self.data_dir, f"./qa/verified-{category}-dev.json")
			elif type_ in ["train", "dev"]:
				data_path = os.path.join(self.data_dir, f"./qa/{category}-{type_}.json")
			elif type_ == "test":
				data_path = os.path.join(self.data_dir, f"./qa/{category}-test-without-answers.json")
			else:
				assert False, f"Unexpected keyword argument `type_`: {type_}"
		with open(data_path, 'r', encoding="utf8") as f:
			data = json.load(f)["Data"]
		batch, current_batch_size, = list(), 0
		for entry in data:
			normalized_entry = self._normalize_entry(entry)
			# Generate context by EntityPaages
			context = list()
			entity_title = normalized_entry["entity_title"]
			entity_filename = normalized_entry["entity_filename"]
			for title, filename in zip(entity_title, entity_filename):
				file_path = os.path.join(self.data_dir, f"./evidence/{category}", filename)
				with open(file_path, 'r', encoding="utf8") as f:
					article = list(filter(None, f.read().splitlines()))
				context.append([title, article])
			answers = normalized_entry["answer_normalized_aliases"][:]	# Simply use `answer_normalized_aliases`
			batch.append({"context": context,
						  "question": normalized_entry["question"],
						  "answers": answers,
						  })
			current_batch_size += 1
			if current_batch_size == batch_size:
				# self.check_batch_data_keys(batch)
				yield batch
				batch, current_batch_size, = list(), 0
		if current_batch_size > 0:
			# self.check_batch_data_keys(batch)
			yield batch

	# Normalize a single entry of TriviaQA data
	# @param entry: Dict, A single QA-sample in JSON format of TriviaQA
	# @return normalized_entry: Dict, Normalized QA-sample in JSON format
	def _normalize_entry(self, entry):
		normalized_columns = ["question", "question_id", "question_source"]
		# Extract raw data
		entity_pages = entry["EntityPages"]
		question = entry["Question"]
		question_id = entry["QuestionId"]
		question_source = entry["QuestionSource"]
		answer = entry.get("Answer")
		search_results = entry.get("SearchResults")
		question_part_of_verified_eval = entry.get("QuestionPartOfVerifiedEval")
		question_verified_eval_attempt = entry.get("QuestionVerifiedEvalAttempt")
		# Normalize the different dictationary
		answer_dict = self._normalize_dict_data(data=answer, prefix="answer")	# Normalize Answer
		entity_pages_dict = self._normalize_list_of_dicts_data(data=entity_pages, prefix="entity_pages")	# Normalize EntityPages
		search_results_dict = self._normalize_list_of_dicts_data(data=search_results, prefix="search_results")	# Normalize SearchResults
		# Combine the normalized data
		# normalized_entry = {column: eval(column) for column in normalized_columns}	# Error! Local variables (question, question_id, question_source) are not defined
		normalized_entry = dict()
		for column in normalized_columns:
			normalized_entry[column] = eval(column)
		normalized_entry = {**normalized_entry, **answer_dict, **entity_pages_dict, **search_results_dict}
		return normalized_entry
		
	# Normalize Dict-like data, e.g. Answer
	# @param data: Dict-like variable
	# @param prefix: Normalize key name by adding prefix, i.e. "answer" for Answer
	# @return normalized_dict: Dict[Obj]
	def _normalize_dict_data(self, data, prefix):
		normalized_dict = dict()
		if data is not None:
			for key, value in data.items():
				print(key, value)
				normalized_key = f"{prefix}_{self._transform_camel_to_underscore(key)}"
				normalized_dict[normalized_key] = value
		return normalized_dict

	# Normalize List[Dict]-like data, e.g. EntityPages and SearchResults
	# @param data: List[Dict]-like variable
	# @param prefix: Normalize key name by adding prefix, i.e. "entity_pages" for EntityPages and "search_results" for SearchResults
	# @return normalized_dict: Dict[List[Obj]]
	def _normalize_list_of_dicts_data(self, data, prefix):
		normalized_dict = dict()
		if data is not None:
			for i, datum in enumerate(data):	
				for key, value in datum.items():
					normalized_key = f"{prefix}_{self._transform_camel_to_underscore(key)}"
					if normalized_key in normalized_dict:
						normalized_dict[normalized_key].append(value)
					else:
						# Note that if i > 0, then it means that `datum` in `data` has different keys
						normalized_dict[normalized_key] = [None] * i + [value]
						logging.warning(f"New key occurs: {normalized_key}")
		return normalized_dict

	# Transform UpperCamelCase string to lower_case_with_underscores
	# @param string: String in UpperCamelCase format, e.g. QuestionPartOfVerifiedEval
	def _transform_camel_to_underscore(self, string):
		return string[0].lower() + re.sub("[A-Z]", lambda _match: f"_{_match.group().lower()}", string[1:])

20240906

  • 新食堂修了两个多月,就做了两件事,把一楼窗口玻璃拆了,然后挂了个意义不明的肯德基老爷爷的巨幅肖像画。没去仔细看一楼到底是不是真的改卖肯德基了,但是肯定大部分还是原班人马,但人家新园挂个老盛昌好歹还是面条,我就不信新食堂一楼真改卖炸鸡薯条。

  • 晚上九点跟嘉伟认真跑了一下(去晚一点人少,舒服些,赤膊没啥心理负担),我提议由我来带,怕他像昨天一样带太快,计划是四分起步,顶个10K。结果我带的比昨天还快,倒反天罡,把嘉伟给干爆了。匀速跑4K@3’40"+渐加速2K@3’55"放松

  • 节奏极好,这可能是今年节奏最好的一次,2K时余力尚足,3K时稍有不支,但状态很好,感觉可以冲击5000米PB,但是连续两圈分别被足球和小P孩绊了一次,节奏稍乱,我不想以此为借口,但第十圈时确实感觉心肺顶不住了。我跟嘉伟说可能这是最后一圈了,潜台词是想他要是能上来带就给我拉住,结果半圈之后,嘉伟居然掉下去了(他确实减速了),巨难绷,我看他掉下去后便提速,很快力竭,最终嘉伟独自顶到5K,用时18分40秒,他是放了不少,这个配速是他的乳酸阈,对他应该不算太难,可能还是状态不好。

  • PS:虽然跑崩,但今天跑得是比较满意的,确实要跟水平相当的一起练更有效果,独自跑硬顶很难。表现来看基本持平3月,但目前天气比3月要热得多,等到天气凉快下来,5000米完全有可能冲击18分以内。

此后roberta-large-finetuned-race及longformer-large-4096-answering-race的输入处理以此为准:


class MultipleChoiceDataset(BaseDataset):
	dataset_name = "Multiple-choice"
	batch_data_keys = ["article",	# Str, usually
					   "question",	# Str
					   "options",	# List[Str]
					   "answer",	# Int
					   ]
	def __init__(self, data_dir, **kwargs):
		super(MultipleChoiceDataset, self).__init__(data_dir, **kwargs)

	# Generate inputs for different models
	# @param batch: @yield of function `yield_batch`
	# @param tokenizer: Tokenizer object
	# @param model_name: See `model_name` of CLASS defined in `src.models.multiple_choice`
	@classmethod
	def generate_model_inputs(cls,
							  batch,
							  tokenizer,
							  model_name,
							  **kwargs,
							  ):
		if model_name == "LIAMF-USP/roberta-large-finetuned-race":
			# Unpack keyword arguments
			max_length = kwargs.get("max_length", 512)
			# Generate batch inputs
			batch_inputs = list()
			for data in batch:
				# Unpack data
				article = data["article"]
				question = data["question"]
				option = data["options"]
				flag = question.find('_') == -1
				choice_inputs = list()
				for choice in option:
					question_choice = question + ' ' + choice if flag else question.replace('_', choice)
					inputs = tokenizer(article,
									   question_choice,
									   add_special_tokens = True,
									   max_length = max_length,
									   padding = "max_length",
									   truncation = True,
									   return_overflowing_tokens = False,
									   return_tensors = None,	# return list instead of pytorch tensor, for concatenation
									   ) 			# (1, max_length)
					choice_inputs.append(inputs)	# (n_option, 1, max_length)
				batch_inputs.append(choice_inputs)	# (batch_size, n_option, 1, max_length)
			# InputIds and AttentionMask
			input_ids = torch.LongTensor([[inputs["input_ids"] for inputs in choice_inputs] for choice_inputs in batch_inputs])
			attention_mask = torch.LongTensor([[inputs["attention_mask"] for inputs in choice_inputs] for choice_inputs in batch_inputs])
			model_inputs = {"input_ids": input_ids,	# (batch_size, n_option, max_length)
							"attention_mask": attention_mask,	# (batch_size, n_option, max_length)
							}
		elif model_name == "potsawee/longformer-large-4096-answering-race":
			# Unpack keyword arguments
			max_length = kwargs["max_length"]
			# Generate batch inputs
			batch_inputs = list()
			for data in batch:
				# Unpack data
				article = data["article"]
				question = data["question"]
				option = data["options"]
				article_question = [f"{question} {tokenizer.bos_token} article"] * 4
				# Tokenization
				inputs = tokenizer(article_question,
								   option,
								   max_length = max_length,
								   padding = "max_length",
								   truncation = True,
								   return_tensors = "pt",
								   ) # (, max_length)
				batch_inputs.append(inputs)
			# InputIds and AttentionMask
			input_ids = torch.cat([inputs["input_ids"].unsqueeze(0) for inputs in batch_inputs], axis=0)
			attention_mask = torch.cat([inputs["attention_mask"].unsqueeze(0) for inputs in batch_inputs], axis=0)
			model_inputs = {"input_ids": input_ids,	# (batch_size, n_option, max_length)
							"attention_mask": attention_mask,	# (batch_size, n_option, max_length)
							}
		else:
			raise NotImplementedError(model_name)
		return model_inputs

20240907

  • 集群炸了,No space left on device,不知道哪个好大儿搞了个大的,现在大家都写不了文件。

  • 周六日常回血,晚上力量训练,30箭步×8组(+20kg),组间50次双脚提踵(+20kg),结束慢跑三圈放松,嘉伟独自跑了几组间歇,看起来状态不是很好,不过AX倒是很稳目前,今晚10组×800米@345,一个人顶完很不错的。

  • PS:时隔两个多月,SXY再次上山,不过并不是很舒服,似乎失去了当初的热情。其实游泳很舒服,最近上海到处都在推销游泳馆,趁着新生入学薅一波羊毛了属于是,这种天气大太阳晒得,无论跑步还是徒步都很找罪受。

之前面试有问过为什么多分类问题用CE而非MSE作为loss,之前的回答是凹凸性,其实当时就觉得有问题,两者显然都是凸的,并不是这个原因,事实上CE比MSE更易于被目前主流的优化器(如AdamW)优化,下面的可视化结果的确说明了这一点(本质上CE等价于MLE,来自Entropy的定义以及KL散度):

l = ∑ i ( y i − y i ′ ) 2 l = ∑ i y i log ⁡ y i ′ \begin{aligned} l&=\sum_i(y_i-y_i')^2\\ l&=\sum_i y_i\log y_i' \end{aligned} ll=i(yiyi)2=iyilogyi

def softmax(x):
	exp_x = np.exp(x - np.max(x)) # 为了数值稳定性,减去输入的最大值
	return exp_x / np.sum(exp_x, axis=-1, keepdims=True)

def mse(gt, logits, epsilon=1e-7):
	probs = softmax(logits)
	# probs = np.clip(probs, epsilon, 1. - epsilon)
	return np.sum((gt - probs) ** 2)

def ce(gt, logits, epsilon=1e-7):
	probs = softmax(logits)
	probs = np.clip(probs, epsilon, 1. - epsilon)
	return - np.sum(gt * np.log(probs))

y = np.asarray([1, 0, 0])
logits = np.asarray([-5, 5, -1000])
softmax(logits) # 显然倾向于分类到第1类

在上面这个例子中,mse(y, logits)ce(y, logits)分别为1.9998和10.0000

下面我们固定第三维logit是-1000,然后让第一维和第二维logit从-10到10,然后我们来看看损失的变化情况:

# 创建网络
y = np.asarray([1, 0, 0])
logits1 = np.linspace(-10, 10, 100)
logits2 = np.linspace(-10, 10, 100)
X, Y = np.meshgrid(logits1, logits2)

# 计算MSE和CE
Z_mse = np.zeros_like(X)
Z_ce = np.zeros_like(X)

for i in range(X.shape[0]):
	for j in range(X.shape[1]):
		logits = np.array([X[i, j], Y[i, j], -1000])
		Z_mse[i, j] = mse(y, logits)
		Z_ce[i, j] = ce(y, logits)

# 绘制热力图
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
im1 = ax1.imshow(Z_mse,
				 extent = [-10, 10, -10, 10],
				 origin = "lower",
				 aspect = "auto",
				 cmap = "rainbow",
				 # cmap = "viridis",
				 )
ax1.set_title("MSE Loss")
ax1.set_xlabel("logits1")
ax1.set_xlabel("logits2")
plt.colorbar(im1, ax = ax1)


im2 = ax2.imshow(Z_ce,
				 extent = [-10, 10, -10, 10],
				 origin = "lower",
				 aspect = "auto",
				 cmap = "rainbow",
				 # cmap = "viridis",
				 )
ax1.set_title("Cross-Entropy Loss")
ax1.set_xlabel("logits1")
ax1.set_xlabel("logits2")
plt.colorbar(im2, ax = ax2)

plt.tight_layout()
plt.show()


这意味着什么?要知道我们最终希望loss减小到0(CE则是到1),那么其实就是热力图的右下角,即优化到右下角。那么初始点如果在左上,就有一个问题,对于MSE来说,左上的区分度太小了,几乎是没有变化的,就会导致梯度不够显著,然后很难找到正确的下降方向,反观CE就明显要好很多,优化方向是很显著的。这才是用CE而非MSE的根本原因,跟凸性关系不大,不过据说MSE和CE的偏导等价,不是很懂那些人在说啥。


20240908

  • 中午十一点半的食堂,被新生支配的恐惧。

  • 今年筹备给wyl买教师节礼物,想了好久,这五六年里,什么按摩仪、手环、枕头、电动牙刷,以及各种保健品都买遍了,最后搞了个文创杯具,甭管它好不好用,好看就完事了。

  • 晚上计划跟AK跑长距离,本来6点多沙赛队伍要在操场进行接力赛,但是AK临时有会鸽了,我也懒得下去凑热闹,九点等人少了简单遛了5K多,跑前吃了几片辣豆皮,不是很舒服,不过随便跑跑也有随便跑跑的好处,上强度?上个鸟。目前九月跑量45K,均配4’06",下周末就要出去一周,这个月怕是要摸了。

小坑,关于transformers的question-answering的pipeline的运行机制:

以deepset/roberta-base-squad2为例:

from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering
from settings import MODEL_SUMMARY
context = 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'
question = 'When did Beyonce start becoming popular?'
model_path = MODEL_SUMMARY["deepset/roberta-base-squad2"]["path"]
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForQuestionAnswering.from_pretrained(model_path)
inputs = dict(context = context, question = question)
pipe = pipeline("question-answering", model = model, tokenizer = tokenizer)
outputs = pipe(inputs)
print(outputs)

具体应当如何把context和question输入到tokenizer里?


	# Generate inputs for different models
	# @param batch: @yield of function `yield_batch`
	# @param tokenizer: Tokenizer object
	# @param model_name: See `model_name` of CLASS defined in `src.models.extractive`	
	@classmethod
	def generate_model_inputs(cls,
							  batch,
							  tokenizer,
							  model_name,
							  **kwargs,
							  ):
		if model_name == "deepset/roberta-base-squad2":
			# Unpack keyword arguments
			max_length = kwargs.get("max_length", 512)
			# Generate batch inputs
			batch_inputs = list()
			contexts = list()
			questions = list()
			for data in batch:
				context = str()
				for title, sentences in data["context"]:
					# context += title + '\n'
					context += '\n'.join(sentences) + '\n'
				contexts.append(context)
				questions.append(data["question"])
			model_inputs = tokenizer(questions,
									 contexts,
									 add_special_tokens = True,
									 max_length = max_length,
									 padding = "max_length",
									 truncation = True,
									 return_overflowing_tokens = False,
									 return_tensors = "pt",
									 ) 	# Dict[input_ids: Tensor(batch_size, max_length),
										#	   attention_mask: Tensor(batch_size, max_length)]
		else:
			raise NotImplementedError(model_name)
		return model_inputs

验证结果表明,必须是先question,后context的顺序,而且这不是这一个模型这样,这就是pipeline内置的逻辑,反过来模型可以输出(本质上就是拼起来了),但是结果出错。这里有个问题,就是SQuAD比较特殊,context都只有一段话(即单文档),这样的话对于roberta来说,就是用</s>来分割question和context,如果是多文档,可能还是得全部拼起来,而不能是用</s>分割,否则模型无法识别question和context所属的部分,而不能给出正确的解答。

不过后来扒源码还是看到了,在./transformers/pipelines/question_answering.pyQuestionAnsweringPipeline.preprocess方法中:

    def preprocess(self, example, padding="do_not_pad", doc_stride=None, max_question_len=64, max_seq_len=None):
        # XXX: This is specal, args_parser will not handle anything generator or dataset like
        # For those we expect user to send a simple valid example either directly as a SquadExample or simple dict.
        # So we still need a little sanitation here.
        if isinstance(example, dict):
            example = SquadExample(None, example["question"], example["context"], None, None, None)

        if max_seq_len is None:
            max_seq_len = min(self.tokenizer.model_max_length, 384)
        if doc_stride is None:
            doc_stride = min(max_seq_len // 2, 128)

        if doc_stride > max_seq_len:
            raise ValueError(f"`doc_stride` ({doc_stride}) is larger than `max_seq_len` ({max_seq_len})")

        if not self.tokenizer.is_fast:
            features = squad_convert_examples_to_features(
                examples=[example],
                tokenizer=self.tokenizer,
                max_seq_length=max_seq_len,
                doc_stride=doc_stride,
                max_query_length=max_question_len,
                padding_strategy=PaddingStrategy.MAX_LENGTH,
                is_training=False,
                tqdm_enabled=False,
            )
        else:
            # Define the side we want to truncate / pad and the text/pair sorting
            question_first = self.tokenizer.padding_side == "right"

            encoded_inputs = self.tokenizer(
                text=example.question_text if question_first else example.context_text,
                text_pair=example.context_text if question_first else example.question_text,
                padding=padding,
                truncation="only_second" if question_first else "only_first",
                max_length=max_seq_len,
                stride=doc_stride,
                return_token_type_ids=True,
                return_overflowing_tokens=True,
                return_offsets_mapping=True,
                return_special_tokens_mask=True,
            )
            # When the input is too long, it's converted in a batch of inputs with overflowing tokens
            # and a stride of overlap between the inputs. If a batch of inputs is given, a special output
            # "overflow_to_sample_mapping" indicate which member of the encoded batch belong to which original batch sample.
            # Here we tokenize examples one-by-one so we don't need to use "overflow_to_sample_mapping".
            # "num_span" is the number of output samples generated from the overflowing tokens.
            num_spans = len(encoded_inputs["input_ids"])

            # p_mask: mask with 1 for token than cannot be in the answer (0 for token which can be in an answer)
            # We put 0 on the tokens from the context and 1 everywhere else (question and special tokens)
            p_mask = [
                [tok != 1 if question_first else 0 for tok in encoded_inputs.sequence_ids(span_id)]
                for span_id in range(num_spans)
            ]

注意看encoded_inputs的定义,其实参数顺序和self.tokenizer.padding_side有关,即到底在左侧padding还是在右侧padding,当然默认是右侧padding,所以是question在前,context在后


20240909

  • 跟房东视频认识了一下,口语确是硬伤,嘴巴跟不上脑袋,虽然也是好久不用英文交流,不过确实是也有点听不懂瑞士人的英语,感觉有点方言的味道。感觉房东还是很热情的,会是一趟好的旅行。

  • 雨夜湿闷,硬刚嘉伟,他是2K@340-345×4组,间歇5分钟,不是很舒服。鞋子拿错,穿的361飞飙,鞋底不稳,前掌蹬地太猛,右脚踝稍有异样,最终9K出头,不想非要凑到10K,还是稳一点为好,今晚本不打算上这么大强度,中间陪YY,LZR分别慢跑了几小段,感觉他俩还是很稳的,基本上是410左右的配,看起来并不算吃力,应该等天气凉快万米跑进40分钟不是大问题。

关于deepspeed的存储优化

  • ds_report (cli)
  • https://huggingface.co/docs/accelerate/usage_guides/deepspeed
  • activation:激活值
    • backward 过程中使用链式法则计算梯度时会用到。有了它算梯度会更快,但它不是必须存储的,因为可以通过重新做Forward来算它(gradient checkpointing)。

offload

  • ZeRO-Offload uses DeepSpeed’s highly optimized CPU implementation of Adam called DeepSpeedCPUAdam.
    • https://github/microsoft/DeepSpeed/tree/master/deepspeed/ops/adam
  • offload_{optimizer|param}:可以比较有效地缓解 gpu 显存的压力(Only applicable with ZeRO >= Stage-2.)
    • none
    • cpu: cpu memory
    • nvme:

混合精度:

下面这个是经典的说明模型并性格和数据并行优化的内存的图:

  • OS: 12, P/G: 2;
    • 中间值
      • Parameter: fp16, 2
      • Gradient: fp16, 2
    • 必存(OS,Adam optimizer 优化器有关):
      • parameter: fp32, 4
      • momentum: fp32, 4
      • variance: fp32, 4
  • zero_stage:
    • [0] Disabled,
    • [1] OS, optimizer state partitioning,
    • [2] OS+G, optimizer+gradient state partitioning and
    • [3] OS+G+P, optimizer+gradient+parameter partitioning
from accelerate import DeepSpeedPlugin
  • DeepSpeedPlugin 参数配置
    • gradient_accumulation_steps: int = None,
    • gradient_clipping: float = None,
      • 1.0
    • zero_stage: 2

running exceptions

  • cannot find -lcurand and -lcudart

    • https://github/microsoft/DeepSpeed/issues/3929
    cd /home/asdf/.local/lib/python3.10/site-packages/torch/lib
    ln -s /usr/local/cuda/lib64/libcurand.so .
    
  • accelerate config default

    • distributed_type: MULTI_GPU
    • $HF_HOME/accelerate/default_config.yaml
  • accelerate env:打印运行环境

  • accelerate test:环境测试;


20240910

  • 罕见的漫长秋雨季,不过总算是凉快不少。也挺好,雨停时刚好我也回来了。

  • 上马要到20号左右才有消息,今天先抽个10.27的问泰安世10km精英赛,虽然10K没有什么意义,但这是为数不多地能和嘉伟同台竞技的机会。这辈子估计是没机会在5000米以下的距离上接近嘉伟的PB,不过10km还是有点机会的。(又怕会受伤,唉,那样首马就又没了)

  • 晚上雨依然很大,到九点多下去看了一眼,不是能跑的样子,遂实验楼B1-15F上下×5次,间歇为100次台阶跳步,感觉很好,有半年没有爬楼梯了。

https://huggingface.co/gaussalgo/T5-LM-Large_Canard-HotpotQA-rephrase

import datasets
canard_train_augm = datasets.load_dataset("gaussalgo/Canard_Wiki-augmented", split="train")  # see the dataset card for details
canard_test_augm = datasets.load_dataset("gaussalgo/Canard_Wiki-augmented", split="test")

canard_df = canard_train_augm.to_pandas()
canard_test_df = canard_train_augm.to_pandas()

### Curation of seq2seq input contexts and labels
import random

def input_context_from_sample(row: dict, max_length=5) -> str:
    context = "Previous conversation:"
    context += "\nQuestion: "
    context += ", ".join(row["History"][:3])
    for i in range(3, len(row["History"]), 2):
        context += "\nAnswer: "
        context += row["History"][i]
        if i+1 < len(row["History"]):
            context += "\nQuestion: "
            context += row["History"][i+1]

    context += "\n\nCurrent Question: "
    context += row["Question"]

    context += "\nSearch results:"
    all_contexts = row["retrieved_contexts"].tolist()[:max_length-1] + [row["true_contexts"]]
    random.shuffle(all_contexts)

    for i, search_result in enumerate(all_contexts):
        context += "\n[%s]: " % (i+1)
        context += search_result.replace("CANNOTANSWER", "")

    context += "\nCurrent Answer: "
    return context


def rephrasing_context_from_sample(row: dict) -> str:
    context = "Previous conversation:"
    context += "\nQuestion: "
    context += ", ".join(row["History"][:3])
    for i in range(3, len(row["History"]), 2):
        context += "\nAnswer: "
        context += row["History"][i]
        if i+1 < len(row["History"]):
            context += "\nQuestion: "
            context += row["History"][i+1]
    
    context += "\n\nCurrent Question: "
    context += row["Question"]

    context += "\nMore specific question: "
    return context


def hotpotqa_context(row: dict) -> str:
    context = "Current Question: "
    context += row["question"]

    context += "\nSearch results:"
    all_contexts = [" ".join(context) for context in row["context"]["sentences"]]

    for i, search_result in enumerate(all_contexts):
        context += "\n[%s]: " % (i+1)
        # context += search_result.replace("CANNOTANSWER", "")

    context += "\nCurrent Answer: "
    return context


input_texts = canard_df.apply(lambda row: input_context_from_sample(row), axis=1).values
input_val_texts = canard_test_df.iloc[:200].apply(lambda row: input_context_from_sample(row), axis=1).values

too_long_index = [len(t) > 20000 for t in input_texts]
input_texts = [t for i, t in enumerate(input_texts) if not too_long_index[i]]
print("training on %s samples" % len(input_texts))

labels = canard_df.answer.apply(lambda ans: "No answer" if ans == "CANNOTANSWER" else ans).values
labels = [l for i, l in enumerate(labels)  if not too_long_index[i]]

val_labels = canard_test_df.answer.apply(lambda ans: "No answer" if ans == "CANNOTANSWER" else ans).values

rephrasing_inputs = canard_df.apply(lambda row: rephrasing_context_from_sample(row), axis=1).values
print(rephrasing_inputs[0])

rephrasing_val_inputs = canard_test_df.apply(lambda row: rephrasing_context_from_sample(row), axis=1).values

rephrasing_labels = canard_df.Rewrite.values
rephrasing_val_labels = canard_test_df.Rewrite.values
print(rephrasing_labels[0])

# Training
# see Adaptor's homepage for details:
# https://github/gaussalgo/adaptor

from adaptor.lang_module import LangModule

lang_module = LangModule("google/t5-large-lm-adapt")

from adaptor.evaluators.generative import ROUGE, BLEU

evaluators = [BLEU(), ROUGE()]

from adaptor.objectives.seq2seq import Sequence2Sequence

seq_qa = Sequence2Sequence(lang_module,
                           texts_or_path=input_texts,
                           labels_or_path=labels,
                           val_texts_or_path=input_val_texts,
                           val_labels_or_path=val_labels,
                           batch_size=4,
                           val_evaluators=evaluators,
                           objective_id="Canard")

hotpot_train = datasets.load_dataset("hotpot_qa", "distractor")["train"]
hotpot_val = datasets.load_dataset("hotpot_qa", "distractor")["validation"]

hotpot_inputs = hotpot_train.to_pandas().apply(hotpotqa_context, axis=1)
hotpot_val_inputs = hotpot_val.to_pandas().apply(hotpotqa_context, axis=1)

too_long_index = [len(t) > 20000 for t in hotpot_inputs]

hotpot_inputs = [t for i, t in enumerate(hotpot_inputs) if not too_long_index[i]]
hotpot_answers = [t for i, t in enumerate(hotpot_train["answer"]) if not too_long_index[i]]

seq_additional_qa = Sequence2Sequence(lang_module,
                                      texts_or_path=hotpot_inputs,
                                      labels_or_path=hotpot_answers,
                                      val_texts_or_path=hotpot_val_inputs[:200],
                                      val_labels_or_path=hotpot_val["answer"][:200],
                                      batch_size=4,
                                      val_evaluators=evaluators,
                                      objective_id="HotpotQA",
                                      share_other_objective_head=seq_qa)


seq_rephrasing = Sequence2Sequence(lang_module,
                                   texts_or_path=rephrasing_inputs,
                                   labels_or_path=rephrasing_labels,
                                   val_texts_or_path=rephrasing_val_inputs[:200],
                                   val_labels_or_path=rephrasing_val_labels[:200],
                                   batch_size=4,
                                   val_evaluators=evaluators,
                                   objective_id="rephrasing",
                                   share_other_objective_head=seq_qa)

from adaptor.utils import AdaptationArguments, StoppingStrategy

training_arguments = AdaptationArguments(output_dir="checkpoints-chatbot",
                                         learning_rate=5e-5,
                                         stopping_strategy=StoppingStrategy.ALL_OBJECTIVES_CONVERGED,
                                         stopping_patience=8,
                                         save_total_limit=8,
                                         do_train=True,
                                         do_eval=True,
                                         bf16=True,
                                         warmup_steps=1000,
                                         gradient_accumulation_steps=8,
                                         logging_steps=10,
                                         eval_steps=200,
                                         save_steps=1000,
                                         num_train_epochs=10,
                                         evaluation_strategy="steps")

from adaptor.schedules import ParallelSchedule
from adaptor.adapter import Adapter

schedule = ParallelSchedule(objectives=[seq_qa, seq_additional_qa, seq_rephrasing],
                            args=training_arguments)
adapter = Adapter(lang_module, schedule, args=training_arguments)

adapter.train()

20240911

  • 坏消息,15号可能会有台风。

  • 发现好像是胖了些,穿短裤勒得有肚腩,可能最近吃太好了,也有可能是前几天晚上回去突然想吃,就一下子吃了六块黄油面包(一馋就停不下来)。问题不大,反正在学校也胖不到哪里去。

  • 晚上半小时慢跑@7K,没穿袜子,不是很好跑,而且学校里人明显太多,中途出去遛了会儿。最近因为天气原因,一个个都没怎么练,不过小崔倒是可能天天在江湾体育场黑练,他去年这个时候比我现在还要强,如果能练回来确是一大助力。

  • SXY晚上在卢湾太用力,带恢复心率的四组1200米间歇,都跑到175bpm的平均心率,最大196bpm,太急了。

PEFT之LoRA v.s. whole model

import peft
from peft import LoraConfig, get_peft_model
import os
import torch
import numpy as np
from tqdm import tqdm
from torch import nn
from torch import optim
from torch.utils.data import Dataset, DataLoader
import torchvision
from torchvision import transforms
from torchvision.transforms import Resize
from torchvision.models import resnet152

os.environ['http_proxy'] = 'http://127.0.0.1:7890'
os.environ['https_proxy'] = 'http://127.0.0.1:7890'

DEVICE = "cuda:0"
normalize = transforms.Compose(
    [
        transforms.Resize(224),
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
    ]
)
train_set = torchvision.datasets.CIFAR10(
    root="./data", 
    train=True, 
#     download=True, 
    transform=normalize
)
train_loader = DataLoader(train_set, batch_size=128, shuffle=False, num_workers=2)

test_set = torchvision.datasets.CIFAR10(
    root='./data', 
    train=False, 
#     download=True, 
    transform=normalize)
test_loader = torch.utils.data.DataLoader(
    test_set, batch_size=128, shuffle=False, num_workers=2)
criterion = nn.CrossEntropyLoss()

def train(net, train_loader, lr=1e-3, epochs=20):
    trainable_para = []
    for p in net.parameters():
        if p.requires_grad:
            trainable_para.append(p)
            
    print("num of trainable parameters: ", sum(p.numel() for p in trainable_para if p.requires_grad))
    optimizer = optim.Adam(trainable_para, lr=lr)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs)
    net.train()
    
    for epoch in range(epochs):
        net.train()
        for inputs, targets in tqdm(train_loader):
            inputs, targets = inputs.to(DEVICE), targets.to(DEVICE)
            optimizer.zero_grad()
            outputs = net(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()
        scheduler.step()
        if (epoch+1) % 2 == 0:
            test(net)
    net.eval()

def test(net):
    net.eval()
    test_loss = 0
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_idx, (inputs, targets) in enumerate(test_loader):
            inputs, targets = inputs.to(DEVICE), targets.to(DEVICE)
            outputs = net(inputs)
            loss = criterion(outputs, targets)

            test_loss += loss.item()
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()

    print( 'Loss: %.3f | Acc: %.3f%% (%d/%d)' % (test_loss/len(test_loader), 100.*correct/total, correct, total))

先看whole train

model = resnet152(weights='DEFAULT')
in_features = model.fc.in_features
model.fc = nn.Linear(in_features, 10)
model.to(DEVICE)
train_loader = DataLoader(train_set, batch_size=128, shuffle=False, num_workers=2)
train(model, train_loader, lr=3e-4)

再看LoRa

target_modules = []
available_types = [torch.nn.modules.conv.Conv2d, torch.nn.modules.linear.Linear]
for n, m in model.named_modules():
    if type(m) in available_types:
        target_modules.append(n)
target_modules.remove('fc')

config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",# 'none', 'all' or 'lora_only'
    target_modules=target_modules,
    modules_to_save=["fc"],
)
peft_model = get_peft_model(model, config).to(DEVICE)
peft_model.print_trainable_parameters()

train_loader = DataLoader(train_set, batch_size=1024, shuffle=False, num_workers=2)
model = resnet152(weights=None, num_classes=10)
model.to(DEVICE)
peft_model = get_peft_model(model, config).to(DEVICE)
peft_model.print_trainable_parameters()
train(peft_model, train_loader)

train_loader = DataLoader(train_set, batch_size=2048, shuffle=False, num_workers=2)
model = resnet152(weights=None, num_classes=10)
model.to(DEVICE)
peft_model = get_peft_model(model, config).to(DEVICE)
peft_model.print_trainable_parameters()
train(peft_model, train_loader)
test(peft_model)

20240912

  • 晚上本学期第一次例训,这届新生已经比我低8级了。嘉伟没有来,我问他怎么不来练会儿,他含糊其辞,感觉就是不想见某人。

  • 本来想认真跑会儿,带上了XTEP 160X3.0PRO,但自从9月6日跟嘉伟认真跑完之后,又有一周没穿碳板,感觉又不适应碳板落地的脚感(这双鞋还是只能前掌跑),9月6日那天节奏确实好,今天感觉就是一点都不兴奋,导致身体格外沉重(难道我真的长胖了?),又沉又不稳。虽然最后还是凑了10K出头,勉强算是补量,但是效率很低。

  • 中间一组慢的0.73K是带的一个新生,说自己也经常跑步,想加入田径队。我还是挺高兴的,有新鲜血液总是好事,从5分慢慢给带到4分半,不到两圈就扛不住了,虽然是菜了些,但我也不是从这种水平慢慢跑上来的吗?何况这还是一个新生,还是不能太有所苛求。

  • 目前九月前12天跑量72K(其中跑休一日),平均配速4’04"出头。其实现在对我来说,4开头的配速都是慢跑,独自训练中不想全力,也很难全力,甚至畏惧全力,刚开始的时候,谁不是每每顶到极限才停下,伤过才懂得敬畏,但也是伤过才更加渴求胜利,纵使粉身碎骨。虽然我不是跑得最快的那一批的跑者,但就我这种月跑不到200K的训练量,能达到现在这个水平,知足了。

  • PS:我总想让刚入门的跑者懂得这个道理,但没有什么是比疼痛更深刻的教训了。

关于speculative decoding的技巧:

  • strategy of generation:生成的策略,李宏毅
    • https://drive.google/file/d/1Ac3oFUtq6ThokrMvB7VUfBCUFsoMPba-/view
  • paper
    • https://github/lucidrains/speculative-decoding/tree/main
    • https://pytorch/blog/hitchhikers-guide-speculative-decoding/

  • 投机采样(猜测、随机)
  • approximation model vs. target model
    • 预言家,模型小,速度快;
      • 一次可以快速生成多个tokens
    • non-AutoRegressive models,一次同时生成多个 tokens
    • compressed models,量化/蒸馏,压缩过的小模型;
    • 以外挂的形式接入,而不需要改变LLM;
  • 假如输入是 inputs,其输出为连续的两个 token:A, B
    • 则对于 target model (LLM),可以并行做三次推理,一次生成3个token(A+B+C)
      • target_model(inputs) => A
      • target_mdoel(inputs + A) => B
      • target_mdoel(inputs + A + B) => C
  • 预言家犯错的情况,inputs => A,B (A is right, B is wrong)
    • target_model(inputs) => A
    • target_model(inputs + A) => C(这里识别错误)
    • target_model(inputs + A + B) => D (显然更不可信)
    • 这样依然可以一次生成2个tokens(A 和 C)
  • 极限情况下预言家预测的tokens (A 和 B)全错
    • target_model(inputs) => C
    • target_model(inputs + A) => D
    • target_mdoel(inputs + A + B) => E
    • 此时最差可以拿到一个争取的 token;

一些关于多任务学习的记录:

The approach to fine-tuning a Language Model (LLM) for multiple tasks depends on various factors, including the size of your datasets, the similarity of the tasks, and your available computational resources. Here are two common approaches:

  1. Multi-Task Training (Combined Datasets):
  • If you have multiple datasets for different tasks and these tasks share some similarities (e.g., all text classification tasks), you can combine them into a single dataset.
  • Multi-task training on a combined dataset can lead to a model that generalizes well across different tasks. It allows the model to learn shared representations and potentially perform better on each task.
  • However, combining datasets may introduce some noise or task-specific patterns that could negatively impact performance on individual tasks.
  1. Task-Specific Fine-Tuning (Sequential Training):
  • Alternatively, you can fine-tune your LLM separately for each task. Train the model on one task, save the weights (e.g., LoRA weights), and then fine-tune the model for the next task using the base weights combined with the previously saved LoRA weights.
  • This approach can be useful when tasks are significantly different or when you have limited computational resources. It allows you to fine-tune incrementally and retain task-specific knowledge.
  • However, it may require more manual intervention to manage the training process for each task.

Consider these factors when deciding which approach to take:

  • Data Size: If you have a large amount of data for each task, multi-task training on combined datasets can be effective. If data is limited, task-specific fine-tuning may be better.

  • Task Similarity: If tasks are closely related, multi-task training can benefit from shared representations. If tasks are dissimilar, task-specific fine-tuning might be more appropriate.

  • Computational Resources: Multi-task training can be computationally intensive, so consider your hardware limitations.

  • Evaluation Metrics: Evaluate both approaches on your specific tasks using appropriate evaluation metrics to determine which works better in practice.

  • Experiment: It’s often beneficial to experiment with both approaches to see which one yields better results for your specific use case.

The choice between these approaches can vary based on your specific requirements and constraints.

一个简单的小技巧,比如在类继承的时候,想要在子类的函数中调用父类的同名函数方法,可以这么写:

class A:
	def __init__(self):
		pass
		
	def f(self, a):
		return a + 1 # res = a + 1

class B(A):
	def __init__(self):
		super(B, self).__init__()
	
	def f(self, a):
		return super(B, self).f(a) ** 2 # res = (a + 1) ^ 2
	

直接调用self.f(a)当然会造成循环调用,然后栈溢出。因为常见的一些情况,比如想要重写父类方法,但重写的方法其实就是在父类上做了一些小修改,不想复制太多代码。


20240913

  • 晚饭前称个毛重,65.5kg,净重130斤左右,确实没长胖多少,无所谓。

  • 发现我后面的兄弟也是个杀批,果真是蒸蒸日上了。

  • 今天状态明显比昨天好太多,晚上强度间歇,2.56K@342+400米@122+400米@114+300米@57s+300米@56s+2.43K353+400米@122+2.43K@351+1.24K@353@,总共凑了10K,从第5个2.43K开始跟嘉伟一起跑,因此后面4组质量得以保证,确实非常顶,主要也是趁走之前补点量。计划上明晚力量训练,然后后天起早跑个LSD,中午补一觉,然后收拾行李启程。

  • PS:注意到最近操场上出现一个高手,Keep上的记录,昨晚九点多(刚好在训练之后,因此没有看到)跑了4个1’11"-1’13"圈速的400米间歇,然后补了8圈慢跑@4’50",之前也经常看到TA跑10K的记录,基本上都在45分左右,这个水平相当高,10K可能只是慢跑,但这种质量的400米对我来说已是天花板,关键TA的性别写的是女,我到目前还不是很确信是哪个人,但是我估计不可能是女生,如果是真的,一个目测接近国家一级水平的女生,真就降维打击了。

https://docs.adapterhub.ml/transitioning.html

AdapterHub发布adapter-transformers更新:

改用adapters包,之前adapters-transformers安装时直接会连带安装transformers,现在分割下来了。

adapters兼容adapters-transformers的现有的所有调节器

# pip install adapter-transformers
pip install adapters

安装时会自动更新适配的transformers版本,导致之前的adapter-transformers代码报错:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
adapter-transformers 3.2.1.post0 requires huggingface-hub<0.14.0,>=0.11.0, but you have huggingface-hub 0.24.7 which is incompatible.
adapter-transformers 3.2.1.post0 requires tokenizers!=0.11.3,<0.14,>=0.11.1, but you have tokenizers 0.19.1 which is incompatible.

一些类的迁移:

  • AdapterModel classes, e.g. AutoAdapterModel (see AdapterModels )

  • Adapter configurations e.g. PrefixTuningConfig (see Configurations )

  • Adapter composition blocks, e.g. Stack` (see Composition Blocks )

  • The AdapterTrainer class

adapter-transformers示例:

from transformers import AutoModelWithHeads

model = AutoModelWithHeads.from_pretrained(r"D:\resource\model\huggingface\common\roberta-base")
adapter_name = model.load_adapter(r"D:\resource\model\huggingface\AdapterHub\roberta-base-pf-hotpotqa", source="hf")
model.active_adapters = adapter_name

from transformers import pipeline, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(r"D:\resource\model\huggingface\common\roberta-base")
question_answering_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

articles = "Ben's father is David, they are good friends."
questions = "Who is Ben's father?"
inputs = {"question": questions, "context": articles}
outputs = question_answering_pipeline(inputs)
# {'score': 1.5180694390437566e-05, 'start': 16, 'end': 22, 'answer': 'David,'}

from transformers import pipeline, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(r"D:\resource\model\huggingface\common\roberta-base")

articles = "Radio City is India's first private FM radio station and was started on 3 July 2001."
questions = "What is the first private FM radio station in India?"

articles = """Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy"."""
questions = """When did Beyonce start becoming popular?"""
inputs = tokenizer(questions,
                    articles, 
                    
                   max_length=256,
                   padding='max_length',
                   truncation=True,
                   return_tensors='pt',
                   )	# (4, max_length)
outputs = model(**inputs)
vocab = tokenizer.get_vocab()
vocab = {id_: token for token, id_ in vocab.items()}
start_index = outputs.start_logits[0, 1:].argmax().item()
end_index = outputs.end_logits[0, 1:].argmax().item()
print(start_index, end_index)
print(' '.join([vocab[inputs["input_ids"][0, 1:][i].item()] for i in range(start_index, end_index + 1)]))
for iid in inputs["input_ids"][0]:
    print(vocab[iid.item()], end=' ')

adapters示例:

from transformers import AutoModel
import adapters

model = AutoModel.from_pretrained("bert-base-uncased")
adapters.init(model) # prepare model for use with adapters

adapters.init()是必要的操作

adapter_config = AdapterConfig.load("lora")

# add adapter to the model
model.add_adapter("adapter_name", config="lora")
# activate adapter
model.set_active_adapters("adapter_name")
# freeze model weights and activate adapter
model.train_adapter("adapter_name")

因此更新后的写法是:

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
import adapters

model = AutoModelForQuestionAnswering.from_pretrained(model_path)
adapters.init(model) # prepare model for use with adapters
model.add_adapter(adapter_path, config="lora")
model.set_active_adapters(adapter_path)

tokenizer = AutoTokenizer.from_pretrained(model_path)
question_answering_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

articles = "Ben's father is David, they are good friends."
questions = "Who is Ben's father?"
inputs = {"question": questions, "context": articles}
outputs = question_answering_pipeline(inputs)
outputs

通过set_activate参数直接设为True,可以少一行:

model.add_adapter(adapter_path, config="lora", set_active=True)
# model.set_active_adapters(adapter_path)

但有意思的是

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
import adapters

model = AutoModelForQuestionAnswering.from_pretrained(model_path)
adapters.init(model) # prepare model for use with adapters
model.add_adapter(adapter_path, config="lora", set_active=True)
model.set_active_adapters(adapter_path)

tokenizer = AutoTokenizer.from_pretrained(model_path)
question_answering_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

articles = "Ben's father is David, they are good friends."
questions = "Who is Ben's father?"
inputs = {"question": questions, "context": articles}
outputs = question_answering_pipeline(inputs)
outputs

就这个每次运行结果还都不一样,搞了半天才发现model.add_adapter方法的第一个参数并不是adapter路径,而就是adapter_name。

from adapters import BnConfig, ConfigUnion

config = ConfigUnion(
    BnConfig(mh_adapter=True, output_adapter=False, reduction_factor=16, non_linearity="relu"),
    BnConfig(mh_adapter=False, output_adapter=True, reduction_factor=2, non_linearity="relu"),
)
model.add_adapter("union_adapter", config=config)

经过多次失败后,正确的新版本调用方式应该是:

model_path = r"D:\resource\model\huggingface\common\roberta-base"
adapter_path = r"D:\resource\model\huggingface\AdapterHub\roberta-base-pf-hotpotqa"

from adapters import AutoAdapterModel
import adapters

model = AutoAdapterModel.from_pretrained(model_path)
adapters.init(model) # prepare model for use with adapters
adapter_name = model.load_adapter(adapter_path)
model.set_active_adapters(adapter_name)

from transformers import AutoTokenizer, pipeline
tokenizer = AutoTokenizer.from_pretrained(model_path)
question_answering_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

articles = "Ben's father is David, they are good friends."
questions = "Who is Ben's father?"
inputs = {"question": questions, "context": articles}
outputs = question_answering_pipeline(inputs)
outputs

这样就对了,并且发现adapters.init(model)其实有没有无所谓。


20240914

  • 明天傍晚的✈,估计可以赶在贝碧嘉登陆前润走。收拾完行李,差不太多,最后还是带了一双NRC发的缓震鞋,万一想稍许跑会儿呢,可能也是这辈子为数不多请白人吃辣堡的机会了[奸笑]。

  • 晚上力量训练,30箭步×8组(+20kg),组间50次双脚提踵(+20kg),结束慢跑40分钟@4’45"放松,为什么跑这么多呢,因为截至昨晚月跑量72.1K(均配4’02"),我还是怀揣上半月跑完100K的梦想的,至于剩下100K呢?只能留给下半月慢慢想办法咯。

  • PS:感觉明早补完剩下9.5K挺悬,最近几日负荷有点高,还是有点累,不可勉强。

import re
s = 'abc,  abc,  defg,  dds'
re.split('\W+', s)  # 说明:\W 匹配任何非单词字符,任何字母
# 运行结果:
['abc', 'abc', 'defg', 'dds']
re.split('(\W+)', s)  # 说明:如果加上括号或'[]',结果会同时返回去掉的值
# 运行结果:
['abc', ',  ', 'abc', ',  ', 'defg', ',  ', 'dds']
re.split('(\W+)', s, 1)  # 说明:当前字符串只切分1次
# 运行结果:
['abc', ',  ', 'abc,  defg,  dds']
re.split('wxy*', s)  # 说明:没有可匹配的项,返回原来的字符串。
# 运行结果:
['abc,  abc,  defg,  dds']
line = 'aaa bbb ccc;ddd   eee,fff'
re.split(r'[;,]',line)  # 两个字符以上切割需要放在 [ ] 中
运行结果:
['aaa bbb ccc', 'ddd   eee', 'fff']
re.split(r'[;,\s]',line)  # 所有空白字符切割
# 运行结果:
['aaa', 'bbb', 'ccc', 'ddd', '', '', 'eee', 'fff']
file_name = 'F:\\02-data\\data_standar\\0224整年-Exported.csv'
print(re.split('[\\\, .]', file_name))
['F:', '02-data', 'data_standar', '0224整年-Exported', 'csv']

接着昨天的话,后来测了一下ChatGLM系列的实际输出,以ChatGLM-6B-INT4为例:


def test_2():
	from transformers import AutoTokenizer, AutoModel
	model_root = '/nfsshare/home/caoyang/code/model/huggingface/THUDM'
	model_names = ['chatglm-6b-int4', 'chatglm-6b-int8', 'chatglm-6b']
	
	for model_name in model_names:
		logging.info(f"Load model: {model_name}")
		model_path = os.path.join(model_root, model_name)
		tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
		model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda()
		model = model.eval()
		logging.info(f"- ok!")
		context = [
		"""关于辩论原则的表述,下列哪些选项是正确的?
A. 当事人辩论权的行使仅局限于一审程序中开庭审理的法庭调查和法庭辩论阶段
B. 当事人向法院提出起诉状和答辩状是其行使辩论权的一种表现
C. 证人出庭陈述证言是证人行使辩论权的一种表现
D. 督促程序不适用辩论原则""",
	"""依我国法律规定,在我国法院受理的涉外离婚案件审理过程中,认定婚姻是否有效应当以下列哪一项为准据法?
A.婚姻缔结地法
B.当事人本国法
C.当事人住所地法
D.法院地""",
		]
		logging.info(f"Run chat ...")
		response, history = model.chat(tokenizer, context[0], history=[])
		with open(f"run-{model_name}.txt", 'w', encoding="utf8") as f:
			f.write(response)
			f.write('\n' + '------------' + '\n')
			f.write(str(history))
		logging.info(f"- ok!")
		logging.info(f"Run forward ...")
		model_inputs = tokenizer(context[0], 
								 max_length=256, 
								 padding="max_length", 
								 truncation=True, 
								 return_tensors="pt",
								 )
		for key in model_inputs:
			model_inputs[key] = model_inputs[key].cuda()
		model_outputs = model(**model_inputs)
		logits = model_outputs.logits
		past_key_values = model_outputs.past_key_values
		with open(f"run-{model_name}.txt", 'a', encoding="utf8") as f:
			f.write('\n' + '------------' + '\n')
			f.write(str(model_inputs))
			f.write('\n')
			for key in model_inputs:
				f.write(f"{key}: {model_inputs[key].size()}\n")
			f.write('\n' + '------------' + '\n')
			
			f.write(str(dir(model_outputs)))
			f.write('\n' + '------------' + '\n')
			f.write(f"logits: {logits.size()}\n")
			f.write(f"past_key_values: {len(past_key_values)}\n")
			for i, past_key_value in enumerate(past_key_values):
				f.write(f"{i}. {len(past_key_value)} ")
				for kv in past_key_value:
					f.write(f"{kv.size()} ")
				f.write('\n')
			
		logging.info(f"- ok!")
		del model, tokenizer, model_inputs, model_outputs, logits, past_key_values
		gc.collect()

tokenizer的输出model_inputs中除了input_idsattention_mask外,多一个position_ids,形状都是(batch_size, max_length)

model_outputs里有loss, logits, past_key_values三个值,

  • loss是损失值,model.eval模式下为None,
  • logitsshape(batch_size, max_length, d_model),
  • d_model130528(即为vocab_size
  • 对于ChatGLM-6B-int4/int8而言,past_key_valuesList(Tuple(key: tensor, value: tensor))
  • key, value的形状一样,都是(max_length, batch_size, 32, 128)
  • past_key_values这个List的长度是28,对应config里的num_layers
  • key, value形状里的这个32和128似乎确实不变,查了一下config,32是num_attention_heads,

20240915(完篇)

早上原计划去易跑10K陪几个参赛的去一起跑会儿,还好没起得来,赛后风评极差。九点多跑完预定的计划,月跑凑满100K,很累,又闷又晒,完全不像是即将台风登陆的样子。

中午吃饱饭,回去洗澡,稍事休息,一点准时出发,虽然之前把所有的行李列好了清单,最后出门前依然狼狈。

值机时被告知签证是旅游签,需要行程单,真是奇怪,当时肯定办的是文体签,而且确实不需要行程单,这不知道是怎么回事,好在带了些A4纸,当场写了一份行程单,反正邀请函和ticket都打印好了,应该不至于在海关被拦。

好在飞机没有晚点,今晚上海8点之后的航班全部都被取消了,也算是运气很好赶在之前逃走了。


行文至此,我已经在曼谷素万那普机场,等待我的是近3小时的候机,以及12小时的漫长航程,顺利的话,当地时间7:25抵达苏黎世,我已提前买好前往卢加诺的火车票(12:48-14:58),到时候Vegan将会在车站接我,希望一切顺利。

第一次坐国际航班,设备确实比国内廉价航空高档些。旁边的小哥也是前往英国留学,下机后一同找路,纠结了很久到底要不要拿行李,结果莫名其妙被直接带到安检的地方(根本不用值机),然后两瓶忘喝的牛奶就直接报销了。最后,终于问明白是不需要拿行李的(同一家航司转机,同一个机场是不需要取行李的)。

总之一切安好,会很累,但是也会是有趣的。就像这两天在忻州疯狂拉练,果然还是徒步更适合一些。


当地时间9月16日上午7:45(北京时间13:45,距离启程整整24小时)顺利落地苏黎世,差不多11个半小时。小插曲,前面一排三个座位有两个空着,我右边的老爷爷提议让我坐前面一排,这样空旷一些。本来我左手边坐着一个金发美女,确实是不很自在,我接受了他的提议,然后右前方的老哥用开玩笑的话跟我讲这样需要extra charge,不过不被发现就没关系,我真以为他开玩笑,结果瞬间就被crew真实了。当时就觉得自被老爷爷耍了,直到他面不改色的掏出一张金卡,然后跟crew说要买前面的座位,等老爷爷坐到前面后,crew说我现在可以自由调换到原先老爷爷的位置了(所以是同排切换不需要extra charge,异排切换就需要是吧)

在飞机上吃了两顿,很撑,因为上机前还吃了些东西,而且一直没上厕所,但似乎也不太想上厕所。下机后,海关根本不查行李,只要不申报就行了,不是很懂为什么,不是啥肉制品奶制品都不让入境的吗?搞不明白。苏黎世很冷,只有5℃,而我只带了一件长袖,预定的火车还有4个多小时才抵达。

到这里也就告一段落了。作为完篇似乎是草率了些,但也就这么着吧,生活本就如此,也本该如此。

本文标签: Memocy