概述&原模型

写在开始

论文代码仓库:https://github.com/softhuafei/Pytorch-implementation-for-OCSGA

其实是少了object_label这个文件,需要翻历史记录下载,我也不到为啥把这个删了。

这应该是我第一次改论文代码,对于整体的代码框架啥的其实都不是很清楚,还有具体用到的一些方法,也不是特别熟,所以就来边学边改,做个记录。

----2022/5/30 update----

和博士师兄讨论完,感觉还是任重道远,bug确实不少,要弄的东西也蛮多的,所以既然这玩意不是只靠查查资料就能解决的,还是要静下心来,研读代码,毕竟这个代码也是挺规范的,还有很多地方值得学习,所以还是要耐住躁动和寂寞🚴‍♂️。

----2022/6/5 update----

哥们现在过拟合的问题还是没解决,感概良多,但是还是要踏踏实实地把这个东西做完,感觉心情失落是预期过高,对任务的分析不够客观,所以更需要静下心来调整分析。虽然说每天就答辩了,感觉具体的结果还是应该延期到交论文的时候提交,继续努力!

----2022/6/18 update—

由于这个课程结课延期,zxr终于获得了一线生机,然后这里来记录一下一些想法。

目前我想对于模型进行改进,但是又不知从何下手。

根据户老师所说的,要根据目标进行建模,根据问题特征来设计。这可能会涉及到对于数据的特征的提取,也就是对于数据的观察后,得出某种分布的hypothesis,或者是内部的某种relation,将这个insight应用于模型的构建中,但是问题在于,这样得到的结果可能是好的,也可能根本不起作用,而且这样的insight只是来自于后验的一部分观察,实际上和潜在的本质还是有不少差距的,就算是work了,怎么去证明是按照你设计的初衷work的,人面对deep NN这种复杂的模型,不能去完全用人为的特征限制,但是也要保证一定范围的可控,追求这种平衡是微妙的,也是有趣的。

对于神经网络的优化目标,是否可以定义为不让loss function卡在局部极值,而对于神经网络的解释性,可不可以理解为如何设计一种机制来做到将损失函数的空间转换成更加容易梯度下降的空间。其中的一种途径就是可以通过提取足够的特征,像MLP,其实可以喂给它不同类型的数据,理论上都是能逼近分布的,但是由于数据集的问题,可能学习一段时间就卡住了,这也说明在平时的应用中,MLP + Softmax的拟合往往是有限的。而CNN,Transformer其实都是先进行了提取特征的步骤,这其实是用一种弱机制去对输入的数据进行转换,得到相对来说最问题相关的部分,针对特定的问题,进行解决。

不可控性的来源应该是梯度下降这个方法,只是规定了要减少分布的差别,但是具体怎么做到的,是没有规定的,虽然可以探索人类未曾涉及的规律,但是这就会导致内部的隐状态和机制有些不可解释,而提取特征的过程,是不是可以视为把评判的标准不只是用梯度下降纯粹向着目标,而是另外还要兼顾到一些信息的提取,也就是增强真正有效的信息。

对于神经网络的三层理解:

  1. 线性和非线性组合的迭代,数据空间变换的角度,这种视角是很局限的,不是一般适用的。
  2. 概率分布和函数的逼近,从理论角度理解神经网络的潜在拟合能力。
  3. 优化损失函数空间。

Good idea和idea work之间的联系?

个人感觉,人总是去观察数据内在的某种联系和因果,这样得到某些规律,然后把这种规律作为motivation对模型加以改进,在loss function的空间上可能体现为选取了潜在更好的一块空间,把不太可能的部分省去。因为数据集和模型的复杂性,没办法提前证明这种idea的准确性,只能通过结果体现。一般来说,Good idea都是对于特定问题而言的,这里先不考虑普适的性质,对于特定的问题,实际上是用某种规则去进行弱的约束,感觉有点像黑盒里的东西变了一下,但是你也不知道能不能行,对于规律这种东西,理论上MLP都能学,但是介于实际问题中的复杂性,就很难达到那种效果了,这时候人为的一些idea是能更好的去协助模型学习,所以本质的假设还是人脑可以正确地理解问题。

所以感觉就有点玄学了,毕竟不是可以理论证明的东西,但是却可以从实践中一点点感知。物理这种就是根据某些假设先证明再检验,而DP就是先由某些假设实验再去理解,最终work的原因也可能是多样的,耦合的因素很多,很少有理论证明的部分。

首先从main.py开始

main.pyif __name__ == '__main__':开始看,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Tuning with NCRF++')
# parser.add_argument('--status', choices=['train', 'decode'], help='update algorithm', default='train')
parser.add_argument('--config', help='Configuration File', default='None')
parser.add_argument('--wordemb', help='Embedding for words', default='None')
parser.add_argument('--charemb', help='Embedding for chars', default='None')
parser.add_argument('--status', choices=['train', 'decode'], help='update algorithm', default='train')
parser.add_argument('--savemodel', default="data/model/saved_model.lstmcrf.")
parser.add_argument('--savedset', help='Dir of saved data setting')
parser.add_argument('--train', default="data/conll03/train.bmes")
parser.add_argument('--dev', default="data/conll03/dev.bmes" )
parser.add_argument('--test', default="data/conll03/test.bmes")
parser.add_argument('--seg', default="True")
parser.add_argument('--raw')
parser.add_argument('--loadmodel')
parser.add_argument('--output')
args = parser.parse_args()

data = Data()
data.HP_gpu = torch.cuda.is_available()
if args.config == 'None':
data.train_dir = args.train
data.dev_dir = args.dev
data.test_dir = args.test
data.model_dir = args.savemodel
data.dset_dir = args.savedset
print("Save dset directory:",data.dset_dir)
save_model_dir = args.savemodel
data.word_emb_dir = args.wordemb
data.char_emb_dir = args.charemb
if args.seg.lower() == 'true':
data.seg = True
else:
data.seg = False
print("Seed num:", seed_num)
else:
data.read_config(args.config)
# data.show_data_summary()
status = data.status.lower()
print("Seed num:",seed_num)
#
if (data.root_dir is not None) and (not os.path.exists(data.root_dir)):
os.mkdir(data.root_dir)

if status == 'train':
print("MODEL: train")
data_initialization(data)
data.generate_instance('train')
data.generate_instance('dev')
data.generate_instance('test')
data.build_pretrain_emb()
train(data)
elif status == 'decode':
print("MODEL: decode")
data.load(data.dset_dir)
data.read_config(args.config)
print(data.raw_dir)
# exit(0)
data.show_data_summary()
data.generate_instance('raw')
print("nbest: %s"%(data.nbest))
decode_results, pred_scores = load_model_decode(data, 'raw')
if data.nbest and not data.sentence_classification:
data.write_nbest_decoded_results(decode_results, pred_scores, 'raw')
else:
data.write_decoded_results(decode_results, 'raw')
else:
print("Invalid argument! Please use valid arguments! (train/test/decode)")

首先是一个parser = argparse.ArgumentParser(description='Tuning with NCRF++'),这个可以理解为存一些参数的容器,特点是,这个是通过命令行来进行对参数的赋值。

1
2
3
4
5
6
7
8
9
10
11
# demo.py
import argparse

parser = argparse.ArgumentParser(description='命令行中传入一个数字')
#type是要传入的参数的数据类型 help是该参数的提示信息
parser.add_argument('--str_arg', type=str, help='传入的字符串')

args = parser.parse_args()

#获得传入的参数
print(args.str_arg)
1
2
3
4
5
6
python demo.py --str_arg = "lalala"
> lalala

推荐的资料
https://zhuanlan.zhihu.com/p/56922793
https://www.bilibili.com/video/BV1U4411j7xb?spm_id_from=333.337.search-card.all.click

接下来会碰到data,这玩意其实是data.py里定义的Data类的实例,这个Data类负责的是模型的参数和数据,这里列举一些包含的内容(只是一部分):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
self.model_name = None
### data
self.sentence_classification = False
self.MAX_SENTENCE_LENGTH = 250
self.MAX_WORD_LENGTH = -1
self.MAX_OBJECT_NB = 3
self.number_normalized = True
self.norm_word_emb = False
self.word_alphabet = Alphabet('word')

self.label_alphabet = Alphabet('label',True)
self.tagScheme = "NoSeg" ## BMES/BIO
self.split_token = ' ||| '
self.seg = True

### I/O
self.train_dir = None
self.dev_dir = None
self.test_dir = None
self.raw_dir = None

self.root_dir = None
self.decode_dir = None
self.dset_dir = None ## data vocabulary related file
self.model_dir = None ## model save file
self.load_model_dir = None ## model load file

self.log_dir = None

self.feature_emb_dirs = []
# img object files
self.object_dir = None

self.train_texts = []
self.dev_texts = []
self.test_texts = []
self.raw_texts = []

self.train_Ids = []
self.dev_Ids = []
self.test_Ids = []
self.raw_Ids = []

self.word_emb_dim = 50
self.char_emb_dim = 30
self.object_emb_dim = 50

###Networks
self.word_feature_extractor = "LSTM" ## "LSTM"/"CNN"/"GRU"/
self.use_char = True
self.char_feature_extractor = "CNN" ## "LSTM"/"CNN"/"GRU"/None
self.use_crf = True
self.nbest = None

## Training
self.average_batch_loss = False
self.optimizer = "SGD" ## "SGD"/"AdaGrad"/"AdaDelta"/"RMSProp"/"Adam"
self.status = "train"
### Hyperparameters
self.HP_cnn_layer = 4
self.HP_iteration = 100
self.HP_batch_size = 10
self.HP_char_hidden_dim = 50
self.HP_hidden_dim = 512
self.HP_dropout = 0.5
self.HP_MCA_dropout = 0.5
self.HP_multi_head = 1
self.HP_SGA_layer=8

self.HP_bilstm = True

self.HP_gpu = False
self.HP_lr = 0.015
self.HP_lr_decay = 0.05
self.HP_clip = None
self.HP_momentum = 0
self.HP_l2 = 1e-8

接下来是开始判断config文件是否为空,如果是空的,那就把args容器里面的默认值,赋值给data的成员变量,比如data.train_dir这些。如果config不是空就直接调用data.read_config(config),这个就是直接把config文件里面的参数对应读进data的参数,具体的话是先把config内容转换成字典,是用data中的config_file_to_dict函数。

接下来就是训练or解码的部分了,以train为例,

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
  if status == 'train':
print("MODEL: train")
data_initialization(data) # 构建词表
data.generate_instance('train')
data.generate_instance('dev')
data.generate_instance('test')
data.build_pretrain_emb()
train(data)

# data_initialization
def data_initialization(data):
data.initial_feature_alphabets()
data.build_alphabet(data.train_dir)
data.build_alphabet(data.dev_dir)
data.build_alphabet(data.test_dir)
data.build_object_alphabet(data.object_dir)
data.fix_alphabet()

词表构建啥的细节后面的Chapter会细琐,构建完词表然后是去了generate_instance这个函数,这个函数是和batch划分和decode写文件有关的,这里给出得到的结果的展示,其实就是把所有信息整合起来。

1
2
3
4
5
6
7
8
9
self.train_texts[0]

[words, features, chars, labels, objects]

[['RT', '@JayKenMinaj', '_', ':', 'Me', 'outside', 'of', 'where', 'George', 'Zimmerman', 'got', 'shot', 'at', '.', 'You', 'know', 'God', 'is', 'so', 'good', '.', 'http://t.co/Z3neVBQ7vF'], [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []], [['R', 'T'], ['@', 'J', 'a', 'y', 'K', 'e', 'n', 'M', 'i', 'n', 'a', 'j'], ['_'], [':'], ['M', 'e'], ['o', 'u', 't', 's', 'i', 'd', 'e'], ['o', 'f'], ['w', 'h', 'e', 'r', 'e'], ['G', 'e', 'o', 'r', 'g', 'e'], ['Z', 'i', 'm', 'm', 'e', 'r', 'm', 'a', 'n'], ['g', 'o', 't'], ['s', 'h', 'o', 't'], ['a', 't'], ['.'], ['Y', 'o', 'u'], ['k', 'n', 'o', 'w'], ['G', 'o', 'd'], ['i', 's'], ['s', 'o'], ['g', 'o', 'o', 'd'], ['.'], ['h', 't', 't', 'p', ':', '/', '/', 't', '.', 'c', 'o', '/', 'Z', '0', 'n', 'e', 'V', 'B', 'Q', '0', 'v', 'F']], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['person', 'car', 'cell', 'phone', 'car']]

self.train_Ids[0]
[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 15, 22], [[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []], [[2, 3], [4, 5, 6, 7, 8, 9, 10, 11, 12, 10, 6, 13], [14], [15], [11, 9], [16, 17, 18, 19, 12, 20, 9], [16, 21], [22, 23, 9, 24, 9], [25, 9, 16, 24, 26, 9], [27, 12, 28, 28, 9, 24, 28, 6, 10], [26, 16, 18], [19, 23, 16, 18], [6, 18], [29], [30, 16, 17], [31, 10, 16, 22], [25, 16, 20], [12, 19], [19, 16], [26, 16, 16, 20], [29], [23, 18, 18, 32, 15, 33, 33, 18, 29, 34, 16, 33, 27, 35, 10, 9, 36, 37, 38, 35, 39, 40]], [1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [5, 16, 18, 19, 16]]

然后预处理弄完了,就直接进入训练函数train了。

首先先要处理的就是划分batch,具体来说是这样的,划分一下batchstartend然后对train_Idsslice就行了:

1
2
3
4
5
6
7
8
9
10
11
12
13
batch_size = data.HP_batch_size
batch_id = 0
train_num = len(data.train_Ids)
total_batch = train_num//batch_size+1
for batch_id in range(total_batch):
start = batch_id*batch_size
end = (batch_id+1)*batch_size
if end >train_num:
end = train_num
instance = data.train_Ids[start:end]
if not instance:
continue
batch_word, batch_features, batch_wordlen, batch_wordrecover, batch_char, batch_charlen, batch_charrecover, batch_label, mask, batch_obj, obj_mask = batchify_with_label(instance, data.HP_gpu, True, data.sentence_classification, data.MAX_OBJECT_NB)

然后看batchify_with_label这个函数:

1
2
3
4
5
def batchify_with_label(input_batch_list, gpu, if_train=True, sentence_classification=False, max_object_nb=0):
if sentence_classification:
return batchify_sentence_classification_with_label(input_batch_list, gpu, if_train)
else:
return batchify_sequence_labeling_with_label(input_batch_list, gpu, if_train, max_object_nb=max_object_nb)

实际上还要去看batchify_sequence_labeling_with_label函数(截取了一部分):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def batchify_sequence_labeling_with_label(input_batch_list, gpu, if_train=True, max_object_nb=0):
"""
input: list of words, chars and labels, various length. [[words, features, chars, labels, topics],[words, features, chars,labels, topics],...]
words: word ids for one sentence. (batch_size, sent_len)
features: features ids for one sentence. (batch_size, sent_len, feature_num)
chars: char ids for on sentences, various length. (batch_size, sent_len, each_word_length)
labels: label ids for one sentence. (batch_size, sent_len)
object: object id for this sentence

output:
zero padding for word and char, with their batch length
word_seq_tensor: (batch_size, max_sent_len) Variable
feature_seq_tensors: [(batch_size, max_sent_len),...] list of Variable
word_seq_lengths: (batch_size,1) Tensor
char_seq_tensor: (batch_size*max_sent_len, max_word_len) Variable
char_seq_lengths: (batch_size*max_sent_len,1) Tensor
char_seq_recover: (batch_size*max_sent_len,1) recover char sequence order
label_seq_tensor: (batch_size, max_sent_len)
mask: (batch_size, max_sent_len)
object_tensor: (batch_size, max_topic_len) Variable
object_mask: (batch_size, max_topic_len)
"""


batch_size = len(input_batch_list)
words = [sent[0] for sent in input_batch_list] # 这个结合上面的train_Ids输出看
features = [np.asarray(sent[1]) for sent in input_batch_list]
feature_num = len(features[0][0])
chars = [sent[2] for sent in input_batch_list]
labels = [sent[3] for sent in input_batch_list]
objects = [sent[4] for sent in input_batch_list]


word_seq_lengths = torch.LongTensor(list(map(len, words)))
max_seq_len = word_seq_lengths.max().item()
word_seq_tensor = torch.zeros((batch_size, max_seq_len), requires_grad = if_train).long()
label_seq_tensor = torch.zeros((batch_size, max_seq_len), requires_grad = if_train).long()
object_tensor = torch.zeros((batch_size, max_object_nb), requires_grad=if_train).long()
object_mask = torch.zeros((batch_size, max_object_nb), requires_grad=if_train).byte()
object_lengths = torch.LongTensor(list(map(len, objects)))

for idx, (seq, label, seqlen) in enumerate(zip(words, labels, word_seq_lengths)):
seqlen = seqlen.item()
word_seq_tensor[idx, :seqlen] = torch.LongTensor(seq)
label_seq_tensor[idx, :seqlen] = torch.LongTensor(label)
mask[idx, :seqlen] = torch.Tensor([1]*seqlen)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Ouput of batch_word
tensor([[17263, 876, 17264, 5671, 8, 17265, 15, 313, 3082, 2398,
29, 6190, 9973, 11973, 127, 300, 4878, 36, 1070, 1397,
17266, 192, 192, 17267],
[ 2, 16550, 4, 147, 5, 723, 16551, 803, 1452, 83,
785, 69, 1053, 5, 16552, 16553, 0, 0, 0, 0,
0, 0, 0, 0],
[ 2, 2067, 5, 2068, 2069, 2070, 2071, 2072, 2070, 2073,
2074, 2075, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0],
[ 2, 18490, 5, 3053, 19, 1746, 69, 1465, 8699, 192,
18491, 18492, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0],
[ 2, 11806, 5, 6629, 300, 470, 14, 11807, 11808, 11809,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0],
[ 2, 18415, 5, 10658, 18416, 3119, 5198, 45, 94, 18417,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0],
[ 2, 10927, 5, 10782, 4134, 16619, 192, 16620, 16621, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0],
[ 2, 16279, 5, 58, 786, 7513, 16280, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0],
[ 2, 17990, 5, 17991, 17992, 17993, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0],
[ 4012, 131, 96, 192, 20515, 20516, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0]], device='cuda:0')
torch.Size([10, 24])

tensor([[ 2, 7381, 5, 7388, 341, 16686, 16687, 3635, 37, 4535,
495, 16688, 15, 672, 5484, 8, 13808, 69, 37, 7460,
8, 92, 15, 92, 1403, 323, 7152, 1408],
[ 2, 17320, 5, 6209, 159, 1497, 14, 94, 181, 9653,
192, 17321, 2714, 5, 17322, 17323, 100, 5, 101, 15,
17324, 17325, 0, 0, 0, 0, 0, 0],
[15582, 69, 13403, 5, 1448, 3767, 225, 159, 15583, 15584,
88, 32, 9694, 15585, 39, 12421, 5, 15586, 6279, 15587,
0, 0, 0, 0, 0, 0, 0, 0],
[ 2, 20728, 5, 122, 1284, 8, 566, 629, 15, 15,
15, 15, 20, 20729, 15, 20730, 20731, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[ 2, 8460, 5, 8461, 8462, 97, 3660, 8463, 36, 622,
8464, 8465, 291, 8466, 8467, 8468, 8469, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[ 841, 5, 15006, 1399, 15007, 69, 15008, 659, 267, 15009,
4784, 15010, 15011, 15012, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[ 2, 21204, 5, 17259, 29, 341, 15, 21205, 159, 1619,
1760, 473, 15, 21206, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[ 2, 3129, 5, 1448, 13414, 13415, 19, 13416, 37, 1659,
8, 13417, 13418, 13419, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[ 1520, 10610, 36, 10611, 10612, 10613, 97, 10614, 15, 15,
15, 10615, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0],
[ 1636, 7560, 8074, 36, 6149, 8075, 192, 8076, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0]],
device='cuda:0')
torch.Size([10, 28])

看到这个'cuda:0'真滴是非常的羡慕啊,不会出bug,然后就看了一下源码,是这样处理的,其实就是加个gpu判断然后cuda()

1
2
3
4
5
6
7
8
9
10
11
12
if gpu:
word_seq_tensor = word_seq_tensor.cuda()
for idx in range(feature_num):
feature_seq_tensors[idx] = feature_seq_tensors[idx].cuda()
word_seq_lengths = word_seq_lengths.cuda()
word_seq_recover = word_seq_recover.cuda()
label_seq_tensor = label_seq_tensor.cuda()
char_seq_tensor = char_seq_tensor.cuda()
char_seq_recover = char_seq_recover.cuda()
object_tensor = object_tensor.cuda()
object_mask = object_mask.cuda()
mask = mask.cuda()

然后接下来其实就是把这些batch input喂给模型,就可以去训练了。

这里还是要提一句,这个cuda报错的问题要怎么去避免,因为默认tensor都是放到cpu上的:

1
2
if data.HP_gpu:
model = model.cuda()

其实train的结构算是比较简单的,就是设置了epoch iteration的数量去循环,1个epoch是指把整个训练集过一遍,1个epoch内,是需要进行相应的batch iteration的,其实也就是对train_size // batch_size这么多组batch进行循环,同时计算loss并反向传播更新。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
model.train()
model.zero_grad()
batch_size = data.HP_batch_size
batch_id = 0
train_num = len(data.train_Ids)
total_batch = train_num//batch_size+1
for batch_id in range(total_batch):
start = batch_id*batch_size
end = (batch_id+1)*batch_size
if end >train_num:
end = train_num
instance = data.train_Ids[start:end]
if not instance:
continue
batch_word, batch_features, batch_wordlen, batch_wordrecover, batch_char, batch_charlen, batch_charrecover, batch_label, mask, batch_obj, obj_mask = batchify_with_label(instance, data.HP_gpu, True, data.sentence_classification, data.MAX_OBJECT_NB)
instance_count += 1
loss, tag_seq = model.calculate_loss(batch_word, batch_features, batch_wordlen, batch_char, batch_charlen, batch_charrecover, batch_label, mask, batch_obj, obj_mask)
right, whole = predict_check(tag_seq, batch_label, mask, data.sentence_classification)
right_token += right
whole_token += whole
# print("loss:",loss.item())
sample_loss += loss.item()
total_loss += loss.item()
if end%500 == 0:
temp_time = time.time()
temp_cost = temp_time - temp_start
temp_start = temp_time
print(" Instance: %s; Time: %.2fs; loss: %.4f; acc: %s/%s=%.4f"%(end, temp_cost, sample_loss, right_token, whole_token,(right_token+0.)/whole_token))
if sample_loss > 1e8 or str(sample_loss) == "nan":
print("ERROR: LOSS EXPLOSION (>1e8) ! PLEASE SET PROPER PARAMETERS AND STRUCTURE! EXIT....")
exit(1)
sys.stdout.flush()
sample_loss = 0
loss.backward()
optimizer.step()
model.zero_grad()

里面还有个evaluate的部分,这部分会单独拿出来说,因为之前在这里遇到了bug。

然后这样的话,大致算是过了一遍流程,接下来的Chapter会细致的解析。

整体架构

如何创建一个工程

这部分改编自link

大体的目录如下(不是全部):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
├── checkpoints/
├── data/
│ ├── __init__.py
│ ├── dataset.py
│ └── get_data.sh
├── models/
│ ├── __init__.py
│ ├── crf.py
│ ├── mca.py
│ └── MUL_LSTM_MCA.py
└── utils/
│ ├── __init__.py
│ ├── alphabet.py
│ ├── data.py
│ ├── metric.py
│ └── functions.py
└── object_detector/
│ ├── mrcnn/
│ │ ├── config.py
│ │ ├── model.py
│ │ ├── utils.py
│ │ ├── visualize.py
│ ├── detector.py
├── sample.train.config
├── main.py
├── requirements.txt
├── README.md

其中:

  • checkpoints/: 用于保存训练好的模型,可使程序在异常退出后仍能重新载入模型,恢复训练
  • data/:数据相关操作,包括数据预处理、dataset实现等
  • models/:模型定义,可以有多个模型,例如上面的crfMUL_LSTM_MCA,一个模型对应一个文件
  • utils/:可能用到的工具函数,在这个项目中主要是封装了词表的一些相关函数。
  • config.py:配置文件,所有可配置的变量都集中在此,并提供默认值
  • main.py:主文件,训练和测试程序的入口,可通过不同的命令来指定不同的操作和参数
  • requirements.txt:程序依赖的第三方库
  • README.md:提供程序的必要说明

数据加载

这部分要看具体的任务,还要看提供的数据集格式,有时候数据集是划分好的,有时候是要自己去划分验证集、测试集的。这里还可以用到torchtext啥的,有机会会细琐,但是应该不是在这篇文章里。

我举个栗子,比如这种文本分类的处理,就是tokenize再构建vocab再去构建embedding,而对于NER,还要考虑label啥的,细节是有些不一样,但是还是有很多共同的地方,这个先鸽一下,项目做多了就回来总结。

img

关于_init_.py

可以看到,几乎每个文件夹下都有__init__.py,一个目录如果包含了__init__.py 文件,那么它就变成了一个包(package)。__init__.py可以为空,也可以定义包的属性和方法,但其必须存在,其它程序才能从这个目录中导入相应的模块或函数。例如在data/文件夹下有__init__.py,则在main.py 中就可以from data.dataset import DogCat。而如果在__init__.py中写入from .dataset import DogCat,则在main.py中就可以直接写为:from data import DogCat,或者import data; dataset = data.DogCat,相比于from data.dataset import DogCat更加便捷。

模型

这个搭模型,个人感觉相对来说比较简单,毕竟整体来看就和搭积木一样,把输入输出维度弄好,调调库就直接解决。

但是捏,像bert这种预训练模型,我之前没接触过,所以刚开始只知道一个hugggingface,不知道具体怎么去用bert,这里详细记录一下。

直接import bert的包是不够的,还有config和参数加载的过程:

1
2
3
4
5
6
7
8
9
10
11
12
13
import transformers
from transformers import BertConfig,BertModel

bert_config = BertConfig.from_json_file("model/config.json")
bert = BertModel(self.bert_config)

a = torch.load("./model/pytorch_model.bin")
model_dict = bert.state_dict()
pretrained_dict = {k:v for k,v in a.items() if k in model_dict}
model_dict.update(pretrained_dict)
bert.load_state_dict(model_dict)

output = bert(tensor_test).last_hidden_state # get the output

这上面的config.jsonpytorch_model.bin是要去抱抱脸下的,link在这里bert_based_cased

BiLSTM就蒜了,没啥可说的了,调调库就完事,原本是不准备讲的,但是由于改模型的时候卡bug了,改动的就是这块,不改的话是妹有问题的,所以还是有必要看看。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# char -> emb -> drop
char_embeds = self.char_drop(
self.char_embeddings(char_inputs)) # (batch_size * sent_len, word_length, char_emb_dim)
char_hidden = None
# -> char lstm,
pack_char_input = pack_padded_sequence(char_embeds, char_seq_lengths.cpu().numpy(), batch_first=True)
char_rnn_out, char_hidden = self.char_lstm(pack_char_input, char_hidden)
# last hiddens
## char_hidden = (h_t, c_t)
# char_hidden[0] = h_t = (2, batch_size, lstm_dimension)
char_features = char_hidden[0].transpose(1, 0).contiguous().view(char_batch_size, -1) # (batch_size * sent_len, char_hidden_dim)
char_features = char_features[char_seq_recover]
# cat char_hidden_dim for every char in a word
char_features = char_features.view(batch_size, sent_len, -1) # (batch_size, sent_len, char_hidden_dim)

# word -> word emb
word_embs = self.word_embeddings(word_inputs)
# concat-> word represent
word_represent = torch.cat([word_embs, char_features], 2)
word_represent = self.word_drop(word_represent) # (batch_size, sent_len, char_hidden_dim + word_emb_dim)

# -> word seq lstm
packed_word = pack_padded_sequence(word_represent, word_seq_lengths.cpu().numpy(), True)
hidden = None
lstm_out, hidden = self.word_lstm(packed_word, hidden)
lstm_out, _ = pad_packed_sequence(lstm_out, batch_first=True)
text_feat = self.droplstm(lstm_out) # (batch_size, sent_len, hidden_dim) X
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
text_feat:

tensor([[[ 2.4799e-01, 0.0000e+00, -5.9055e-02, ..., 0.0000e+00,
-0.0000e+00, -0.0000e+00],
[-6.9201e-01, 1.2951e+00, -4.5080e-01, ..., 0.0000e+00,
-0.0000e+00, -0.0000e+00],
[-0.0000e+00, 0.0000e+00, -2.0977e-01, ..., -2.4604e-01,
-4.0794e-01, 0.0000e+00],
...,
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00]]], device='cuda:0',
grad_fn=<NativeDropoutBackward0>)
torch.Size([10, 22, 200])
object_feat1:

tensor([[[ 0.0969, 0.1156, 0.7336, ..., 0.3659, 0.8398, -0.0733],
[ 0.0969, 0.1156, 0.7336, ..., 0.3659, 0.8398, -0.0733],
[ 0.0969, 0.1156, 0.7336, ..., 0.3659, 0.8398, -0.0733],
[ 0.0969, 0.1156, 0.7336, ..., 0.3659, 0.8398, -0.0733]]

...,

[[ 0.3127, 0.2678, -0.1752, ..., 0.0502, 0.4489, 0.0190],
[ 0.0328, 0.0349, 0.0448, ..., 0.0234, -0.0258, -0.0344],
[ 0.0328, 0.0349, 0.0448, ..., 0.0234, -0.0258, -0.0344],
[ 0.0328, 0.0349, 0.0448, ..., 0.0234, -0.0258, -0.0344]]],
device='cuda:0', grad_fn=<AddBackward0>)
torch.Size([10, 4, 200])
object_feat2:

tensor([[[ 0.6313, 0.0668, 1.7060, ..., 0.7507, -1.8600, -0.0626],
[-0.1332, -0.1341, 1.6088, ..., -0.1397, -0.1451, -0.1421],
[ 0.2382, -0.0867, 1.9205, ..., -0.0903, -1.6302, -1.1887],

...,
[ 0.3853, -0.0110, -0.0118, ..., 1.0296, -1.6793, -0.0116],
[ 0.0198, -0.0935, -0.1009, ..., -0.0973, -0.1011, -0.0990],
[-0.0844, 0.0894, 1.4972, ..., -0.0885, -0.0920, -0.0901]]],
device='cuda:0', grad_fn=<AddBackward0>)
torch.Size([10, 22, 200])
final_feature:

tensor([[[ 1.8418, -0.1712, 0.9581, ..., 0.7861, -1.2411, 0.4923],
[-2.2322, 2.3943, 0.4914, ..., -1.0388, 0.6236, 0.4107],
[ 0.3094, -0.5312, 1.0190, ..., -1.4420, -1.4349, -0.6635],
...,

[ 0.6611, -0.3536, -0.6253, ..., 1.3579, -1.0447, 0.5446],
[-0.2124, -0.5470, -0.7103, ..., -0.9521, 0.6715, 0.4549],
[-0.4616, -0.1183, 0.8152, ..., -0.9340, 0.6814, 0.4641]]],
device='cuda:0', grad_fn=<CloneBackward0>)
torch.Size([10, 22, 200])
outputs:

torch.Size([10, 22, 12])
tensor([[[-8.9976e-01, 3.3332e+00, 4.5517e-01, ..., -6.5960e-02,
5.3608e-01, 3.6342e-01],
[ 8.6288e-01, 6.8532e+00, -1.0493e+00, ..., 4.0046e-01,
5.7598e-01, 3.4561e-01],
[-3.6927e-01, 4.3114e+00, -1.8063e+00, ..., -1.3175e+00,
-2.2204e-01, 2.9056e-01],

...,
[ 8.6925e-01, 3.2675e+00, 5.0813e-01, ..., -9.7195e-01,
1.8619e-03, -4.3975e-01],
[ 2.0090e-01, 3.2550e+00, -3.0622e-01, ..., -1.0537e+00,
-5.0671e-01, -1.6619e-01],
[ 3.9266e-02, 1.8018e+00, -6.3803e-01, ..., -2.3062e-01,
3.9648e-01, -5.0846e-01]]], device='cuda:0', grad_fn=<AddBackward0>)

这里再说一个crf,以前也就是知道维特比算法,然后具体crf怎么和神经网络结合的基本没了解过,这里以pytorch官方文档和本模型的代码做个较为细致的解读。

原理的部分看这个: https://www.zhihu.com/question/316740909/answer/2380526295

补充概率图模型大致介绍:

https://www.cnblogs.com/jiangkejie/p/10729773.html

https://blog.csdn.net/weixin_44441131/article/details/104434297

https://longaspire.github.io/blog/概率图模型总览/

然后接下来是pytorch官方bilstm + crf代码的分析,可以看这个:

主要对本模型中crf分析如下:

这里提一下模型的习惯的构建方式:

models/__init__py中,代码如下:

1
from model.MUL_LSTM_MCA import *

这样在主函数中就可以写成:

1
2
3
4
5
6
7
from models import MUL_LSTM_MCA
----------------------------
import models
model = models.MUL_LSTM_MCA()
----------------------------
import models
model = getattr('models', 'MUL_LSTM_MCA')()

最后一种写法蛮重要的,可能之前都没接触过,但却是最常用的。

数据处理

Alphabet构建

先放张数据集的样式:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
IMGID:1015799
RT O
@JayKenMinaj O
_ O
: O
Me O
outside O
of O
where O
George B-PER
Zimmerman I-PER
got O
shot O
at O
. O
You O
know O
God O
is O
so O
good O
. O
http://t.co/Z3neVBQ7vF O

IMGID:1109405
Swan O
upping O
: O
first O
stop O
Hermitage B-LOC
Warf I-LOC
, O
Tower B-LOC
Bridge I-LOC
( O
Tower B-LOC
and O
Olympic B-OTHER
rings O
in O
the O
background O
) O
http://t.co/pQdOHx3s O

IMGID:563049
RT O
@redbullESPORTS O
: O
Smash B-OTHER
Shiba I-OTHER
is O
stoked O
for O
#NWM7 O
Melee B-OTHER
Grand I-OTHER
Finals I-OTHER
. O
http://t.co/hmTEF9q0gz O

然后构建词表,这里是分成两种形式去处理,但实际中是CoNLL 2003这种,所以sentence_classification这种format就不说了。也比较容易理解,先去判断if len(line.strip().split('\t')) >= 2,也就是选取word label这种行,然后再分割出这两部分,分别加到word_alphabetlabel_alphabet,提一下这里的feature其实应该是sentence_classification的那个POS这种标记符,这方便处理可以不考虑。

Alphabet类

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
"""
Alphabet maps objects to integer ids. It provides two way mapping from the index to the objects.
"""
class Alphabet:
def __init__(self, name, label=False, keep_growing=True):
self.name = name
self.UNKNOWN = "</unk>"
self.label = label
self.instance2index = {}
self.instances = []
self.keep_growing = keep_growing

# Index 0 is occupied by default, all else following.
self.default_index = 0
self.next_index = 1
if not self.label:
self.add(self.UNKNOWN)

def add(self, instance):
if instance not in self.instance2index:
self.instances.append(instance)
self.instance2index[instance] = self.next_index
self.next_index += 1

def get_index(self, instance):
try:
return self.instance2index[instance]
except KeyError:
if self.keep_growing:
index = self.next_index
self.add(instance)
return index
else:
return self.instance2index[self.UNKNOWN]

def get_instance(self, index):
if index == 0:
if self.label:
return self.instances[0]
# First index is occupied by the wildcard element.
return None
try:
return self.instances[index - 1]
except IndexError:
print('WARNING:Alphabet get_instance ,unknown instance, return the first label.')
return self.instances[0]

data里面的build_alphabet函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def build_alphabet(self, input_file):
in_lines = open(input_file,'r', encoding='utf-8').readlines()
for line in in_lines:
if len(line.strip().split('\t')) >= 2:
## if sentence classification data format, splited by \t
if self.sentence_classification:
pairs = line.strip().split(self.split_token)
sent = pairs[0]
if sys.version_info[0] < 3:
sent = sent.decode('utf-8')
words = sent.split()
for word in words:
if self.number_normalized:
word = normalize_word(word)
self.word_alphabet.add(word)
for char in word:
self.char_alphabet.add(char)
label = pairs[-1]
self.label_alphabet.add(label)
## build feature alphabet
for idx in range(self.feature_num):
feat_idx = pairs[idx+1].split(']',1)[-1]
self.feature_alphabets[idx].add(feat_idx)

## if sequence labeling data format i.e. CoNLL 2003
else:
pairs = line.strip().split()
word = pairs[0]
if sys.version_info[0] < 3:
word = word.decode('utf-8')
if self.number_normalized:
word = normalize_word(word)
label = pairs[-1]
self.label_alphabet.add(label)
self.word_alphabet.add(word)
## build feature alphabet
for idx in range(self.feature_num):
feat_idx = pairs[idx+1].split(']',1)[-1]
self.feature_alphabets[idx].add(feat_idx)
for char in word:
self.char_alphabet.add(char)
self.word_alphabet_size = self.word_alphabet.size()
self.char_alphabet_size = self.char_alphabet.size()
self.label_alphabet_size = self.label_alphabet.size()
for idx in range(self.feature_num):
self.feature_alphabet_sizes[idx] = self.feature_alphabets[idx].size()
startS = False
startB = False
for label,_ in self.label_alphabet.iteritems():
if "S-" in label.upper():
startS = True
elif "B-" in label.upper():
startB = True
if startB:
if startS:
self.tagScheme = "BMES"
else:
self.tagScheme = "BIO"
if self.sentence_classification:
self.tagScheme = "Not sequence labeling task"

Embedding构建

首先需要说明的一点是,embeddingalphabetindex一定要是对应关系,也就是说如果选择用glove embedding,就要构建相应indexdictionary,如果选用bert的话同理,然后再利用训练集构建的词表和这个embedding字典(bert的话有embedding layer不用考虑)去构建输入的词向量。

原模型采用的是glove200词向量,也就是glove.twitter.27B.200d.txt,这个embedding的内容如下:

img

说的通俗易懂一点就是建立一个dictionarykey是首位的单词,value是对应的词向量,构建过程如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def load_pretrain_emb(embedding_path):
embedd_dim = -1
embedd_dict = dict()
with open(embedding_path, 'r', encoding="utf8") as file:
for line in file:
line = line.strip()
if len(line) == 0:
continue
tokens = line.split()
if embedd_dim < 0:
embedd_dim = len(tokens) - 1
elif embedd_dim + 1 != len(tokens):
## ignore illegal embedding line
continue
# assert (embedd_dim + 1 == len(tokens))
embedd = np.empty([1, embedd_dim])
embedd[:] = tokens[1:]
if sys.version_info[0] < 3:
first_col = tokens[0].decode('utf-8')
else:
first_col = tokens[0]
embedd_dict[first_col] = embedd
return embedd_dict, embedd_dim

然后是再利用自己构建的词表来创建输入的向量,算是比较巧妙的一种方法,利用自己构建词表index->word,借助embedding字典word->embedding,然后直接变成index->embedding,这也和glove本身的性质有关,也就是word embedding pair,换成bert的话,就要利用它自带的词表index,不然和预训练的embedding不会对应。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def build_pretrain_embedding(embedding_path, word_alphabet, embedd_dim=100, norm=True):
embedd_dict = dict()
if embedding_path != None:
embedd_dict, embedd_dim = load_pretrain_emb(embedding_path)
alphabet_size = word_alphabet.size()
scale = np.sqrt(3.0 / embedd_dim)
pretrain_emb = np.zeros([word_alphabet.size(), embedd_dim])
perfect_match = 0
case_match = 0
not_match = 0
for word, index in word_alphabet.iteritems():
if word in embedd_dict:
if norm:
pretrain_emb[index,:] = norm2one(embedd_dict[word])
else:
pretrain_emb[index,:] = embedd_dict[word]
perfect_match += 1
elif word.lower() in embedd_dict:
if norm:
pretrain_emb[index,:] = norm2one(embedd_dict[word.lower()])
else:
pretrain_emb[index,:] = embedd_dict[word.lower()]
case_match += 1
else:
pretrain_emb[index,:] = np.random.uniform(-scale, scale, [1, embedd_dim])
not_match += 1
pretrained_size = len(embedd_dict)
print("Embedding:\n pretrain word:%s, prefect match:%s, case_match:%s, oov:%s, oov%%:%s"%(pretrained_size, perfect_match, case_match, not_match, (not_match+0.)/alphabet_size))
return pretrain_emb, embedd_dim

模型训练&评估

改模型

词表处理

首先就是对词表的重写,因为是要用不同的embedding,由于BERT预训练模型中已经有了一个给定的词表(vocab.txt),因此并不需要根据自己的语料来建立一个词表。当然,也不能够根据自己的语料来建立词表,因为相同的字在我们自己构建的词表中和vocab.txt中的索引顺序肯定会不一样。然后是要利用这个索引作为输入喂到BERT里。

具体做法就是把build_alphabet给改了,改后如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
    def build_label_alphabet(self, input_file):
in_lines = open(input_file,'r', encoding='utf-8').readlines()

for line in in_lines:
if len(line.strip().split('\t')) >= 2:
## if sentence classification data format, splited by \t
if self.sentence_classification:
pairs = line.strip().split(self.split_token)
sent = pairs[0]
if sys.version_info[0] < 3:
sent = sent.decode('utf-8')

label = pairs[-1]
self.label_alphabet.add(label)
## build feature alphabet
for idx in range(self.feature_num):
feat_idx = pairs[idx+1].split(']',1)[-1]
self.feature_alphabets[idx].add(feat_idx)

## if sequence labeling data format i.e. CoNLL 2003
else:
pairs = line.strip().split()
label = pairs[-1]
self.label_alphabet.add(label)
## build feature alphabet
for idx in range(self.feature_num):
feat_idx = pairs[idx+1].split(']',1)[-1]
self.feature_alphabets[idx].add(feat_idx

self.label_alphabet_size = self.label_alphabet.size()
for idx in range(self.feature_num):
self.feature_alphabet_sizes[idx] = self.feature_alphabets[idx].size()

startS = False
startB = False
for label,_ in self.label_alphabet.iteritems():
if "S-" in label.upper():
startS = True
elif "B-" in label.upper():
startB = True
if startB:
if startS:
self.tagScheme = "BMES"
else:
self.tagScheme = "BIO"
if self.sentence_classification:
self.tagScheme = "Not sequence labeling task"

# 改动的地方在这里
def build_word_alphabet(self, input_file):
with open(input_file, 'r', encoding='utf-8') as f:
for i, word in enumerate(f):
w = word.strip('\n')
self.word_alphabet.add(w)
self.word_alphabet_size = self.word_alphabet.size()

其实就是加了个bulid_word_alphabet,对BERT的vocab.txt去处理,这里用了自己写的Alphabet类,是挺方便的。然后再把有关char的部分全删掉,还有原本word embedding的构建。

但是嗷,这样配合下面的模型替换后,实际的效果并不是很理想,考虑是分词的原因,因为berttokenizers会把word还要划分,所以其实直接那种整个单词的输入可能是有点问题的,接下来是利用huggingface中的tokenizer分词:

模型更换

大致分为两块,第一是不加object_feature直接把char-Bilstm换成bert,第二就是再在这个基础上,去加上visual的信息,做一个多模态的NER

其实也是因为俺之前没用过bert,然后这第一次用就卡了挺多bug的,所以准备把改模型的工作划分成阶段性,减少错误的耦合。

替换为Bert

模型改为以下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import transformers
from transformers import BertConfig,BertModel



class BERT_CRF(nn.Module):
"""
bert + crf
"""
def __init__(self, data):
super(BERT_CRF, self).__init__()
print('build BERT_CRF network...')
self.gpu = data.HP_gpu
self.average_batch = data.average_batch_loss

# 可能会加上 self.dropbert = nn.Dropout(data.HP_dropout) # word seq bert out -> dropout

# topic_embed_dim to the dim as text hidden dim
self.txt_feat_linear = nn.Linear(768, 200)

self.hidden2tag = nn.Linear(self.mca_params['HIDDEN_SIZE'], data.label_alphabet_size + 2)
# crf
self.crf = CRF(data.label_alphabet_size, self.gpu)

# Bert
self.bert_config = BertConfig.from_json_file("model/config.json")
self.bert = BertModel(self.bert_config)

a = torch.load("./model/pytorch_model.bin")
model_dict = self.bert.state_dict()
pretrained_dict = {k:v for k,v in a.items() if k in model_dict}
model_dict.update(pretrained_dict)
self.bert.load_state_dict(model_dict)

def _get_bert_features(self, word_inputs, feature_inputs, word_seq_lengths):
"""
word_input -> bert -> dropout -> Linear -> tagscores
:param word_inputs: (batch_size, sent_len)
:param feature_inputs: [(batch_size, sent_len), ...] list of variables
:param word_seq_lengths: list of batch_size, (batch_size, 1)

:return:
variable(batch_size,sent_len, hidden_dim)
"""
batch_size = word_inputs.size(0)
sent_len = word_inputs.size(1)

# Bert
device = torch.device('cuda:0')

tmp = torch.tensor(word_inputs, device=device)
tmp = tmp.long()

text_feat = self.txt_feat_linear(self.bert(tmp).last_hidden_state)

# -> tagscore
outputs = self.hidden2tag(text_feat)

return outputs

或者是用distilbert-base-cased,也就是蒸馏过的,效果确实好,跑的又快又好,改动如下:

1
2
3
4
5
import transformers
from transformers import DistilBertConfig,DistilBertModel

self.bert_config = DistilBertConfig.from_json_file("model/config.json")
self.bert = DistilBertModel(self.bert_config)

https://huggingface.co/distilbert-base-cased

加入多模态信息

这个要考虑的点应该还是有些的,比如object_featureembedding选取,这个是否能直接也和bertembedding选取一样还是要研究一下的,大致的思路先设成concat以后再去get fusion feature

遇到的bug

colab打开文件

解决方法:用!cd不行的时候试试%cd

Torch Tensor放在GPU上

这个比较怪,有几种格式是不行的,需要注意一下

梯度NAN

这个可能是学习律没调好,可以参考一些资料,bert的话设置成1e-5 or 3e-5都是可以的。也就是

1
optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)

p,r,f的值问题

会有值异常的情况,这里的分别就是评测指标,这个不熟的话可以看下面的资料:https://zhuanlan.zhihu.com/p/161703182

异常的原因是压根啥也没学到,可能是和学习率的设置有关,我就是因为学习率不对导致的这个问题。

Bert过拟合

可以看看下面的例子,训练集的f1score就很高,但是测试集就不行了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
Epoch: 9 training finished. Time: 115.85s, speed: 34.53st/s,  total loss: 11371.80078125
totalloss: 11371.80078125
Entity type: PER p = 0.9280045351473923 r = 0.7383852052322959 f = 0.8224064305450891
Entity type: LOC p = 0.8448979591836735 r = 0.7919655667144907 f = 0.8175759071834116
Entity type: ORG p = 0.8514056224899599 r = 0.45689655172413796 f = 0.5946704067321178
Entity type: OTHER p = 0.8706896551724138 r = 0.4297872340425532 f = 0.5754985754985755
Entity type: all p = 0.8794280836534357 r = 0.6672603626943006 f = 0.7587921193150433
speed: 39.21st/s; acc: 0.9588, p: 0.8794, r: 0.6673, f: 0.7588
Entity type: PER p = 0.5945273631840796 r = 0.4329710144927536 f = 0.5010482180293501
Entity type: LOC p = 0.5848623853211009 r = 0.4885057471264368 f = 0.5323590814196242
Entity type: ORG p = 0.3375 r = 0.10931174089068826 f = 0.16513761467889906
Entity type: OTHER p = 0.323943661971831 r = 0.10222222222222223 f = 0.1554054054054054
Entity type: all p = 0.5500505561172901 r = 0.351875808538163 f = 0.42919132149901384
Dev: time: 127.54s, speed: 39.27st/s; acc: 0.8892, p: 0.5501, r: 0.3519, f: 0.4292
Entity type: PER p = 0.5916149068322981 r = 0.4196035242290749 f = 0.49097938144329895
Entity type: LOC p = 0.5734870317002881 r = 0.46906305244549207 f = 0.5160453808752026
Entity type: ORG p = 0.32450331125827814 r = 0.11680572109654351 f = 0.17177914110429449
Entity type: OTHER p = 0.24390243902439024 r = 0.06887052341597796 f = 0.10741138560687431
Entity type: all p = 0.5359723531259818 r = 0.335959038991729 f = 0.4130250574990922
Test: time: 84.01s, speed: 38.79st/s; acc: 0.8881, p: 0.5360, r: 0.3360, f: 0.4130
Epoch: 10/200
Shuffle: first input word list: [1989, 121, 1, 133, 4110, 1105, 3966, 1117, 3589, 1105, 4707, 121, 1, 1, 1]
Instance: 500; Time: 14.51s; loss: 1278.0686; acc: 7798.0/8189.0=0.9523
Instance: 1000; Time: 14.53s; loss: 1258.4695; acc: 15454.0/16250.0=0.9510
Instance: 1500; Time: 14.51s; loss: 1191.4680; acc: 23218.0/24397.0=0.9517
Instance: 2000; Time: 14.56s; loss: 1284.1797; acc: 30973.0/32569.0=0.9510
Instance: 2500; Time: 14.64s; loss: 1328.2502; acc: 38397.0/40438.0=0.9495
Instance: 3000; Time: 14.43s; loss: 1240.5212; acc: 45835.0/48265.0=0.9497
Instance: 3500; Time: 14.55s; loss: 1379.7368; acc: 53359.0/56210.0=0.9493
Instance: 4000; Time: 14.42s; loss: 1359.3643; acc: 61151.0/64439.0=0.9490
Instance: 4000; Time: 0.06s; loss: 0.0000; acc: 61151.0/64439.0=0.9490
Epoch: 10 training finished. Time: 116.21s, speed: 34.42st/s, total loss: 10320.058349609375
totalloss: 10320.058349609375
Entity type: PER p = 0.9315818281335523 r = 0.7677041046459179 f = 0.8417408506429277
Entity type: LOC p = 0.8866090712742981 r = 0.7852702056432329 f = 0.8328683743342633
Entity type: ORG p = 0.8691099476439791 r = 0.5366379310344828 f = 0.6635576282478348
Entity type: OTHER p = 0.7667785234899329 r = 0.48617021276595745 f = 0.5950520833333334
Entity type: all p = 0.8867574257425742 r = 0.6960816062176166 f = 0.7799346879535558
speed: 38.91st/s; acc: 0.9632, p: 0.8868, r: 0.6961, f: 0.7799
Entity type: PER p = 0.6033653846153846 r = 0.45471014492753625 f = 0.5185950413223142
Entity type: LOC p = 0.6221198156682027 r = 0.5172413793103449 f = 0.5648535564853556
Entity type: ORG p = 0.3114754098360656 r = 0.15384615384615385 f = 0.20596205962059622
Entity type: OTHER p = 0.18055555555555555 r = 0.11555555555555555 f = 0.14092140921409213
Entity type: all p = 0.5241935483870968 r = 0.3783958602846054 f = 0.439519158527423
Dev: time: 128.61s, speed: 38.86st/s; acc: 0.8871, p: 0.5242, r: 0.3784, f: 0.4395
Entity type: PER p = 0.5936578171091446 r = 0.44328193832599116 f = 0.5075662042875158
Entity type: LOC p = 0.5782219159200551 r = 0.49440188568061283 f = 0.5330368487928844
Entity type: ORG p = 0.30434782608695654 r = 0.14183551847437426 f = 0.19349593495934958
Entity type: OTHER p = 0.16263736263736264 r = 0.10192837465564739 f = 0.12531752751905167
Entity type: all p = 0.5028743498494388 r = 0.3617565970854667 f = 0.42079945023479554
Test: time: 83.61s, speed: 38.97st/s; acc: 0.8852, p: 0.5029, r: 0.3618, f: 0.4208
Epoch: 11/200
Shuffle: first input word list: [1, 1, 133, 1, 21542, 123, 25351, 4077, 1109, 1, 3072, 2502, 1, 1]
Instance: 500; Time: 14.56s; loss: 1038.2153; acc: 7877.0/8206.0=0.9599
Instance: 1000; Time: 14.58s; loss: 1133.9968; acc: 15664.0/16356.0=0.9577
Instance: 1500; Time: 14.43s; loss: 1135.8523; acc: 23177.0/24248.0=0.9558
Instance: 2000; Time: 14.63s; loss: 1096.4941; acc: 30962.0/32400.0=0.9556
Instance: 2500; Time: 14.63s; loss: 1277.8718; acc: 38635.0/40496.0=0.9540
Instance: 3000; Time: 14.55s; loss: 1193.8625; acc: 46213.0/48460.0=0.9536
Instance: 3500; Time: 14.61s; loss: 1012.0950; acc: 53744.0/56312.0=0.9544
Instance: 4000; Time: 14.67s; loss: 1173.8591; acc: 61486.0/64439.0=0.9542
Instance: 4000; Time: 0.07s; loss: 0.0000; acc: 61486.0/64439.0=0.9542
Epoch: 11 training finished. Time: 116.72s, speed: 34.27st/s, total loss: 9062.2470703125
totalloss: 9062.2470703125
Entity type: PER p = 0.8663298116674323 r = 0.8506991429860171 f = 0.8584433318161129
Entity type: LOC p = 0.9219701162147206 r = 0.7967479674796748 f = 0.8547973319651103
Entity type: ORG p = 0.808695652173913 r = 0.6012931034482759 f = 0.6897404202719406
Entity type: OTHER p = 0.8234265734265734 r = 0.5010638297872341 f = 0.623015873015873
Entity type: all p = 0.8732367518109035 r = 0.7417422279792746 f = 0.8021362283312905
speed: 38.69st/s; acc: 0.9675, p: 0.8732, r: 0.7417, f: 0.8021
Entity type: PER p = 0.5479166666666667 r = 0.47644927536231885 f = 0.5096899224806202
Entity type: LOC p = 0.6430317848410758 r = 0.5038314176245211 f = 0.564983888292159
Entity type: ORG p = 0.25196850393700787 r = 0.12955465587044535 f = 0.1711229946524064
Entity type: OTHER p = 0.2079207920792079 r = 0.09333333333333334 f = 0.12883435582822086
Entity type: all p = 0.5183527305282005 r = 0.3745148771021992 f = 0.43484791588434096
Dev: time: 129.12s, speed: 38.97st/s; acc: 0.8893, p: 0.5184, r: 0.3745, f: 0.4348
Entity type: PER p = 0.5332125603864735 r = 0.48623348017621143 f = 0.5086405529953917
Entity type: LOC p = 0.6096822995461422 r = 0.4749558043606364 f = 0.5339516396157669
Entity type: ORG p = 0.2968036529680365 r = 0.15494636471990464 f = 0.20360219263899765
Entity type: OTHER p = 0.18575851393188855 r = 0.08264462809917356 f = 0.11439466158245949
Entity type: all p = 0.502540786306499 r = 0.37002756990941316 f = 0.4262220709992061
Test: time: 83.39s, speed: 39.08st/s; acc: 0.8876, p: 0.5025, r: 0.3700, f: 0.4262
Epoch: 12/200
Shuffle: first input word list: [7506, 1, 119, 1699, 1106, 1, 119, 1115, 1105, 7245, 1106, 4702, 1109, 1, 1107, 1295, 1, 1112, 5130, 1108, 1620, 4801, 2688, 1]
Instance: 500; Time: 14.44s; loss: 1003.1052; acc: 7896.0/8212.0=0.9615
Instance: 1000; Time: 14.64s; loss: 1009.0051; acc: 15646.0/16300.0=0.9599
Instance: 1500; Time: 14.56s; loss: 1015.7129; acc: 23344.0/24327.0=0.9596
Instance: 2000; Time: 14.51s; loss: 992.6101; acc: 31061.0/32366.0=0.9597
Instance: 2500; Time: 14.52s; loss: 952.4902; acc: 38903.0/40512.0=0.9603
Instance: 3000; Time: 14.68s; loss: 989.9021; acc: 46493.0/48444.0=0.9597
Instance: 3500; Time: 15.40s; loss: 1149.4375; acc: 54169.0/56501.0=0.9587
Instance: 4000; Time: 14.50s; loss: 1003.2131; acc: 61772.0/64439.0=0.9586
Instance: 4000; Time: 0.06s; loss: 0.0000; acc: 61772.0/64439.0=0.9586
Epoch: 12 training finished. Time: 117.31s, speed: 34.10st/s, total loss: 8115.476318359375
totalloss: 8115.476318359375
Entity type: PER p = 0.9242191500256016 r = 0.8141632837167343 f = 0.8657074340527579
Entity type: LOC p = 0.9044619422572179 r = 0.8240076518412243 f = 0.8623623623623624
Entity type: ORG p = 0.8483965014577259 r = 0.6271551724137931 f = 0.7211895910780669
Entity type: OTHER p = 0.7117117117117117 r = 0.6723404255319149 f = 0.6914660831509846
Entity type: all p = 0.8729749631811488 r = 0.7678108808290155 f = 0.8170227429359064
speed: 38.96st/s; acc: 0.9698, p: 0.8730, r: 0.7678, f: 0.8170
Entity type: PER p = 0.578088578088578 r = 0.4492753623188406 f = 0.5056065239551478
Entity type: LOC p = 0.5982532751091703 r = 0.524904214559387 f = 0.5591836734693878
Entity type: ORG p = 0.3088235294117647 r = 0.1700404858299595 f = 0.2193211488250653
Entity type: OTHER p = 0.16033755274261605 r = 0.1688888888888889 f = 0.1645021645021645
Entity type: all p = 0.4777777777777778 r = 0.38939197930142305 f = 0.42908054169636495
Dev: time: 128.44s, speed: 38.91st/s; acc: 0.8829, p: 0.4778, r: 0.3894, f: 0.4291
Entity type: PER p = 0.5748148148148148 r = 0.42731277533039647 f = 0.4902084649399874
Entity type: LOC p = 0.554983922829582 r = 0.5085444902769594 f = 0.5307503075030751
Entity type: ORG p = 0.30455635491606714 r = 0.15137067938021453 f = 0.2022292993630573
Entity type: OTHER p = 0.11796982167352538 r = 0.1184573002754821 f = 0.11821305841924398
Entity type: all p = 0.45717106887188347 r = 0.3647105159511619 f = 0.40573994961112936
Test: time: 83.82s, speed: 38.88st/s; acc: 0.8814, p: 0.4572, r: 0.3647, f: 0.4057
Epoch: 13/200
Shuffle: first input word list: [1989, 121, 2407, 9415, 1626, 3110, 1115, 1119, 1238, 1108, 1105, 5838, 5361, 7239, 119, 1, 121, 21908, 1120, 1891, 4675, 1, 1]
Instance: 500; Time: 14.47s; loss: 1008.4021; acc: 7600.0/7921.0=0.9595
Instance: 1000; Time: 14.60s; loss: 904.8628; acc: 15557.0/16189.0=0.9610
Instance: 1500; Time: 15.21s; loss: 890.7754; acc: 23248.0/24185.0=0.9613
Instance: 2000; Time: 14.53s; loss: 934.3428; acc: 30880.0/32140.0=0.9608
Instance: 2500; Time: 14.48s; loss: 902.6064; acc: 38665.0/40220.0=0.9613
Instance: 3000; Time: 14.53s; loss: 974.2864; acc: 46434.0/48328.0=0.9608
Instance: 3500; Time: 14.61s; loss: 993.2480; acc: 54312.0/56536.0=0.9607
Instance: 4000; Time: 14.60s; loss: 872.9658; acc: 61925.0/64439.0=0.9610
Instance: 4000; Time: 0.06s; loss: 0.0000; acc: 61925.0/64439.0=0.9610
Epoch: 13 training finished. Time: 117.09s, speed: 34.16st/s, total loss: 7481.48974609375
totalloss: 7481.48974609375
Entity type: PER p = 0.8835676625659051 r = 0.9070816418583671 f = 0.8951702648564435
Entity type: LOC p = 0.9378980891719745 r = 0.8450502152080345 f = 0.889056603773585
Entity type: ORG p = 0.9246153846153846 r = 0.6476293103448276 f = 0.761723700887199
Entity type: OTHER p = 0.813443072702332 r = 0.6308510638297873 f = 0.7106051527860996
Entity type: all p = 0.8976349521574292 r = 0.8050518134715026 f = 0.8488262910798122
speed: 38.83st/s; acc: 0.9744, p: 0.8976, r: 0.8051, f: 0.8488
Entity type: PER p = 0.5139146567717996 r = 0.5018115942028986 f = 0.5077910174152154
Entity type: LOC p = 0.6324582338902148 r = 0.5076628352490421 f = 0.563230605738576
Entity type: ORG p = 0.2732919254658385 r = 0.17813765182186234 f = 0.2156862745098039
Entity type: OTHER p = 0.18439716312056736 r = 0.11555555555555555 f = 0.14207650273224043
Entity type: all p = 0.4857142857142857 r = 0.3958602846054334 f = 0.436208125445474
Dev: time: 128.99s, speed: 38.62st/s; acc: 0.8845, p: 0.4857, r: 0.3959, f: 0.4362
Entity type: PER p = 0.5144124168514412 r = 0.5110132158590308 f = 0.512707182320442
Entity type: LOC p = 0.6137026239067055 r = 0.49616971125515613 f = 0.54871293580971
Entity type: ORG p = 0.2938775510204082 r = 0.17163289630512515 f = 0.21670428893905191
Entity type: OTHER p = 0.15777262180974477 r = 0.09366391184573003 f = 0.11754537597234226
Entity type: all p = 0.4837686111789114 r = 0.39031114612051987 f = 0.4320435967302453
Test: time: 84.74s, speed: 38.46st/s; acc: 0.8848, p: 0.4838, r: 0.3903, f: 0.4320

index越界

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Traceback (most recent call last):
File "main.py", line 569, in <module>
train(data)
File "main.py", line 481, in train
speed, acc, p, r, f, _,_ = evaluate(data, model, "test")
File "main.py", line 171, in evaluate
tag_seq = model(batch_word, batch_features, batch_wordlen, batch_char, batch_charlen, batch_charrecover, mask, batch_obj, obj_mask)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/content/drive/MyDrive/Pytorch-implementation-for-OCSGA/model/MUL_LSTM_MCA.py", line 223, in forward
char_seq_recover, object_inputs, object_mask)
File "/content/drive/MyDrive/Pytorch-implementation-for-OCSGA/model/MUL_LSTM_MCA.py", line 126, in _get_lstm_features
text_feat = self.txt_feat_linear(self.bert(tmp).last_hidden_state)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 1027, in forward
return_dict=return_dict,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 613, in forward
output_attentions,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 498, in forward
past_key_value=self_attn_past_key_value,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 430, in forward
output_attentions,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 311, in forward
key_layer = self.transpose_for_scores(self.key(hidden_states))
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Traceback (most recent call last):
File "main.py", line 569, in <module>
train(data)
File "main.py", line 481, in train
speed, acc, p, r, f, _,_ = evaluate(data, model, "test")
File "main.py", line 171, in evaluate
tag_seq = model(batch_word, batch_features, batch_wordlen, batch_char, batch_charlen, batch_charrecover, mask, batch_obj, obj_mask)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/content/drive/MyDrive/Pytorch-implementation-for-OCSGA/model/MUL_LSTM_MCA.py", line 228, in forward
outs = self._get_lstm_features(word_inputs, feature_inputs, word_seq_lengths, char_inputs, char_seq_lengths, char_seq_recover, object_inputs, object_mask)
File "/content/drive/MyDrive/Pytorch-implementation-for-OCSGA/model/MUL_LSTM_MCA.py", line 126, in _get_lstm_features
text_feat = self.txt_feat_linear(self.bert(tmp).last_hidden_state)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 1027, in forward
return_dict=return_dict,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 613, in forward
output_attentions,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 498, in forward
past_key_value=self_attn_past_key_value,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 430, in forward
output_attentions,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/bert/modeling_bert.py", line 327, in forward
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
image-20220530210749779

魔改的

image-20220530160217433

原本的

image-20220530160241049