admin管理员组文章数量:1596328
vin-slam中调用ceres库内部代码分析与性能优化
- 1,vin-slam中后端参数优化调用流程代码
- 2,ceres内部的求解流程(未完待续)
首先,很抱歉前几次上传的关于一些图像算法代码不全,主要是对这个csdn用法不太熟悉,有些东西遗漏了,如有兴趣可以加我微信yhtao923,我们可以交流一下。
本文对vin-slam一些算法原理不做介绍,有关这方面内容网络资源较多,大家可以搜索到很多相关内容。本文主要针对ceres库中被调用的关键代码做分析,对其涉及的矩阵结构进行调整以及借助一些平台向量指令进行优化,使得性能得到大幅度提升。注意,这里仅仅是ceres中关于非线性优化的运行性能,并不包括算法的执行效果,算法执行结果并不会改变。而这领域网上资源较少,大部分均未对ceres库中内部函数做进一步优化,这里我阅读了库函数内部代码,并根据slam数据结构特性进行了重新修改,其中优化思路也可以适用于其他类型的非线性优化场合。
1,vin-slam中后端参数优化调用流程代码
关于vin-slam中后端优化代码在目录VINS-Mono\vins_estimator\src中,代码文件为estimator.cpp,函数为optimization(),如下代码:
/*此处为添加惯导部分参数块 */
vector2double();
//loss_function = new ceres::HuberLoss(1.0);
loss_function = new ceres::CauchyLoss(1.0);
for (int i = 0; i < WINDOW_SIZE + 1; i++)
{
ceres::LocalParameterization *local_parameterization = new PoseLocalParameterization();
problem.AddParameterBlock(para_Pose[i], SIZE_POSE, local_parameterization);
problem.AddParameterBlock(para_SpeedBias[i], SIZE_SPEEDBIAS);
}
for (int i = 0; i < NUM_OF_CAM; i++)
{
ceres::LocalParameterization *local_parameterization = new PoseLocalParameterization();
problem.AddParameterBlock(para_Ex_Pose[i], SIZE_POSE, local_parameterization);
if (!ESTIMATE_EXTRINSIC)
{
ROS_DEBUG("fix extinsic param");
problem.SetParameterBlockConstant(para_Ex_Pose[i]);
}
else
{
ROS_DEBUG("estimate extinsic param");
}
}
vins-slam中默认摄像头个数为1,滑动窗口为10个,宏定义NUM_OF_CAM为1,WINDOW_SIZE为10.
因此参数包括10个位置,姿态,速度,加速度和角速度的bias的数据,(3+4+3+3+3) * 11个参数,参数块为pose,para_SpeedBias,以及一个摄像头的外参数para_Ex_Pose,因此参数块一共为2 * 11 + 1 = 23个。
下一步给ceres的类添加惯导部分的残差快。
/*添加惯导部分的残差块 */
for (int i = 0; i < WINDOW_SIZE; i++)
{
int j = i + 1;
if (pre_integrations[j]->sum_dt > 10.0)
continue;
IMUFactor* imu_factor = new IMUFactor(pre_integrations[j]);
problem.AddResidualBlock(imu_factor, NULL, para_Pose[i], para_SpeedBias[i], para_Pose[j], para_SpeedBias[j]);
}
/*添加视觉部分的残差块 */
int f_m_cnt = 0;
int feature_index = -1;
for (auto &it_per_id : f_manager.feature)
{
it_per_id.used_num = it_per_id.feature_per_frame.size();
if (!(it_per_id.used_num >= 2 && it_per_id.start_frame < WINDOW_SIZE - 2))
continue;
++feature_index;
int imu_i = it_per_id.start_frame, imu_j = imu_i - 1;
Vector3d pts_i = it_per_id.feature_per_frame[0].point;
for (auto &it_per_frame : it_per_id.feature_per_frame)
{
imu_j++;
if (imu_i == imu_j)
{
continue;
}
Vector3d pts_j = it_per_frame.point;
if (ESTIMATE_TD)
{
ProjectionTdFactor *f_td = new ProjectionTdFactor(pts_i, pts_j, it_per_id.feature_per_frame[0].velocity, it_per_frame.velocity,
it_per_id.feature_per_frame[0].cur_td, it_per_frame.cur_td,
it_per_id.feature_per_frame[0].uv.y(), it_per_frame.uv.y());
problem.AddResidualBlock(f_td, loss_function, para_Pose[imu_i], para_Pose[imu_j], para_Ex_Pose[0], para_Feature[feature_index], para_Td[0]);
}
else
{
ProjectionFactor *f = new ProjectionFactor(pts_i, pts_j);
problem.AddResidualBlock(f, loss_function, para_Pose[imu_i], para_Pose[imu_j], para_Ex_Pose[0], para_Feature[feature_index]);
}
f_m_cnt++;
}
}
下一步就可以构建ceres优化器:
ceres::Solver::Options options;
options.linear_solver_type = ceres::DENSE_SCHUR;
options.trust_region_strategy_type = ceres::DOGLEG;
options.max_num_iterations = NUM_ITERATIONS;
if (marginalization_flag == MARGIN_OLD)
options.max_solver_time_in_seconds = SOLVER_TIME * 4.0 / 5.0;
else
options.max_solver_time_in_seconds = SOLVER_TIME;
TicToc t_solver;
ceres::Solver::Summary summary;
ceres::Solve(options, &problem, &summary);
从上面代码可以看出求解非线性优化器中的参数个数可以分类,一类为固定的惯导部分参数,为11组,另一组为视觉部分的para_Feature,个数不确定,一般几百个到上千个。
这里关键的求解函数ceres::Solve(options, &problem, &summary),为主要耗时的函数,因此下一节我们来分析ceres内部的求解函数流程。
2,ceres内部的求解流程(未完待续)
求解主要函数ceres::Solve(options, &problem, &summary),在文件ceres-solver\internal\ceres\solver中
函数内部分为关键三个函数,前处理函数为preprocess(),优化函数为minimize(),后处理函数为PostSolveSummarize():
思维导图如下:
下面我们来逐个分析这三个大函数。
预处理函数preprocess(),关键代码分析
由于vins-mono选择了trust_region算法,因此主要调用了trust_region_preprocessor文件中的主函数:
bool TrustRegionPreprocessor::Preprocess(const Solver::Options& options,
ProblemImpl* problem,
PreprocessedProblem* pp)
这个预处理函数中共初始话建立的四个部分的功能,SetupLinearSolver(),SetupEvaluator(),SetupInnerIterationMinimizer(),SetupMinimizerOptions()。
除了SetupLinearSolver()以外,其余三个函数均为对求解过程的一些参数进行初始化,对性能影响不大。因此我们着重分析SetupLinearSolver()。主要到此函数内部调用了函数ReorderProgram(),其注释为:
// Reorder the program to reduce fill in and improve cache coherency
// of the Jacobian.
意思为对雅可比结构进行重新排序,以提升cache一致性,但是实际作用不仅仅如此。这里我们针对雅可比矩阵特点对其重新排序后,可以为后续的舒尔消元带来极大的方便。ceres源码中ReorderProgram的排序为打乱了原始雅可比矩阵相对于参数的排序,笔者猜测是为了多线程运行的便捷(此处有疑点),因此对雅可比矩阵顺序进行打乱操作。
ReorderProgram()函数在文件trust_region_preprocessor中,其内部调用的分支中有三个函数,根据前面预设的参数条件,这里运行的是函数ReorderProgramForSchurTypeLinearSolver(),源码在文件reorder_program中,此处的排序函数为ComputeStableSchurOrdering()在文件parameter_block_ordering中,采用了类似g2o中的图优化对参数块进行排序。这里的重新排序的算法我们就不过多讨论了。笔者针对性能的提升,重新做了简单的排序。
根据前面slam里参数结构,固定参数块个数为23个,而视觉部分的参数对应的残差块数据个数为2,计算到海森矩阵中为1个元素,最终视觉部分在海森矩阵中会组成一个对角矩阵,这里舒尔消元过程中会当做一个大矩阵块作为参数来计算。因此我们把23个惯导以及摄像头外参数作为一个组,把视觉部分参数作为一组。那么重新排序代码为:
vector<ParameterBlock*> schur_ordering;
#if 0
//const int size_of_first_elimination_group = \
//ComputeStableSchurOrdering(*program, &schur_ordering);
#else
int size_of_first_elimination_group = 0;
const vector<ParameterBlock*>& parameter_blocks = program->parameter_blocks();
int param_blocks_size = parameter_blocks.size();
for (int i = 23; i < param_blocks_size; ++i)
{
ParameterBlock* parameter_block = parameter_blocks[i];
schur_ordering.push_back(parameter_block);
size_of_first_elimination_group++;
}
for (int i = 0; i < 23; ++i)
{
ParameterBlock* parameter_block = parameter_blocks[i];
schur_ordering.push_back(parameter_block);
}
#endif
上述重新组合代码放在ReorderProgram文件的函数ReorderProgramForSchurTypeLinearSolver()中,代替了函数ComputeStableSchurOrdering()。注意:这里调整参数顺序是优化的重点改动地方。
对于后面舒尔消元以及求解方程的过程,比较重要的结构体为对雅可比稀疏矩阵的描述。这个在SetupMinimizerOptions这个函数里调用了如下代码来实现:
pp->minimizer_options.jacobian.reset(pp->evaluator->CreateJacobian());
其中CreateJacobian函数在文件block_jacobian_writer中。可以看出对于稀疏雅可比矩阵的描述采用了类CompressedRowBlockStructure。这个类主要分为行结构和列结构。其中列结构代表的为参数块个数,行结构为残差块个数。代码如下:
bs->cols.resize(parameter_blocks.size());
int cursor = 0;
for (int i = 0; i < parameter_blocks.size(); ++i) {
CHECK_NE(parameter_blocks[i]->index(), -1);
CHECK(!parameter_blocks[i]->IsConstant());
bs->cols[i].size = parameter_blocks[i]->LocalSize();
bs->cols[i].position = cursor;
cursor += bs->cols[i].size;
}
其中bs->cols[i].size为第i个参数块中包含参数的个数,也就是这一块雅可比小矩阵的列数。bs->cols[i].position为第i个参数块的位置。
对于行结构的初始化,这里是基于列结构来计算行结构的信息,代码如下:
代码的分析增加在注释中。
// Construct the cells in each row.
const vector<ResidualBlock*>& residual_blocks = program_->residual_blocks();
int row_block_position = 0;
/*残差块的个数为行信息的个数 */
bs->rows.resize(residual_blocks.size());
for (int i = 0; i < residual_blocks.size(); ++i) {
const ResidualBlock* residual_block = residual_blocks[i];
CompressedRow* row = &bs->rows[i];
/* 每一个行块的大小为残差数据的维数 */
row->block.size = residual_block->NumResiduals();
/*
/* 每一个行块的索引 */
row->block.position = row_block_position;
row_block_position += row->block.size;
// Size the row by the number of active parameters in this residual.
/* 有的优化参数为常量,因此可能并非需要优化的参数 */
const int num_parameter_blocks = residual_block->NumParameterBlocks();
int num_active_parameter_blocks = 0;
for (int j = 0; j < num_parameter_blocks; ++j) {
if (residual_block->parameter_blocks()[j]->index() != -1) {
num_active_parameter_blocks++;
}
}
/* 每个行块中cells的个数,每个cells对应一个参数块的雅可比小矩阵 */
row->cells.resize(num_active_parameter_blocks);
/* 初始化每个cell的信息 */
// Add layout information for the active parameters in this row.
for (int j = 0, k = 0; j < num_parameter_blocks; ++j) {
const ParameterBlock* parameter_block =
residual_block->parameter_blocks()[j];
if (!parameter_block->IsConstant()) {
Cell& cell = row->cells[k];
/* cell对应的参数块索引 */
cell.block_id = parameter_block->index();
/* position 为雅可比矩阵保存的内存中的起始地址, 这个position在后续计算舒尔消元过程中要经常用到*/
cell.position = jacobian_layout_[i][k];
// Only increment k for active parameters, since there is only layout
// information for active parameters.
k++;
}
}
sort(row->cells.begin(), row->cells.end(), CellLessThan);
}
求解函数Minimize的分析
Minimize()函数实体在 trust_region_minimizer中,正如前面所言,工程中初始化的求解类型为信任域算法。
void TrustRegionMinimizer::Minimize(const Minimizer::Options& options,
double* parameters,
Solver::Summary* solver_summary) {
start_time_in_secs_ = WallTimeInSeconds();
iteration_start_time_in_secs_ = start_time_in_secs_;
/* 初始化求解函数参数 */
Init(options, parameters, solver_summary);
/*这里开始第一次调用雅可比以及残差计算部分 */
RETURN_IF_ERROR_AND_LOG(IterationZero());
// Create the TrustRegionStepEvaluator. The construction needs to be
// delayed to this point because we need the cost for the starting
// point to initialize the step evaluator.
step_evaluator_.reset(new TrustRegionStepEvaluator(
x_cost_,
options_.use_nonmonotonic_steps
? options_.max_consecutive_nonmonotonic_steps
: 0));
iteration_start_time_in_secs_ = WallTimeInSeconds();
while (FinalizeIterationAndCheckIfMinimizerCanContinue()) {
const double previous_gradient_norm = iteration_summary_.gradient_norm;
const double previous_gradient_max_norm =
iteration_summary_.gradient_max_norm;
iteration_summary_ = IterationSummary();
iteration_summary_.iteration =
solver_summary->iterations.back().iteration + 1;
/* ComputeTrustRegionStep实现舒尔消元以及求解方程,并且计算计算delt更新数据,ComputeTrustRegionStep为整个迭代过程最重要得部分,也是最耗时的部分,优化重点在这里*/
RETURN_IF_ERROR_AND_LOG(ComputeTrustRegionStep());
if (!iteration_summary_.step_is_valid) {
RETURN_IF_ERROR_AND_LOG(HandleInvalidStep());
continue;
}
if (options_.is_constrained &&
options_.max_num_line_search_step_size_iterations > 0) {
// Use a projected line search to enforce the bounds constraints
// and improve the quality of the step.
DoLineSearch(x_, gradient_, x_cost_, &delta_);
}
ComputeCandidatePointAndEvaluateCost();
DoInnerIterationsIfNeeded();
if (ParameterToleranceReached()) {
return;
}
if (FunctionToleranceReached()) {
return;
}
if (IsStepSuccessful()) {
RETURN_IF_ERROR_AND_LOG(HandleSuccessfulStep());
} else {
// Declare the step unsuccessful and inform the trust region strategy.
iteration_summary_.step_is_successful = false;
iteration_summary_.cost = candidate_cost_ + solver_summary_->fixed_cost;
// When the step is unsuccessful, we do not compute the gradient
// (or update x), so we preserve its value from the last
// successful iteration.
iteration_summary_.gradient_norm = previous_gradient_norm;
iteration_summary_.gradient_max_norm = previous_gradient_max_norm;
strategy_->StepRejected(iteration_summary_.relative_decrease);
}
}
}
以下为主函数minimizer下调用子函数结构:
先分析IterationZero()函数,这里主要调用函数为EvaluateGradientAndJacobian(),内部又调用了evaluator_->Evaluate(evaluate_options,
x_.data(),
&x_cost_,
residuals_.data(),
gradient_.data(),
jacobian_));
其中Evaluate调用的地方在文件program_evaluator.h中的Evaluate函数,首先调用函数PrepareForEvaluation为雅可比矩阵和残差矩阵相关小矩阵的内存布局。
调用ceres外部对雅可比和残差的计算部分如下代码:
// Evaluate the cost, residuals, and jacobians.
double block_cost;
if (!residual_block->Evaluate(
evaluate_options.apply_loss_function,
&block_cost,
block_residuals,
block_jacobians,
scratch->residual_block_evaluate_scratch.get())) {
abort = true;
return;
}
这部分耗时对于整个求解过程的耗时占用较小,因此我们可以暂时不需要太多关注。而最耗时的为求解方程中的舒尔消元部分,schur_eliminator_impl.h文件中的Eliminate函数,最后SolveReducedLinearSystem,BackSubstitute这几个函数也较为耗时。这里我们主要优化Eliminate函数。首先优化后时间对比如下:
I0814 01:09:30.134727 127160 schur_complement_solver.cc:171] EliminateSlam 0.000228167
I0814 01:09:30.140404 127160 schur_complement_solver.cc:178] Eliminate 0.00563121
I0814 01:09:30.144488 127160 schur_complement_solver.cc:171] EliminateSlam 0.00022912
I0814 01:09:30.151427 127160 schur_complement_solver.cc:178] Eliminate 0.00400996
I0814 01:09:30.156728 127160 schur_complement_solver.cc:171] EliminateSlam 0.000275135
I0814 01:09:30.157441 127160 schur_complement_solver.cc:178] Eliminate 0.000660896
I0814 01:09:30.158865 127160 schur_complement_solver.cc:171] EliminateSlam 0.000221014
I0814 01:09:30.169198 127160 schur_complement_solver.cc:178] Eliminate 0.00068593
I0814 01:09:30.173861 127160 trust_region_minimizer.cc:136] cost 0.0395019
I0814 01:09:30.242223 127160 schur_complement_solver.cc:171] EliminateSlam 0.000225067
I0814 01:09:30.242919 127160 schur_complement_solver.cc:178] Eliminate 0.000641108
I0814 01:09:30.247565 127160 schur_complement_solver.cc:171] EliminateSlam 0.00022912
I0814 01:09:30.253784 127160 schur_complement_solver.cc:178] Eliminate 0.000674009
I0814 01:09:30.256924 127160 schur_complement_solver.cc:171] EliminateSlam 0.000255108
I0814 01:09:30.265875 127160 schur_complement_solver.cc:178] Eliminate 0.000684977
I0814 01:09:30.272472 127160 schur_complement_solver.cc:171] EliminateSlam 0.000262022
I0814 01:09:30.280575 127160 schur_complement_solver.cc:178] Eliminate 0.00079298
I0814 01:09:30.284862 127160 trust_region_minimizer.cc:136] cost 0.0431149
I0814 01:09:30.341929 127160 schur_complement_solver.cc:171] EliminateSlam 0.000477076
I0814 01:09:30.344331 127160 schur_complement_solver.cc:178] Eliminate 0.000635147
I0814 01:09:30.357460 127160 schur_complement_solver.cc:171] EliminateSlam 0.000243902
I0814 01:09:30.358187 127160 schur_complement_solver.cc:178] Eliminate 0.000662088
I0814 01:09:30.366591 127160 schur_complement_solver.cc:171] EliminateSlam 0.0018239
I0814 01:09:30.369652 127160 schur_complement_solver.cc:178] Eliminate 0.00064683
I0814 01:09:30.372679 127160 trust_region_minimizer.cc:136] cost 0.0314691
la
I0814 01:09:30.436864 127160 schur_complement_solver.cc:171] EliminateSlam 0.000241041
I0814 01:09:30.438820 127160 schur_complement_solver.cc:178] Eliminate 0.00144482
I0814 01:09:30.441287 127160 schur_complement_solver.cc:171] EliminateSlam 0.000237942
I0814 01:09:30.446676 127160 schur_complement_solver.cc:178] Eliminate 0.00332403
I0814 01:09:30.456245 127160 schur_complement_solver.cc:171] EliminateSlam 0.00421715
I0814 01:09:30.456990 127160 schur_complement_solver.cc:178] Eliminate 0.00067687
I0814 01:09:30.460294 127160 schur_complement_solver.cc:171] EliminateSlam 0.000252008
I0814 01:09:30.468626 127160 schur_complement_solver.cc:178] Eliminate 0.000689983
I0814 01:09:30.472404 127160 trust_region_minimizer.cc:136] cost 0.0359242
I0814 01:09:30.530846 127160 schur_complement_solver.cc:171] EliminateSlam 0.000265121
I0814 01:09:30.532482 127160 schur_complement_solver.cc:178] Eliminate 0.000962019
I0814 01:09:30.542966 127160 schur_complement_solver.cc:171] EliminateSlam 0.000476122
I0814 01:09:30.547399 127160 schur_complement_solver.cc:178] Eliminate 0.000714064
I0814 01:09:30.553222 127160 schur_complement_solver.cc:171] EliminateSlam 0.000302076
I0814 01:09:30.563824 127160 schur_complement_solver.cc:178] Eliminate 0.000738144
I0814 01:09:30.566916 127160 trust_region_minimizer.cc:136] cost 0.0364859
I0814 01:09:30.624837 127160 schur_complement_solver.cc:171] EliminateSlam 0.000313997
I0814 01:09:30.628013 127160 schur_complement_solver.cc:178] Eliminate 0.00311995
I0814 01:09:30.636418 127160 schur_complement_solver.cc:171] EliminateSlam 0.000257015
I0814 01:09:30.639873 127160 schur_complement_solver.cc:178] Eliminate 0.00140882
I0814 01:09:30.641511 127160 schur_complement_solver.cc:171] EliminateSlam 0.000247955
I0814 01:09:30.654429 127160 schur_complement_solver.cc:178] Eliminate 0.00766611
I0814 01:09:30.659863 127160 trust_region_minimizer.cc:136] cost 0.0354891
I0814 01:09:30.717357 127160 schur_complement_solver.cc:171] EliminateSlam 0.00029397
^CI0814 01:09:30.720774 127160 schur_complement_solver.cc:178] Eliminate 0.00069499
I0814 01:09:30.726763 127160 schur_complement_solver.cc:171] EliminateSlam 0.000328064
I0814 01:09:30.730413 127160 schur_complement_solver.cc:178] Eliminate 0.00172615
I0814 01:09:30.737879 127160 schur_complement_solver.cc:171] EliminateSlam 0.000255108
I0814 01:09:30.740783 127160 schur_complement_solver.cc:178] Eliminate 0.000646114
其中EliminateSlam 为针对slam项目优化后的时间单位为秒,Eliminate 为ceres默认代码的运行效果。可见优化后效果均能提升几倍,这里我们测试平台为x64平台Intel® Core™ i7-9700 CPU @ 3.00GHz 3.00 GHz,在单核的ubuntu虚拟机上运行,内存为4G。
首先借助于ceres的注释,介绍舒尔消元的基本算法原理。
实现SchurEliminatorBase接口的类实现线性最小二乘问题的变量消去法。
假设输入线性系统:
可以划分为:
其中x=[y,z]是变量的分区。分割变量的形式是,E’E是块对角矩阵。或换句话说,E中的参数块形成一个独立的集合由块矩阵A’A所暗示的图。那么这个部分提供计算Schur补功能
这里:
这是消除运算,即构造线性系统通过消除E。
算法还提供反转,z的值它可以通过求解线性系统
通过观察
矩阵A的行必须是重新排序,使得E为对角矩阵,F为其他矩阵。
为了简化说明,仅在这种情况下描述了具有空值D的情况。
通常的消除方法如下。从
我们可以形成正规方程:
将第一个方程的两边乘以(E’E)^(-1),然后乘以F’E,我们得到:
现在减去我们得到的两个方程
而不是形成正规方程并对其进行运算对于一般稀疏矩阵,这里的算法处理一个参数块一次以y表示。对应于单个行的行
参数块yi称为块,算法运行一次处理一个块。自20世纪90年代以来,数学一直保持不变
简化线性系统可以表示为简化线性系统的和每个块的线性系统。这可以通过观察两个事情。
。E’E是块对角矩阵。
当计算E’F时,仅计算单个块内的项,即当转置和相乘时用于y1列块
为了优化前和优化后结果一致,我们针对函数Eliminate重新在schur_eliminator_impl.h文件里写函数:
template <int kRowBlockSize, int kEBlockSize, int kFBlockSize>
void SchurEliminator<kRowBlockSize, kEBlockSize, kFBlockSize>::EliminateSlam(
const BlockSparseMatrixData& A,
const double* b,
const double* D,
BlockRandomAccessMatrix* lhs,
double* rhs)
注意对应的调用其他头文件里,例如schur_eliminator.h应添加对应的接口。
首先Eliminate函数被调用之前,必须进行一个初始话过程,即需要调用schur_eliminator_impl.h的init函数。在源码schur_complement_solver文件的SolveImpl函数中,代码如下:
eliminator_->Init(num_eliminate_blocks, kFullRankETE, bs);
这个函数初始化了两个结构体,lhs_row_layout_以及chunks。lhs_row_layout_保存了上面矩阵中的R矩阵中每一个雅可比矩阵组成的海森矩阵块的地址索引。chunks为向量,每一个成员chunk包括以下三个:
int size;
int start;
BufferLayoutType buffer_layout;
其中size为参数块对应的残差行个数,start为每个参数块对应残差列行的起始位置。buffer_layout为E’F矩阵临时保存的内存区的小矩阵快布局。
我们再着重分析优化重点部分Eliminate函数,
第一部分为对R矩阵的对角元素增加拉姆达,再LM算法里会用到。
// Add the diagonal to the schur complement.
if (D != NULL) {
ParallelFor(context_,
num_eliminate_blocks_,
num_col_blocks,
num_threads_,
[&](int i) {
const int block_id = i - num_eliminate_blocks_;
int r, c, row_stride, col_stride;
CellInfo* cell_info = lhs->GetCell(
block_id, block_id, &r, &c, &row_stride, &col_stride);
if (cell_info != NULL) {
const int block_size = bs->cols[i].size;
typename EigenTypes<Eigen::Dynamic>::ConstVectorRef diag(
D + bs->cols[i].position, block_size);
std::lock_guard<std::mutex> l(cell_info->m);
MatrixRef m(cell_info->values, row_stride, col_stride);
m.block(r, c, block_size, block_size).diagonal() +=
diag.array().square().matrix();
}
});
}
这部分耗时较小,因此暂时略过。
为了减少循环内部一些函数调用开销,我们重新做一个内存初始化:
int num_rows_ = 0;
const int num_blocks = blocks.size();
block_layout_.resize(num_blocks, 0);
for (int i = 0; i < 23; ++i)
{
block_layout_[i] = num_rows_;
num_rows_ += blocks[i];
}
用于代替lhs->GetCell函数的调用。
重点优化地方在循环内部,如下代码:
ParallelFor(
context_,
0,
int(chunks_.size()),
num_threads_,
[&](int thread_id, int i) {
double* buffer = buffer_.get() + thread_id * buffer_size_;
const Chunk& chunk = chunks_[i];
const int e_block_id = bs->rows[chunk.start].cells.front().block_id;
const int e_block_size = bs->cols[e_block_id].size;
VectorRef(buffer, buffer_size_).setZero();
typename EigenTypes<kEBlockSize, kEBlockSize>::Matrix ete(e_block_size,
e_block_size);
if (D != NULL) {
const typename EigenTypes<kEBlockSize>::ConstVectorRef diag(
D + bs->cols[e_block_id].position, e_block_size);
ete = diag.array().square().matrix().asDiagonal();
} else {
ete.setZero();
}
FixedArray<double, 8> g(e_block_size);
typename EigenTypes<kEBlockSize>::VectorRef gref(g.data(),
e_block_size);
gref.setZero();
// We are going to be computing
//
// S += F'F - F'E(E'E)^{-1}E'F
//
// for each Chunk. The computation is broken down into a number of
// function calls as below.
// Compute the outer product of the e_blocks with themselves (ete
// = E'E). Compute the product of the e_blocks with the
// corresponding f_blocks (buffer = E'F), the gradient of the terms
// in this chunk (g) and add the outer product of the f_blocks to
// Schur complement (S += F'F).
ChunkDiagonalBlockAndGradient(
chunk, A, b, chunk.start, &ete, g.data(), buffer, lhs);
// Normally one wouldn't compute the inverse explicitly, but
// e_block_size will typically be a small number like 3, in
// which case its much faster to compute the inverse once and
// use it to multiply other matrices/vectors instead of doing a
// Solve call over and over again.
typename EigenTypes<kEBlockSize, kEBlockSize>::Matrix inverse_ete =
InvertPSDMatrix<kEBlockSize>(assume_full_rank_ete_, ete);
// For the current chunk compute and update the rhs of the reduced
// linear system.
//
// rhs = F'b - F'E(E'E)^(-1) E'b
if (rhs) {
FixedArray<double, 8> inverse_ete_g(e_block_size);
MatrixVectorMultiply<kEBlockSize, kEBlockSize, 0>(
inverse_ete.data(),
e_block_size,
e_block_size,
g.data(),
inverse_ete_g.data());
UpdateRhs(chunk, A, b, chunk.start, inverse_ete_g.data(), rhs);
}
// S -= F'E(E'E)^{-1}E'F
ChunkOuterProduct(
thread_id, bs, inverse_ete, buffer, chunk.buffer_layout, lhs);
});
// For rows with no e_blocks, the schur complement update reduces to
// S += F'F.
NoEBlockRowsUpdate(A, b, uneliminated_row_begins_, lhs, rhs);
}
可以看出循环次数为chunks_.size(),代表参数块总个数。每次循环将对应参数快所有的雅可比矩阵做计算得到lhs随影一部分以及rhs对应的一部分。循环内部主要包含了三个函数ChunkDiagonalBlockAndGradient(),UpdateRhs(),ChunkOuterProduct()。
其中ChunkDiagonalBlockAndGradient()函数有两个功能,一个是计算F‘F中视觉部分对他的贡献,一个是计算E’F矩阵,保存在临时缓冲区buffer中。代码为:
// Given a Chunk - set of rows with the same e_block, e.g. in the
// following Chunk with two rows.
//
// E F
// [ y11 0 0 0 | z11 0 0 0 z51]
// [ y12 0 0 0 | z12 z22 0 0 0]
//
// this function computes twp matrices. The diagonal block matrix
//
// ete = y11 * y11' + y12 * y12'
//
// and the off diagonal blocks in the Guass Newton Hessian.
//
// buffer = [y11'(z11 + z12), y12' * z22, y11' * z51]
//
// which are zero compressed versions of the block sparse matrices E'E
// and E'F.
//
// and the gradient of the e_block, E'b.
template <int kRowBlockSize, int kEBlockSize, int kFBlockSize>
void SchurEliminator<kRowBlockSize, kEBlockSize, kFBlockSize>::
ChunkDiagonalBlockAndGradient(
const Chunk& chunk,
const BlockSparseMatrixData& A,
const double* b,
int row_block_counter,
typename EigenTypes<kEBlockSize, kEBlockSize>::Matrix* ete,
double* g,
double* buffer,
BlockRandomAccessMatrix* lhs) {
const CompressedRowBlockStructure* bs = A.block_structure();
const double* values = A.values();
int b_pos = bs->rows[row_block_counter].block.position;
const int e_block_size = ete->rows();
// Iterate over the rows in this chunk, for each row, compute the
// contribution of its F blocks to the Schur complement, the
// contribution of its E block to the matrix EE' (ete), and the
// corresponding block in the gradient vector.
for (int j = 0; j < chunk.size; ++j) {
const CompressedRow& row = bs->rows[row_block_counter + j];
if (row.cells.size() > 1) {
EBlockRowOuterProduct(A, row_block_counter + j, lhs);
}
// Extract the e_block, ETE += E_i' E_i
const Cell& e_cell = row.cells.front();
// clang-format off
MatrixTransposeMatrixMultiply
<kRowBlockSize, kEBlockSize, kRowBlockSize, kEBlockSize, 1>(
values + e_cell.position, row.block.size, e_block_size,
values + e_cell.position, row.block.size, e_block_size,
ete->data(), 0, 0, e_block_size, e_block_size);
// clang-format on
if (b) {
// g += E_i' b_i
// clang-format off
MatrixTransposeVectorMultiply<kRowBlockSize, kEBlockSize, 1>(
values + e_cell.position, row.block.size, e_block_size,
b + b_pos,
g);
// clang-format on
}
// buffer = E'F. This computation is done by iterating over the
// f_blocks for each row in the chunk.
for (int c = 1; c < row.cells.size(); ++c) {
const int f_block_id = row.cells[c].block_id;
const int f_block_size = bs->cols[f_block_id].size;
double* buffer_ptr = buffer + FindOrDie(chunk.buffer_layout, f_block_id);
// clang-format off
MatrixTransposeMatrixMultiply
<kRowBlockSize, kEBlockSize, kRowBlockSize, kFBlockSize, 1>(
values + e_cell.position, row.block.size, e_block_size,
values + row.cells[c].position, row.block.size, f_block_size,
buffer_ptr, 0, 0, e_block_size, f_block_size);
// clang-format on
}
b_pos += row.block.size;
}
}
为了这部分优化方便,同时将来方便于嵌入式平台优化,我们将其分开处理,首先第一部分并且将内部调用的宏展开后并使用AVX指令得到:
for(int i = 0;i < chunks_.size();i++)
{
const Chunk& chunk = chunks_[i];
int r, c, row_stride, col_stride;
CellInfo* cell_info =
lhs->GetCell(0, 0, &r, &c, &row_stride, &col_stride);
double *lhs_val = cell_info->values;
for (int j = 0; j < chunk.size; ++j)
{
const CompressedRow& row = bs->rows[chunk.start + j];
//EBlockRowOuterProduct(A, chunk.start + j, lhs);
int block1 = row.cells[1].block_id - num_eliminate_blocks_;
int block1_size = bs->cols[row.cells[1].block_id].size;
r = block_layout_[block1];
c = block_layout_[block1];
double *val_ptr = (double *)(values + row.cells[1].position);
__m256i mask = _mm256_setr_epi32(-20, -20, -48, -9,20, 20, 48, 9);
__m256d val_b01 = _mm256_set_pd(0,0,0,0);
__m256d val_b11 = _mm256_set_pd(0,0,0,0);
for(int k = 0;k < block1_size;k++)
{
double *lhs_val_ptr = &lhs_val[(k + r) * col_stride + 0 + c];
double a_val0 = val_ptr[0 * block1_size + k];
double a_val1 = val_ptr[1 * block1_size + k];
__m256d v_sum0 = _mm256_loadu_pd(&lhs_val[(k + r) * col_stride + c]);
__m256d v_sum1 = _mm256_maskload_pd(&lhs_val[(k + r) * col_stride + c + 4],mask);
__m256d v_val0 = _mm256_set_pd(a_val0,a_val0,a_val0,a_val0);
__m256d v_val1 = _mm256_set_pd(a_val1,a_val1,a_val1,a_val1);
__m256d val_b00 = _mm256_loadu_pd(&val_ptr[0]);
__m256d val_b10 = _mm256_loadu_pd(&val_ptr[block1_size]);
val_b01 = _mm256_maskload_pd(&val_ptr[4],mask);
val_b11 = _mm256_maskload_pd(&val_ptr[block1_size + 4],mask);
v_sum0 = _mm256_fmadd_pd(v_val0,val_b00,v_sum0);
v_sum0 = _mm256_fmadd_pd(v_val1,val_b10,v_sum0);
v_sum1 = _mm256_fmadd_pd(v_val0,val_b01,v_sum1);
v_sum1 = _mm256_fmadd_pd(v_val1,val_b11,v_sum1);
double *v_sum0_ptr = (double *)&v_sum0;
double *v_sum1_ptr = (double *)&v_sum1;
_mm256_storeu_pd(lhs_val_ptr,v_sum0);
lhs_val_ptr[4] = v_sum1_ptr[0];
lhs_val_ptr[5] = v_sum1_ptr[1];
}
const int block2 = row.cells[2].block_id - num_eliminate_blocks_;
const int block2_size = bs->cols[row.cells[2].block_id].size;
// CellInfo* cell_info =
//lhs->GetCell(block1, block2, &r, &c, &row_stride, &col_stride);
r = block_layout_[block1];
c = block_layout_[block2];
double *val_ptr0 = (double *)(values + row.cells[1].position);
double *val_ptr1 = (double *)(values + row.cells[2].position);
for(int k = 0;k < block1_size;k++)
{
double *lhs_val_ptr = &lhs_val[(k + r) * col_stride + 0 + c];
double a_val0 = val_ptr0[0 * block1_size + k];
double a_val1 = val_ptr0[1 * block1_size + k];
__m256d v_sum0 = _mm256_loadu_pd(&lhs_val[(k + r) * col_stride + c]);
__m256d v_sum1 = _mm256_maskload_pd(&lhs_val[(k + r) * col_stride + c + 4],mask);
__m256d v_val0 = _mm256_set_pd(a_val0,a_val0,a_val0,a_val0);
__m256d v_val1 = _mm256_set_pd(a_val1,a_val1,a_val1,a_val1);
__m256d val_b00 = _mm256_loadu_pd(&val_ptr1[0]);
__m256d val_b10 = _mm256_loadu_pd(&val_ptr1[block1_size]);
val_b01 = _mm256_maskload_pd(&val_ptr1[4],mask);
val_b11 = _mm256_maskload_pd(&val_ptr1[block1_size + 4],mask);
v_sum0 = _mm256_fmadd_pd(v_val0,val_b00,v_sum0);
v_sum0 = _mm256_fmadd_pd(v_val1,val_b10,v_sum0);
v_sum1 = _mm256_fmadd_pd(v_val0,val_b01,v_sum1);
v_sum1 = _mm256_fmadd_pd(v_val1,val_b11,v_sum1);
double *v_sum0_ptr = (double *)&v_sum0;
double *v_sum1_ptr = (double *)&v_sum1;
_mm256_storeu_pd(lhs_val_ptr,v_sum0);
lhs_val_ptr[4] = v_sum1_ptr[0];
lhs_val_ptr[5] = v_sum1_ptr[1];
}
block1 = row.cells[2].block_id - num_eliminate_blocks_;
block1_size = bs->cols[row.cells[2].block_id].size;
r = block_layout_[block1];
c = block_layout_[block1];
val_ptr = (double *)(values + row.cells[2].position);
for(int k = 0;k < block1_size;k++)
{
double *lhs_val_ptr = &lhs_val[(k + r) * col_stride + 0 + c];
double a_val0 = val_ptr[0 * block1_size + k];
double a_val1 = val_ptr[1 * block1_size + k];
__m256d v_sum0 = _mm256_loadu_pd(&lhs_val[(k + r) * col_stride + c]);
__m256d v_sum1 = _mm256_maskload_pd(&lhs_val[(k + r) * col_stride + c + 4],mask);
__m256d v_val0 = _mm256_set_pd(a_val0,a_val0,a_val0,a_val0);
__m256d v_val1 = _mm256_set_pd(a_val1,a_val1,a_val1,a_val1);
__m256d val_b00 = _mm256_loadu_pd(&val_ptr[0]);
__m256d val_b10 = _mm256_loadu_pd(&val_ptr[block1_size]);
val_b01 = _mm256_maskload_pd(&val_ptr[4],mask);
val_b11 = _mm256_maskload_pd(&val_ptr[block1_size + 4],mask);
v_sum0 = _mm256_fmadd_pd(v_val0,val_b00,v_sum0);
v_sum0 = _mm256_fmadd_pd(v_val1,val_b10,v_sum0);
v_sum1 = _mm256_fmadd_pd(v_val0,val_b01,v_sum1);
v_sum1 = _mm256_fmadd_pd(v_val1,val_b11,v_sum1);
double *v_sum0_ptr = (double *)&v_sum0;
double *v_sum1_ptr = (double *)&v_sum1;
_mm256_storeu_pd(lhs_val_ptr,v_sum0);
lhs_val_ptr[4] = v_sum1_ptr[0];
lhs_val_ptr[5] = v_sum1_ptr[1];
}
}
}
此处有些循环由于次数为固定值1或者6,因此将其展开使用avx指令来计算提升效率。
可以看出大循环内部每一个参数块都需要计算e’e以及逆,这里由于e小矩阵维数为1,因此逆矩阵直接取其倒数,并且将每一组数据保存在一个临时缓存区。代码如下:
const CompressedRowBlockStructure* bs = A.block_structure();
const double* values = A.values();
int b_pos = bs->rows[chunk.start].block.position;
// Iterate over the rows in this chunk, for each row, compute the
// contribution of its F blocks to the Schur complement, the
// contribution of its E block to the matrix EE' (ete), and the
// corresponding block in the gradient vector.
double mask_buf0[4] = {-1,-2,-3,-4};
double mask_buf1[4] = {1,2,3,4};
int mask_num = (chunk.size & 0x3);
__m256d mask0 = _mm256_loadu_pd(mask_buf0);
__m256d mask1 = _mm256_loadu_pd(mask_buf1);
for(int j = 0;j < mask_num;j++)
{
mask_buf0[j] = mask_buf1[j];
}
__m256d mask = _mm256_loadu_pd(mask_buf0);
__m256d zero = _mm256_set1_pd(0.0);
double ete_t = ete,g_t = g;
double zero_t = 0.0;
__m256d vSum0 = _mm256_broadcast_sd(&zero_t);
__m256d vSum1 = _mm256_broadcast_sd(&zero_t);
__m256d vgSum0 = _mm256_broadcast_sd(&zero_t);
__m256d vgSum1 = _mm256_broadcast_sd(&zero_t);
__m256i vindex0,vindex1;
vindex0 = _mm256_set_epi64x(24,16,8,0);
vindex1 = _mm256_set_epi64x(28,20,12,4);
double ete_tt = ete_t;
double g_tt = 0;
for (int j = 0; j < chunk.size; j += 4)
{
// clang-format off
/*
MatrixTransposeMatrixMultiply
<kRowBlockSize, kEBlockSize, kRowBlockSize, kEBlockSize, 1>(
values + e_cell.position, row.block.size, e_block_size,
values + e_cell.position, row.block.size, e_block_size,
ete.data(), 0, 0, e_block_size, e_block_size);
*/
double *val_ptr = (double *)(values + idx * 2);
int num = 4;
if((j + 4) > chunk.size)
{
num = chunk.size & 0x3;
mask1 = mask;
}
__m256d v_val0 = _mm256_i64gather_pd (val_ptr, vindex0, 2);
__m256d v_val1 = _mm256_i64gather_pd (val_ptr, vindex1, 2);
v_val0 = _mm256_blendv_pd(v_val0,zero,mask1);
v_val1 = _mm256_blendv_pd(v_val1,zero,mask1);
vSum0 = _mm256_fmadd_pd(v_val0,v_val0,vSum0);
vSum1 = _mm256_fmadd_pd(v_val1,v_val1,vSum1);
// clang-format on
//const CompressedRow& row = bs->rows[chunk.start + j];
// Extract the e_block, ETE += E_i' E_i
//const Cell& e_cell = row.cells.front();
// clang-format off
// g += E_i' b_i
// clang-format off
double * b_ptr = (double*)(b + idx * 2);
__m256d v_val_b_0 = _mm256_i64gather_pd (b_ptr, vindex0, 2);
__m256d v_val_b_1 = _mm256_i64gather_pd (b_ptr, vindex1, 2);
vgSum0 = _mm256_fmadd_pd(v_val0,v_val_b_0,vgSum0);
vgSum1 = _mm256_fmadd_pd(v_val1,v_val_b_1,vgSum1);
idx += num;
}
ete_tt += vSum0[0] + vSum0[1] + vSum0[2] + vSum0[3];
ete_tt += vSum1[0] + vSum1[1] + vSum1[2] + vSum1[3];
g_tt += vgSum0[0] + vgSum0[1] + vgSum0[2] + vgSum0[3];
g_tt += vgSum1[0] + vgSum1[1] + vgSum1[2] + vgSum1[3];
ete = ete_tt;
g = g_tt;
// Normally one wouldn't compute the inverse explicitly, but
// e_block_size will typically be a small number like 3, in
// which case its much faster to compute the inverse once and
// use it to multiply other matrices/vectors instead of doing a
// Solve call over and over again.
double inverse_ete = (double)1/ete;
// For the current chunk compute and update the rhs of the reduced
// linear system.
//
// rhs = F'b - F'E(E'E)^(-1) E'b
double inverse_ete_g = inverse_ete * g;
//UpdateRhs(chunk, A, b, chunk.start, inverse_ete_g.data(), rhs);
ete_inv[i * e_block_size * e_block_size] = inverse_ete;
ete_inv_g[i * e_block_size] = inverse_ete_g;
}
UpdateRhs()为计算如下矩阵:
F’b - F’E(E’E)^(-1) E’b
内部代码如下:
// Update the rhs of the reduced linear system. Compute
//
// F'b - F'E(E'E)^(-1) E'b
template <int kRowBlockSize, int kEBlockSize, int kFBlockSize>
void SchurEliminator<kRowBlockSize, kEBlockSize, kFBlockSize>::UpdateRhs(
const Chunk& chunk,
const BlockSparseMatrixData& A,
const double* b,
int row_block_counter,
const double* inverse_ete_g,
double* rhs) {
const CompressedRowBlockStructure* bs = A.block_structure();
const double* values = A.values();
const int e_block_id = bs->rows[chunk.start].cells.front().block_id;
const int e_block_size = bs->cols[e_block_id].size;
//LOG(INFO) << "UpdateRhs " << e_block_size;
int b_pos = bs->rows[row_block_counter].block.position;
for (int j = 0; j < chunk.size; ++j) {
const CompressedRow& row = bs->rows[row_block_counter + j];
const Cell& e_cell = row.cells.front();
typename EigenTypes<kRowBlockSize>::Vector sj =
typename EigenTypes<kRowBlockSize>::ConstVectorRef(b + b_pos,
row.block.size);
// clang-format off
MatrixVectorMultiply<kRowBlockSize, kEBlockSize, -1>(
values + e_cell.position, row.block.size, e_block_size,
inverse_ete_g, sj.data());
// clang-format on
for (int c = 1; c < row.cells.size(); ++c) {
const int block_id = row.cells[c].block_id;
const int block_size = bs->cols[block_id].size;
const int block = block_id - num_eliminate_blocks_;
std::lock_guard<std::mutex> l(*rhs_locks_[block]);
// clang-format off
MatrixTransposeVectorMultiply<kRowBlockSize, kFBlockSize, 1>(
values + row.cells[c].position,
row.block.size, block_size,
sj.data(), rhs + lhs_row_layout_[block]);
// double *rhs_ptr = rhs + lhs_row_layout_[block];
//if(calc_idx > 20 && calc_idx < 30 && param_idx < 5)
//LOG(INFO) << "Eliminate " << lhs_row_layout_[block];
// clang-format on
}
b_pos += row.block.size;
}
}
根据雅可比矩阵维数特性为固定值,因此这里我们同样将内部循环简化,使用avx指令得到:
int b_pos = 0;
for(int i = 0;i < chunks_.size();i++)
{
const Chunk& chunk = chunks_[i];
FixedArray<double, 8> inverse_ete_g(1);
inverse_ete_g.data()[0] = ete_inv_g[i];
double inverse_ete_eb = ete_inv_g[i];
double zero_t = 0.0;
__m256d v_zero = _mm256_broadcast_sd(&zero_t);
__m256d mask = _mm256_set_pd(-1,-1,2,2);
__m256i maski64 = _mm256_set_epi64x(-1,-1,2,2);
for (int j = 0; j < chunk.size; ++j)
{
const CompressedRow& row = bs->rows[chunk.start + j];
typename EigenTypes<kRowBlockSize>::Vector sj =
typename EigenTypes<kRowBlockSize>::ConstVectorRef(b + b_pos,2);
// clang-format off
//MatrixVectorMultiply<kRowBlockSize, kEBlockSize, -1>(
// values + e_cell.position, row.block.size, e_block_size,
// inverse_ete_g.data(), sj.data());
double *val_ptr0 = (double *)(values + b_pos);
double *val_ptr1 = (double *)(values + b_pos + 1);
sj.data()[0] -= val_ptr0[0] * inverse_ete_eb;
sj.data()[1] -= val_ptr1[0] * inverse_ete_eb;
// clang-format on
int block_id = row.cells[1].block_id;
int block_size = bs->cols[block_id].size;
int block = block_id - num_eliminate_blocks_;
// clang-format off
double sj_0 = sj.data()[0];
double sj_1 = sj.data()[1];
// clang-format off
double *val_ptr = (double *)(values + row.cells[1].position);
double *sj_ptr = sj.data();
double *rhs_ptr = rhs + lhs_row_layout_[block];
__m256d v_sj_0 = _mm256_broadcast_sd(&sj_0);
__m256d v_sj_1 = _mm256_broadcast_sd(&sj_1);
__m256d val_00 = _mm256_loadu_pd(val_ptr);
__m256d val_01 = _mm256_loadu_pd(&val_ptr[block_size]);
__m256d rhs_val0 = _mm256_loadu_pd(rhs_ptr);
__m256d val_10 = _mm256_loadu_pd(&val_ptr[4]);
__m256d val_11 = _mm256_loadu_pd(&val_ptr[block_size + 4]);
__m256d rhs_val1 = _mm256_loadu_pd(&rhs_ptr[4]);
val_10 = _mm256_blendv_pd(val_10,v_zero,mask);
val_11 = _mm256_blendv_pd(val_11,v_zero,mask);
rhs_val0 = _mm256_fmadd_pd(val_00,v_sj_0,rhs_val0);
rhs_val0 = _mm256_fmadd_pd(val_01,v_sj_1,rhs_val0);
rhs_val1 = _mm256_fmadd_pd(val_10,v_sj_0,rhs_val1);
rhs_val1 = _mm256_fmadd_pd(val_11,v_sj_1,rhs_val1);
_mm256_storeu_pd(&rhs_ptr[0],rhs_val0);
rhs_ptr[4] = rhs_val1[0];
rhs_ptr[5] = rhs_val1[1];
block_id = row.cells[2].block_id;
block_size = bs->cols[block_id].size;
block = block_id - num_eliminate_blocks_;
// clang-format off
val_ptr = (double *)(values + row.cells[2].position);
sj_ptr = sj.data();
rhs_ptr = rhs + lhs_row_layout_[block];
val_00 = _mm256_loadu_pd(val_ptr);
val_01 = _mm256_loadu_pd(&val_ptr[block_size]);
rhs_val0 = _mm256_loadu_pd(rhs_ptr);
val_10 = _mm256_loadu_pd(&val_ptr[4]);
val_11 = _mm256_loadu_pd(&val_ptr[block_size + 4]);
rhs_val1 = _mm256_loadu_pd(&rhs_ptr[4]);
val_10 = _mm256_blendv_pd(val_10,v_zero,mask);
val_11 = _mm256_blendv_pd(val_11,v_zero,mask);
rhs_val0 = _mm256_fmadd_pd(val_00,v_sj_0,rhs_val0);
rhs_val0 = _mm256_fmadd_pd(val_01,v_sj_1,rhs_val0);
rhs_val1 = _mm256_fmadd_pd(val_10,v_sj_0,rhs_val1);
rhs_val1 = _mm256_fmadd_pd(val_11,v_sj_1,rhs_val1);
// clang-format on
_mm256_storeu_pd(&rhs_ptr[0],rhs_val0);
rhs_ptr[4] = rhs_val1[0];
rhs_ptr[5] = rhs_val1[1];
b_pos += 2;
}
}
下一步计算由于需要用到E’F矩阵,因此这里需要重新写一个循环来计算这部分,代码如下:
for (int j = 0; j < chunk.size; ++j)
{
const CompressedRow& row = bs->rows[chunk.start + j];
const Cell& e_cell = row.cells.front();
// buffer = E'F. This computation is done by iterating over the
// f_blocks for each row in the chunk.
int f_block_id = row.cells[1].block_id;
int f_block_size = bs->cols[f_block_id].size;
double* buffer_ptr = buffer;
// clang-format off
double *val_ptr0 = (double *)(values + e_cell.position);
double *val_ptr1 = (double *)(values + row.cells[1].position);
__m256d v_val0 = _mm256_loadu_pd(val_ptr1);
__m256d v_val1 = _mm256_loadu_pd(&val_ptr1[4]);
__m256d v_sum0 = _mm256_loadu_pd(buffer_ptr);
__m256d v_sum1 = _mm256_loadu_pd(&buffer_ptr[4]);
__m256d v_val2 = _mm256_loadu_pd(&val_ptr1[6]);
__m256d v_val3 = _mm256_loadu_pd(&val_ptr1[10]);
__m256d v_cel_0 = _mm256_broadcast_sd(&val_ptr0[0]);
__m256d v_cel_1 = _mm256_broadcast_sd(&val_ptr0[1]);
v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
v_val3 = _mm256_blendv_pd(v_val3,v_zero,mask);
v_sum1 = _mm256_blendv_pd(v_sum1,v_zero,mask);
v_sum0 = _mm256_fmadd_pd(v_cel_0,v_val0,v_sum0);
v_sum0 = _mm256_fmadd_pd(v_cel_1,v_val2,v_sum0);
v_sum1 = _mm256_fmadd_pd(v_cel_0,v_val1,v_sum1);
v_sum1 = _mm256_fmadd_pd(v_cel_1,v_val3,v_sum1);
_mm256_storeu_pd(&buffer_ptr[0],v_sum0);
buffer_ptr[4] = v_sum1[0];
buffer_ptr[5] = v_sum1[1];
f_block_id = row.cells[2].block_id;
f_block_size = bs->cols[f_block_id].size;
buffer_ptr = buffer + FindOrDie(chunk.buffer_layout, f_block_id);
// clang-format off
val_ptr0 = (double *)(values + e_cell.position);
val_ptr1 = (double *)(values + row.cells[2].position);
v_val0 = _mm256_loadu_pd(val_ptr1);
v_val1 = _mm256_loadu_pd(&val_ptr1[4]);
v_sum0 = _mm256_loadu_pd(buffer_ptr);
v_sum1 = _mm256_loadu_pd(&buffer_ptr[4]);
v_val2 = _mm256_loadu_pd(&val_ptr1[6]);
v_val3 = _mm256_loadu_pd(&val_ptr1[10]);
v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
v_cel_0 = _mm256_broadcast_sd(&val_ptr0[0]);
v_cel_1 = _mm256_broadcast_sd(&val_ptr0[1]);
v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
v_val3 = _mm256_blendv_pd(v_val3,v_zero,mask);
v_sum1 = _mm256_blendv_pd(v_sum1,v_zero,mask);
v_sum0 = _mm256_fmadd_pd(v_cel_0,v_val0,v_sum0);
v_sum0 = _mm256_fmadd_pd(v_cel_1,v_val2,v_sum0);
v_sum1 = _mm256_fmadd_pd(v_cel_0,v_val1,v_sum1);
v_sum1 = _mm256_fmadd_pd(v_cel_1,v_val3,v_sum1);
_mm256_storeu_pd(&buffer_ptr[0],v_sum0);
buffer_ptr[4] = v_sum1[0];
buffer_ptr[5] = v_sum1[1];
// clang-format on
}
为了减少计算量,将F’矩阵提取出来变型为:
F’(b-E(E’E)^(-1) E’b)
ChunkOuterProduct()为计算:
S -= F’E(E’E)^{-1}E’F。
内部代码如下:
// Compute the outer product F'E(E'E)^{-1}E'F and subtract it from the
// Schur complement matrix, i.e
//
// S -= F'E(E'E)^{-1}E'F.
template <int kRowBlockSize, int kEBlockSize, int kFBlockSize>
void SchurEliminator<kRowBlockSize, kEBlockSize, kFBlockSize>::
ChunkOuterProduct(int thread_id,
const CompressedRowBlockStructure* bs,
const Matrix& inverse_ete,
const double* buffer,
const BufferLayoutType& buffer_layout,
BlockRandomAccessMatrix* lhs) {
// This is the most computationally expensive part of this
// code. Profiling experiments reveal that the bottleneck is not the
// computation of the right-hand matrix product, but memory
// references to the left hand side.
const int e_block_size = inverse_ete.rows();
BufferLayoutType::const_iterator it1 = buffer_layout.begin();
double* b1_transpose_inverse_ete =
chunk_outer_product_buffer_.get() + thread_id * buffer_size_;
// S(i,j) -= bi' * ete^{-1} b_j
for (; it1 != buffer_layout.end(); ++it1) {
const int block1 = it1->first - num_eliminate_blocks_;
const int block1_size = bs->cols[it1->first].size;
// clang-format off
MatrixTransposeMatrixMultiply
<kEBlockSize, kFBlockSize, kEBlockSize, kEBlockSize, 0>(
buffer + it1->second, e_block_size, block1_size,
inverse_ete.data(), e_block_size, e_block_size,
b1_transpose_inverse_ete, 0, 0, block1_size, e_block_size);
// clang-format on
BufferLayoutType::const_iterator it2 = it1;
for (; it2 != buffer_layout.end(); ++it2) {
const int block2 = it2->first - num_eliminate_blocks_;
int r, c, row_stride, col_stride;
CellInfo* cell_info =
lhs->GetCell(block1, block2, &r, &c, &row_stride, &col_stride);
if (cell_info != NULL) {
const int block2_size = bs->cols[it2->first].size;
//LOG(INFO) << " " << block1_size << " " << block2_size;
std::lock_guard<std::mutex> l(cell_info->m);
// clang-format off
MatrixMatrixMultiply
<kFBlockSize, kEBlockSize, kEBlockSize, kFBlockSize, -1>(
b1_transpose_inverse_ete, block1_size, e_block_size,
buffer + it2->second, e_block_size, block2_size,
cell_info->values, r, c, row_stride, col_stride);
// clang-format on
}
}
}
}
这部分使用指令重写后代码为:
for(int i = 0;i < chunks_.size();i++)
{
double* buffer = buffer_.get();
const Chunk& chunk = chunks_[i];
VectorRef(buffer, buffer_size_).setZero();
const CompressedRowBlockStructure* bs = A.block_structure();
const double* values = A.values();
typename EigenTypes<kEBlockSize, kEBlockSize>::Matrix inverse_ete(1,1);
inverse_ete.data()[0] = ete_inv[i];
__m256i vindex0 = _mm256_set_epi64x(24,16,8,0);
__m256i vindex1 = _mm256_set_epi64x(28,20,12,4);
double zero_t = 0.0;
__m256d v_zero = _mm256_broadcast_sd(&zero_t);
__m256d mask = _mm256_set_pd(-1,-1,2,2);
for (int j = 0; j < chunk.size; ++j)
{
const CompressedRow& row = bs->rows[chunk.start + j];
const Cell& e_cell = row.cells.front();
// buffer = E'F. This computation is done by iterating over the
// f_blocks for each row in the chunk.
int f_block_id = row.cells[1].block_id;
int f_block_size = bs->cols[f_block_id].size;
double* buffer_ptr = buffer;
// clang-format off
double *val_ptr0 = (double *)(values + e_cell.position);
double *val_ptr1 = (double *)(values + row.cells[1].position);
__m256d v_val0 = _mm256_loadu_pd(val_ptr1);
__m256d v_val1 = _mm256_loadu_pd(&val_ptr1[4]);
__m256d v_sum0 = _mm256_loadu_pd(buffer_ptr);
__m256d v_sum1 = _mm256_loadu_pd(&buffer_ptr[4]);
__m256d v_val2 = _mm256_loadu_pd(&val_ptr1[6]);
__m256d v_val3 = _mm256_loadu_pd(&val_ptr1[10]);
__m256d v_cel_0 = _mm256_broadcast_sd(&val_ptr0[0]);
__m256d v_cel_1 = _mm256_broadcast_sd(&val_ptr0[1]);
v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
v_val3 = _mm256_blendv_pd(v_val3,v_zero,mask);
v_sum1 = _mm256_blendv_pd(v_sum1,v_zero,mask);
v_sum0 = _mm256_fmadd_pd(v_cel_0,v_val0,v_sum0);
v_sum0 = _mm256_fmadd_pd(v_cel_1,v_val2,v_sum0);
v_sum1 = _mm256_fmadd_pd(v_cel_0,v_val1,v_sum1);
v_sum1 = _mm256_fmadd_pd(v_cel_1,v_val3,v_sum1);
_mm256_storeu_pd(&buffer_ptr[0],v_sum0);
buffer_ptr[4] = v_sum1[0];
buffer_ptr[5] = v_sum1[1];
f_block_id = row.cells[2].block_id;
f_block_size = bs->cols[f_block_id].size;
buffer_ptr = buffer + FindOrDie(chunk.buffer_layout, f_block_id);
// clang-format off
val_ptr0 = (double *)(values + e_cell.position);
val_ptr1 = (double *)(values + row.cells[2].position);
v_val0 = _mm256_loadu_pd(val_ptr1);
v_val1 = _mm256_loadu_pd(&val_ptr1[4]);
v_sum0 = _mm256_loadu_pd(buffer_ptr);
v_sum1 = _mm256_loadu_pd(&buffer_ptr[4]);
v_val2 = _mm256_loadu_pd(&val_ptr1[6]);
v_val3 = _mm256_loadu_pd(&val_ptr1[10]);
v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
v_cel_0 = _mm256_broadcast_sd(&val_ptr0[0]);
v_cel_1 = _mm256_broadcast_sd(&val_ptr0[1]);
v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
v_val3 = _mm256_blendv_pd(v_val3,v_zero,mask);
v_sum1 = _mm256_blendv_pd(v_sum1,v_zero,mask);
v_sum0 = _mm256_fmadd_pd(v_cel_0,v_val0,v_sum0);
v_sum0 = _mm256_fmadd_pd(v_cel_1,v_val2,v_sum0);
v_sum1 = _mm256_fmadd_pd(v_cel_0,v_val1,v_sum1);
v_sum1 = _mm256_fmadd_pd(v_cel_1,v_val3,v_sum1);
_mm256_storeu_pd(&buffer_ptr[0],v_sum0);
buffer_ptr[4] = v_sum1[0];
buffer_ptr[5] = v_sum1[1];
// clang-format on
}
param_idx++;
// S -= F'E(E'E)^{-1}E'F
//ChunkOuterProduct(
//0, bs, inverse_ete, buffer, chunk.buffer_layout, lhs);
const int e_block_size = 1;
BufferLayoutType::const_iterator it1 = chunk.buffer_layout.begin();
double* b1_transpose_inverse_ete = chunk_outer_product_buffer_.get();
//CellInfo* cell_info =
// lhs->GetCell(0, 0, &r, &c, &row_stride, &col_stride);
//double *lhs_val = cell_info->values;
// S(i,j) -= bi' * ete^{-1} b_j
int r, c, row_stride, col_stride;
CellInfo* cell_info =
lhs->GetCell(0, 0, &r, &c, &row_stride, &col_stride);
double *lhs_val = cell_info->values;
v_zero = _mm256_broadcast_sd(&zero_t);
mask = _mm256_set_pd(-1,-1,2,2);
for (; it1 != chunk.buffer_layout.end(); ++it1)
{
const int block1 = it1->first - num_eliminate_blocks_;
double *buffer_ptr = buffer + it1->second;
// clang-format on
__m256d vinverse_ete = _mm256_broadcast_sd(inverse_ete.data());
__m256d v_val0 = _mm256_loadu_pd(buffer_ptr);
__m256d v_val1 = _mm256_loadu_pd(&buffer_ptr[4]);
v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
__m256d btete0 = _mm256_mul_pd(v_val0,vinverse_ete);
__m256d btete1 = _mm256_mul_pd(v_val1,vinverse_ete);
_mm256_storeu_pd(&b1_transpose_inverse_ete[0],btete0);
b1_transpose_inverse_ete[4] = btete1[0];
b1_transpose_inverse_ete[5] = btete1[1];
BufferLayoutType::const_iterator it2 = it1;
for (; it2 != chunk.buffer_layout.end(); ++it2)
{
const int block2 = it2->first - num_eliminate_blocks_;
//cell_info = lhs->GetCell(block1, block2, &r, &c, &row_stride, &col_stride);
r = block_layout_[block1];
c = block_layout_[block2];
buffer_ptr = buffer + it2->second;
double *lhs_val_ptr = &lhs_val[r * col_stride + c];
for(int j = 0;j < 6;j++)
{
__m256d vb1_inverse_ete = _mm256_broadcast_sd(&b1_transpose_inverse_ete[j]);
v_val0 = _mm256_loadu_pd(buffer_ptr);
v_val1 = _mm256_loadu_pd(&buffer_ptr[4]);
v_val1 = _mm256_blendv_pd(v_val1,v_zero,mask);
__m256d res0 = _mm256_loadu_pd(&lhs_val_ptr[j * col_stride + 0]);
__m256d res1 = _mm256_loadu_pd(&lhs_val_ptr[j * col_stride + 4]);
__m256d btete0 = _mm256_mul_pd(vb1_inverse_ete,v_val0);
__m256d btete1 = _mm256_mul_pd(vb1_inverse_ete,v_val1);
res0 = _mm256_sub_pd(res0,btete0);
res1 = _mm256_sub_pd(res1,btete1);
_mm256_storeu_pd(&lhs_val_ptr[j * col_stride + 0],res0);
lhs_val_ptr[j * col_stride + 4] = res1[0];
lhs_val_ptr[j * col_stride + 5] = res1[1];
}
}
}
}
整体减完成后,也就是循环完成后调用了函数NoEBlockRowsUpdate,再计算S += F’F,这部分耗时较短,也可以暂时忽略。
上述代码较多,为了说明流程,这里并非全部代码,需要这部分优化代码可以加我微信yhtao923,我们可以交流一下。
版权声明:本文标题:vin-slam中调用ceres库内部代码分析与性能优化 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://m.elefans.com/xitong/1728257187a1151157.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论