Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
In this report a number of algorithms for optimal control of a double inverted pendulum on a cart (DIPC)
are investigated and compared. Modeling is based on Euler-Lagrange equations derived by specifying a
Lagrangian, difference between kinetic and potential energy of the DIPC system. This results in a system
of nonlinear differential equations consisting of three 2-nd order equations. This system of equations is then
transformed into a usual form of six 1-st order ordinary differential equations (ODE) for control design pur-
poses. Control of a DIPC poses a certain challenge, since unlike a robot, the system is underactuated: one
controlling force per three degrees of freedom (DOF). In this report, problem of optimal control minimizing
a quadratic cost functional is addressed. Several approaches are tested: linear quadratic regulator (LQR),
state-dependent Riccati equation (SDRE), optimal neural network (NN) control, and combinations of the
NN with the LQR and the SDRE. Simulations reveal superior performance of the SDRE over the LQR and
improvements provided by the NN, which compensates for model inadequacies in the LQR. Limited capa-
bilities of the NN to approximate functions over the wide range of arguments prevent it from significantly
improving the SDRE performance, providing only marginal benefits at larger pendulum deflections.
1
where
T2 1
I2
T0 = m0 θ̇02
2
l2 L2 1 h 2 2 i
T1 = m1 θ̇0 + l1 θ̇1 cos θ1 + l1 θ̇1 sin θ1
m2g 2
L1 T1
Y
1
T0 I1 + I1 θ̇2
2 1
m0 l1 1 1
m1 θ̇02 + m1 l12 + I1 θ̇12 + m1 l1 θ̇0 θ̇1 cos θ1
u =
m1 g
2 2
1 h 2
X T2 = m2 θ̇0 + L1 θ̇1 cos θ1 + l2 θ̇2 cos θ2
2
2 i 1
Figure 1: Double inverted pendulum on a cart + L1 θ̇1 sin θ1 + l2 θ̇2 sin θ2 + I2 θ̇22
2
1 1 1
m2 θ̇0 + m2 L1 θ̇1 + m2 l2 + I2 θ̇22
2 2 2 2
=
range of initial DIPC states. As a universal function 2 2 2
approximator, the NN is thus trained to implement a + m2 L1 θ̇0 θ̇1 cos θ1 + m2 l2 θ̇0 θ̇2 cos θ2
nonlinear optimal controller. + m2 L1 l2 θ̇1 θ̇2 cos(θ1 − θ2 )
As the last step, two combinations of feedback NN
control with LQR and SDRE will be designed. Approxi- P0 = 0
mation capabilities of a NN are limited by its size and the P1 = m1 gl1 cos θ1
fact that optimization in the space of its weights is non- P2 = m2 g L1 cos θ1 + l2 cos θ2
convex. Thus, only local minimums are usually found,
and the solution is at most suboptimal. The problem Thus the Lagrangian of the system is given by
is more severe with wider ranges of NN inputs and out- 1
m0 + m1 + m2 θ̇02
puts. To address this, a combination of the NN with L =
a conventional feedback suboptimal control is designed 2
to simplify the NN input-output mapping and therefore 1 1
m1 l12 + m2 L21 + I1 θ̇12 + m2 l22 + I2 θ̇22
+
reduce training complexity. If the conventional feedback 2 2
control were optimal, the optimal NN output would triv- + m1 l1 + m2 L1 cos(θ1 )θ̇0 θ̇1
ially be zero. Simple logic says that a suboptimal con-
ventional control will simplify the NN mapping to some + m2 l2 cos(θ2 )θ̇0 θ̇2 + m2 L1 l2 cos(θ1 − θ2 )θ̇1 θ̇2
extent: instead of generating optimal controls, the NN is − m1 l1 + m2 L1 g cos θ1 − m2 l2 g cos θ2
trained to produce corrections to the controls generated
by the conventional suboptimal controller. For example, Differentiating the Lagrangian by θ̇ and θ yields La-
the LQR provides near-optimal control in the vicinity grange equation (1) as
of the equilibrium, since nonlinear and linearized DIPC
d ∂L ∂L
dynamics are close in the equilibrium. Thus the NN will − = u
only have to correct the LQR output when the lineariza- dt ∂ θ̇0 ∂θ0
tion accuracy diminishes. The next sections will discuss
d ∂L ∂L
the above concepts in details and illustrate them with − = 0
dt ∂ θ̇1 ∂θ1
simulations.
d ∂L ∂L
− = 0
dt ∂ θ̇2 ∂θ2
3 Modeling
Or explicitly,
The DIPC system is graphically depicted in Fig. 1. To X
derive its equations of motion, one of the possible ways u = mi θ̈0 + m1 l1 + m2 L1 cos(θ1 )θ̈1
is to use Lagrange equations:
+ m2 l2 cos(θ2 )θ̈2 − m1 l1 + m2 L1 sin(θ1 )θ̇12
d ∂L ∂L
− =Q (1) − m2 l2 sin(θ2 )θ̇22
dt ∂ θ̇ ∂θ
0 = m1 l1 + m2 L1 cos(θ1 )θ̈0
where L = T − P is a Lagrangian, Q is a vector of
m1 l12 + m2 L21 + I1 θ̈1
generalized forces (or moments) acting in the direction +
of generalized coordinates θ and not accounted for in + m2 L1 l2 cos(θ1 − θ2 )θ̈2
formulation of kinetic energy T and potential energy P .
Kinetic and potential energies of the system are given + m2 L1 l2 sin(θ1 − θ2 )θ̇22
by the sum of energies of its individual components (a
− m1 l1 + m2 L1 g sin θ1
wheeled cart and two pendulums):
0 = m2 l2 cos(θ2 )θ̈0 + m2 L1 l2 cos(θ1 − θ2 )θ̈1
T = T 0 + T1 + T2 + m2 l22 + I2 θ̈2 − m2 L1 l2 sin(θ1 − θ2 )θ̇12
P = P0 + P1 + P2 − m2 l2 g sin θ2
2
Lagrange equations for the DIPC system can be writ- which represents an accumulated cost of the sequence of
ten in a more compact matrix form: states xk and controls uk from the current discrete time t
to the final time tf inal . For regulation problems tf inal =
D(θ)θ̈ + C(θ, θ̇)θ̇ + G(θ) = Hu (2) ∞. Optimization is done with respect to the control
where sequence subject to constraints of the system dynamics
d1 d2 cos θ1 d3 cos θ2
! (6). In our case,
D(θ) = d2 cos θ1 d4 d5 cos(θ1 −θ2 ) (3)
d3 cos θ2 d5 cos(θ1 −θ2 ) d6 Lk (xk , uk ) = xTk Qxk + uTk Ruk (8)
3
4.2 State-Dependent Riccati Equation Control Ruk2
An approximate nonlinear analytical solution of the
optimal control problem (7)–(8) subject to (6) is given xk uk
Neural network Double inverted x k 1
by a technique referred to as the state-dependent Ric- pendulum on a cart
cati equation (SDRE) control. The SDRE approach [3] xk
involves manipulating the dynamic equations
q 1
ẋ = f (x, u)
xTk Qx k
into a pseudo-linear state-dependent coefficient (SDC)
form in which system matrices are explicit functions of
the current state: Figure 2: Neural network control diagram
A standard LQR problem (Riccati equation) can then and on the other hand, SDC form of the system dynamics
be solved at each time step to design the state feedback is
control law on-line. For digital implementation, (15) is ẋ = A(x)x + B(x)u
approximately discretized at each time step into
Since O(x)2 → 0 when x → 0, then A(x) → A and
xk+1 = Φ(xk )xk + Γ(xk )uk (16) B(x) → B. Therefore, it is natural to expect that per-
And the SDRE regulator is then specified similar to the formance of the SDRE regulator will be very close to the
discrete LQR (compare with (13)) as LQR in the vicinity of the equilibrium, and the differ-
ence will show at larger pendulum deflections. This is
uk = −R−1 ΓT (xk )P(xk )xk ≡ −K(xk )xk (17) exactly illustrated in the Simulation Results section.
where P(xk ) is the steady state solution of the difference 4.3 Neural Network Learning Control
Riccati equation, obtained by solving the discrete-time A neural network (NN) control is often popular in con-
algebraic Riccati equation (14) using state-dependent trol of nonlinear systems due to universal function ap-
matrices Φ(xk ) and Γ(xk ), which are treated as being proximation capabilities of NNs. Neural networks with
constant at each time step. Thus the approach is some- only one hidden layer and an arbitrarily large num-
times considered as a nonlinear extension to the LQR. ber of neurons represent nonlinear mappings, which can
Proposition 1. Dynamic equations of a double inverted be used for approximation of any nonlinear function
pendulum on a cart are presentable in SDC form (15). f ∈ C(Rn , Rm ) over a compact subset of Rn [7, 4, 12]. In
this section, we utilize the function approximation prop-
Proof. From the derived dynamic equations for the erties of NNs to approximate a solution of the nonlinear
DIPC system (6) it is clear that the required SDC form optimal control problem (7)–(8) subject to system dy-
(15) can be obtained if vector G(θ) is presentable in the namics (6), and thus design an optimal NN regulator.
SDC form: G(θ) = Gsd (θ)θ. Let us construct Gsd (θ) as The problem can be solved by directly implementing a
feedback controller (see Figure 2) as:
0 0 0
sin θ
Gsd (θ) = 0 −f1 θ1 1 0 uk = NN (xk , w) ,
0 0 −f2 sinθ2θ2
Next, an optimal set of weights w is computed to
Elements of constructed Gsd (θ) are bounded everywhere solve the optimization problem. To do this, a stan-
and G(θ) = Gsd (θ)θ as required. Thus the system dy- dard calculus of variations approach is taken: mini-
namic equations can be presented in the SDC form as mization of a functional subject to equality constraints
xk+1 = f (xk , uk ). Let λk be a vector of Lagrange mul-
0 I 0
ẋ = x+ u (18) tipliers in the augmented cost function
−D−1 Gsd −D−1 C D−1 H
tf inal n o
X
H= L(xk , uk ) + λTk (xk+1 −f (xk , uk )) (19)
Derived system equations (18) (compare with the lin- k=t
earized system (9)–(11)) are discretized at each time step
into (16), and control is then computed as given by (17). We can now derive the recurrent Euler-Lagrange equa-
Remark. In the neighborhood of equilibrium x = 0 ∂H
tions by solving ∂x = 0 w.r.t. the Lagrange multipliers
system equations in the SDC form (18) turn into lin- k
earized equations (9) used in the LQR design. This can and then find the optimal set of NN weights w ? by solv-
be checked either by direct computation or by noting ing the optimality condition ∂H ∂w = 0 (numerically by
that on one hand, linearization yields means of gradient descent).
Dropping dependence of L(xk , uk ) and f (xk , uk ) on
ẋ = Ax + Bu + O(x)2 xk and uk , and N N (xk , w) on xk and w for brevity, the
4
∂fc (x,u)
2 Ruk
∂u ,
where fc (x, u) is a brief notation for the right-
hand side of the continuous system (6), i.e. ẋ = fc (x, u).
dx kNN duk
NN Jacobian + Pendulum Ȝ k 1
Jacobian ∂fc (x, u) 0 I
dx k = (22)
∂x −D−1 M −2D−1 C
Ȝk
q 1
+ ∂fc (x, u) 0
= (23)
∂u D−1 H
2Qx k
where matrix M(θ, θ̇) = (M0 M1 M2 ), and each of its
Figure 3: Adjoint system diagram columns (note that ∂D −1 ∂D −1
−1
∂C ∂G ∂D −1
Euler-Lagrange equations are derived as Mi = θ̇ + − D (Cθ̇ + G − Hu) (24)
∂θi ∂θi ∂θi
T
∂f ∂f ∂NN
λk = + λk+1 Remark 1. Clearly, Jacobians (22)–(24) transform into
∂xk ∂uk ∂xk linear system matrices (10)–(11) in equilibrium θ = 0,
T
∂L ∂L ∂NN θ̇ = 0.
+ + (20) Remark 2. From (3)–(5) it follows that M0 ≡ 0.
∂xk ∂uk ∂xk
Remark 3. Jacobians (22)–(24) are derived from the
with λtf inal initialized as zero vector. For L(xk , uk ) continuous system equations (6). BPTT requires com-
∂L ∂L putation of the discrete system Jacobians. Thus, to use
given by (8), ∂x = 2xTk Q, and ∂u = 2uTk R. These
k k the derived matrices in NN training, they should be dis-
equations correspond to an adjoint system shown graph- cretized (e.g. as it was done for the LQR and SDRE).
ically in Figure 3, with optimality condition Computation of the NN Jacobian is easy to perform
tf inal given the nonlinear functions of individual neural ele-
∂H X ∂f ∂L ∂NN ments are y = tanh(z). In this case, NN with N0 inputs,
= λTk+1 + = 0. (21) single output u and a single hidden layer with Nh ele-
∂w ∂uk ∂uk ∂w
k=t
ments is described by
The overall training procedure for the NN can now be
Nh N0
summarized as follows: X X
u= w2i tanh w1i,j xj + w1bi + w2b
1. Simulate the system forward in time for tf inal time i=1 j=1
steps (Figure 2). Although, as we mentioned,
tf inal = ∞ for regulation problems, in practice it Or in a more compact form,
is set to a sufficiently large number (in our case
tf inal = 500).
u = w2 tanh W1 x + w1b + w2b
5
W1 , w1b Hidden layer Ruk2
W11,1
W11,2
+ tanh
1
xk uk
Neural network + x k 1
x1 W11,6
Double inverted
pendulum on a cart
Wb11 W2 , w b2 Output xk
x2 W12,1
K
2
W12,2
+ tanh
W21
W22
+
u
q 1
x6 W12,6
Wb12 W2Nh
Wb2 xTk Qx k
W1Nh,1
W1Nh,2
+ tanh Nh
Figure 5: Neural network + LQR control diagram
W1Nh,6
Wb1Nh
6
2 Ruk SDRE control was designed for a simplified nonlinear
model, and the NN made it possible to compensate for
dx kNN duk wind disturbances and higher-order terms unaccounted
NN Jacobian + Pendulum Ȝ k 1
dx k Jacobian for in the SDRE design.
K 4.5 Receding Horizon Neural Network Control
Ȝk Is it possible to further improve control scheme de-
+ + q 1
signed in the previous section? Recall, that the NN was
trained numerous times over the wide range of the pen-
2Qx k
dulum angles to minimize the cost functional (19). The
complexity of the approach is a function of the final time
Figure 6: Adjoint NN + LQR system diagram tf inal , which determines the length of a training epoch.
Solution suggested in this section is to apply a receding
horizon framework and train the NN in a limited range
where in case of SDRE K ≡ K(xk ), minimization of the of the pendulum angles, along a trajectory starting in
augmented cost function (19) by taking derivatives of H the current point and having relatively short duration
w.r.t. xk and w yields slightly modified Euler-Lagrange (horizon) N . This is accomplished by rewriting the cost
equations: function (19) as
T t+N−1
∂f ∂f ∂NN Xn o
λk = + +K λk+1 H= L(xk , uk )+λTk (xk+1 −f (xk , uk )) +V (xt+N )
∂xk ∂uk ∂xk
T k=t
∂L ∂L ∂NN
+ + +K (28) where the last term V (xt+N ) denotes the cost-to-go from
∂xk ∂uk ∂xk
time t + N to time tf inal . Since range of the NN inputs
Note that when SDRE is used instead of the LQR, in this case is limited and the horizon length N is short,
∂(K(xk )xk )
should be used in place of K in the above a shorter time is required to train the NN. However, only
∂xk local minimization of function (19) is achieved: starting
equation. These equations correspond to an adjoint sys-
with a significantly different initial condition the NN will
tem shown graphically in Figure 6 (compare to Figure 3),
not provide cost minimization since it was not trained for
with optimality condition (21)
it. This issue is addressed by periodically retraining the
The training procedure for the NN is the same as in
NN: after a period of time called an update interval, the
the previous section, and all the system Jacobians and
NN is retrained taking the current state vector as initial.
the NN Jacobian are computed in the same way. The
Update interval is usually significantly shorter than the
only new item in this section is the SDRE Jacobian
∂(K(xk )xk ) horizon, and in classical model-predictive control (MPC)
∂xk which appears in the Euler-Lagrange equa- is often only one time step. This technique was applied
tions (28). This Jacobian is computed either numeri- in helicopter control problem [15, 14] and is referred to
cally, which may be computationally expensive, or ap- as Model Predictive Neural Control (MPNC).
proximately as In practice, the true value of V (xt+N ) is unknown,
and must be approximated. Most common is to simply
∂ (K(xk )xk ) set V (xt+N ) = 0; however, this may lead to reduced
≈ K (xk )
∂xk stability and poor performance for short horizon lengths
[9]. Alternatively, we may include a control Lyapunov
Experiments in the Simulation Results section illus- function (CLF), which guarantees stability if the CLF is
trate the superior performance of such combination con- an upper bound on the cost-to-go, and results in a region
trol over the pure NN or LQR control. For the SDRE, of attraction for the MPC of at least that of the CLF [9].
a noticeable cost reduction is achieved only near criti- The cost-to-go V (xt+N ) can be approximated using the
cal pendulum deflections (close to the boundaries of the solution of the SDRE at time t + N ,
SDRE recovery region).
Remark. The idea of using NNs to adjust outputs V (xt+N ) ≈ xTt+N P(xt+N )xt+N
of a conventional controller to account for differences
between the actual system and its model used in the This CLF provides the exact cost-to-go for regulation
conventional control design was also employed in a num- assuming a linear system at the horizon time. A sim-
ber of works [15, 14, 2, 11]. Calise, Rysdyk and John- ilar formulation was used for nonlinear regulation by
son [2, 11] used a NN controller to supplement an ap- Cloutier et al [13].
proximate feedback linearization control of a fixed wing All the equations from the previous section apply here
aircraft and a helicopter. The NN provided additional as well. The Euler-Lagrange equations (20) are initial-
controls to match the response of the vehicle to the ref- ized in this case as
erence model, compensating for the approximations in T
the autopilot design. Wan and Bogdanov [15, 14] de- ∂V (xt+N )
λt+N = ≈ P(xt+N )xt+N
signed a model predictive neural control (MPNC) for an ∂xt+N
autonomous helicopter, where a NN worked in pair with
the SDRE autopilot and provided minimization of the where dependence of the SDRE solution P on state vec-
quadratic cost function over a receding horizon. The tor xt+N was neglected.
7
Stability of the MPNC is closely related to that of the uk
traditional MPC. Ideally, in the case of unconstrained Neural network xˆ k 1 -
+
optimization, stability is guaranteed provided V (xt+N ) Ruk2 emulator
+
xk
is a CLF and is an (incremental) upper bound on the
cost-to-go [9]. In this case, the minimum region of at- xk uk
Neural network x k 1
traction of the receding horizon optimal control is de- Double inverted
pendulum on a cart
termined by the CLF used and horizon length. The xk
guaranteed region of operation contains that of the CLF
controller and may be made as large as desired by in- q 1
creasing the optimization horizon (restricted to the in-
finite horizon domain) [10]. In our case, the minimum xTk Qx k
region of attraction of the receding horizon MPNC is
determined by the SDRE solution used as the CLF to
approximate the terminal cost. In addition, we also re- Figure 7: Neural network adaptive control diagram
strict the controls to be of the form given by (4.4) and
the optimization is performed with respect to the NN uk
weights w. In theory, the universal mapping capability Neural network xˆ k 1 -
+
of NNs implies that the stability guarantees are equiva- Ruk2 emulator
+
xˆ k
lent to that of the traditional MPC framework. However, q 1
in practice stability is affected by the chosen size of the xk uk
Neural network Double inverted x k 1
NN (which affects the actual mapping capabilities), as pendulum on a cart
well as the horizon length N and update interval length xk
(how often NN is re-optimized). When the horizon is
short, performance is more affected by the chosen CLF. q 1
8
but are left currently beyond our scope.
Table 1: Simulation parameters
Parameter Value
m0 1.5 kg 6 Conclusions
m1 0.5 kg
m2 0.75 kg This report demonstrated potential advantage of the
L1 0.5 m SDRE technique over the LQR design in nonlinear opti-
L2 0.75 m mal control problems with an underactuated plant. Re-
∆t 0.02 s gion of pendulum recovery for SDRE appeared to be 55
Q diag(5 50 50 20 700 700) to 91 percent larger than in case of the LQR control.
R 1 Direct optimization via using neural networks yields
Nh 40 results superior to the LQR, but the recovery region is
tf inal 500 about the same as in the LQR case. This happens due to
limited approximation capabilities of the NN and non-
convex numerical optimization challenges.
Combination of the NN control with the LQR (or with
Table 2: Control performance (cost) comparison
the SDRE) provides larger recovery regions and better
overall performance. In this case the NN learns to gener-
Control Deflection of pendulums, deg.
ate corrections to the LQR (SDRE) control to compen-
Opposite direction
sate for suboptimality of the LQR (SDRE).
10 15 18 28
LQR 115.2 328.4 705.0 n/r To enhance this report, it would be valuable to inves-
SDRE 112.9 277.3 437.5 3655.8 tigate taking limited control authority into account in
NN 114.1 323.4 n/r n/r all control designs.
NN+LQR 113.3 275.7 448.3 n/r
NN+SDRE 112.5 276.3 436.6 2753.7
Same direction References
20 30 35 67
LQR 36.7 108.3 325.3 n/r [1] R. W. Brockett and H. Li. A light weight rotary
SDRE 36.0 84.0 118.0 5250.5 double pendulum: maximizing the domain of at-
NN 36.9 85.7 144.8 n/r traction. In Proceedings of the 42nd IEEE Confer-
NN+LQR 37.3 86.2 136.4 n/r ence on Decision and Control, Maui, Hawaii, De-
NN+SDRE 36.0 84.0 118.0 n/r cember 2003.
9
[9] A. Jadbabaie, J. Yu, and J. Hauser. Stabilizing re-
ceding horizon control of nonlinear systems: a con-
trol Lyapunov function approach. In Proceedings of
American Control Conference, 1999.
[10] A. Jadbabaie, J. Yu, and J. Hauser. Unconstrained
receding horizon control of nonlinear systems. In
Proceedings of IEEE Conference on Decision and
Control, 1999.
[11] E. Johnson, A. Calise, R. Rysdyk, and H. El-
Shirbiny. Feedback linearization with neural net-
work augmentation applied to x-33 attitude control.
In Proceedings of the AIAA Guidance, Navigation,
and Control Conference, August 2000.
[12] W. T. Miller, R. S. Sutton, and P. J. Werbos. Neural
networks for control. MIT Press, Cambridge, MA,
1990.
[13] M. Sznaizer, J. Cloutier, R. Hull, D. Jacques, and
C. Mracek. Receding horizon control Lyapunov
function approach to suboptimal regulation of non-
linear systems. Journal of Guidance, Control and
Dynamics, 23(3):399–405, May-June 2000.
[14] E. Wan, A. Bogdanov, R. Kieburtz, A. Baptista,
M. Carlsson, Y. Zhang, and M. Zulauf. Model
predictive neural control for aggressive helicopter
maneuvers. In T. Samad and G. Balas, editors,
Software Enabled Control: Information Technolo-
gies for Dynamical Systems, chapter 10, pages 175–
200. IEEE Press, John Wiley & Sons, 2003.
[15] E. A. Wan and A. A. Bogdanov. Model predictive
neural control with applications to a 6 DOF heli-
copter model. In Proceedings of IEEE American
Control Conference, Arlington, VA, June 2001.
[16] P. Werbos. Backpropagation through time: what it
does and how to do it. Proceedings of IEEE, spe-
cial issue on neural networks, 2:1550–1560, October
1990.
[17] W. Zhong and H. Rock. Energy and passivity based
control of the double inverted pendulum on a cart.
In Proceedings of the IEEE international confer-
ence on control applications, Mexico City, Mexico,
September 2001.
10
Bottom pendulum angles θ1 Bottom pendulum angles θ1
10 25
SDRE
5 20 LQR
NN
0 15 NN+LQR
NN+SDRE
deg
−5 10
deg
−10 SDRE 5
LQR
−15 NN 0
−20 NN+LQR −5
NN+SDRE
−25 −10
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time, s
time, s
Bottom pendulum velocities dθ1/dt
Bottom pendulum velocities dθ1/dt
100 40
20
0
0
deg/s
deg/s
SDRE
LQR LQR
−200 NN NN
−40
NN+LQR NN+LQR
NN+SDRE NN+SDRE
−300 −60
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time, s
time, s
Top pendulum angles θ2
Top pendulum angles θ 20
2
10 LQR
15 SDRE
NN
5 NN+LQR
10
NN+SDRE
0
deg
5
deg
−5 SDRE 0
LQR
−10 NN −5
NN+LQR
−15
NN+SDRE −10
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 time, s
time, s
Top pendulum velocities dθ /dt
2
Top pendulum velocities dθ2/dt 20
30
10
20
0
deg/s
10
−10
deg/s
0 SDRE
SDRE −20 LQR
−10 LQR NN
NN −30 NN+LQR
−20 NN+LQR NN+SDRE
−30 NN+SDRE −40
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 time, s
time, s
Cart position θ0
Cart position θ0 3.5
0.5 SDRE
3 LQR
0 NN
2.5 NN+LQR
−0.5 NN+SDRE
2
−1
m
1.5
−1.5 SDRE 1
m
−2 LQR
NN 0.5
−2.5 NN+LQR
NN+SDRE 0
−3 0 1 2 3 4 5 6 7
0 1 2 3 time, s 4 5 6 7 time, s
1
m/s
0 0.5
SDRE 0
−1 LQR
NN −0.5
NN+LQR
−2 NN+SDRE −1
0 1 2 3 4 5 6 7
0 1 2 3 time, s 4 5 6 7 time, s
80
10
60
N
5
N
40
0
20
−5
0
−20 −10
−40 −15
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
time, s time, s
Figure 10: 10 deg. opposite direction Figure 11: 20 deg. same direction
11
Bottom pendulum angles θ Bottom pendulum angles θ
1 1
20
SDRE
10 30 LQR
NN
20 NN+LQR
0 NN+SDRE
deg
deg
−10 10
SDRE
LQR
−20 NN 0
NN+LQR
−30 NN+SDRE −10
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time, s time, s
Bottom pendulum velocities dθ /dt Bottom pendulum velocities dθ /dt
1 1
200 50
100
0
deg/s
deg/s
−100 −50
SDRE SDRE
−200 LQR LQR
NN −100
−300 NN
NN+LQR NN+LQR
−400 NN+SDRE NN+SDRE
−150
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time, s time, s
Top pendulum angles θ2 Top pendulum angles θ
2
10 30
SDRE
LQR
20 NN
0 NN+LQR
NN+SDRE
deg
10
deg
−10 SDRE
LQR 0
NN
−20 NN+LQR
NN+SDRE −10
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time, s time, s
deg/s
NN+SDRE
deg/s
20
−40 SDRE
0
LQR
NN
−20 −60 NN+LQR
NN+SDRE
−40
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time, s time, s
NN+LQR
NN+SDRE
2
m
−2 SDRE
LQR
NN
1
−3 NN+LQR
NN+SDRE 0
−4
0 1 2 3 4 5 6 7 −1
time, s 0 1 2 3 4 5 6 7
time, s
Cart velocity dθ0/dt Cart velocity dθ0/dt
4
4
SDRE
2 3 LQR
NN
2 NN+LQR
m/s
0 NN+SDRE
m/s
1
SDRE
LQR
−2 NN
0
NN+LQR
NN+SDRE −1
−4
0 1 2 3 4 5 6 7 −2
time, s 0 1 2 3 4 5 6 7
time, s
Control force
Control force
250
SDRE 50 SDRE
LQR LQR
NN NN
200 NN+LQR
40 NN+LQR
NN+SDRE NN+SDRE
150
30
100
20
N
50 10
0 0
−50 −10
−100 −20
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
time, s time, s
Figure 12: 15 deg. opposite direction Figure 13: 30 deg. same direction
12
Bottom pendulum angles θ Bottom pendulum angles θ
1 1
30 40 SDRE
LQR
20 NN
20 NN+LQR
10 NN+SDRE
deg
deg
0
−10
−20 SDRE
LQR
NN+LQR
−20
−30
NN+SDRE
−40 0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 time, s
time, s
Bottom pendulum velocities dθ /dt
Bottom pendulum velocities dθ1/dt 1
400
100
200
0
0
deg/s
deg/s
−100
SDRE
−200 LQR
SDRE −200 NN
LQR
−400 NN+LQR
NN+LQR
−300 NN+SDRE
NN+SDRE
−600 0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
time, s
time, s
Top pendulum angles θ2
Top pendulum angles θ
2
30 LQR
SDRE
10 NN
20 NN+LQR
NN+SDRE
0
deg
10
deg
−10 0
SDRE
LQR
−20 NN+LQR −10
NN+SDRE
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 time, s
time, s
Top pendulum velocities dθ /dt
Top pendulum velocities dθ /dt 2
2
SDRE
100 LQR 0
NN+LQR
NN+SDRE
deg/s
deg/s
50 −50
SDRE
0 LQR
−100 NN
NN+LQR
−50 NN+SDRE
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 time, s
time, s
Cart position θ0
Cart position θ 6
0
1 SDRE
5 LQR
0 4 NN
NN+LQR
−1 3 NN+SDRE
m
2
m
−2
SDRE 1
LQR
−3 NN+LQR 0
NN+SDRE
−1
−4 0 1 2 3 4 5 6 7
0 1 2 3 time, s 4 5 6 7 time, s
2
m/s
−2 SDRE
LQR
0
−4 NN+LQR
NN+SDRE
−6 −2
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
time, s time, s
40
150
100 20
N
50
0
0
−20
−50
−40
−100
−150 −60
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3
time, s time, s
Figure 14: 18 deg. opposite direction Figure 15: 35 deg. same direction
13
Bottom pendulum angles θ1
20
deg
0
−20
SDRE
−40 NN+SDRE
0 1 2 3 4 5 6 7
time, s
Bottom pendulum velocities dθ /dt
1
Pendulum angles
500 80
θ1
60 θ2
deg/s
0
40
deg
20
−500
SDRE 0
NN+SDRE
−1000 −20
0 1 2 3 4 5 6 7
time, s −40
0 1 2 3 time, s 4 5 6 7
Top pendulum angles θ2
20 Pendulum velocities
1500
10 dθ1/dt
1000 dθ2/dt
deg
0
500
−10
deg/s
0
−20
−500
−30 SDRE
NN+SDRE
−40 −1000
0 1 2 3 4 5 6 7
time, s −1500
0 1 2 3 4 5 6 7
time, s
Top pendulum velocities dθ2/dt
200 Cart position, θ0
15
100
deg/s
0 10
−100
5
m
−200
−300 SDRE
NN+SDRE 0
−400
0 1 2 3 4 5 6 7
time, s −5
0 1 2 3 4 5 6 7
time, s
Cart position θ0
1 Cart velocity, dθ0/dt
25
0 20
15
−1
m
m/s
10
−2 5
SDRE
NN+SDRE 0
−3
0 1 2 3 4 5 6 7 −5
time, s 0 1 2 3 4 5 6 7
time, s
Cart velocity dθ0/dt
10 Control force
5 600
0 400
m/s
−5
200
SDRE
−10 NN+SDRE
0
0 1 2 3 4 5 6 7
time, s
−200
N
Control force
1000
SDRE −400
NN+SDRE
800
−600
600 −800
400 −1000
0
Figure 17: 67 deg. same direction, SDRE
−200
−400
−600
0 0.5 1 1.5 2 2.5 3
time, s
14