C. Proposed conjugate gradient projection method
We propose a CGP algorithm to optimize the covariance matrices for SZF. CGP methods are particularly useful in MIMO systems, as the solutions can be found using gradients and functions of complex-valued matrix variables. Some other optimization methods are only well defined for functions of real-valued vectors, so in those circumstances the covariance matrices and functions would have to be decoupled and expressed in terms of those vectors. CGP algorithms or gradient projection algorithms have been used for covariance optimization in other similar circumstances. For example, CGP is used in a weighted MAC sum-rate maximization in [
25] and [
26], and a gradient projection method is used for MIMO interference systems in [
27] and for the MIMO MAC in [
28]. We model our CGP algorithm after the one in [
26], which operates on transmit filter matrices
T
u
instead of on the covariance matrices
Q
u
directly. This method has the advantage of guaranteeing a positive semidefinite covariance matrix
(this is a Cholesky decomposition [
29]). Operating on
Q
u
directly would require a projection during each iteration to ensure the solution is in the set of positive semidefinite matrices (cf. [
25]).
Let us rewrite and combine (10) and (11) to account for a weighted sum rate. Without loss of generality, we assume
π(
k)
= k for brevity of notation:
(15)
Note in the above that , as the columns of are orthonormal, so . Thus, there is the same transmit power constraint on B
k
as on Q
k
.
Let us further define
, where
T
k
is a
matrix. Thus,
. Defining
T
k
in such a manner helps reduce the complexity of the optimization by reducing the number of optimization variables [
5]. The power constraint can also be re-expressed as
, since
. The CGP algorithm that operates on
T
k
is described as Algorithm 1 in the following.
Algorithm 1 CGP Algorithm for SZF covariance optimization
Initialize: T
k
; S
k
= 0, ∀k; ρ = 1; α = 1.
Calculate WSR from (15).
repeat
Store Tk _old= T
k
, ∀k; Sk _old= S
k
, ∀k; ρ
old
= ρ; WSR
old
= WSR.
Calculate gradients: G
k
, ∀k from (16)
Normalize gradients:
Project gradients:
Calculate Frobenius norm:
Determine search directions:
Step in search directions:
Normalize transmit filter sum-power:
Calculate WSR from (15).
Set LoopCounter = 0.
while WSR < WSR
old
do
Decrease step size α.
Set
LoopCounter = LoopCounter + 1
if LoopCounter = LoopThresh then
Set WSR
old
= WSR.
Reset α to 1.
end if
Recalculate , T
k
, and WSR.
end while
until desired accuracy reached
Because the SZF WSR maximization problem is not convex, the CGP algorithm may not necessarily find the global optimum. Furthermore, the optimal T
k
is not necessarily unique, since B
k
is positive semidefinite. (For example, multiply T
k
by any unitary matrix, and the new T
k
will yield the same B
k
, and thus the same WSR.) The local optimum that the algorithm finds is also to some degree dependent on the initial values for T
k
. Often, when optimizing covariance matrices, an initial choice of a scaled identity matrix is used, but this in general cannot be done here, as generally T
k
is not a square matrix. Furthermore, even if the algorithm was operating on B
k
instead of T
k
, a scaled identity would still not be an appropriate starting point, as the rank would likely be too large; the rank of B
k
would be instead of . Instead, we initialize T
k
by distributing values of to the columns of T
k
in a round-robin fashion. This is equivalent to creating a identity matrix, vertically concatenating copies of the rows of that identity matrix until there are rows, then finally multiplying by . For example, if T
k
was 3 × 2, entries (1,1), (2,2), and (3,1) of T
k
would be initialized to , while the remaining entries would be 0.
The gradient can be calculated using matrix calculus from the partial differential of (15) with respect to
[
30]. Specifically,
. Since the gradients will be normalized, leading constants can be left off. It can be shown that the gradient for user
k is proportional to
(16)
The above gradients seem quite complex at first glance. However, some computational savings can be obtained by a successive calculation of part of the gradients. To begin, the sums
can first be calculated and stored for each
i = 1, ...,
K0 to avoid calculating these sums multiple times. Next, we can define
Z
k
as
(17)
Then, each gradient G
k
can be calculated starting from k = K0 downwards, using a running sum for Z
k
. For example, if K0 = 4, , and so on.
In [
26], the authors define aggregate matrices
G,
S, and
T, which are the horizontal concatenation of the matrices
G
u
,
S
u
, and
T
u
, respectively. This primarily allows them to avoid the summation of squared
F-norms and traces in the notation for their algorithm. For example, in the gradient normalization step,
can be represented more compactly as
. This notation, strictly speaking, is not possible with our adaptation for SZF, as the gradients
G
k
and matrices
T
k
are generally of different dimensions for each
k. An equivalent notation could still be used by instead defining aggregate matrices as a block-diagonal formation of the component matrices instead of a horizontal concatenation. However, this could potentially require additional memory and computational complexity unless the algorithm can account for the sparseness of the aggregate matrices (i.e., the many matrix entries after block-diagonalization that equal zero), and is not strictly necessary in the first place.
In the "step in search directions" portion of the algorithm, it is possible to find an approximately best step size, for example via an inexact line search like Armijo's Rule [
31]. However, we find just as in [
26] that it is generally sufficient to simply reduce the step size by a factor if there is no increase in the WSR. For example, we had good results when using equal weights of
w
k
= 1, ∀
k, by simply multiplying
α by about 0.8. We did, however, notice on rare occasions when the algorithm did not converge properly
a. This is likely due to the non-convexity of the problem; the algorithm is likely stalling near a saddle point in these cases. Repeated decreases in
α did not result in an increase in the WSR, and often led to a small decrease in the WSR. This may also be due to the fact that when a non-linear function is being optimized, an inexact line search (or lack of one, in our case) can lead to the search not being in the correct direction [
31]. For example, if a function is being maximized, although the search should be in a direction of ascent, the search direction may actually be in one of descent. Thus, we implement the addition of a loop counter to compensate for these rare cases. If the loop counter reaches a certain threshold (we use a threshold of 100), the previous best WSR is set to the currently found value for the WSR, and
α is reset
b to 1. Since this updated value is often smaller than the previous value, there is a guaranteed larger value that the algorithm can head towards. This slight decrease in WSR and resetting of
α is generally enough for the algorithm to get sufficiently far enough away from wherever it has stalled to continue finding a better solution (i.e., even better than where it stalled). If the algorithm's WSR still does not increase notably at this point, then it means the algorithm has found a local solution to the problem, as the change in WSR should be less than the desired accuracy. Thus, the algorithm can stop and return the current solution.
While we found that a decrease in the step size α by a factor of 0.8 worked well for our simulations, this value can likely be tuned depending on the specific system parameters and/or channel experienced by the users, to improve the convergence of the algorithm. The loop threshold, however, is best made dependent on the step size factor and the numerical precision of the system, and also optionally on the desired accuracy of the WSR. For instance, with our values of 0.8 and 100, note that 0.8100 ≈ 2 × 10-10 ≈ 2-32. Thus, when the loop threshold is reached, the gradients would be changing around the 10th decimal place, or the 32nd bit of a floating point representation. Further decreases in the step size would result in changes in the gradients (and thus the WSR) that are quite insignificant and likely below whatever accuracy of the solution that would be required in practice. Thus, our loop threshold of 100 is reasonably logical in conjunction with the step size factor of 0.8.
We lastly wish to note that the proposed algorithm, like the existing method from [
5], is meant to find covariance matrices for a given selection of users and their order. The problem of user scheduling and the selection of an order is a complicated issue in and of itself, and for the most part outside the scope of this paper. Where scheduling is involved herein, we generally consider an optimal exhaustive search. The goal of this paper is rather to provide an improved algorithm that applies to whatever selection and order that the scheduler may wish to examine. The capability of examining alternative choices might be desired by the scheduler to, for example, meet certain QoS guarantees for the users, such as a minimum throughput.
D. Discussion of complexity
In [
8,
9], we calculate the complexity of the method for calculating covariance matrices from [
5] in terms of the number of flops (floating point operations) required. It was found that the existing method has complexity order
, assuming all users have the same number of receive antennas
N. That is also the order of finding the null space basis vectors for SZF. Since our CGP algorithm also requires these vectors, it too must have a complexity order of at least
. This in fact is exactly the order of complexity of the CGP algorithm, as the other steps have no greater complexity.
M
T
is the largest matrix dimension encountered, but it is never necessary to multiply two
M
T
×
M
T
matrices together (with complexity
[
32]) for all
K0 users. Each gradient requires a matrix inversion, but this is of an
N
k
×
N
k
matrix, which would have a complexity order
[
32]. The only other comparable order term is the multiplication of
for each user with complexity
, which when summed over all
K0 users also works out to
.
In fact, any precoding method that requires calculating an SVD, a QR decomposition, or a pseudoinverse of an
M ×
N or
N ×
M matrix for each of
K0 users (for example, to find beamforming vectors for the precoder) will have complexity order
, where
M is the larger matrix dimension [
32]. As a comparison, again assuming all users have
N receive antennas, the regularized BD method of [
15] performs an SVD of a (
K0-1)
N ×
M
T
matrix for each user. As generally a system will be looking to schedule as many users and/or send as many data streams as possible, the product
K0N is of the same order as
M
T
, so the method of [
15] is also approximately of order
. The efficient WSR method of [
16] is also technically a scheduling algorithm, so there would be a factor of
K in its complexity. However, to compare it on equal terms, we will ignore the complexity of determining the best user to allocate a data stream to, and just assume that decision has been made. At each step
i of the method, when allocating the
i th stream, the method calculates the pseudoinverse of an
i ×
M
T
composite channel matrix to perform waterfilling. In the case where the base station transmits the maximum of
M
T
data streams, the total order of complexity would theoretically then be
. However, some complexity savings might be possible if the pseudoinverse can be recursively calculated as each row is added to the composite channel matrix
c. We are uncertain what the exact order of complexity would be in such a case, however.
In fairness, we note that while the precoding methods compared above scale with around the same order of complexity as our proposed CGP algorithm, the power allocation for those methods is of closed form instead of the iterative power allocation of our algorithm. Thus, there would be a smaller constant on the highest order term for those algorithms compared with the CGP algorithm.