Qubit Matrix Chain Multiplication in O(n)¶

Benchmark
Basics
Current status
Naive coding
What I have done in SymPauli
Relational data structure and super parallelism
Chain multiplication
Scalablility
But comes with great penalty
Conclusion

Benchmark¶

SymPauli¶

https://github.com/hirasawakinko/SymPauli

Basics¶

Any Matrix can be expressed in sum of Pauli Operator.

Pauli Operator a set of operation in quantum computing. They are denoted as

Text Only
I, X, Y, Z

In their matrix form

Text Only
I := [ 1  0 ]
     [ 0  1 ]

X := [ 0  1 ]
     [ 1  0 ]

Y := [ 0 -i ]
     [ i  0 ]

Z := [ 1  0 ]
     [ 0 -1 ]

The "i" in Y is the "i" of imaginary number.

Let say there is a matrix A, you are able to carry Pauli decomposition to it.

Text Only
Pauli decomposition:

matrix A = summation(i)[ coefficient i * pauli{I, X, Y, Z} ]

for example 4I + 2Y + 7X - 4Z + 3Y ...

Then you can carry out tensor product to make larger matrix.

Text Only
I tensor X

= [ 1  0 ] tensor X
  [ 0  1 ]

= [ X  0 ]
  [ 0  X ]

= [ 0  1  0  0 ]
  [ 1  0  0  0 ]
  [ 0  0  0  1 ]
  [ 0  0  1  0 ]

And the rules of multiplications are:

Text Only
I* == + *
*I == + *
XY ==  iZ
YX == -iZ
YZ ==  iX
ZY == -iX
ZX ==  iY
XZ == -iY

You may try multipying those by hand, to confirm that they are valid.

And obviously these symbolic rules are valid, you can avoid matrix multiplication, and merely do the string manipulation.

Text Only
IX -> X
XY -> Z
YZ -> X
...

For example:

Text Only
matrix A = 4I + 2Y + 7X - 4Z + 3Y

matrix A * Pauli X
= (4I + 2Y + 7X - 4Z + 3Y) * X
= (4A + 2Z + 7I - 4Y + 3Z)

I omitted all those coefficient changes. Not because you don't have to, it is just because coefficient is not the main concern in this research.

Current status¶

There are many Python library can do such symbolic quantum computation, however, not fast enough.

Algorithm they use are naive, causing masive redundent runtime. For example OpenFermion.

Some are not even usnig symbolic computation. Those are just expend string input into a matrix, and then carryout matrix vector multiplication. For example Qiskit, Qulacs. Which makes them absurbly slow in some of the most simple tasks.

For example this.

Text Only
(*) represent tensor product

matrix A 
= I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*) 
  I (*) I (*) I (*) I (*) I (*) I (*)

Now we multiply it by
matrix B 
= X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*) 
  X (*) X (*) X (*) X (*) X (*) X (*)

How long you think this would take? I can do this under a second single-handedly, but them? Probably take more than 1 year to output the answer.

For tensor product, you multiply stuffs bitwise-ly. As I*X -> X, the result is just matrix B itself. Cool?

Naive coding¶

So, what about OpenFermion's naive coding? Well, they almost got it but not quite. Say you are multipying two matrix A and B:

Text Only
matrix A  |  matrix B
= I Y     |  = X Z
+ Y X     |  + Y Y
+ X Y     |  + X Y
+ Z X     |
+ Y Z     |

Just as what you will do to (x + 3)(x + 2), you do the same thing to these matrix.

Text Only
first lable them...

matrix A  |  matrix B
= I Y     |  = X Z  (b1)
+ Y X     |  + Y Y  (b2)
+ X Y     |  + X Y  (b3)
+ Z X     | 
+ Y Z     |


matrix A * matrix B

= matrix A * b1 + matrix A * b2 + matrix A * b3 ...

So you would carry the bitwise multiplication like this:

Text Only
  X Z (b1)    Y Y  (b2)    X Y  (b3)  
matrix A    matrix A     matrix A     
= I Y       = I Y        = I Y        
+ Y X    +  + Y X    +   + Y X      
+ X Y       + X Y        + X Y        
+ Z X       + Z X        + Z X        
+ Y Z       + Y Z        + Y Z


                 Y Y  (b2)    X Y  (b3)   
               matrix A     matrix A      
               = I Y        = I Y         
= something +  + Y X    +   + Y X   
               + X Y        + X Y         
               + Z X        + Z X         
               + Y Z        + Y Z


                 X Y  (b3)  
               matrix A     
               = I Y        
= something +  + Y X    
               + X Y      
               + Z X        
               + Y Z

At each step, you broadcase one row of matrix B to all rows of matrix A.

Text Only
matrix A
= I Y * X Z (b1)
+ Y X * X Z (b1)
+ X Y * X Z (b1)
+ Z X * X Z (b1)
+ Y Z * X Z (b1)

This approach seems legit, but absolutely we can do better and faster.

What I have done in SymPauli¶

As you can see in the precious section, we have to call out matrix A multiple times in order to carryout complete process of matrix multiplication. From this, we know that matrix A is basically not going to change anyway.

Also the not-so-obvious one is that, Pauli Operator has its cyclic group property. It means that we can totally predict the outcome.

When you merged like terms i.e. 2y + 3y +7x = 5y + 7x, there is only unique Pauli tensors left in the pool.

Let say you have I X Y Z, still ignoring coefficient.

Text Only
  (I X Y Z) * X
= (X I Z Y)

(I +X +Y +Z) is the simpliest form of itself, you cannot simplify it anymore. After multiplying something, the outcome is still in its simpliest form. And you should notice that multiplication in here literally means permutation.

Text Only
  (I X Y Z) * Y
= (Y Z I X)

  (I X Y Z) * Z
= (Z Y X I)

Focus back on the matrix A * b1 :

Text Only
  X Z (b1)
matrix A
= I Y
+ Y X
+ X Y
+ Z X
+ Y Z

Further focus on the left Pauli Operators:

Text Only
  X (b1)
matrix A
= I
+ Y
+ X
+ Z
+ Y

By the previous rule, we know the answer already.

Text Only
  (I X Y Z) * X
= (X I Z Y)

  X (b1)
matrix A
= I -> X
+ Y -> Z
+ X -> I
+ Z -> Y
+ Y -> Z

Here you see some duplication. i.e. we already know that X*Z -> Y, so why are we still checking it row by row? This is just a waste of time.

Here it comes the relational data structure.

Relational data structure and super parallelism¶

This is one part of matrix A. And we are going to perform unified permutation operator to them all.

Text Only
matrix A
= I * X
+ Y * X
+ X * X
+ Z * X
+ Y * X

Here I purpose a relational data structure to execute a super parallelism algorithm.

You have a fontend and a backend for this data structure. And you map every value of the fontend to the backend, so that its value refers to what it is at the back.

Text Only
front and back

fontend   backend
   I      0 = I
   Y      1 = X
   X      2 = Y
   Z      3 = Z
   Y

mapping front to back
fontend   backend
    0      0 = I
    2      1 = X
    1      2 = Y
    3      3 = Z
    2

When you do multiplication, we do it to the backend, not frontend. After that, you read out values of frontend from the backend.

Text Only
multiplying X
fontend   backend
    0      0 = I
    2      1 = X     * X
    1      2 = Y
    3      3 = Z
    2

multiplied X
fontend   backend
    0      0 = X
    2      1 = I
    1      2 = Z
    3      3 = Y
    2

reading result
fontend   backend
X < 0       0 = X
Z < 2       1 = I
I < 1       2 = Z
Y < 3       3 = Y
Z < 2

After reading out, you can reset matrix A to its init state solely by reseting the backend.

Text Only
reset backend
fontend   backend
I < 0       0 = I
Y < 2       1 = X
X < 1       2 = Y
Z < 3       3 = Z
Y < 2

So it is back to original matrix A. So you can do the next multiplication (also because it requires the original one), namely:

Text Only
  Y Y  (b2)
matrix A
= I Y
+ Y X
+ X Y
+ Z X
+ Y Z

Chain multiplication¶

Last step in the last section we reset the backend. However it is not required if you are going to do a chain multiplication i.e. A * b * c * d * e * ..., {b c d e} are in small letter and they are one-row thing (not linear combination).

Text Only
fontend   backend
   I       0 = I
   Y       1 = X   * b * c * d * e * ...
   X       2 = Y
   Z       3 = Z
   Y

By not reseting the backend after every turn of multiplication, you are able to carry out the chain action that use precious state as the starting point of next action. So you can maintain the initial mapping and read the result with that initial mapping.

Text Only
initial mapping
fontend   backend
    0      0 = I
    2      1 = X  * b * c * d * e * ...
    1      2 = Y
    3      3 = Z
    2

after all multiplications
fontend   backend
    0      0 = ?
    2      1 = ?
    1      2 = ?
    3      3 = ?
    2

you can read the result by the very same initial mapping
fontend   backend
? < 0      0 = ?
? < 2      1 = ?
? < 1      2 = ?
? < 3      3 = ?
? < 2

Scalablility¶

This relational data structure and algorithm is parallel native. Even not under parallel programming, still be able to gain a masive speed up.

Due to the fact that all data manipulation is done solely to the backend, and the backend is as small as a constant, i.e. size of backend array = 4 Pauli Operator * N qubits, complexity of O(n). With this size, you can easily pack that into CPU cache and do everything in unprecedented speed.

This is the initial state of a 6 qubits backend, and you have one row of your matrix is (I I I I I I):

Text Only
qubits  0  1  2  3  4  5

0 = I   0  0  0  0  0  0
1 = X   1  1  1  1  1  1
2 = Y   2  2  2  2  2  2
3 = Z   3  3  3  3  3  3

my row : (I I I I I I) -> (0 0 0 0 0 0)

For example multiplied by (X Y Z Y Z X):

Text Only
qubits  0  1  2  3  4  5

0 = I   1  2  3  2  3  1
1 = X   0  3  2  3  2  0
2 = Y   3  0  1  0  1  3
3 = Z   2  1  0  1  0  2

You just have to look at the same position of the table, to find out what it becomes now.

Text Only
my row : (I I I I I I) -> (1 2 3 2 3 1) -> (X Y Z Y Z X)

Same table applies to all rows. Multiply once, read many. Because Pauli Operator is Cyclic, Just like you rotate a analogue clock (those come with hours minutes pins), the internal relative position of 10 o'clock and 4 o'clock and others are not gonna to change.

Second is that everything is summation, and the ordering of summation does not matter due to addition commutativity. So you can split the whole thing into subprocess and sum them up lately.

Further is that tensor product of Pauli operator is carried out bitwise-ly, so that you can further split it into smaller subtasks.

Also the small size of the backend makes the communicational cost very low.

But comes with great penalty¶

Read and write performance of this data structure is fatal. Since you have to first map the information of frontend to the backend, and after that you have to do the reverse. This kind of translation cost is super expensive, for matrix that is 30000 rows of Pauli tensor * 40 qubits, it is about 0.1X to 0.01X of the R/W performance.

However, for the chain multiplication situation, R/W is only performed at first and the last, multiplying chain of 30000 rows of Pauli tensor * 40 qubits to the matrix of same size but in a linear combunation, this data structure is giving out 30000X to 40000X performance.

Until now, I have been ignoring the coefficients, mainly focused on Pauli Operators. All comparison stated above is:

Text Only
conventional with coefficient 
VS 
new without coefficient

So yeah it is just BS benchmark and nothing you can take seriously. But that is because there is no room for improvement when we include coefficients into the calculation. At most 10X speend up at the same scale using Python, And I am already using Numpy for the heavy leafting coefficient work.

So you can see that no coefficients, 40000X. Yes coefficients, 10X. Damn coefficients sucks.

Conclusion¶

I independently invented a new algorithm using relational data structure and phantomly revolutionized the contentional approach by 40000X under the same scale, but this is just when I ignore all computation work for the coefficients. With coefficients, 10X is the maximun I can get. So, long live the symbolic computation, god damn it numeric computation.

Last update: 2023-06-28
Created: 2023-06-28