Project Proposal:
CUDA Factor
AJ Kaufmann & David Matlack
Main Project Page
Summary
Efficiently factor large numbers using CUDA.
Background

There are many algorithms designed to factor large prime numbers quickly. Pollard's p-1 Algorithm takes advantage of Fermat's Little Theorem to find factors of a compound number n. The algorithm is not inherently parallel and is limited in the number of factors it can find as the size of n increases. A better, and inherently parallel algorithm, is Lenstra's Elliptic Curve algorithm. It's inherent parallelism makes it an attractive algorithm but it is quite complicated.

One application of finding prime factors is in RSA RSA is a popular asymmetric cryptographic algorithm. An asymmetric cryptographic algorithm is one that uses two different keys, one public and one private, to encrypt and decrypt messages. Thus in order to break RSA, we must determine the private key from the public key. In RSA, the public key is a tuple (n, e) where n is a large number with two large prime factors, p and q. The private key is d. If we can compute p and q from n, then we can determine d. Thus beging able to find the prime factors of large numbers is the key to breaking RSA. More information on the algorithm can be found on the Wikipedia page for RSA.

Challenge

The challenges in implementing a prime number factorization algorithm is twofold. First, we will be doing math operations on numbers much larger than can be held in typical x86 primitives. Normally there are many libraries that exist to run multiple precision arithmetic, but there aren't any complete libraries available for CUDA. There exists a partially implemented port of the GNU Multiple Precision library called CUMP. We may have to extend this library or write the instructions we need from scratch.

The other challenge is in choosing an algorithm to do factorization and implementing it. All of the algorithms are inherently complex. In addition, if we choose to implement Pollard's p-1 Algorithm, we will have to figure out how to efficiently parallelize it.

Resources

We will be working from several papers which have demonstrated GPU based speedups in factorization, including "ECM on Graphics Cards" and "Efficient Paralles RSA Decryption Algorithm for Many-Core GPUs with CUDA," which gives detailed explanations of GPU based implementations of various factorization algorithms. Our choice of which algorithm we'll use will be affected by both our understanding of the material and the viability of a GPU based implementation.

This paper does not detail the implementation of the large integer type, which will be necessary, so we will also either need to find a suitable library or implement these features from scratch. One possible candidate for a CUDA based large integer library is CUMP. This library is incomplete, however, and would need to be extended with several modular and euclidean arithmetic functions. One downside to using library is that it seems that it is not currently being maintained.

In order to implement a CUDA based piece of software, we will need to use some machines that have CUDA-enabled GPUs. For this purpose, we intend to use the machines in the Gates-Hillman Center which have GPUs.

Depending on the progress made on this project, we may attempt to implement a distributed GPU algorithm, which would probably make use of the Open MPI library, and several of the CUDA enabled machines in the GHC.

Goals/Deliverables
The goals are as follows:
  1. Implement the basic multiple precision functions needed (e.g. multiplication, subtraction, division, addition, modulus)
  2. Implement the chosen factoring algorithm.
Some stretch goals:
  1. Interface our factorization code with RSA.
  2. Extend CUMP or write a somewhat complete library for doing multiple precision arithmetic on CUDA.
Platform

We choose to use CUDA for this task because most of the fastest factorization algorithms are parallelizable. We also enjoyed using CUDA and think that it is an exciting platform to develop on, since there are many important algorithms (factorization included) which have not yet been completely explored on the GPU. We have also found several papers in which promising speedups have been realized.

Proposed Schedule

Week What We Plan To Do
Apr 1-7 Decide on a project.
Apr 8-14 Create proposal, choose algorithm, choose/implement multiple precision library
Apr 15-21 Finalize multiple precision implementation, begin GPU based implementation
Apr 22-28 Continue work on GPU implementation
Apr 29-May 5 Continue work on GPU implementation
May 6-11 Finish GPU based implementation, stretch goals