[Gnuk-users] Gnuk on a faster MCU

Aurelien Jarno aurelien at aurel32.net
Mon Sep 11 20:57:32 UTC 2017


Hi,

On 2017-09-11 10:17, NIIBE Yutaka wrote:
> Hello,
> 
> Firstly, let me explain current status of Chopstx/NeuG/Gnuk.

Thanks, it's a lot clear to me where things are going.

> Aurelien Jarno <aurelien at aurel32.net> wrote:
> > I therefore started to prototype things a bit, and I "ported" Gnuk on a
> > STM32L432 MCU. I say "ported" because I have done things quick and dirty
> > and the keys are not even stored in flash, but in RAM. This MCU has a
> > Cortex-M4 CPU running at 80MHz and tiny caches (1kB for instructions,
> > 256B for data). It's available in a QFN32 case, even smaller than the
> > STM32F103. It's also able to do crystal-less USB (I haven't tried yet).
> >
> > On such a CPU, Gnuk is able to do a RSA2048 decryption in 0.84s and
> > a RSA4096 decryption in 5.18s (vs 1.27s and 8.22s on FST-01). The gains
> > are mainly due to the instruction cache, as it hides the wait states of
> > the flash memory. The remaining gain comes from the single cycle
> > multiply-and-add instructions. I have been able to get these down to
> > respectively 0.65s and 3.87s by using the UMAAL DSP instruction in
> > MULADDC and mpi_montsqr.
> 
> Great!
> 
> QFN32, crystal-less USB and single cycle multiply-and-add sound great to
> me.  I'm afraid STM32L432 has more features.

Yes, it has a few more interesting features like 3 capacitive sensing
channels (useful for example) to add an authentication validation and a
random number generator. I won't fully trust such a generator, but it
can provide additional entropy to the one provided by the ADC. In order
to port Gnuk quickly I actually replaced all the ADC/Neug code by a call
to the RNG. Note that the STM32L432 only has a single ADC.

I guess such a small chip without crystal can be used to create
something like tomu.im, though probably slightly bigger as there is also
a regulator to add. It just get more difficult to find a thin PCB
manufacturer, especially for prototypes or small series.

> > I am still pondering wether to try with even faster MCU, like an STM32F4
> > at 168MHz even if it comes in a bigger LQFP64 case. I would consider
> > getting a < 2s signature / decryption for a RSA4096 something
> > acceptable.
> 
> I think that 2 seconds is acceptable.  When I started the development of
> OpenPGPcard alternative, it took like 5 seconds for RSA1024 with
> ATmega328 running 20MHz.  I didn't feel it's acceptable for my own use
> cases.  Then, for RSA2048, it was something like 2 seconds with
> STM32F103 with PolarSSL in 2010.  Thus, I started Gnuk.
>
> BTW, currently we are using p*q modulus.  It is known that multi prime
> modulus can speed up RSA computation (It is patented by US5848159, still
> effective).  There was a technique of p^k*q modulus, which was patented
> by US6396926.  I found that the latter patent was expired in 2010, due
> to failure to pay maintenance fee.  For me, the latter technique seems
> to be covered by more general multi prime modulus technique.  If not, I
> wonder we can use that.

Thanks for the pointer, I'll try to have a look at that. That would also
benefit the FST-01 users.

I have started to look at optimizing the existing low level math
operations. The thing I have learned with the Cortex M4, is that the
addition really comes for free with a multiplication, so algorithms
which try to trade multiplications for multiplications are not faster.
Adding two variables with a carry is actually faster using UMAAL with
the two multiplicand being zero than using ADDS and ADC.

> > It seems the biggest portability issue concerns the flash.
> 
> Right.
> 
> > The current code assumes that the pages are small (1 or 2kB) and that
> > the writes are done 2 bytes by 2 bytes. These assumptions are used in
> > src/flash.c, but also define the format of the data in
> > src/openpgp-do.c.
> 
> I feel that src/openpgp-do.c requires major surgery.  It assumes "2
> bytes by 2 bytes", and data can be overwritten by more 0-bit data.

I fully agree. Also I guess flash.c should provide a bit more of
abstraction so that we can provide a different version of flash.c for a
different MCU without any change to openpgp-do.c

> > I wonder if one way to fix that would be to use a single data pool,
> > with the possibility to store longer objects like keys or certificates.
> > It would mean triggering the garbage collector each time a sensitive
> > data like a private key has been removed or replace. This is however a
> > significant change to the current code.
> 
> I think that since data for private key is better to be handled
> carefully, it is good we have different data storage and different
> access routines.

I understand your concern. That said given the low granularity of the
sections on the more advanced STM32 MCUs, there is not a lot of
alternatives. One can imagine keeping the private keys in RAM just the
time of the flash erase, but it means the data is loss in case of power
failure.

> For certificate data object, I want to kill the feature.

Ok.

Thanks for all your answers. I guess for now I'll continue working on
porting Gnuk to the STM32L432. I guess I'll try to cleanup all my
patches and start to submit the chopstx related ones in the next
days or weeks.

Aurelien

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien at aurel32.net                 http://www.aurel32.net



More information about the gnuk-users mailing list