[lacnog] problemas con el TA de RPKI de LACNIC

Lun Abr 17 12:49:59 -03 2023

Hi Carlos, LACNOG,

On Sun, Apr 16, 2023 at 11:00:31AM -0300, Carlos Marcelo Martinez Cagnazzo wrote:
> Hubo un problema efectivamente que duró aproximadamente una hora, bien
> dentro de la tolerancia de los objetos, por lo que quienes ya estaban
> corriendo validadores no deben haber tenido dificultad para usar su
> caché local.

>From my observations the outage lasted approximately two hours. And
unfortunately, for many - probably most - RPKI validator instances
around the world, their 'caché local' became entirely invalid, because
of internally inconsistent RRDP delta updates.

Many RPs saw a Manifest reference to file to which they had no access.
The manifest in question is a 'top-level' manifest and basically is a
'gateway' to all the CA certificates representing every RPKI-enabled
LACNIC member. To see a recent copy in decoded form:
http://console.rpki-client.org/repository.lacnic.net/rpki/lacnic/48f083bb-f603-4893-9990-0284c04ceb85/ff14e9055d5afaa37fbe20f4a26bd13c8f18d79a.mft.html

The type of RRDP discrepancy I observed can happen if distinct HTTP
clients (aka RPKI validator instances) are served different data,
despite requesting the same URL. In other words: two HTTPS clients
requested rrdp.lacnic.net/abc/123.xml - one client received 123.xml with
content "ABC" but the other client receives 123.xml with content "XYZ".
This can happen if multiple RRDP frontend servers are in play,
out-of-sync with each other.

I'm concerned there might be an 'active/active' aspect in the
high-availability setup of LACNIC without proper synchronization
within the cluster itself. For example: if some kind of
'directory-to-RRDP' conversion process is executed on two (or more)
nodes, the nodes each should use a unique RRDP session ID, and a
load-balancer should do apply active/backup distribution.

I'm happy to help investigate where exactly the issue resides to prevent
reoccurance.

Kind regards,

Job