[lacnog] problemas con el TA de RPKI de LACNIC
job en sobornost.net
Lun Abr 17 12:49:59 -03 2023
Hi Carlos, LACNOG,
On Sun, Apr 16, 2023 at 11:00:31AM -0300, Carlos Marcelo Martinez Cagnazzo wrote:
> Hubo un problema efectivamente que duró aproximadamente una hora, bien
> dentro de la tolerancia de los objetos, por lo que quienes ya estaban
> corriendo validadores no deben haber tenido dificultad para usar su
> caché local.
>From my observations the outage lasted approximately two hours. And
unfortunately, for many - probably most - RPKI validator instances
around the world, their 'caché local' became entirely invalid, because
of internally inconsistent RRDP delta updates.
Many RPs saw a Manifest reference to file to which they had no access.
The manifest in question is a 'top-level' manifest and basically is a
'gateway' to all the CA certificates representing every RPKI-enabled
LACNIC member. To see a recent copy in decoded form:
The type of RRDP discrepancy I observed can happen if distinct HTTP
clients (aka RPKI validator instances) are served different data,
despite requesting the same URL. In other words: two HTTPS clients
requested rrdp.lacnic.net/abc/123.xml - one client received 123.xml with
content "ABC" but the other client receives 123.xml with content "XYZ".
This can happen if multiple RRDP frontend servers are in play,
out-of-sync with each other.
I'm concerned there might be an 'active/active' aspect in the
high-availability setup of LACNIC without proper synchronization
within the cluster itself. For example: if some kind of
'directory-to-RRDP' conversion process is executed on two (or more)
nodes, the nodes each should use a unique RRDP session ID, and a
load-balancer should do apply active/backup distribution.
I'm happy to help investigate where exactly the issue resides to prevent
Más información sobre la lista de distribución LACNOG