How I rescued my postgresql database on fly.io

· About

How I rescued my postgresql database on fly.io

The problem #

I had an old fly.io postgres database that I could no longer connect to, for whatever reason (too old maybe), and data inside that I wanted to rescue.

First, I tried to diagnose the database, and connect in many different ways to see if I could get at the data:

$ fly apps list
NAME           OWNER   STATUS   PLATFORM LATEST DEPLOY     
mydb           personal deployed machines                  
shy-field-1671 personal deployed machines                  

$ fly postgres connect -a mydb
Error: no active leader found

$ fly postgres config show -a mydb
Error: no active leader found

$ fly postgres failover -a mydb
Error: failover is not available for standalone postgres

$ fly postgres restart -a mydb
Error: no active leader found

$ fly status --app mydb
ID            STATE  ROLE   REGION CHECKS            IMAGE                                  CREATED             UPDATED              
148e271cdd0d89 started unknown ams   2 total, 2 passing flyio/postgres-standalone:14.1 (v0.0.7) 2023-01-18T20:24:44Z 2023-08-13T14:46:37Z

❯ fly machine status 148e271cdd0d89 --app mydb
Machine ID: 148e271cdd0d89
Instance ID: 01H7QKQNZD4ZBGHP8PSFPQYJYH
State: started

**VM**
  ID            = 148e271cdd0d89                           
  Instance ID   = 01H7QKQNZD4ZBGHP8PSFPQYJYH               
  State         = started                                  
  Image         = flyio/postgres-standalone:14.1 (v0.0.7)  
  Name          = young-river-6979                         
  Private IP    = fdaa:0:2072:a7b:23c8:2948:4c51:2         
  Region        = ams                                      
  Process Group = app                                      
  CPU Kind      = shared                                   
  vCPUs         = 1                                        
  Memory        = 256                                      
  Created       = 2023-01-18T20:24:44Z                     
  Updated       = 2023-08-13T14:46:37Z                     
  Entrypoint    =                                          
  Command       =                                          
  Volume        = vol_ke628r6pepjvwmnp                     

**Event Logs**
STATE   EVENT SOURCE TIMESTAMP                    INFO
started start flyd  2023-08-13T15:46:37.636+01:00
starting start user  2023-08-13T15:46:37.225+01:00
stopped exit flyd  2023-08-13T15:34:21.321+01:00 exit_code=0,oom_killed=false,requested_stop=true
stopping stop user  2023-08-13T15:34:19.823+01:00
started start flyd  2023-08-13T15:10:29.282+01:00

$ fly checks list --app mydb
Health Checks for mydb
  NAME | STATUS  | MACHINE        | LAST UPDATED     | OUTPUT
-------*---------*----------------*------------------*--------------------------------------------------------------------------
  pg   | passing | 148e271cdd0d89 | Jan 2 2024 19:39 | [✓] connections: 7 used, 3 reserved, 100 max (3.39ms)                    
-------*---------*----------------*------------------*--------------------------------------------------------------------------
  vm   | passing | 148e271cdd0d89 | Jan 2 2024 19:39 | [✓] checkDisk: 9.15 GB (93.9%) free space on /data/ (40.44µs)            
       |         |                |                  | [✓] checkLoad: load averages: 0.00 0.00 0.00 (62.78µs)                   
       |         |                |                  | [✓] memory: system spent 0s of the last 60s waiting on memory (31.44µs)  
       |         |                |                  | [✓] cpu: system spent 0s of the last 60s waiting on cpu (21.98µs)        
       |         |                |                  | [✓] io: system spent 0s of the last 60s waiting on io (21.26µs)          
-------*---------*----------------*------------------*--------------------------------------------------------------------------

So it LOOKS healthy... but this no active leader message is really bad. Yikes.

Could I try creating a new postgresdb from a snapshot?

❯ fly volumes snapshots list vol_ke628r6pepjvwmnp
**Snapshots**
ID                 STATUS SIZE    CREATED AT   
vs_ejDYvgm1z0GXYsL created 28480078 14 hours ago
vs_Yj5YwKj0mz1YRhzX created 28510195 1 day ago   
vs_lqM91DxPBnpl5uXo created 28514186 2 days ago  
vs_kZoR7j7XjMbJBh7m created 28503069 3 days ago  
vs_8x4zNOZ59lalXSZX created 28496591 4 days ago  
vs_YVyoj8nbayzgJS6N created 28505782 5 days ago  

❯ fly postgres create --snapshot-id vs_ejDYvgm1z0GXYsL
**?** **Choose an app name (leave blank to generate one):** mydb-restored-2024
automatically selected personal organization: Aaron Kelly
Some regions require a Launch plan or higher (bom, fra).
See https://fly.io/plans to set up a plan.
**?** **Select region:** Amsterdam, Netherlands (ams)
**?** **Select configuration:** Development - Single node, 1x shared CPU, 256MB RAM, 1GB disk
**?** **Scale single node pg to zero after one hour?** Yes
Creating postgres cluster in organization personal
Creating app...
Setting secrets on app mydb-restored-2024...
Restoring 1 of 1 machines with image flyio/postgres-standalone:14.1@sha256:ca27c53b81cae713e67d7ced87a4289961db4a81e382b09aaf42ea53032791eb
Error: failed to create volume: failed to create volume: Restore volume size must be at least 10GB (Request ID: 01HN5BS7FC83JWTB0PQ92WX2TK-ams)

No joy.

How about forking a volume?

$ fly postgres create --fork-from mydb
**?** **Choose an app name (leave blank to generate one):** mydb-restored-2024
automatically selected personal organization: Aaron Kelly
Error: Failed to resolve the volume associated with the primary instance. See `fly pg create --help` for more information

$ fly volumes list --app mydb
ID                  STATE  NAME   SIZE REGION ZONE ENCRYPTED ATTACHED VM   CREATED AT   
vol_ke628r6pepjvwmnp created pg_data 10GB ams   ac80 false    148e271cdd0d89 2 years ago 
vol_d7xkrkynxpo4w2q9 created pg_data 10GB ord   bddf false                  5 months ago

$ fly postgres create --fork-from mydb:vol_d7xkrkynxpo4w2q9
**?** **Choose an app name (leave blank to generate one):** mydb-restored-2024
automatically selected personal organization: Aaron Kelly
**?** **Select VM size:** shared-cpu-1x - CPU Kind: Shared, vCPUs: 1 Memory: 256MB
Creating postgres cluster in organization personal
Creating app...
Setting secrets on app mydb-restored-2024...
Provisioning 1 of 1 machines with image flyio/postgres:14.6@sha256:3c25db96357a78e827ca7dbbf4963089bdfcd48fb05a40cd82cfff5c61fb7710
Waiting for machine to start...
Machine 784e662b66d068 is created
==> Monitoring health checks
  Waiting for **784e662b66d068** to become healthy (started, 3/3)

Postgres cluster mydb-restored-2024 created
  Username:    postgres
  Password:    UsvWa7FHbUhuuPn
  Hostname:    mydb-restored-2024.internal
  Proxy port:  5432
  Postgres port:  5433
  Connection string: postgres://postgres:UsvWa7FHbUhuuPn@mydb-restored-2024.internal:5432

_Save your credentials in a secure place -- you won't be able to see them again!_

**Connect to postgres**
Any app within the Aaron Kelly organization can connect to this Postgres using the above connection string

Now that you've set up Postgres, here's what you need to understand: https://fly.io/docs/postgres/getting-started/what-you-should-know/ 

$ fly pg connect --app mydb-restored-2024
Connecting to fdaa:0:2072:a7b:111:952:a3c7:2... complete
psql (14.6 (Debian 14.6-1.pgdg110+1))
Type "help" for help.

postgres=# \dt
Did not find any relations.

Ehhh not good.

Looking at the logs on https://fly.io/apps/mydb-restored-2024/monitoring :

2024-01-27T11:51:13.904 app[784e90dc440568] ams [info] exporter | INFO[0088] Established new database connection to "fdaa:0:2072:a7b:13b:1bf2:88ea:2:5432". source="postgres_exporter.go:970"

2024-01-27T11:51:13.910 app[784e90dc440568] ams [info] flypg | 2024-01-27 11:51:13.908 UTC [537] FATAL: password authentication failed for user "flypgadmin"

2024-01-27T11:51:13.910 app[784e90dc440568] ams [info] flypg | 2024-01-27 11:51:13.908 UTC [537] DETAIL: Connection matched pg_hba.conf line 3: "host all all ::0/0 md5"

2024-01-27T11:51:14.445 app[784e90dc440568] ams [info] flypg | 2024-01-27 11:51:14.444 UTC [539] FATAL: password authentication failed for user "postgres"

2024-01-27T11:51:14.445 app[784e90dc440568] ams [info] flypg | 2024-01-27 11:51:14.444 UTC [539] DETAIL: Connection matched pg_hba.conf line 3: "host all all ::0/0 md5"

2024-01-27T11:51:14.445 app[784e90dc440568] ams [info] Failed to create required users: failed to connect to `host=fdaa:0:2072:a7b:13b:1bf2:88ea:2 user=postgres database=postgres`: failed SASL auth (FATAL: password authentication failed for user "postgres" (SQLSTATE 28P01))

Very sad

The solution #

Oh well, so I suppose I'm now in data rescue mode.

Could I just ssh to the volume and then create a dump of the database?

Lets see if I can connect, and see my data:

fly ssh console --app mydb
Connecting to fdaa:0:2072:a7b:23c8:2948:4c51:2... complete
root@148e271cdd0d89:/# su postgres
postgres@148e271cdd0d89:/$ psql
psql (14.1 (Debian 14.1-1.pgdg110+1))

Type "help" for help.

postgres=# \d

                List of relations
 Schema |       Name        |   Type   |  Owner   
--------+-------------------+----------+----------
 public | appliances        | table    | postgres
 public | appliances_id_seq | sequence | postgres
 public | ketones           | table    | postgres
 public | ketones_id_seq    | sequence | postgres
 public | weight            | table    | postgres
 public | weight_id_seq     | sequence | postgres
(6 rows)

There it is! Now lets exit the psql program and take a dump of the db...

postgres-# \q

postgres@148e271cdd0d89:/$ cd
postgres@148e271cdd0d89:~$ pwd
/var/lib/postgresql
postgres@148e271cdd0d89:~$ pg_dump > mydb.sql
postgres@148e271cdd0d89:~$ cat mydb.sql 
--
-- PostgreSQL database dump
--

-- Dumped from database version 14.1 (Debian 14.1-1.pgdg110+1)
-- Dumped by pg_dump version 14.1 (Debian 14.1-1.pgdg110+1)

SET statement_timeout = 0;
SET lock_timeout = 0;

Awesome, now lets exit the container competely, and download it:

fly sftp get /var/lib/postgresql/mydb.sql --app mydb
5379 bytes written to mydb.sql

Using the example https://hub.docker.com/_/postgres/ as an example, I created the following docker-compose.yml :

# Use postgres/example user/password credentials
version: '3.1'

services:

  db:
    image: postgres
    restart: always
    environment:
      POSTGRES_PASSWORD: example
	volumes:
	  - ./backups:/mnt

  adminer:
    image: adminer
    restart: always
    ports:
      - 8080:8080

I moved my database backup file to the new directory backups/mydb.sql

I then connected to the database and then attempted to run the import:

postgres@f64d15e57bc5:/$ pg_restore -U postgres -d postgres < /mnt/mydb.sql 
pg_restore: error: input file appears to be a text format dump. Please use psql.

Oh. I would have thought the export defaults would have been ok. Well, after searching a bit, the import command is pretty simple:

postgres@f64d15e57bc5:/$ psql -d postgres -f /mnt/mydb.sql 
SET
SET
 set_config 
------------
(1 row)
CREATE TABLE
ALTER TABLE
CREATE SEQUENCE
...

I checked the Adminer on http://localhost:8080/, logged in with postgres and example...

and I could see my data, great success!

My original, embarrassing, hacky solution #

My first attempt involved:

I don't know why I didn't think of looking for postgres tools to help with data dumps, but at least I learned a lot about other stuff along the way.

What I did #

Lets try and take a backup of the postgres data folder:

$ fly ssh console --app mydb
Connecting to fdaa:0:2072:a7b:23c8:2948:4c51:2... complete

root@148e271cdd0d89:~# cd
root@148e271cdd0d89:~#

root@148e271cdd0d89:/# tar cvf mydb_backup.tar /data/postgres
tar: Removing leading `/' from member names
/data/postgres/
/data/postgres/pg_serial/
/data/postgres/pg_replslot/
/data/postgres/postgresql.conf
/data/postgres/pg_stat_tmp/
/data/postgres/pg_stat_tmp/db_0.stat
...
/data/postgres/postmaster.opts
/data/postgres/pg_hba.conf

$ fly sftp get /root/mydb_backup.tar --app mydb
44103680 bytes written to mydb_backup.tar

Using the example https://hub.docker.com/_/postgres/ as an example, I created a folder with the following docker-compose.yml inside:

# Use postgres/example user/password credentials
version: '3.1'

services:

  db:
    image: postgres
    restart: always
    environment:
      POSTGRES_PASSWORD: example
	volumes:
	  - ./database-backups/data/postgres:/var/lib/postgresql/data

  adminer:
    image: adminer
    restart: always
    ports:
      - 8080:8080

Inside that folder I then created a folder called database-backups, and moved mydb-backup.tar into it.

Next I ran this command to start the containers, and checked the logs to make sure the startup was ok:

$ docker compose up -d
$ docker compose logs
adminer-1  | [Sat Jan 27 13:11:05 2024] PHP 7.4.33 Development Server (http://[::]:8080) started
db-1       | The files belonging to this database system will be owned by user "postgres".
db-1       | This user must also own the server process.
db-1       | 
db-1       | The database cluster will be initialized with locale "en_US.utf8".
...
db-1       | 2024-01-27 13:11:08.526 UTC [1] LOG:  database system is ready to accept connections

I also checked that Adminer was up at http://localhost:8080/, which it was. I was able to login with the default username postgres and password example. I didn't give a value for the database. I logged in fine and saw the default databases default, template0, template1

Happy with all that, I entered the postgres container:

$ docker exec -it --user postgres db bash

And checked the permissioning of the postgres data folder inside the container:

postgres@048cbcd015f6:/$ cd /var/lib/postgresql
postgres@048cbcd015f6:~$ ls -al
total 12
drwxr-xr-x  1 postgres postgres 4096 Jan 12 00:23 .
drwxr-xr-x  1 root     root     4096 Jan 12 00:23 ..
drwx------ 19 postgres postgres 4096 Jan 27 13:53 data

postgres@0a3201ad1f64:/$ id
uid=999(postgres) gid=999(postgres) groups=999(postgres),101(ssl-cert)

I exited the container, and then then extracted my backup:

$ tar xvf mydb_backup.tar
x data/postgres/
x data/postgres/pg_serial/
x data/postgres/pg_replslot/
x data/postgres/postgresql.conf
...

I then chowned and chmodded backup dir to match what's inside the dir inside the container:

$ sudo chown -R 999:999 database-backups/data
$ sudo chmod -R 700 database-backups/data

I then brought all the services up with docker compose up -d, and checked that the logs were ok with docker compose logs

I went back to Adminer, however, when trying to login, it gave me the message:

Unable to connect to PostgreSQL server: FATAL: password authentication failed for user "postgres"

I needed a way to connect without a password. So, I to the container as root, installed an editor, and and edited this file:

$ docker compose exec -it  db bash
$ apt update && apt install vim
$ vim /var/lib/postgresql/data/pg_hba.conf

It had this contents:

local all postgres  trust
host all all 0.0.0.0/0 md5
host all all ::0/0 md5

I replaced all the md5 with trust :

local all postgres  trust
host all all 0.0.0.0/0 md5
host all all ::0/0 md5

Restarted the database:

pg_util restart

And I was able to login with:

server: db
username: postgres
password: example

When I first attempted this method months ago, it was a success - I saw all my old data in the postgres db.

However, repeating these steps recently was not successful.

So, this solution is bad and very hacky.