[New servers] - Migrate data from Knopi to Ramashka #512

Closed
opened 2023-05-03 09:59:04 +02:00 by muppeth · 12 comments
Owner
No description provided.
muppeth added this to the 23.05 - May milestone 2023-05-03 09:59:04 +02:00
muppeth self-assigned this 2023-05-03 09:59:04 +02:00
muppeth added this to the 23.05 - May project 2023-05-05 22:25:22 +02:00
Author
Owner

There are two major filestores that need to be migrated email and nextcloud.

Email

Estimated time of migration is about 4 days. We are rsyncing each individual mbox to speed up the calculation process. We sync mboxes in parallel (64) to make things go faster. This means there might be some discrepancies between final sync as there is 4 day difference however this may not be that much of a deal as after the initial migration, secondary one should take less time (we need to test that still). If the time of migration does indeed take more time and we want to avoid mbox state discrepancy we could implement idea from nextcloud sync

Nextcloud

Nextcloud file store is huge. Not only that it is also tightly connected to database as filecache is stored in it. This in combination with server side per used key encryption is vary much prone to problems. From our experience, this is very fragile and delicated issue. For example:
Some years ago we have had situation where user accidentally wiped his nextcloud storage by removing directory form nextcloud sync client. We did have the data (files) in the backup so we wanted to test and see how easy it is to restore. However this turned out to be a rather disaster situation. We had the data which we could restore but all keys and filecache in the db have been removed. We had a daily db backup but user reported issue too late and we could not recover db state from the day the files have vanished. Took us days to manage to recover only portion of the data.
Moving datastore to another server does bring up some of the problems we need to face. There are two ways to do the move.

  • With long downtime - We could do a sync similar to mailbox in terms of procedure: move files per user in parallel, measure the time, repeat moving. This could give us idea after at least two runs, how long we would need to put nextcloud offline (or at leeast disable files app) before we can do the last move and re-enable everything.

  • With no or minimal per user downtime - This approach would require both new and old datastores mounted on the cloud server. Then we should commence the following per user:

    • sync data from old to new filestore
    • disable useraccount (preventing user to possibly upoad new content)
    • re-sync data from old to new filestore
    • bind mount user files from new filestore onto old one
    • enable useraccount
      This procedure would mean that although user can experience short downtime (time it takes to re-sync data which can take minutes), service as a whole can work without disturbance for everyone else. However we do need to consider few things (and probably quite few more that were thought of yet).
    • users with their datastore migrated need to be marked as done to make sure no further sync is done
    • We need to make sure that in case the server needs to reboot (for whatever reason) all bind mounts need to happen before nextcloud comes back online ( this means before fpm process and cronjob)
    • We need to further test the scenario to see if it's even doable (can we create couple thousand bind mounts?, will symlinking work better?, what to do in case of disaster where we need to revert the process half way as we did not anticipate something?)

In both cases we should start testing the migration to see how long with it take. For second option, we need to do a lot of tests before we commence this operation in production.

There are two major filestores that need to be migrated email and nextcloud. ### Email Estimated time of migration is about 4 days. We are rsyncing each individual mbox to speed up the calculation process. We sync mboxes in parallel (64) to make things go faster. This means there might be some discrepancies between final sync as there is 4 day difference however this may not be that much of a deal as after the initial migration, secondary one should take less time (we need to test that still). If the time of migration does indeed take more time and we want to avoid mbox state discrepancy we could implement idea from nextcloud sync ## Nextcloud Nextcloud file store is huge. Not only that it is also tightly connected to database as filecache is stored in it. This in combination with server side per used key encryption is vary much prone to problems. From our experience, this is very fragile and delicated issue. For example: Some years ago we have had situation where user accidentally wiped his nextcloud storage by removing directory form nextcloud sync client. We did have the data (files) in the backup so we wanted to test and see how easy it is to restore. However this turned out to be a rather disaster situation. We had the data which we could restore but all keys and filecache in the db have been removed. We had a daily db backup but user reported issue too late and we could not recover db state from the day the files have vanished. Took us days to manage to recover only portion of the data. Moving datastore to another server does bring up some of the problems we need to face. There are two ways to do the move. - **With long downtime** - We could do a sync similar to mailbox in terms of procedure: move files per user in parallel, measure the time, repeat moving. This could give us idea after at least two runs, how long we would need to put nextcloud offline (or at leeast disable files app) before we can do the last move and re-enable everything. - **With no or minimal per user downtime** - This approach would require both new and old datastores mounted on the cloud server. Then we should commence the following per user: - sync data from old to new filestore - disable useraccount (preventing user to possibly upoad new content) - re-sync data from old to new filestore - bind mount user files from new filestore onto old one - enable useraccount This procedure would mean that although user can experience short downtime (time it takes to re-sync data which can take minutes), service as a whole can work without disturbance for everyone else. However we do need to consider few things (and probably quite few more that were thought of yet). - users with their datastore migrated need to be marked as done to make sure no further sync is done - We need to make sure that in case the server needs to reboot (for whatever reason) all bind mounts need to happen before nextcloud comes back online ( this means before fpm process and cronjob) - We need to further test the scenario to see if it's even doable (can we create couple thousand bind mounts?, will symlinking work better?, what to do in case of disaster where we need to revert the process half way as we did not anticipate something?) In both cases we should start testing the migration to see how long with it take. For second option, we need to do a lot of tests before we commence this operation in production.
muppeth modified the milestone from 23.05 - May to 23.06 - June 2023-06-07 10:10:20 +02:00
Owner

Sounds like a pain in the butt!
Obviously the second proposal for NC sounds great: if user can have no more that a day without using his/her NC, that would be awesome.

Concerning email, does that mean that there is no stop of the service?

Sounds like a pain in the butt! Obviously the second proposal for NC sounds great: if user can have no more that a day without using his/her NC, that would be awesome. Concerning email, does that mean that there is no stop of the service?
muppeth removed this from the 23.05 - May project 2023-06-11 10:49:10 +02:00
Author
Owner

I have started migration of mailboxes. Goes pretty fast. In two/three days initial sync should be complete and then we could run and measure the secondary syncs. This will give us insight into how long this could take. Once we know this we could make decission on how to do the switch from one filestore to the other.

I have started migration of mailboxes. Goes pretty fast. In two/three days initial sync should be complete and then we could run and measure the secondary syncs. This will give us insight into how long this could take. Once we know this we could make decission on how to do the switch from one filestore to the other.
muppeth modified the milestone from 23.06 - June to 23.07 - July 2023-07-03 22:27:51 +02:00
Author
Owner

Mailbox migration is done. Initial sync is complete. Re-syncing now takes about 5 hours. We should get it down to just 2-3 hours.
So Weekend (saturday 15th) we should be able to migrate to new server.

Mailbox migration is done. Initial sync is complete. Re-syncing now takes about 5 hours. We should get it down to just 2-3 hours. So Weekend (saturday 15th) we should be able to migrate to new server.
muppeth removed this from the 23.07 - July milestone 2023-08-06 15:35:17 +02:00
Author
Owner

Time to get back on nextcloud migration.

Time to get back on nextcloud migration.
muppeth added this to the 23.10 - October milestone 2023-10-01 14:55:47 +02:00
Author
Owner

Initial sync of nextcloud data has started.

Initial sync of nextcloud data has started.
muppeth modified the milestone from 23.10 - October to 23.11 - November 2023-10-30 22:35:16 +01:00
Author
Owner

Second sync after initial took about 11.5 hours. I think this can be lowered when running the sync in a loop for few days. Once we have more realistic time on how long will it take to do last sync we can decide on how to do the last sync. I think there are three options though the third one seems to be an overkill at this moment and could cause ton of issues so I woudn't even take in to consideration:

  1. Disable entire Nextcloud for the duration of the last migration
  2. Disable only files for the duration of the migration
  3. Create symlinks or bind mounts for each user, switching to new storage on per user basis.

The reason this is rather delicate operation is mainly the fact the files are encrypted and possible inconsistency between database, keys and files could cause files to be not decryptable. Having to deal with such issues in the past, I would rather go for option 1 or 2 as it will be stressful as is.

Probably safest bet is disabling entire nextcloud for the duration of the move but that could cause several hours of downtime. Disabling only files app could be a good compromise as only files would be affected. however we need to make sure no clean cronjob or anything like that would trigger changes to filecache and other file related tables in the database.
At the end it all depends on how long will the last sync take. So lets wait a week or so to have better idea.

Second sync after initial took about 11.5 hours. I think this can be lowered when running the sync in a loop for few days. Once we have more realistic time on how long will it take to do last sync we can decide on how to do the last sync. I think there are three options though the third one seems to be an overkill at this moment and could cause ton of issues so I woudn't even take in to consideration: 1. Disable entire Nextcloud for the duration of the last migration 2. Disable only files for the duration of the migration 3. Create symlinks or bind mounts for each user, switching to new storage on per user basis. The reason this is rather delicate operation is mainly the fact the files are encrypted and possible inconsistency between database, keys and files could cause files to be not decryptable. Having to deal with such issues in the past, I would rather go for option 1 or 2 as it will be stressful as is. Probably safest bet is disabling entire nextcloud for the duration of the move but that could cause several hours of downtime. Disabling only files app could be a good compromise as only files would be affected. however we need to make sure no clean cronjob or anything like that would trigger changes to filecache and other file related tables in the database. At the end it all depends on how long will the last sync take. So lets wait a week or so to have better idea.
Owner

I also think that "safest bet is disabling entire nextcloud for the duration of the move" even if it takes several hours. I prefer that rather than the issues the other options could generate.

I also think that "safest bet is disabling entire nextcloud for the duration of the move" even if it takes several hours. I prefer that rather than the issues the other options could generate.
Owner

I agree to. As long as we warn users about it, I guess that it something they can understand.

I agree to. As long as we warn users about it, I guess that it something they can understand.
Author
Owner

Got it down to 7.20h I think we could get it down to 5 easily. maybe more. I will do some more tweaking.

Got it down to 7.20h I think we could get it down to 5 easily. maybe more. I will do some more tweaking.
muppeth modified the milestone from 23.11 - November to 23.12 - December 2023-12-03 13:56:28 +01:00
Author
Owner

We are almost ready but decided on last meeting to move it to january to prevent downtime around end of the year when people would rather have access to their data and its generally not the best period to do this kind of stuff.

We are almost ready but decided on last meeting to move it to january to prevent downtime around end of the year when people would rather have access to their data and its generally not the best period to do this kind of stuff.
muppeth modified the milestone from 23.12 - December to 24.01 - January 2023-12-17 00:11:00 +01:00
muppeth modified the milestone from 24.01 - January to 24.02 - February 2024-02-06 16:31:43 +01:00
muppeth removed their assignment 2024-02-06 16:39:39 +01:00
muppeth added this to the (deleted) project 2024-02-15 01:06:05 +01:00
muppeth added the
Bare metal
label 2024-03-03 12:04:32 +01:00
muppeth modified the milestone from 24.02 - February to 24.03 - March 2024-03-03 16:26:42 +01:00
Author
Owner

Ufffff.... finally. Looks like things are done!
I will keep Knopi in the rack for few more days to make sure things are doing fine. Once we think its ok to put it down, we can hang Simo in it's place.

Ufffff.... finally. Looks like things are done! I will keep Knopi in the rack for few more days to make sure things are doing fine. Once we think its ok to put it down, we can hang Simo in it's place.
muppeth self-assigned this 2024-03-25 00:08:17 +01:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Disroot/Disroot-Project#512
No description provided.