[x] announce migration window to: ops list, wikitech-l list, Slack [x] schedule downtime via cookbook for phab1001 and all services on it, via cookbook: [cumin2002:~] $ sudo cookbook sre.hosts.downtime -D 14 -r 'T322250' phab1001.eqiad.wmnet [x] confirm downtime is active in Icinga web UI (https://icinga.wikimedia.org) [x] disable puppet on phab1001: sudo disable-puppet 'T280597' [x] stop Apache, PHP-FPM and phd on phab1001 [phab1001:~] sudo systemctl stop apache2 [phab1001:~] sudo systemctl stop php7.3-fpm [phab1001:~] sudo systemctl stop phd [x] confirm there are no more PHP processes running [phab1001:~] sudo ps aux | grep php [x] rsync /srv/repos diff by pulling on phab1004 from phab1001: [phab1004:/] (as root) rsync -avp --bwlimit=2m --delete rsync://phab1001.eqiad.wmnet/srv-repos/ /srv/repos/ [x] check on phab1004 if any files under /srv/repos owned by UID 497 (vcs). if so, give them to user phd [phab1004:/] find /srv/repos -uid 497 [phab1004:/] find /srv/repos -uid 497 -exec chown phd {} \; - find proved far too slow on a fresh rsync of the repos data. We used chmod -R phd:phd instead, accepting that everything is phd:phd and not some mix of phd:phd and phd:www-data [x] check on phab1004 if any files under /srv/repos owned by GID 498 (aphlict). if so, give them to group phd [phab1004:/] find /srv/repos -gid 498 [phab1004:/] find /srv/repos -gid 498 -exec chgrp phd {} \; - find proved far too slow on a fresh rsync of the repos data. We used chmod -R phd:phd instead, accepting that everything is phd:phd and not some mix of phd:phd and phd:www-data [x] check on phab1004 if any files under /srv/repos are owned by a user that is NOT phd [phab1004:/] find /srv/repos ! -user phd [x] expect this to show the PHEX repo but nothing else. decide what to do with PHEX (root-owned) - Decision here: Only some stuff under here was root-owned, that seems likely to have been an artifact of some manual operation on phab1001 [x] output the full tree of /srv/repos and compare number of directories / files between both servers [phab1001:/] tree -upfg > /root/repos-tree (this file will be just under 500MB of text) [phab1001:/] tail /root/repos-tree [phab1004:/] tree -upfg > /root/repos-tree [phab1004:/] tail /root/repos-tree [] optional: if not satisfied yet: copy result file from old server to new server (scp -3 ...) and run an actual diff between them [x] set mysql ports for master and slave, specifically for eqiad (currently this happens in codfw but not in common hiera) merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/859145 run-puppet-agent, check what happens on phab1004 [x] merge re-revert of the phabricator server name in common Hiera, run puppet, watch the changes on phab1004 and phab2002 https://gerrit.wikimedia.org/r/c/operations/puppet/+/860031 [x] run a scap deploy to phab1004 (insert command, deployment server name) [x] enable phd service on phab1004 merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/859628 and run-puppet-agent [x] wait a couple minutes and check phd is still running (how long?) (if killed by puppet for any reason, it'll be every puppet run...) [x] merge re-revert of the DNS/SPF change https://gerrit.wikimedia.org/r/c/operations/dns/+/860032 and run "authdns-update" on ns0.wikimedia.org, syncs to other DNS servers [x] wait about a minute and optionally use "dig phabricator.discovery.wmnet @ns0.wikimedia.org" to see it change from alias for phab1001 to an alias for phab1004 [x] informational: dumps don't need to switch, they are already on phab1004, this has happened before [x] informational: stats emails don't need to switch, they are already on phab1004, this has happened before testing [x] check https://phabricator.wikimedia.org works, watch out for yellow exclamation marks / warnings for admins [x] test aphlict works by moving something on a workboard while someone else watches [x] test if a ticket update shows up on IRC [x] test if email from a ticket update arrives (by a user who has email notifications) [x] check phabricator logs for exceptions (that aren't usual noise) (insert command / pathes) [x] test if CI works / "recheck" on a change in Gerrit finalizing [] merge patch to disable phd (and apache and php-fpm) on phab1001? [x] verify proper monitoring downtime on phab1001 [x] reply to list emails and Slack that migration is done succesfully, link to ticket in case they see any issues [x] publish fingerprints on wikitech page after migration is done and grace period (how long?): [x] double check which settings can move to common Hiera, remove setting from hosts files in Hiera [] merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/824412 and check puppet run [] remove phab1001 from mysql grants, coordinate with DBA on merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/858419 [x] create decom ticket for phab1001 - https://phabricator.wikimedia.org/T323418 [x] remove production puppet role from phab1001, merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/824804 [x] run decom cookbook from a cumin host on phab1001 [cumin2002:~] $ sudo cookbook sre.hosts.decommission phab1001.eqiad.wmnet -t T323418 [x] remove phab1001 from site.pp https://gerrit.wikimedia.org/r/c/operations/puppet/+/858421 [x] check all the SRE boxes on decom ticket, assign to dcops in eqiad, add dcops tag [x] resolve https://phabricator.wikimedia.org/T280597 [x] set OKR to 100% in Betterworks, profit