** Run all the following as the analytics user. ** 0) Deploy new version of refinery as per https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Deploy/Refinery-source 00) Make sure we have a backup! :) 000) Make sure we have kinit-ed 0000) As per (https://phabricator.wikimedia.org/T314147#8155713), we learnt we need to use Spark3 to avoid a parquet bug, so first we set spark3 on the current session: export SPARK_HOME=/usr/lib/airflow/lib/python3.7/site-packages/pyspark export SPARK_CONF_DIR=/etc/spark3/conf 1) open $SPARK_HOME/bin/sqark-sql and run: ALTER TABLE `wmf`.`editors_daily` CHANGE COLUMN user_fingerprint_or_id user_fingerprint_or_name STRING COMMENT 'If an anonymous user, this is a hash of the IP + UA, otherwise it is their global username across wiki dbs' ; now run: show partitions wmf.editors_daily; exit; example output: spark-sql (default)> show partitions wmf.editors_daily; partition month=2022-06 month=2022-07 2) For each month partition shown above, run the following HQL that will INSERT OVERWRITE the partition with valid data. Each run should take 2-3 minutes: $SPARK_HOME/bin/spark-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64 \ -f editors_daily_monthly.hql \ -d refinery_hive_jar=hdfs://analytics-hadoop/wmf/refinery/current/artifacts/refinery-hive-shaded.jar \ -d source_table=wmf_raw.mediawiki_private_cu_changes \ -d user_history_table=wmf.mediawiki_user_history \ -d destination_table=wmf.editors_daily \ -d month=2022-06 \ -d coalesce_partitions=1 3) Now let's create the new monthly table: $SPARK_HOME/bin/spark-sql -f create_editors_by_country_monthly_table.hql.hql \ --database wmf Finally, run the following for the same partition sets we found above in step (1) to populate the new editors_by_country_monthly_table table. Each run should take 2-3 minutes: $SPARK_HOME/bin/spark-sql --master yarn --executor-memory 8G --executor-cores 4 --driver-memory 2G --conf spark.dynamicAllocation.maxExecutors=64 \ -f unique_editors_by_country_monthly.hql \ -d source_table=wmf.editors_daily \ -d destination_table=wmf.unique_editors_by_country_monthly \ -d month=2022-06 \ -d coalesce_partitions=1