RedShift

Redshift Best Practices

  • Smaller node types load data faster
  • Best Practices for data load:
  • 1 file in S3 per slice (instances in RedShift)
  • Compressed using gzip compression
  • File size: 1MB to 1GB compressed
  • COPY from S3 is the fastest
  • COPY from EMR HDFS may be faster, but most people don't use HDFS - they store data in S3
  • First column of SORTKEY should not be compressed

  • Workflows: move from staging table to production table

  • Make sure to wrap the entire workflow into ONE transaction
  • COMMITs are very expensive in RedShift
  • Disable statistics on staging tables
  • Make sure that the distribution keys match between staging and prod tables
  • Compress your staging tables

  • Do ANALYZE after VACUUM