Blog Devops notes

Running Graphite on EC2

2013-03-16 00:00:00 -0700



Graphite is one of the first line troubleshooting tools for Shokunin's clients and most clients run it on Amazon EC2. Through trial and error, we have established a few best practice rules for setting up in the cloud.



Disk Layout



While graphite is very good at minimizing I/O issues, with enough stats, disk is the first thing that slows down.

  • Create several volumes and partiton data across them according to some grouping
  • Used provisioned IOPS EBS volumes for better I/O
  • Use EXT4 file system for better performance




Data Collection



Python has a GIL which means data collection processes are limited to a single core. Run more than one process to keep from maxing out a single core. Have each cache process write to a separate EBS volume.





sample config file


################################################
#       Puppet Controlled
#   /opt/graphite/conf/carbon.conf
################################################
# Don't write to the default port keep it to catch misconfigured
# clients and fix them later
[cache]
MAX_CACHE_SIZE = inf
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
ENABLE_UDP_LISTENER = False
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 0.0.0.0
CACHE_QUERY_PORT = 7002
USE_FLOW_CONTROL = True
LOG_UPDATES = False
LOG_CACHE_HITS = False
WHISPER_AUTOFLUSH = False
###############################################
#
[cache:01]
STORAGE_DIR = /graphite_data/01
LOCAL_DATA_DIR = /graphite_data/01
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 1000
MAX_CREATES_PER_MINUTE = 50
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2013
ENABLE_UDP_LISTENER = False
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2013
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2014
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 0.0.0.0
CACHE_QUERY_PORT = 7012
USE_FLOW_CONTROL = True
LOG_UPDATES = False
LOG_CACHE_HITS = False
WHISPER_AUTOFLUSH = False
###############################################
[cache:02]
STORAGE_DIR = /graphite_data/02
LOCAL_DATA_DIR = /graphite_data/02
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 1000
MAX_CREATES_PER_MINUTE = 50
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
ENABLE_UDP_LISTENER = False
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2023
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2024
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 0.0.0.0
CACHE_QUERY_PORT = 7022
USE_FLOW_CONTROL = True
LOG_UPDATES = False
LOG_CACHE_HITS = False
WHISPER_AUTOFLUSH = False
###############################################

You will need to modify local_settings.py to make it aware of the new storage locations, by adding the following:

#/opt/graphite/webapp/graphite/local_settings.py
STANDARD_DIRS = ['/graphite_data/01', 
                 '/graphite_data/02', 
                 '/graphite_data/03', 
                 '/graphite_data/04']

Use Collectd



While there are other collectors, we prefer collectd because it's light compiled C and has plugins for all major infrastructure compenents (Apache, Nginx, Mysql, Redis, Java JMX) and it is simple to write other plugins. Example plugin and config


Sample Base Graphite Collectd Config:

###############################################################
#               Puppet Controlled Default Template
###############################################################
FQDNLookup   false
LoadPlugin syslog
<Plugin syslog>
  LogLevel info
</Plugin>
LoadPlugin cpu
LoadPlugin disk
LoadPlugin interface
LoadPlugin memory
LoadPlugin network
LoadPlugin swap
LoadPlugin vmem
LoadPlugin write_graphite
<Plugin "write_graphite">
  <Carbon>
     Host "<%= graphite_server %>"
     Port "<%= graphite_port %>"
     Prefix "infra.<%= server_role %>."
     EscapeCharacter "_"
     StoreRates true
     AlwaysAppendDS false
  </Carbon>
</Plugin>
###################################



Misc Tips



  • Don't use the relay feature as that process tends to use up CPU and you start dropping updates. Use puppet or chef to spread your clients over various ports.
  • When installing the web component, don't use SQL lite, start with MySQL from the beginning
  • Use aggregation when you want to wrap up stats. Docs
  • Graphite installation notes for Ubuntu