Saturday, October 31, 2015

Scaling Postgres Meetup at IBM



Summary

This was a *really interesting* meetup with two presentation:
  • one of the postgres commiters (Andres Freund) presenting about the internals of postgres, recent changes ("major recent scalability improvements (9.2, 9.5, 9.6) that have improved postgres' scalability massively") and some of the ongoing challenges
  • the second presentation was Marco Slot (Citus Data) was on pgshard ("seamlessly distribute a table across many servers for horizontal scale and replicate it for high availability"

Some of the highlights from the presentations
  • Discussion of UMA and NUMA architecture and changes to Postgres internals to work better with large modern Intel servers (multiple Cores and Sockets) … the key here is that you want to ensure the memory access is local (i.e. the memory buffer is directly attached to that core/socket).  The Postgres database is significantly different from Oracle e.g. in Oracle the users belong to the database, where as on a Postgres server you might be running multiple databases but with common users.
  • Improvement to the locking implementation to better match the NUMA architecture. Andreas started with a quick overview of the different levels of locking within Postgres : level-one (internal spin lock.. very fast but minimal functionality), level-two (internal light-weight lock - some basic functionality: can be acquired in read mode by multiple threads, with error recovery) and level-three (heavy-weight lock - full functionality … sort of locks a DBA inspects via pg_locks). Andreas then went on to explained how by moving the heavy weight end-user locks from the global postgres server level to the more local database level, this was a better fit for modern NUMA architecture. He also went through some results from pgbench test he had run … basically his tests scaled pretty well upto several hundred concurrent sessions and under extremely load the totally through didn't significantly drop (as per earlier Postgres releases).
  • The pgshard presentation was also interesting, this is a relatively young open-source project, but there was a nice demo running over multiple (4?) AWS instances… the sharding principals looked similar to me to Mongo sharding.


There was also some good Q&A after the presentation
  • One of the current scalability limitation of Postgres is the Buffer Cache, which is still using the older "clean-sweep algorithm". Interesting there a release where this algorithm was replaced with something more modern/efficient (clean-sweep algorithm is from 1979). Unfortunately this release had to be pulled as it turned-out that it infringed on IBM patents :)
  • There was an interesting discussion during q&a regarding whether you should have a buffer cache bigger than 8G (presumably due to the performance limitations of the clean-sweep algorithm). The response from Andres was that this is highly dependent on workload.
  • After the meetup I did a bit of googling regarding the clean-sweep algorithm and found some more details: "Inside the Buffer Cache" (see link below)
  • There was also some discussions of whether introducing pgbouncer (connection pooling algorithm) should be moved into postgres core?

Links:




Lastly this is a photo of me making notes at this meetup:

A couple of useful looking resources I found after the presentation:

Notes (from presentation … so not necessarily 100% accurate)


Introduction


SplendidData - PostgresPURE: (i) own Postgres Distro and (ii) Advanced Migration (Oracle>Postgres)
Reiner Peterke (performance analysis tooling) … presentation on hold (as we have two speakers)

Andres Freund

Vertical Scaling
- 2005 - 2 cores (x86)
- 2015 - 18 cores + multiple sockets
- Why scale back on a single machine?

Architecture considerations:
a) UMA
- single bus between memory and CPU(s)
- simpler but doesn't scale nicely for modern x86 servers with multiple cores, each with multiple sockets
b) NUMA (Non-Uniform Memory Access)
- each CPU has own memory
- accessing local memory but remote memory is more expensive

Postgres lock primer
l1 - spin lock … very fast to acquire, exclusive mode only, no queuing (cpu spins), …
l2 - light-weight lock … can be acquired in read mode by multiple threads, with error recovery
l3 - heavy-weight lock … as per pg_locks, only locks seen by enduser DBAs, more complex, error recovery, dynamic identities, deadlock checks



Acquire lock
- can be a bottleneck in earlier version of postgres
- most lock don’t conflict? stock lock details locally which is better for NUMA architecture (ask “fastpath”)

testing concurrent clients
- readonly pgbench scale 300
- ec2 m4.8xlarge
- new locking method is better for over 8 clients

perf top -az
- s_lock (spinlock acquire) taking 90% of CPU time

atomic operations
- https://en.wikipedia.org/wiki/Linearizability : ‘In concurrent programming, an operation (or set of operations) is atomic, linearizable, indivisible or uninterruptible if it appears to the rest of the system to occur instantaneously. Atomicity is a guarantee of isolation from concurrent processes.’)
- add & subtract, compare exchange
- 20 +1 operations, don’t want to loose any updates
- rlwock pgbench now scales better unto 64 clients / 400000 TPS

levels of caching
- traditional discs are super slow
- even ssd are much slower than main memory
- 8k pages by default
- postgres inefficient buffer replacement (‘clock-sweep algorithm’ based on paper from 1979?)
- struct Buffer … usage count
- patent infringement algorithm (‘IBM contacted them and they had to pull from postgres release)
- atomic buffer-algorthim scales well upto 500 clients

not-fixed issues
- extension locks … problematic for bulk write workloads
- buffer algorithm … so far only made a bad algorithm better
- transaction isolation … max out at about 300 active concurrent sessions

q&a
- introducing pgbouncer into postgres (core)
- pgbouncer is a connection pooling utility that can be plugged on top of a PostgreSQL server. It can be used to limit the maximum number of connections on server side by managing a pool of idle connections that can be used by any applications. (http://michael.otacoo.com/postgresql-2/first-steps-with-pgbouncer-how-to-set-and-run-it/)


Scaling up … sharing
- Citus Data … pg_shard
- CloudFlare (CDN) … massive event logs
- PostgresSQL 9.3+
- pg_shard is good for nosql type work patterns i.e. access path by PK
- shards are regular postgres tables

create extension pg_shard;
create table customer_reviews (
select master_create_distributed_table

AWS example: 4 worker nodes … each with multiple table shards

psql -h master-node-1   << sharding/hashtext details  (SPOF? can have multi-master config e.g. )
/d show single table

psql -h worker-node-1  << actual data
/d show local shards for table

select * from ps_distribution_metadata.shard << shard identifier
pg_shard using hash partitioning by default

INSERT - shared lcok on shard
UPDATE/DELETE - exclusive lock

select avg(rating) from reviews
- as mush as possible push the compute down to individual node/shard … which returns local avg + a weight count

\timing … inserts taking 2.5ms not 0.1ms
selecting on a range of PKs
- can be expensive (all shard queries)
- you can you use range partitions (typically by time)

No explain for pg_shard queries
small number of customers using pg_shard


No comments:

Post a Comment