Tips to increase PostgreSQL performance
1. Avoid join if possible use subquery instead of join
Note : subqueries can be slower than LEFT [OUTER] JOINS, but in my opinion their strength is slightly higher readability.
select GPA from Student where sID in (select sID from Apply where major = 'CS');
Don't use Like following :
select GPA from Student, Apply where Student.sID = Apply.sID and Apply.major = 'CS';
Some Other Example :
SELECT id, alias
FROM
mos_categories
WHERE
id IN (
SELECT
DISTINCT catid
FROM mos_content
);
~ 5 seconds
SELECT
DISTINCT mos_content.catid,
mos_categories.alias
FROM
mos_content, mos_categories
WHERE
mos_content.catid = mos_categories.id;
~ 15 seconds
2.If you need to join the table , join should be small table to big table, before that developer need to check the table size using following query.
SELECT pg_size_pretty(pg_relation_size('schema_name.tablename'));
3. PostgreSQL Multicolumn INDEX ORDER Should be Following Order :
Let’s review the explain analyze plan of the following simple query without indexes:
EXPLAIN ANALYZE SELECT line_items.product_id, SUM(line_items.price)
FROM line_items
WHERE product_id > 80
GROUP BY 1
EXPLAIN PLAN :
HashAggregate (cost=13.81..14.52 rows=71 width=12) (actual time=0.137..0.141 rows=20 loops=1)
Group Key: product_id
-> Seq Scan on line_items (cost=0.00..13.25 rows=112 width=8) (actual time=0.017..0.082 rows=112 loops=1)
Filter: (product_id > 80)
Rows Removed by Filter: 388
Planning time: 0.082 ms
Execution time: 0.187 ms
AFTER CREATING INDEX :
CREATE INDEX items_product_id_price ON line_items(product_id, price) ;
EXPLAIN PLAN :
GroupAggregate (cost=0.27..7.50 rows=71 width=12) (actual time=0.034..0.090 rows=20 loops=1)
Group Key: product_id
-> Index Only Scan using items_product_id_price on line_items (cost=0.27..6.23 rows=112 width=8) (actual time=0.024..0.049 rows=112 loops=1)
Index Cond: (product_id > 80)
Heap Fetches: 0
Planning time: 0.271 ms
Execution time: 0.136 ms
CREATING INDEX WITH REVEARSE ORDER :
CREATE INDEX items_product_id_price_reversed ON line_items(price, product_id);
It is equal to full table sequnce scan so don’t create index like this.
OTHER EXAMPLE :
SELECT orders.product_id, SUM(line_items.price)
FROM line_items
LEFT JOIN orders ON line_items.order_id = orders.id
WHERE line_items.created_at BETWEEN '2018-01-01' and '2018-01-02'
GROUP BY 1
Here we have join on order_id and filter on created_at. This way, we can create a multicolumn index that will contain created_at in the first place, order_id in the second and price in the third:
CREATE INDEX line_items_created_at_order_id_price ON line_items(created_at, order_id, price);
We’ll get the following explain plan:
GroupAggregate (cost=12.62..12.64 rows=1 width=12) (actual time=0.029..0.029 rows=1 loops=1)
Group Key: orders.product_id
-> Sort (cost=12.62..12.62 rows=1 width=8) (actual time=0.025..0.026 rows=1 loops=1)
Sort Key: orders.product_id
Sort Method: quicksort Memory: 25kB
-> Nested Loop Left Join (cost=0.56..12.61 rows=1 width=8) (actual time=0.015..0.017 rows=1 loops=1)
-> Index Only Scan using line_items_created_at_order_id_price on line_items (cost=0.27..4.29 rows=1 width=8) (actual time=0.009..0.010 rows=1 loops=1)
Index Cond: ((created_at >= '2018-01-01 00:00:00'::timestamp without time zone) AND (created_at <= '2018-01-02 00:00:00'::timestamp without time zone))
Heap Fetches: 0
-> Index Scan using orders_pkey on orders (cost=0.29..8.30 rows=1 width=8) (actual time=0.004..0.005 rows=1 loops=1)
Index Cond: (line_items.order_id = id)
Planning time: 0.303 ms
Execution time: 0.072 ms
As you can see, line_items_created_at_order_id_price is used to reduce scan by date condition. After that, it’s joined with orders using the orders_pkey index scan.
Note : Gin index faster SELECT AND gist FASTER UPDATE and Btree for Default index
First try to create default index then note the Execution time after that try GIN and GIST index
note the execution time which one index is getting good performance keep that index only.
for taking execution plan please Refer https://www.tutorialdba.com/p/postgresql-explain-plan.html
4. while using Where CLAUSE, Use IN class instead of using OR class
with OR :
SELECT id FROM a
WHERE id = 42
OR id = 4711;
WITHOUT OR:
SELECT id FROM a
WHERE id IN (42, 4711);
5. INSTEAD OF USING DOUBLE LIKE USE “ANY LIKE” OPTION
SELECT id FROM tutorialdba WHERE a_val LIKE 'something%' OR a_val LIKE 'other%';
--REWRITING LIKE FOLLOWING:
SELECT id FROM tutorialdba WHERE a_val LIKE ANY (ARRAY['something%', 'other%']);
6. DO ReINDEX Query Unused Indexes :
Index all the things right? Did you know having unused indexes can negatively affect write performance? The reason is, when you create an index, Postgres is burdened with the task of keeping this index updated after write (INSERT / UPDATE / DELETE) operations. So, adding an index is a balancing act because they can speed up reading of data (if created properly) but will slow down write operations. To find unused indexes you can run the following command.
SELECT
indexrelid::regclass as index,
relid::regclass as table,
'DROP INDEX ' || indexrelid::regclass || ';' as drop_statement
FROM
pg_stat_user_indexes
JOIN
pg_index USING (indexrelid)
WHERE
idx_scan = 0
AND indisunique is false;
7. TO CHECKING INDEX STATUS :
--to checking table index:
postgres=# select relname,indexrelname,indexrelid from pg_stat_all_indexes where indexrelid=24700;
relname | indexrelname | indexrelid
---------+--------------+------------
k | production | 24700
--check the index is valid or not and check the index is alive or not
postgres=# select indexrelid,indislive,indisvalid from pg_index where indexrelid=24700;
indexrelid | indislive | indisvalid
------------+-----------+------------
24700 | t | t
(1 row)
MANUALLY REINDEXING IF VALUE IS FALSE :
DROP INDEX registration.rg_patient_ix10;
CREATE INDEX rg_patient_ix10 ON registration.patient USING btree( lower(firstname), lower(middlename), lower(lastname) varchar_pattern_ops);
8. Before and after changing the query take the explain plan and note down the query run time using explain plan
Example:
Before changing Query functionality:
EXPLAIN ANALYZE Query; note the Execution time
After changing Query functionality:
EXPLAIN ANALYZE Query; note the Execution time
9. CHECKING THE DEAD TUBLES :
IF U GET ticket regarding for query tuning you have to check the table whether is having any dead tubles or not , if dead tuble is exists you have to do vacuum .
--to checking table index:
postgres=# select relname,indexrelname,indexrelid from pg_stat_all_indexes where indexrelid=24700;
relname | indexrelname | indexrelid
---------+--------------+------------
k | production | 24700
--check the index is valid or not and check the index is alive or not
postgres=# select indexrelid,indislive,indisvalid from pg_index where indexrelid=24700;
indexrelid | indislive | indisvalid
------------+-----------+------------
24700 | t | t
(1 row)
10. CHECKING LONG RUNNING QUERY :
THIS SCRIPT WILL HELPFUL FOR CHECKING THE LONG RUNNING QUERY AS WELL DISK USUAGE AND MONITORING RAM SIZE DEAD TUBLES AND ETC.. .
[root@tutorialdba.com]# cat TUTORIALDBAmonitor.sh
echo " "
echo " "
HOSTNAME="101.0.1.123"
PSQL="/opt/edb/as9.6/bin/psql"
PORT=5444
DB="edb"
USER="enterprisedb"
export PGPASSWORD="rayafeel.com"
echo "ENTER THE TIME IN MINUTES ?"
echo "For example if you give 10 means it will display "
echo "queries which is running more than 10 minutes"
echo "0 means will display all queries:"
echo ""
read time
echo " "
echo " "
echo "WHAT ARE THE QUERY IS RUNING MORE THAN $time MINUTES"
$PSQL -d $DB -U $USER -p $PORT -h $HOSTNAME <<EOF
\pset format wrapped
SELECT pid, now() - query_start as "runtime", usename, datname, state, query
FROM pg_stat_activity
WHERE now() - query_start > '$time minutes'::interval
ORDER BY runtime DESC;
EOF
echo "*********************COMPLETED 1****************************** "
echo " "
echo " "
echo " "
echo "CHECKING dISK SPACE:"
df -h
echo "*********************COMPLETED 2****************************** "
echo " "
echo " "
echo " "
echo "CHECKING RAM USAGE:"
free -h
echo "*********************COMPLETED 3****************************** "
echo " "
echo " "
echo " "
echo "ENTER THE BLOATED COUNT ? "
echo "For example if you give 1000 means it will display the table count"
echo "how many table having more than 1000 dead tubles that database:"
echo ""
read count
echo " "
echo " "
echo "HOW MANY TABLES HAVING MORE THAN $count BLOATED TABLE :"
$PSQL -d $DB -U $USER -p $PORT -h $HOSTNAME <<EOF
\c LC
select count(*) from pg_stat_all_tables where n_dead_tup > $count;
EOF
echo " "
$PSQL -d $DB -U $USER -p $PORT -h $HOSTNAME <<EOF
\c DC
select count(*) from pg_stat_all_tables where n_dead_tup >$count;
EOF
echo "*********************COMPLETED 4****************************** "
echo " "
echo " "
echo " "
echo "ENTER THE BLOATED COUNT AGAING ? "
echo "it will give schema,table name and dead tuble count including "
echo "with vacuum & analyze script you can copy and run it to the "
echo "SQL prompt what are the table having more Bloated table "
echo ""
read count1
echo ""
echo ""
echo "WHAT ARE THE TABLE HAVING MORE THAN $count1 BLOATED TABLE : "
$PSQL -d $DB -U $USER -p $PORT -h $HOSTNAME <<EOF
\c LC
select 'vacuum ANALYZE '||schemaname||'.'||relname||';' from pg_stat_all_tables where n_dead_tup >$count1;
select schemaname,relname,n_dead_tup from pg_stat_all_tables where n_dead_tup >$count1;
EOF
echo " "
$PSQL -d $DB -U $USER -p $PORT -h $HOSTNAME <<EOF
\c DC
select 'vacuum ANALYZE '||schemaname||'.'||relname||';' from pg_stat_all_tables where n_dead_tup >$count1;
select schemaname,relname,n_dead_tup from pg_stat_all_tables where n_dead_tup >$count1;
EOF
echo "*********************COMPLETED 5****************************** "
echo " "
SAMPLE OUTPUT:
[root@tutorialdba.com]# sh TUTORIALDBAmonitor.sh
ENTER THE TIME IN MINUTES ?
For example if you give 10 means it will display
queries which is running more than 10 minutes
0 means will display all queries:
4
WHAT ARE THE QUERY IS RUNING MORE THAN 4 MINUTES
Output format is wrapped.
pid | runtime | usename | datname | state | query
-------+-----------------+----------+----------------+-------+------------
15832 | 00:05:57.796746 | hr | LC | idle | unlisten *
15465 | 00:05:35.23556 | ehis | LC | idle | unlisten *
15113 | 00:05:34.880537 | billing | LC | idle | unlisten *
15918 | 00:04:50.816381 | security | LC | idle | unlisten *
15930 | 00:04:48.148488 | billing | DC | idle | unlisten *
...
..
(10 rows)
*********************COMPLETED 1******************************
CHECKING dISK SPACE:
Filesystem Size Used Avail Use% Mounted on
/dev/xvda2 250G 2.0G 249G 1% /
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 8.0K 16G 1% /dev/shm
tmpfs 16G 177M 16G 2% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/xvdb 200G 33G 168G 17% /opt
tmpfs 3.2G 0 3.2G 0% /run/user/1000
*********************COMPLETED 2******************************
CHECKING RAM USAGE:
total used free shared buff/cache available
Mem: 31G 717M 27G 1.3G 2.7G 28G
Swap: 0B 0B 0B
*********************COMPLETED 3******************************
ENTER THE BLOATED COUNT ?
For example if you give 1000 means it will display the table count
how many table having more than 1000 dead tubles that database:
1500
HOW MANY TABLES HAVING MORE THAN 1500 BLOATED TABLE :
You are now connected to database "LC" as user "enterprisedb".
count
-------
3
(1 row)
You are now connected to database "DC" as user "enterprisedb".
count
-------
1
(1 row)
*********************COMPLETED 4******************************
ENTER THE BLOATED COUNT AGAING ?
it will give schema,table name and dead tuble count including
with vacuum & analyze script you can copy and run it to the
SQL prompt what are the table having more Bloated table
2000
WHAT ARE THE TABLE HAVING MORE THAN 2000 BLOATED TABLE :
You are now connected to database "LC" as user "enterprisedb".
?column?
-----------------------------------------
vacuum ANALYZE pg_catalog.pg_attribute;
vacuum ANALYZE pg_catalog.pg_depend;
vacuum ANALYZE pg_catalog.pg_shdepend;
(3 rows)
schemaname | relname | n_dead_tup
------------+--------------+------------
pg_catalog | pg_attribute | 21218
pg_catalog | pg_depend | 9869
pg_catalog | pg_shdepend | 2730
(3 rows)
You are now connected to database "DC" as user "enterprisedb".
?column?
----------------------------------------
vacuum ANALYZE pg_catalog.pg_shdepend;
(1 row)
schemaname | relname | n_dead_tup
------------+-------------+------------
pg_catalog | pg_shdepend | 2730
(1 row)
*********************COMPLETED 5******************************
vacuum and analyze the catalog table which is having more dead tuble, Here pg_attribute havig more dead tuble that is what am chose it and vacuum and analyze that.
LC=# vacuum ANALYZE pg_catalog.pg_attribute;
VACUUM
- PostgreSQL Parameter Tuning :
- max_connections = <num>— This option sets the maximum number of database backend to have at any one time. Use this feature to ensure that you do not launch so many backends that you begin swapping to disk and kill the performance of all the children. Depending on your application it may be better to deny the connection entirely rather than degrade the performance of all of the other children.
max_connections =1000 # connection means , Total Memory taken16GB= work_memX1000 (consider 16MB is work_mem)
- shared_buffers = <num>— Editing this option is the simplest way to improve the performance of your database server. The default is pretty low for most modern hardware. General wisdom says that this should be set to roughly 25% of available RAM on the system. Like most of the options I will outline here you will simply need to try them at different levels (both up and down ) and see how well it works on your particular system. Most people find that setting it larger than a third starts to degrade performance.
shared_buffers = 25% of System Memory
- effective_cache_size = <num>— This value tells PostgreSQL’s optimizer how much memory PostgreSQL has available for caching data and helps in determing whether or not it use an index or not. The larger the value increases the likely hood of using an index. This should be set to the amount of memory allocated to shared_buffers plus the amount of OS cache available. Often this is more than 50% of the total system memory.
effective_cache_size =50% of System Memory
- work_mem = <num>— This option is used to control the amount of memory using in sort operations and hash tables. While you may need to increase the amount of memory if you do a ton of sorting in your application, care needs to be taken. This isn’t a system wide parameter, but a per operation one. So if a complex query has several sort operations in it it will use multiple work_mem units of memory. Not to mention that multiple backends could be doing this at once. This query can often lead your database server to swap if the value is too large. This option was previously called sort_mem in older versions of PostgreSQL.
work_mem = 16mb (its depend on backend operation, increase the size while executing the slow running query or if taken more cost queries like following )
set work_mem='1MB'; select ...; // running time is ~2000 msset work_mem='96MB'; select ...' // running time is ~1500 ms
Note : Its session level only not a server level so you can increase while running BIG type of sorting .
- wall_buffer =<Num> Should be 3% of shared buffer , If Your postgres server is running more and more insert , update, delete statement running means you have to increase as 8mb…16mb
- max_fsm_pages = <num>— This option helps to control the free space map. When something is deleted from a table it isn’t removed from the disk immediately, it is simply marked as “free” in the free space map. The space can then be reused for any new INSERTs that you do on the table. If your setup has a high rate of DELETEs and INSERTs it may be necessary increase this value to avoid table bloat.
- fsync = <boolean>— This option determines if all your WAL pages are fsync()’ed to disk before a transactions is committed. Having this on is safer, but can reduce write performance. If fsync is not enabled there is the chance of unrecoverable data corruption. Turn this off at your own risk.
- commit_delay = <num>and commit_siblings = <num> Two parameters I don’t recommend touching in most cases are commit_siblings and commit_delay, artifacts of an older attempt to improve performance on systems with slow commit times (which includes most systems that don’t have a battery-backed write cache for accelerating that area). Nowadays turning off the synchronous_commit
default commit_delay is zero (no delay)
random_page_cost = <num> — random_page_cost controls the way PostgreSQL views non-sequential disk reads. A higher value makes it more likely that a sequential scan will be used over an index scan indicating that your server has very fast disks.
If this is still confusing to you, Rayafeel does offer a PostgreSQL Tuning Service
Note that many of these options consume shared memory and it will probably be necessary to increase the amount of shared memory allowed on your system to get the most out of these options.
- PostgreSQL Connection Pooling :
There are several ways to accomplish database connection pooling, here is a short list of common ones:
- Pgpoolis a small server that you run on the same server as your clients that will pool database connections to some local or remote server. The application simply points at the pgpool instance instead of the normal postmaster. From the application’s perspective nothing has changed as the connection pooling is hidden from it.
- In a mod_perlenvironment you can use Apache::DBI to handle database connection pooling inside of Apache itself.
- SQLRelayis another db connection manager that is somewhat database agnostic. It works with with several databases other than PostgreSQL.
- You can always write a small bit of code to do this for you yourself, but I would highly recommend using an already developed solution to reduce the amount of debugging you have to do.
Tag:performance, postgresql