Shiv Iyer, Author at The WebScale Database Infrastructure Operations Experts %

Linux Performance Troubleshooting with eBPF – MinervaDB Webinar

Shiv Iyer — Sat, 18 Jul 2020 01:23:17 +0000

Linux Performance Troubleshooting with eBPF – MinervaDB Webinar

MinervaDB is a full-stack performance optimization experts company with core expertise in Infrastructure capacity planning / sizing, Operation Systems kernel tuning, Database Systems Optimizers, Index Optimization, SQL Tuning and Distributed Systems. I am hosting a webinar ( Monday, 27 July 2020 – 06:00 PM to 07:00 PM PDT ) on eBPF (extended Berkley Packet Filter) for Linux Performance Tracing and Troubleshooting. In this webinar, I am talking about how you can use eBPF for Linux performance observability / monitoring, tracing (bcc and bpftrace) and practical examples of using eBPF tools in production engineering. You can register for this webinar here and it is free event, Thanks for registering !

The post Linux Performance Troubleshooting with eBPF – MinervaDB Webinar appeared first on The WebScale Database Infrastructure Operations Experts.

MinervaDB Webinar – Data SRE

Shiv Iyer — Sun, 31 May 2020 04:03:11 +0000

MinervaDB Webinar on Data SRE – Building Database Infrastructure for Performance and Reliability

Building database infrastructure for performance and reliability ? This is a definitive webinar for anyone who is interested in building data SRE, Join Shiv Iyer, Founder and Principal of MinervaDB on Friday, 5 June 2020 – 06:00 PM PDT to 06:45 PM PDT , as he discuss about Data SRE best practices and tools. During this webinar he will discus:

How Data SRE can impact your business?
Most expensive database outages and how can you address those proactively.
Data SRE tools and techniques.
Checklist for building database infrastructure addressing performance, scalability, high availability, reliability, fault-tolerance and security.
Data SRE future.
MinervaDB Data SRE best practices.

The post MinervaDB Webinar – Data SRE appeared first on The WebScale Database Infrastructure Operations Experts.

MinervaDB Webinar – Building MySQL Database Infrastructure for Performance and Reliability

Shiv Iyer — Fri, 22 May 2020 09:31:01 +0000

MinervaDB Webinar – Building MySQL Database Infrastructure for Performance and Reliability

Recently I did a webinar on ” Building MySQL Database Infrastructure for Performance and Reliability ” , It was big success and thought will share the slides of webinar in this blog. I get lot of emails daily from Database Architects, DBAs, Database Engineers, Technical Managers and Developers worldwide on best practices and checklist to build MySQL for performance, scalability, high availability and database SRE, The objective of this webinar is to share with them a cockpit view of MySQL infrastructure operations from MinervaDB perspective. Database Systems are growing faster than ever, The modern datanomy businesses like Facebook, Uber, Airbnb, LinkedIn etc. are powered by Database Systems, This makes Database Infrastructure operationally complex and we can’t technically scale such systems with eyeballs. Building MySQL operations for web-scale means delivering highly responsive, fault-tolerant and self-healing database infrastructure for business. In this webinar we are discussing following topics:

Configuring MySQL for performance and reliability
Troubleshooting MySQL with Linux tools
Troubleshooting MySQL with slow query log
Most common tools used in MinervaDB to troubleshoot MySQL performance
Monitoring MySQL performance
Building MySQL infrastructure operations for performance, scalability and reliability
MySQL Replication

You can download PDF of the webinar here

The post MinervaDB Webinar – Building MySQL Database Infrastructure for Performance and Reliability appeared first on The WebScale Database Infrastructure Operations Experts.

MinervaDB Webinar – Learn with Shiv Iyer on building MySQL Infrastructure for Performance and Reliability

Shiv Iyer — Mon, 11 May 2020 09:40:01 +0000

MinervaDB Webinar – Building MySQL Infrastructure for Performance and Reliability

This is a webinar from Shiv Iyer ( Founder and Principal of MinervaDB ). He is a longtime ( 16 years ) Database Infrastructure Operations Architect with core expertise in performance, scalability and Database SRE in Open Source Database Systems. If you are building MySQL for Performance and Reliability, You must attend this webinar. Shiv explains in detail on how he and the team MinervaDB has built several planet-scale database infrastructure operations for high profile internet properties / companies from diversified verticals like CDNs, Mobile Ad. Networks, Social Media Applications, Online Commerce, Social Media Gaming and FinTech. This webinar is about building best practices and checklist for MySQL performance, capacity planning / sizing, scalability, high availability, fault-tolerance, observability & resilience and database security. You can register for this 100% free and no-obligation webinar here

The post MinervaDB Webinar – Learn with Shiv Iyer on building MySQL Infrastructure for Performance and Reliability appeared first on The WebScale Database Infrastructure Operations Experts.

Building Infrastructure for ClickHouse Performance

Shiv Iyer — Thu, 05 Dec 2019 08:46:19 +0000

Tuning Infrastructure for ClickHouse Performance

When you are building a very large Database System for analytics on ClickHouse you have to carefully build and operate infrastructure for performance and scalability. Is there any one magic wand to take care of the full-stack performance ? Unfortunately, the answer is no ! If you are not proactively monitoring and sizing the database infrastructure, you may be experiencing severe performance bottleneck or sometimes the total database outage causing serious revenue impact and all these may happen during the peak business hours or season, So where do we start planning for the infrastructure of ClickHouse operations ? As your ClickHouse database grows, the complexity of the queries also increases so we strongly advocate for investing in observability / monitoring infrastructure to troubleshoot more efficiently / proactively, We at MinervaDB use Grafana ( https://grafana.com/grafana/plugins/vertamedia-clickhouse-datasource ) to monitor ClickHouse Operations and record every performance counters from CPU, Network, Memory / RAM and Storage .This blog post is about knowing and monitoring the infrastructure component’s performance to build optimal ClickHouse operations.

Monitor for overheating CPUs

The overheating can damage processor and even motherboard, Closely monitor the systems if you are overclocking and it is exceeding 100C, please turn off the system. Most modern processors do reduce their clockspeed when they get warm to try and cool themselves, This will cause sudden degradation in the performance.

Monitor your current CPU speed:

sudo cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq

You can also use turbostat to monitor the CPU load:

sudo ./turbostat --quiet --hide sysfs,IRQ,SMI,CoreTmp,PkgTmp,GFX%rc6,GFXMHz,PkgWatt,CorWatt,GFXWatt
	Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	CPU%c1	CPU%c3	CPU%c6	CPU%c7
	-	-	488	90.71	3900	3498	12.50	0.00	0.00	74.98
	0	0	5	0.13	3900	3498	99.87	0.00	0.00	0.00
	0	4	3897	99.99	3900	3498	0.01
	1	1	0	0.00	3856	3498	0.01	0.00	0.00	99.98
	1	5	0	99.00	3861	3498	0.01
	2	2	1	0.02	3889	3498	0.03	0.00	0.00	99.95
	2	6	0	87.81	3863	3498	0.05
	3	3	0	0.01	3869	3498	0.02	0.00	0.00	99.97
	3	7	0	0.00	3878	3498	0.03
	Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	CPU%c1	CPU%c3	CPU%c6	CPU%c7
	-	-	491	82.79	3900	3498	12.42	0.00	0.00	74.99
	0	0	27	0.69	3900	3498	99.31	0.00	0.00	0.00
	0	4	3898	99.99	3900	3498	0.01
	1	1	0	0.00	3883	3498	0.01	0.00	0.00	99.99
	1	5	0	0.00	3898	3498	56.61
	2	2	0	0.01	3889	3498	0.02	0.00	0.00	99.98
	2	6	0	0.00	3889	3498	0.02
	3	3	0	0.00	3856	3498	0.01	0.00	0.00	99.99
	3	7	0	0.00	3897	3498	0.01

Using PSENSOR to monitor hardware temperature

psensor is a graphical hardware temperature monitor for Linux.

It can monitor:

the temperature of the motherboard and CPU sensors (using lm-sensors).
the temperature of the NVidia GPUs (using XNVCtrl).
the temperature of ATI/AMD GPUs (not enabled in official distribution repositories, see the instructions for enabling its support).
the temperature of the Hard Disk Drives (using hddtemp or libatasmart).
the rotation speed of the fans (using lm-sensors).
the CPU usage (since 0.6.2.10 and using Gtop2).

PSENSOR (source – https://wpitchoune.net/psensor/ )

Since the Intel CPU thermal limit is 100 °C, we can quantify the amount of overheating by measuring the amount of time the CPU temperature was running at > 99 °C

CPU frequency based on cooling (source: TECHSPOT. /INTEL)

Choosing RAID for Performance

The table below explains different RAID levels and how they impact on performance:

RAID Level	Advantages	Disadvantages
RAID level 0 – Striping	RAID 0 offers great performance, both in read and write operations. There is no overhead caused by parity controls.RAID 0 offers great performance, both in read and write operations. All storage capacity is used, there is no overhead.	RAID 0 is not fault-tolerant. If one drive fails, all data in the RAID 0 array are lost. It should not be used for mission-critical systems.
RAID level 1 – Mirroring	RAID 1 offers excellent read speed and a write-speed that is comparable to that of a single drive. In case a drive fails, data do not have to be rebuild, they just have to be copied to the replacement drive.	Software RAID 1 solutions do not always allow a hot swap of a failed drive. That means the failed drive can only be replaced after powering down the computer it is attached to. For servers that are used simultaneously by many people, this may not be acceptable. Such systems typically use hardware controllers that do support hot swapping. The main disadvantage is that the effective storage capacity is only half of the total drive capacity because all data get written twice.
RAID level 5	Read data transactions are very fast while write data transactions are somewhat slower (due to the parity that has to be calculated). If a drive fails, you still have access to all data, even while the failed drive is being replaced and the storage controller rebuilds the data on the new drive.	Technology complexity - If one of the disks in an array using 4TB disks fails and is replaced, restoring the data (the rebuild time) may take a day or longer, depending on the load on the array and the speed of the controller. If another disk goes bad during that time, data are lost forever.
RAID level 6 – Striping with double parity	READs are very fast, If two drives fail, you still have access to all data, even while the failed drives are being replaced. So RAID 6 is more secure than RAID 5.	Technology complexity - Rebuilding an array in which one drive failed can take a long time. Write data transactions are slower than RAID 5 due to the additional parity data that have to be calculated. In one report I read the write performance was 20% lower.
RAID level 10 – combining RAID 1 & RAID 0	High Performance and Fault Tolerant - If something goes wrong, All we need to do is copying all the data from the surviving mirror to a new drive.	Highly expensive - Half of the storage capacity goes directly for mirroring.

Use NCQ with a long queue.
Use CFQ scheduler for HDD.
- Enable write cache for improved WRITEs performance.
Use noop for SSD.
Ext4 is the most reliable.

You can read a detailed blog post about RAID here – https://minervadb.com/index.php/2019/08/04/raid-redundant-storage-for-database-reliability/

Huge Pages

What are Transparent Huge Pages and why they exist ? Operating Systems, Database Systems and several applications run in virtual memory. The Operating System manage virtual memory using pages (contiguous block of memory). Technically virtual memory is mapped to physical memory by the Operating System maintaining the page tables data structure in RAM. The address translation logic (page table walking) is implemented by the CPU’s memory management unit (MMU). The MMU also has a cache of recently used pages. This cache is called the Translation lookaside buffer (TLB).

Operating Systems manage virtual memory using pages (contiguous block of memory). Typically, the size of a memory page is 4 KB. 1 GB of memory is 256 000 pages; 128 GB is 32 768 000 pages. Obviously TLB cache can’t fit all of the pages and performance suffers from cache misses. There are two main ways to improve it. The first one is to increase TLB size, which is expensive and won’t help significantly. Another one is to increase the page size and therefore have less pages to map. Modern OSes and CPUs support large 2 MB and even 1 GB pages. Using large 2 MB pages, 128 GB of memory becomes just 64 000 pages.

Transparent Hugepage Support in Linux exist for performance. Transparent Huge Pages manages large pages automatically and transparently for applications. The benefits are pretty obvious: no changes required on application side; it reduces the number of TLB misses; page table walking becomes cheaper. The feature logically can be divided into two parts: allocation and maintenance. The THP takes the regular (“higher-order”) memory allocation path and it requires that the OS be able to find contiguous and aligned block of memory. It suffers from the same issues as the regular pages, namely fragmentation. If the OS can’t find a contiguous block of memory, it will try to compact, reclaim or page out other pages. That process is expensive and could cause latency spikes (up to seconds). This issue was addressed in the 4.6 kernel (via “defer” option), the OS falls back to a regular page if it can’t allocate a large one. The second part is maintenance. Even if an application touches just 1 byte of memory, it will consume whole 2 MB large page. This is obviously a waste of memory. So there’s a background kernel thread called “khugepaged”. It scans pages and tries to defragment and collapse them into one huge page. Despite it’s a background thread, it locks pages it works with, hence could cause latency spikes too. Another pitfall lays in large page splitting, not all parts of the OS work with large pages, e.g. swap. The OS splits large pages into regular ones for them. It could also degrade the performance and increase memory fragmentation.

Why we recommend to disable Transparent Huge Pages for ClickHouse Performance ?

Transparent Huge Pages (THP) is a Linux memory management system that reduces the overhead of Translation Lookaside Buffer (TLB) lookups on machines with large amounts of memory by using larger memory pages.

However, In our experience ClickHouse often perform poorly with THP enabled, because they tend to have sparse rather than contiguous memory access patterns. When running ClickHouse on Linux, THP should be disabled for best performance.

$ echo 'never' | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

Yandex recommends perf top to monitor the time spend in the kernel for memory management.

RAM

ClickHouse performs great with the high quality investments on RAM. As the data volumes increases, caching benefits more to the frequently executed SORT / SEARCH intensive analytical queries . “Yandex recommends you not to disable overcommit”. The value cat /proc/sys/vm/overcommit_memory should be 0 or 1. Run

$ echo 0 | sudo tee /proc/sys/vm/overcommit_memory

Conclusion

The visibility to ClickHouse Ops. is very important to build optimal, scalable and highly available Data Analytics platforms, Most often what happens is we measure the hardware infrastructure when there is a performance bottleneck and the reactive approach to troubleshoot performance is really expensive. When we work with our customers, we plan and conduct regular performance audits of their ClickHouse Ops. for right sizing their infrastructure.

References:

The post Building Infrastructure for ClickHouse Performance appeared first on The WebScale Database Infrastructure Operations Experts.

Benchmarking ClickHouse Performance on Amazon EC2

Shiv Iyer — Wed, 20 Nov 2019 19:50:12 +0000

Benchmarking ClickHouse Performance – Amazon EC2

Amazon EC2 infrastructure used for benchmarking ClickHouse:

Amazon EC2 instance details – R5DN Eight Extra Large / r5dn.8xlarge
RAM – 256.0 GiB
vCPUs – 32
Storage – 1200 GiB (2 * 600 GiB NVMe SSD)
Network performance – 25 Gigabit

Benchmarking Dataset – New York Taxi data

Data source :

https://github.com/toddwschneider/nyc-taxi-data

http://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html

You can either import raw data (from above sources ) or download prepared partitions

Downloading the prepared partitions

$ curl -O https://clickhouse-datasets.s3.yandex.net/trips_mergetree/partitions/trips_mergetree.tar
$ tar xvf trips_mergetree.tar -C /var/lib/clickhouse # path to ClickHouse data directory
$ # check permissions of unpacked data, fix if required
$ sudo service clickhouse-server restart
$ clickhouse-client --query "select count(*) from datasets.trips_mergetree"

The entire download will be an uncompressed CSV data files of 227 GB in size, It takes approximately 50 minutes with 1Gbit of connection. The data must be pre-processed in PostgreSQL before loading to ClickHouse

$ time psql nyc-taxi-data -c "SELECT count(*) FROM trips;"
## Count
1299989791
(1 row)

real 4m1.274s

Approx. 1.3 Billion records in PostgreSQL with database size of around 390 GB.

Exporting data from PostgreSQL:

COPY
(
SELECT trips.id,
trips.vendor_id,
trips.pickup_datetime,
trips.dropoff_datetime,
trips.store_and_fwd_flag,
trips.rate_code_id,
trips.pickup_longitude,
trips.pickup_latitude,
trips.dropoff_longitude,
trips.dropoff_latitude,
trips.passenger_count,
trips.trip_distance,
trips.fare_amount,
trips.extra,
trips.mta_tax,
trips.tip_amount,
trips.tolls_amount,
trips.ehail_fee,
trips.improvement_surcharge,
trips.total_amount,
trips.payment_type,
trips.trip_type,
trips.pickup,
trips.dropoff,
cab_types.type cab_type,
weather.precipitation_tenths_of_mm rain,
weather.snow_depth_mm,
weather.snowfall_mm,
weather.max_temperature_tenths_degrees_celsius max_temp,
weather.min_temperature_tenths_degrees_celsius min_temp,
weather.average_wind_speed_tenths_of_meters_per_second wind,
pick_up.gid pickup_nyct2010_gid,
pick_up.ctlabel pickup_ctlabel,
pick_up.borocode pickup_borocode,
pick_up.boroname pickup_boroname,
pick_up.ct2010 pickup_ct2010,
pick_up.boroct2010 pickup_boroct2010,
pick_up.cdeligibil pickup_cdeligibil,
pick_up.ntacode pickup_ntacode,
pick_up.ntaname pickup_ntaname,
pick_up.puma pickup_puma,
drop_off.gid dropoff_nyct2010_gid,
drop_off.ctlabel dropoff_ctlabel,
drop_off.borocode dropoff_borocode,
drop_off.boroname dropoff_boroname,
drop_off.ct2010 dropoff_ct2010,
drop_off.boroct2010 dropoff_boroct2010,
drop_off.cdeligibil dropoff_cdeligibil,
drop_off.ntacode dropoff_ntacode,
drop_off.ntaname dropoff_ntaname,
drop_off.puma dropoff_puma
FROM trips
LEFT JOIN cab_types
ON trips.cab_type_id = cab_types.id
LEFT JOIN central_park_weather_observations_raw weather
ON weather.date = trips.pickup_datetime::date
LEFT JOIN nyct2010 pick_up
ON pick_up.gid = trips.pickup_nyct2010_gid
LEFT JOIN nyct2010 drop_off
ON drop_off.gid = trips.dropoff_nyct2010_gid
) TO '/opt/milovidov/nyc-taxi-data/trips.tsv';

The entire activity will be completed in approx. 4 hours , the data snapshot speed was around 80MB per second and TSV file size is 590612904969 bytes.

For data cleansing and removing NULLs we will create a temporary table in ClickHouse:

CREATE TABLE trips
(
trip_id UInt32,
vendor_id String,
pickup_datetime DateTime,
dropoff_datetime Nullable(DateTime),
store_and_fwd_flag Nullable(FixedString(1)),
rate_code_id Nullable(UInt8),
pickup_longitude Nullable(Float64),
pickup_latitude Nullable(Float64),
dropoff_longitude Nullable(Float64),
dropoff_latitude Nullable(Float64),
passenger_count Nullable(UInt8),
trip_distance Nullable(Float64),
fare_amount Nullable(Float32),
extra Nullable(Float32),
mta_tax Nullable(Float32),
tip_amount Nullable(Float32),
tolls_amount Nullable(Float32),
ehail_fee Nullable(Float32),
improvement_surcharge Nullable(Float32),
total_amount Nullable(Float32),
payment_type Nullable(String),
trip_type Nullable(UInt8),
pickup Nullable(String),
dropoff Nullable(String),
cab_type Nullable(String),
precipitation Nullable(UInt8),
snow_depth Nullable(UInt8),
snowfall Nullable(UInt8),
max_temperature Nullable(UInt8),
min_temperature Nullable(UInt8),
average_wind_speed Nullable(UInt8),
pickup_nyct2010_gid Nullable(UInt8),
pickup_ctlabel Nullable(String),
pickup_borocode Nullable(UInt8),
pickup_boroname Nullable(String),
pickup_ct2010 Nullable(String),
pickup_boroct2010 Nullable(String),
pickup_cdeligibil Nullable(FixedString(1)),
pickup_ntacode Nullable(String),
pickup_ntaname Nullable(String),
pickup_puma Nullable(String),
dropoff_nyct2010_gid Nullable(UInt8),
dropoff_ctlabel Nullable(String),
dropoff_borocode Nullable(UInt8),
dropoff_boroname Nullable(String),
dropoff_ct2010 Nullable(String),
dropoff_boroct2010 Nullable(String),
dropoff_cdeligibil Nullable(String),
dropoff_ntacode Nullable(String),
dropoff_ntaname Nullable(String),
dropoff_puma Nullable(String)
) ENGINE = Log;

$ time clickhouse-client --query="INSERT INTO trips FORMAT TabSeparated" < trips.tsv

real 61m38.597s

I have done this benchmarking on a single ClickHouse server using MergeTree engine , Created Summary Table & loaded data below:

CREATE TABLE trips_mergetree
ENGINE = MergeTree(pickup_date, pickup_datetime, 8192)
AS SELECT

trip_id,
CAST(vendor_id AS Enum8('1' = 1, '2' = 2, 'CMT' = 3, 'VTS' = 4, 'DDS' = 5, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14)) AS vendor_id,
toDate(pickup_datetime) AS pickup_date,
ifNull(pickup_datetime, toDateTime(0)) AS pickup_datetime,
toDate(dropoff_datetime) AS dropoff_date,
ifNull(dropoff_datetime, toDateTime(0)) AS dropoff_datetime,
assumeNotNull(store_and_fwd_flag) IN ('Y', '1', '2') AS store_and_fwd_flag,
assumeNotNull(rate_code_id) AS rate_code_id,
assumeNotNull(pickup_longitude) AS pickup_longitude,
assumeNotNull(pickup_latitude) AS pickup_latitude,
assumeNotNull(dropoff_longitude) AS dropoff_longitude,
assumeNotNull(dropoff_latitude) AS dropoff_latitude,
assumeNotNull(passenger_count) AS passenger_count,
assumeNotNull(trip_distance) AS trip_distance,
assumeNotNull(fare_amount) AS fare_amount,
assumeNotNull(extra) AS extra,
assumeNotNull(mta_tax) AS mta_tax,
assumeNotNull(tip_amount) AS tip_amount,
assumeNotNull(tolls_amount) AS tolls_amount,
assumeNotNull(ehail_fee) AS ehail_fee,
assumeNotNull(improvement_surcharge) AS improvement_surcharge,
assumeNotNull(total_amount) AS total_amount,
CAST((assumeNotNull(payment_type) AS pt) IN ('CSH', 'CASH', 'Cash', 'CAS', 'Cas', '1') ? 'CSH' : (pt IN ('CRD', 'Credit', 'Cre', 'CRE', 'CREDIT', '2') ? 'CRE' : (pt IN ('NOC', 'No Charge', 'No', '3') ? 'NOC' : (pt IN ('DIS', 'Dispute', 'Dis', '4') ? 'DIS' : 'UNK'))) AS Enum8('CSH' = 1, 'CRE' = 2, 'UNK' = 0, 'NOC' = 3, 'DIS' = 4)) AS payment_type_,
assumeNotNull(trip_type) AS trip_type,
ifNull(toFixedString(unhex(pickup), 25), toFixedString('', 25)) AS pickup,
ifNull(toFixedString(unhex(dropoff), 25), toFixedString('', 25)) AS dropoff,
CAST(assumeNotNull(cab_type) AS Enum8('yellow' = 1, 'green' = 2, 'uber' = 3)) AS cab_type,

assumeNotNull(pickup_nyct2010_gid) AS pickup_nyct2010_gid,
toFloat32(ifNull(pickup_ctlabel, '0')) AS pickup_ctlabel,
assumeNotNull(pickup_borocode) AS pickup_borocode,
CAST(assumeNotNull(pickup_boroname) AS Enum8('Manhattan' = 1, 'Queens' = 4, 'Brooklyn' = 3, '' = 0, 'Bronx' = 2, 'Staten Island' = 5)) AS pickup_boroname,
toFixedString(ifNull(pickup_ct2010, '000000'), 6) AS pickup_ct2010,
toFixedString(ifNull(pickup_boroct2010, '0000000'), 7) AS pickup_boroct2010,
CAST(assumeNotNull(ifNull(pickup_cdeligibil, ' ')) AS Enum8(' ' = 0, 'E' = 1, 'I' = 2)) AS pickup_cdeligibil,
toFixedString(ifNull(pickup_ntacode, '0000'), 4) AS pickup_ntacode,

CAST(assumeNotNull(pickup_ntaname) AS Enum16('' = 0, 'Airport' = 1, 'Allerton-Pelham Gardens' = 2, 'Annadale-Huguenot-Prince\'s Bay-Eltingville' = 3, 'Arden Heights' = 4, 'Astoria' = 5, 'Auburndale' = 6, 'Baisley Park' = 7, 'Bath Beach' = 8, 'Battery Park City-Lower Manhattan' = 9, 'Bay Ridge' = 10, 'Bayside-Bayside Hills' = 11, 'Bedford' = 12, 'Bedford Park-Fordham North' = 13, 'Bellerose' = 14, 'Belmont' = 15, 'Bensonhurst East' = 16, 'Bensonhurst West' = 17, 'Borough Park' = 18, 'Breezy Point-Belle Harbor-Rockaway Park-Broad Channel' = 19, 'Briarwood-Jamaica Hills' = 20, 'Brighton Beach' = 21, 'Bronxdale' = 22, 'Brooklyn Heights-Cobble Hill' = 23, 'Brownsville' = 24, 'Bushwick North' = 25, 'Bushwick South' = 26, 'Cambria Heights' = 27, 'Canarsie' = 28, 'Carroll Gardens-Columbia Street-Red Hook' = 29, 'Central Harlem North-Polo Grounds' = 30, 'Central Harlem South' = 31, 'Charleston-Richmond Valley-Tottenville' = 32, 'Chinatown' = 33, 'Claremont-Bathgate' = 34, 'Clinton' = 35, 'Clinton Hill' = 36, 'Co-op City' = 37, 'College Point' = 38, 'Corona' = 39, 'Crotona Park East' = 40, 'Crown Heights North' = 41, 'Crown Heights South' = 42, 'Cypress Hills-City Line' = 43, 'DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill' = 44, 'Douglas Manor-Douglaston-Little Neck' = 45, 'Dyker Heights' = 46, 'East Concourse-Concourse Village' = 47, 'East Elmhurst' = 48, 'East Flatbush-Farragut' = 49, 'East Flushing' = 50, 'East Harlem North' = 51, 'East Harlem South' = 52, 'East New York' = 53, 'East New York (Pennsylvania Ave)' = 54, 'East Tremont' = 55, 'East Village' = 56, 'East Williamsburg' = 57, 'Eastchester-Edenwald-Baychester' = 58, 'Elmhurst' = 59, 'Elmhurst-Maspeth' = 60, 'Erasmus' = 61, 'Far Rockaway-Bayswater' = 62, 'Flatbush' = 63, 'Flatlands' = 64, 'Flushing' = 65, 'Fordham South' = 66, 'Forest Hills' = 67, 'Fort Greene' = 68, 'Fresh Meadows-Utopia' = 69, 'Ft. Totten-Bay Terrace-Clearview' = 70, 'Georgetown-Marine Park-Bergen Beach-Mill Basin' = 71, 'Glen Oaks-Floral Park-New Hyde Park' = 72, 'Glendale' = 73, 'Gramercy' = 74, 'Grasmere-Arrochar-Ft. Wadsworth' = 75, 'Gravesend' = 76, 'Great Kills' = 77, 'Greenpoint' = 78, 'Grymes Hill-Clifton-Fox Hills' = 79, 'Hamilton Heights' = 80, 'Hammels-Arverne-Edgemere' = 81, 'Highbridge' = 82, 'Hollis' = 83, 'Homecrest' = 84, 'Hudson Yards-Chelsea-Flatiron-Union Square' = 85, 'Hunters Point-Sunnyside-West Maspeth' = 86, 'Hunts Point' = 87, 'Jackson Heights' = 88, 'Jamaica' = 89, 'Jamaica Estates-Holliswood' = 90, 'Kensington-Ocean Parkway' = 91, 'Kew Gardens' = 92, 'Kew Gardens Hills' = 93, 'Kingsbridge Heights' = 94, 'Laurelton' = 95, 'Lenox Hill-Roosevelt Island' = 96, 'Lincoln Square' = 97, 'Lindenwood-Howard Beach' = 98, 'Longwood' = 99, 'Lower East Side' = 100, 'Madison' = 101, 'Manhattanville' = 102, 'Marble Hill-Inwood' = 103, 'Mariner\'s Harbor-Arlington-Port Ivory-Graniteville' = 104, 'Maspeth' = 105, 'Melrose South-Mott Haven North' = 106, 'Middle Village' = 107, 'Midtown-Midtown South' = 108, 'Midwood' = 109, 'Morningside Heights' = 110, 'Morrisania-Melrose' = 111, 'Mott Haven-Port Morris' = 112, 'Mount Hope' = 113, 'Murray Hill' = 114, 'Murray Hill-Kips Bay' = 115, 'New Brighton-Silver Lake' = 116, 'New Dorp-Midland Beach' = 117, 'New Springville-Bloomfield-Travis' = 118, 'North Corona' = 119, 'North Riverdale-Fieldston-Riverdale' = 120, 'North Side-South Side' = 121, 'Norwood' = 122, 'Oakland Gardens' = 123, 'Oakwood-Oakwood Beach' = 124, 'Ocean Hill' = 125, 'Ocean Parkway South' = 126, 'Old Astoria' = 127, 'Old Town-Dongan Hills-South Beach' = 128, 'Ozone Park' = 129, 'Park Slope-Gowanus' = 130, 'Parkchester' = 131, 'Pelham Bay-Country Club-City Island' = 132, 'Pelham Parkway' = 133, 'Pomonok-Flushing Heights-Hillcrest' = 134, 'Port Richmond' = 135, 'Prospect Heights' = 136, 'Prospect Lefferts Gardens-Wingate' = 137, 'Queens Village' = 138, 'Queensboro Hill' = 139, 'Queensbridge-Ravenswood-Long Island City' = 140, 'Rego Park' = 141, 'Richmond Hill' = 142, 'Ridgewood' = 143, 'Rikers Island' = 144, 'Rosedale' = 145, 'Rossville-Woodrow' = 146, 'Rugby-Remsen Village' = 147, 'Schuylerville-Throgs Neck-Edgewater Park' = 148, 'Seagate-Coney Island' = 149, 'Sheepshead Bay-Gerritsen Beach-Manhattan Beach' = 150, 'SoHo-TriBeCa-Civic Center-Little Italy' = 151, 'Soundview-Bruckner' = 152, 'Soundview-Castle Hill-Clason Point-Harding Park' = 153, 'South Jamaica' = 154, 'South Ozone Park' = 155, 'Springfield Gardens North' = 156, 'Springfield Gardens South-Brookville' = 157, 'Spuyten Duyvil-Kingsbridge' = 158, 'St. Albans' = 159, 'Stapleton-Rosebank' = 160, 'Starrett City' = 161, 'Steinway' = 162, 'Stuyvesant Heights' = 163, 'Stuyvesant Town-Cooper Village' = 164, 'Sunset Park East' = 165, 'Sunset Park West' = 166, 'Todt Hill-Emerson Hill-Heartland Village-Lighthouse Hill' = 167, 'Turtle Bay-East Midtown' = 168, 'University Heights-Morris Heights' = 169, 'Upper East Side-Carnegie Hill' = 170, 'Upper West Side' = 171, 'Van Cortlandt Village' = 172, 'Van Nest-Morris Park-Westchester Square' = 173, 'Washington Heights North' = 174, 'Washington Heights South' = 175, 'West Brighton' = 176, 'West Concourse' = 177, 'West Farms-Bronx River' = 178, 'West New Brighton-New Brighton-St. George' = 179, 'West Village' = 180, 'Westchester-Unionport' = 181, 'Westerleigh' = 182, 'Whitestone' = 183, 'Williamsbridge-Olinville' = 184, 'Williamsburg' = 185, 'Windsor Terrace' = 186, 'Woodhaven' = 187, 'Woodlawn-Wakefield' = 188, 'Woodside' = 189, 'Yorkville' = 190, 'park-cemetery-etc-Bronx' = 191, 'park-cemetery-etc-Brooklyn' = 192, 'park-cemetery-etc-Manhattan' = 193, 'park-cemetery-etc-Queens' = 194, 'park-cemetery-etc-Staten Island' = 195)) AS pickup_ntaname,

toUInt16(ifNull(pickup_puma, '0')) AS pickup_puma,

assumeNotNull(dropoff_nyct2010_gid) AS dropoff_nyct2010_gid,
toFloat32(ifNull(dropoff_ctlabel, '0')) AS dropoff_ctlabel,
assumeNotNull(dropoff_borocode) AS dropoff_borocode,
CAST(assumeNotNull(dropoff_boroname) AS Enum8('Manhattan' = 1, 'Queens' = 4, 'Brooklyn' = 3, '' = 0, 'Bronx' = 2, 'Staten Island' = 5)) AS dropoff_boroname,
toFixedString(ifNull(dropoff_ct2010, '000000'), 6) AS dropoff_ct2010,
toFixedString(ifNull(dropoff_boroct2010, '0000000'), 7) AS dropoff_boroct2010,
CAST(assumeNotNull(ifNull(dropoff_cdeligibil, ' ')) AS Enum8(' ' = 0, 'E' = 1, 'I' = 2)) AS dropoff_cdeligibil,
toFixedString(ifNull(dropoff_ntacode, '0000'), 4) AS dropoff_ntacode,

CAST(assumeNotNull(dropoff_ntaname) AS Enum16('' = 0, 'Airport' = 1, 'Allerton-Pelham Gardens' = 2, 'Annadale-Huguenot-Prince\'s Bay-Eltingville' = 3, 'Arden Heights' = 4, 'Astoria' = 5, 'Auburndale' = 6, 'Baisley Park' = 7, 'Bath Beach' = 8, 'Battery Park City-Lower Manhattan' = 9, 'Bay Ridge' = 10, 'Bayside-Bayside Hills' = 11, 'Bedford' = 12, 'Bedford Park-Fordham North' = 13, 'Bellerose' = 14, 'Belmont' = 15, 'Bensonhurst East' = 16, 'Bensonhurst West' = 17, 'Borough Park' = 18, 'Breezy Point-Belle Harbor-Rockaway Park-Broad Channel' = 19, 'Briarwood-Jamaica Hills' = 20, 'Brighton Beach' = 21, 'Bronxdale' = 22, 'Brooklyn Heights-Cobble Hill' = 23, 'Brownsville' = 24, 'Bushwick North' = 25, 'Bushwick South' = 26, 'Cambria Heights' = 27, 'Canarsie' = 28, 'Carroll Gardens-Columbia Street-Red Hook' = 29, 'Central Harlem North-Polo Grounds' = 30, 'Central Harlem South' = 31, 'Charleston-Richmond Valley-Tottenville' = 32, 'Chinatown' = 33, 'Claremont-Bathgate' = 34, 'Clinton' = 35, 'Clinton Hill' = 36, 'Co-op City' = 37, 'College Point' = 38, 'Corona' = 39, 'Crotona Park East' = 40, 'Crown Heights North' = 41, 'Crown Heights South' = 42, 'Cypress Hills-City Line' = 43, 'DUMBO-Vinegar Hill-Downtown Brooklyn-Boerum Hill' = 44, 'Douglas Manor-Douglaston-Little Neck' = 45, 'Dyker Heights' = 46, 'East Concourse-Concourse Village' = 47, 'East Elmhurst' = 48, 'East Flatbush-Farragut' = 49, 'East Flushing' = 50, 'East Harlem North' = 51, 'East Harlem South' = 52, 'East New York' = 53, 'East New York (Pennsylvania Ave)' = 54, 'East Tremont' = 55, 'East Village' = 56, 'East Williamsburg' = 57, 'Eastchester-Edenwald-Baychester' = 58, 'Elmhurst' = 59, 'Elmhurst-Maspeth' = 60, 'Erasmus' = 61, 'Far Rockaway-Bayswater' = 62, 'Flatbush' = 63, 'Flatlands' = 64, 'Flushing' = 65, 'Fordham South' = 66, 'Forest Hills' = 67, 'Fort Greene' = 68, 'Fresh Meadows-Utopia' = 69, 'Ft. Totten-Bay Terrace-Clearview' = 70, 'Georgetown-Marine Park-Bergen Beach-Mill Basin' = 71, 'Glen Oaks-Floral Park-New Hyde Park' = 72, 'Glendale' = 73, 'Gramercy' = 74, 'Grasmere-Arrochar-Ft. Wadsworth' = 75, 'Gravesend' = 76, 'Great Kills' = 77, 'Greenpoint' = 78, 'Grymes Hill-Clifton-Fox Hills' = 79, 'Hamilton Heights' = 80, 'Hammels-Arverne-Edgemere' = 81, 'Highbridge' = 82, 'Hollis' = 83, 'Homecrest' = 84, 'Hudson Yards-Chelsea-Flatiron-Union Square' = 85, 'Hunters Point-Sunnyside-West Maspeth' = 86, 'Hunts Point' = 87, 'Jackson Heights' = 88, 'Jamaica' = 89, 'Jamaica Estates-Holliswood' = 90, 'Kensington-Ocean Parkway' = 91, 'Kew Gardens' = 92, 'Kew Gardens Hills' = 93, 'Kingsbridge Heights' = 94, 'Laurelton' = 95, 'Lenox Hill-Roosevelt Island' = 96, 'Lincoln Square' = 97, 'Lindenwood-Howard Beach' = 98, 'Longwood' = 99, 'Lower East Side' = 100, 'Madison' = 101, 'Manhattanville' = 102, 'Marble Hill-Inwood' = 103, 'Mariner\'s Harbor-Arlington-Port Ivory-Graniteville' = 104, 'Maspeth' = 105, 'Melrose South-Mott Haven North' = 106, 'Middle Village' = 107, 'Midtown-Midtown South' = 108, 'Midwood' = 109, 'Morningside Heights' = 110, 'Morrisania-Melrose' = 111, 'Mott Haven-Port Morris' = 112, 'Mount Hope' = 113, 'Murray Hill' = 114, 'Murray Hill-Kips Bay' = 115, 'New Brighton-Silver Lake' = 116, 'New Dorp-Midland Beach' = 117, 'New Springville-Bloomfield-Travis' = 118, 'North Corona' = 119, 'North Riverdale-Fieldston-Riverdale' = 120, 'North Side-South Side' = 121, 'Norwood' = 122, 'Oakland Gardens' = 123, 'Oakwood-Oakwood Beach' = 124, 'Ocean Hill' = 125, 'Ocean Parkway South' = 126, 'Old Astoria' = 127, 'Old Town-Dongan Hills-South Beach' = 128, 'Ozone Park' = 129, 'Park Slope-Gowanus' = 130, 'Parkchester' = 131, 'Pelham Bay-Country Club-City Island' = 132, 'Pelham Parkway' = 133, 'Pomonok-Flushing Heights-Hillcrest' = 134, 'Port Richmond' = 135, 'Prospect Heights' = 136, 'Prospect Lefferts Gardens-Wingate' = 137, 'Queens Village' = 138, 'Queensboro Hill' = 139, 'Queensbridge-Ravenswood-Long Island City' = 140, 'Rego Park' = 141, 'Richmond Hill' = 142, 'Ridgewood' = 143, 'Rikers Island' = 144, 'Rosedale' = 145, 'Rossville-Woodrow' = 146, 'Rugby-Remsen Village' = 147, 'Schuylerville-Throgs Neck-Edgewater Park' = 148, 'Seagate-Coney Island' = 149, 'Sheepshead Bay-Gerritsen Beach-Manhattan Beach' = 150, 'SoHo-TriBeCa-Civic Center-Little Italy' = 151, 'Soundview-Bruckner' = 152, 'Soundview-Castle Hill-Clason Point-Harding Park' = 153, 'South Jamaica' = 154, 'South Ozone Park' = 155, 'Springfield Gardens North' = 156, 'Springfield Gardens South-Brookville' = 157, 'Spuyten Duyvil-Kingsbridge' = 158, 'St. Albans' = 159, 'Stapleton-Rosebank' = 160, 'Starrett City' = 161, 'Steinway' = 162, 'Stuyvesant Heights' = 163, 'Stuyvesant Town-Cooper Village' = 164, 'Sunset Park East' = 165, 'Sunset Park West' = 166, 'Todt Hill-Emerson Hill-Heartland Village-Lighthouse Hill' = 167, 'Turtle Bay-East Midtown' = 168, 'University Heights-Morris Heights' = 169, 'Upper East Side-Carnegie Hill' = 170, 'Upper West Side' = 171, 'Van Cortlandt Village' = 172, 'Van Nest-Morris Park-Westchester Square' = 173, 'Washington Heights North' = 174, 'Washington Heights South' = 175, 'West Brighton' = 176, 'West Concourse' = 177, 'West Farms-Bronx River' = 178, 'West New Brighton-New Brighton-St. George' = 179, 'West Village' = 180, 'Westchester-Unionport' = 181, 'Westerleigh' = 182, 'Whitestone' = 183, 'Williamsbridge-Olinville' = 184, 'Williamsburg' = 185, 'Windsor Terrace' = 186, 'Woodhaven' = 187, 'Woodlawn-Wakefield' = 188, 'Woodside' = 189, 'Yorkville' = 190, 'park-cemetery-etc-Bronx' = 191, 'park-cemetery-etc-Brooklyn' = 192, 'park-cemetery-etc-Manhattan' = 193, 'park-cemetery-etc-Queens' = 194, 'park-cemetery-etc-Staten Island' = 195)) AS dropoff_ntaname,

toUInt16(ifNull(dropoff_puma, '0')) AS dropoff_puma

FROM trips

Query performance Benchmarking

Query 1:

SELECT cab_type, count(*) FROM trips_mergetree GROUP BY cab_type

0.163 seconds.

Query 2:

SELECT passenger_count, avg(total_amount) FROM trips_mergetree GROUP BY passenger_count

0.834 seconds.

Query 3:

SELECT passenger_count, toYear(pickup_date) AS year, count(*) FROM trips_mergetree GROUP BY passenger_count, year

1.813 seconds.

Query 4:

SELECT passenger_count, toYear(pickup_date) AS year, round(trip_distance) AS distance, count(*)
FROM trips_mergetree
GROUP BY passenger_count, year, distance
ORDER BY year, count(*) DESC

2.157 seconds.

References

The post Benchmarking ClickHouse Performance on Amazon EC2 appeared first on The WebScale Database Infrastructure Operations Experts.

Database Replication from MySQL to ClickHouse for High Performance WebScale Analytics

Shiv Iyer — Tue, 01 Oct 2019 19:12:51 +0000

MySQL to ClickHouse Replication

MySQL works great for Online Transaction Processing (OLTP) systems, MySQL performance degrades with analytical queries on very large database infrastructure, I agree you can optimize MySQL query performance with InnoDB compressions but why then combine OLTP and OLAP (Online Analytics Processing Systems) when you have columnar stores which can deliver high performance analytical queries more efficiently? I have seen several companies building dedicated MySQL servers for Analytics but over the period of time they end spending more money in fine tuning MySQL for Analytics with no significant improvements, There is no point in blaming MySQL for what it is not built for, MySQL / MariaDB is any day a bad choice for columnar analytics / big data solutions. Columnar database systems are best suited for handling large quantities of data: data stored in columns typically is easier to compress, it is also easier to access on per column basis – typically you ask for some data stored in a couple of columns – an ability to retrieve just those columns instead of reading all of the rows and filter out unneeded data makes the data accessed faster. So how can you combine the best of both ? Using MySQL purely for Transaction Processing Systems and Archiving MySQL Transactional Data for Data Analytics on a columnar store like ClickHouse, This post is about archiving and replicating data from MySQL to ClickHouse, you can continue reading from here.

Recommended Reading for ClickHouse:

Why we recommend ClickHouse over many other columnar database systems ? https://minervadb.com/index.php/2018/03/06/why-we-recommend-clickhouse-over-many-other-columnar-database-systems/
Benchmarking ClickHouse on my MacBook Pro (Super impressive performance) – https://minervadb.com/index.php/2018/01/23/benchmarking-clickhouse-2/
ClickHouse GitHub – https://github.com/ClickHouse
Altinity GitHub – https://github.com/Altinity
What you should know about RAID to improve Database Systems Performance – https://minervadb.com/index.php/2019/08/04/raid-redundant-storage-for-database-reliability/

The post Database Replication from MySQL to ClickHouse for High Performance WebScale Analytics appeared first on The WebScale Database Infrastructure Operations Experts.

Database Systems and Indexes – What you should know about Indexes for Performance Optimization ?

Shiv Iyer — Fri, 26 Jul 2019 19:21:42 +0000

Optimal Indexing for Performance – How to plan Index Ops. ?

An index or database index is a data structure which is used to quickly locate and access the data in a database table. Indexes are created on columns which will be the Search key that contains a copy of the primary key or candidate key of the table. These values are stored in sorted order so that the corresponding data can be accessed quickly (Note that the data may or may not be stored in sorted order). They are also Data Reference Pointers holding the address of the disk block where that particular key value can be found. Indexing in database systems is similar to what we see in books. There are complex design trade-offs involving lookup performance, index size, and index-update performance. Many index designs exhibit logarithmic (O(log(N))) lookup performance and in some applications it is possible to achieve flat (O(1)) performance. Indices can be implemented using a variety of data structures. Popular indices include balanced trees, B+ trees and hashes.The order that the index definition defines the columns in is important. It is possible to retrieve a set of row identifiers using only the first indexed column. However, it is not possible or efficient (on most databases) to retrieve the set of row identifiers using only the second or greater indexed column.

At a very high-level, There are only two kinds of Indexes:

Ordered indices: Indices are based on a sorted ordering of the values.
Hash indices: Indices are based on the values being distributed uniformly across a range of buckets. The buckets to which a value is assigned is determined by function called a hash function.

B+Tree indexing is a method of accessing and maintaining data. It should be used for large files that have unusual, unknown, or changing distributions because it reduces I/O processing when files are read. Also consider B+Tree indexing for files with long overflow chains. The prime block of the B⁺Tree index file (also called the root node) is pointed to by the header in the prime block of the B⁺Tree data file.

Indexing Attributes

Indexes are categorized on indexing attributes:

Primary Key Index

Primary Keys are unique and are stored in sorted manner, the performance of searching operation is quite efficient. The primary index is classified into two types : Dense Index and Sparse Index.
Dense Index
- For every search key value in the data file, there is an index record, This makes searching faster but requires more space to store index records itself.
- Index records contain search key value and a pointer to the actual record on the disk with that search key value.

Sparse Index
- In Sparse Index, An index record here contains a search key and an actual pointer to the data on the disk. To search a record, we first proceed by index record and reach at the actual location of the data.
- Starts at that record pointed to by the index record, and proceed along the pointers in the file (that is, sequentially) until we find the desired record.
- If the data we are looking for is not where we directly reach by following the index, then the system starts sequential search until the desired data is found.
- Dense indices are faster in general, but sparse indices require less space and impose less maintenance for insertions and deletions so best suited for very large-scale high volume SORT / SEARCH queries

Secondary Index
- Secondary index may be generated from a field which is a candidate key and has a unique value in every record, or a non-key with duplicate values. Secondary indexes are on a non-primary key, which allows you to model one-to-many relationships. Secondary Index does not have any impact on how the rows are actually organized in data blocks. They can be in any order. The only ordering is w.r.t the index key in index blocks. Because secondary index does not have any control over the organization of rows, So there will be more I/O and thus queries can be less efficient with secondary index.

Clustered Index
- Clustered indexes sort and store the data rows in the table or view based on their key values. These are the columns included in the index definition. There can be only one clustered index per table, because the data rows themselves can be sorted in only one order. Clustered indexes are efficient on columns that are searched for a range of values. After the row with first value is found using a clustered index, rows with subsequent index values are guaranteed to be physically adjacent, thus providing faster access for a user query or an application. So the clustered index is about the way data is physically sorted on disk, which means it’s a good all-round choice for most situations.
Non-Clustered Index
- Nonclustered indexes have a structure separate from the data rows. A non-clustered index contains the non-clustered index key values and each key value entry has a pointer to the data row that contains the key value. The pointer from an index row in a non-clustered index to a data row is called a row locator. The structure of the row locator depends on whether the data pages are stored in a heap or a clustered table. For a heap, a row locator is a pointer to the row. For a clustered table, the row locator is the clustered index key. You can add non-key columns to the leaf level of the non-clustered index to by-pass existing index key limits, and execute fully covered, indexed, queries. The non-clustered index is created to improve the performance of frequently used queries not covered by clustered index. It’s like a textbook, the index page is created separately at the beginning of that book. When a query is issued against a column on which the index is created, the database will first go to the index and look for the address of the corresponding row in the table. It will then go to that row address and fetch other column values. It is due to this additional step that non-clustered indexes are slower than clustered indexes.

Why there is not standardization implemented for Index creation and operations management ?

There is no standard defines how to create indexes, because the ISO SQL Standard does not cover physical aspects. Indexes are one of the physical parts of database conception among others like storage (tablespace or filegroups). RDBMS vendors all give a CREATE INDEX syntax with some specific options that depend on their software’s capabilities.

The post Database Systems and Indexes – What you should know about Indexes for Performance Optimization ? appeared first on The WebScale Database Infrastructure Operations Experts.

Percona Live Open Source Database Conference 2019

Shiv Iyer — Wed, 08 May 2019 18:30:35 +0000

Percona Live Open Source Database Conference 2019 – Austin, Texas is on 28-30 May 2019, This is no longer a MySQL only conference and it is a must attend conference for every open source database systems geek who is passionate about building an web-scale database infrastructure operations which is optimal, scalable, highly reliable and secured. This conference provides great opportunity to network, learn and share knowledge with passionate technology professionals from across the globe. I personally track this conference for several years and it is among those few very organized Technology Ops. conferences, This year Percona Live is personally more special to me because I had an opportunity to be on the Percona Live conference committee ( https://www.percona.com/live/19/conference-committee ) and really enjoyed the entire process from kick-off call to closure, Thank you Alkin Tezuysal, Sveta Smirnova and Peter Zaitsev for having me on this team, Looking forward to many more in the future. Thank you entire team Percona, conference committee members, sponsors and the attendees for making Percona Live a grand success every year, This cannot be possible without support from all of you !

Why attend Percona Live ?

Percona Live provides an opportunity to network with peers, start-up founders and technology professionals. Mingle with all types of database community members: DBAs, developers, C-level executives and the latest database technology trend-setters.
You are facing some problems or need expert advice ? Come for Percona Live, There are 100s for experts you get to talk with and they are always passionate about sharing the knowledge and helping you out
Diversity – Percona seriously care about your experience and satisfaction on this conference, you still have concerns? please contact Percona community team, Lorraine Pocklington and Tom Basil, in confidence via community-team@percona.com. They will explore with you the best way to accommodate your requests and if we are able to meet your needs.

Vendor neutral and independent open source Database Systems Conference – What are the topics discussed in Percona Live Open Source Database Conference 2019 – Austin, Texas ?

This conference is about how you can benefit maximum from investing in the open source Database Systems projects, social innovation, automation and more. Attending an vendor neutral open source Database Systems conference enables you to draw the big picture of building planet-scale Database Ops. without missing any of the core components involved:

MySQL Topics

Do you have a MySQL business case to highlight, a skill to teach, or a big idea to share? Have you used MySQL to solve an application or business issue? How? Submit your speaking proposal for breakout, tutorial of lightning talk sessions. Put your ideas, case studies, best practices and technical knowledge in front of an engaged audience of open source peers.

What MySQL 8.0 features use cases and applications
What’s new and emerging in MySQL
Best design practices and performance optimization
Are you using MySQL in conjunction with other databases to tailor data for applications?
High availability, clustering
Container solutions like Docker and Kubernetes for MySQL environments
Distributed databases with replication and sharding
Monitoring, management, automation tools and best practices
New technology: AI, machine learning, blockchain databases
User stories and business cases
MySQL DBaaS and PaaS solutions (Amazon Web Services (AWS), Microsoft Azure, Google Cloud)
New MySQL trends: what is the next big thing?

MariaDB Topics

Do you have a MariaDB business case to highlight, a skill to teach, or a big idea to share? Have you used MariaDB to solve an application or business issue? How? Submit your speaking proposal for breakout, tutorial of lightning talk sessions. Put your ideas, case studies, best practices and technical knowledge in front of an engaged audience of open source peers.

MariaDB 10.4 new features and how you can build optimal, scalable and highly reliable MariaDB applications
MariaDB ColumnStore vs. ClickHouse
Galera Cluster new features, operations and troubleshooting
Building secured MariaDB database infrastructure operations, best practices and tips & techniques
MariaDB and MySQL – What Statistics Optimizer Needs Or When and How Not to Use Indexes
Percona XtraBackup vs Mariabackup vs MySQL Enterprise Backup
Tips and Tricks with MariaDB ColumnStore

MongoDB Topics

Do you have a MongoDB business case to highlight, a skill to teach, or a big idea to share? Have you used MongoDB to solve an application or business issue? How? Submit your speaking proposal for breakout, tutorial sessions or lightning talks. Put your ideas, case studies, best practices and technical knowledge in front of an engaged audience of open source peers.

MongoDB 4.0 feature use cases and applications
What’s new and upcoming in the MongoDB ecosystem
What excites you about MongoDB 4.0
Have you used MongoDB ACID transactions for production websites?
Are you using MongoDB in conjunction with other databases to tailor data for applications?
Best design practices and performance optimization
High availability, clustering
MongoDB Atlas and how you use it to get applications to market
MongoDB DBaaS and PaaS solutions (Amazon Web Services (AWS), Microsoft Azure, Google Cloud)
How MongoDB helps application development
Container solutions like Docker and Kubernetes for MongoDB environments
Distributed databases with replication and sharding

PostgreSQL Topics

Do you have a PostgreSQL business case to highlight, a skill to teach, or a big idea to share? Have you used PostgreSQL to solve an application or business issue? How? Submit your speaking proposal for breakout, tutorial of lightning talk sessions. Put your ideas, case studies, best practices and technical knowledge in front of an engaged audience of open source peers.

PostgreSQL feature use cases and applications
What’s new and upcoming in PostgreSQL
Are you using PostgreSQL in conjunction with other databases to tailor data for applications?
Are you using PostgreSQL in the cloud (AWS, Microsoft Azure, Google Cloud
Best design practices and performance optimization
High availability, clustering
Container solutions like Docker and Kubernetes for PostgreSQL environments
Distributed databases with replication and sharding
Monitoring, management, automation tools and best practices
New technology: AI, machine learning, blockchain databases
User stories and business cases
New PostgreSQL trends: what is the next big thing?

Open Source Database Topics

Do you have an open source business case to highlight, a skill to teach, or a big idea to share? Have you used open source database technology to solve an application or business issue? How? Submit your speaking proposal for breakout, tutorial sessions or lightning talks. Put your ideas, case studies, best practices and technical knowledge in front of an engaged audience of open source peers.

Open source database technologies overview and comparisons Introduction to the new open source technology
Using open source database technology with MySQL, MariaDB, MongoDB, PostgreSQL or other technology
Migrating to open source databases
Technology practical use cases
How to implement container solutions like Docker and Kubernetes
Monitoring solutions

Open Source Databases and Business Goals

More and more enterprises are adopting open source databases to help achieve business goals and solve business issues — both on-premises and in the cloud. Percona Live 2019 includes a new business track that covers the best ideas for how open source databases and database technologies can address and solve business issues such as application time-to-market, resource costs, OPEX and CAPEX expenses, etc.

Do you have an open source business case to highlight, a skill to teach, or a big idea to share? Have you used open source database technology to solve a business issue? How? Submit your speaking proposal for breakout, tutorial sessions or lightning talks. Put your ideas, case studies and best practices in front of an engaged audience of open source peers.

Do you have an open source business case to highlight, a skill to teach, or a big idea to share? Have you used open source database technology to solve a particular business issue? How? Submit your speaking proposal for breakout, tutorial sessions or lightning talks. Put your ideas, case studies and best practices in front of an engaged audience of open source peers.

Session topics could be one of (but not limited to):

How did you choose your current open source database?
On-premises or in the cloud, which is your preferred environment?
How has open source databases cut your OPEX or CAPEX costs?
What did moving to the cloud do for resource planning?
How does developers and DBAs communication help launch applications more quickly?
Are you using more than one database to run applications? Why?
Using open source database technology with MySQL, MariaDB, MongoDB, PostgreSQL or other technology
Migrating to open source databases
How do you use database monitoring to make business decisions?

How can this event be successful without these sponsors ?

Percona Live is not complete without the support of sponsors, These sponsors provide a great opportunity for us to connect, network, learn and innovate:

Diamond sponsors

Continuent
VividCortex

Platinum sponsors

Veritas Technologies
Amazon Web Services

Gold sponsors

EnterpriseDB

Silver sponsors

MySQL
Altinity
PingCap
SmartStyle
Facebook
ScaleGrid
Vendita
Intel
Galera Cluster
ProxySQL

Branding

Bloomberg Engineering
Yelp

Media sponsors

Austin Technology Council

How can I buy tickets to attend this conference ?

https://www.percona.com/live/19/register

You have more questions ?

To know more about this conference, tickets, sponsorships and general information, please contact info@percona.com or call +1-888-401-3401

The post Percona Live Open Source Database Conference 2019 appeared first on The WebScale Database Infrastructure Operations Experts.

Tuning MyRocks for performance

Shiv Iyer — Fri, 02 Nov 2018 18:46:19 +0000

There are basically two things which I majorly like about using MyRocks, 1. LSM Advantage – smaller space & lower write amplification and 2. Best of MySQL like replication, storage engine centric database infrastructure operations and MySQL orchestration tools. Facebook built RocksDB as an embeddable and persistent key-value store with lower amplification factor () compared to InnoDB. Let me explain a scenario were InnoDB proves less efficient compared to RocksDB in SSD:
We know InnoDB is constrained by a fixed compressed page size. Alignment during fragmentation and compression causes extra unused space because the leaf nodes are not full. Let’s consider a InnoDB table with a compressed page size of 8KB. A 16KB in-memory page compressed to 5KB still uses 8KB on storage. Adding to this, each entry in the primary key index has 13 bytes of metadata (6 byte transaction id + 7 byte rollback pointer), and the metadata is not compressed, making the space overhead significant for small rows. Typically flash devices are limited by the WRITE endurance, In a typical scenario were index values are stored in leaf nodes and sorted by key, the often operational database may not fit in memory and keys get updated in an random platform leading to higher write amplification. In the worst case, updating one row requires a number of page reads, makes several pages dirty, and forces many dirty pages to be written back to storage.

Sow now what I really love about MyRocks?

It’s all about much lower write amplification factor of RocksDB compared to InnoDB is what I am very much impressed about. On pure flash, reducing write volume (write amplification) is important because flash burns out if writing too much data. Reducing write volume also helps to improve overall throughput on flash. InnoDB adopts “update in place” architecture. Even though updating just 1 record, an entire page where the row belongs becomes dirty, and the dirty page has to be written back to storage. On typical OLTP systems, modification unit (row) size is much smaller than I/O unit (page) size. This makes write amplification very high. I have published performance benchmarking of InnoDB, RocksDB and TokuDB, You can read about it here

Things to remember before tuning MyRocks:

Data loading limitations
- Limitation – Transaction must fit in memory:
  - mysql > ALTER TABLE post_master ENGINE = RocksDB;
    - Error 2013 (HY000): Lost connection to MySQL server during query.
  - Higher memory consumption and eventually get killed by OOM killer
- When loading data into MyRocks tables, there are two recommended session variables:
  - SET session sql_log_bin=0;
  - SET session rocksdb_bulk_load=1;

There are few interesting things to remember before bulk loading MyRocks and tuning the system variable rocksdb_bulk_load:

Data being bulk loaded can never overlap with existing data in the table. It is always recommended to bulk data load into an empty table. But, The mode will allow loading some data into the table, doing other operations and then returning and bulk loading addition data if there is no overlap between what is loaded and what already exists.
The data may not be visible until the bulk load mode is ended (i.e. the rocksdb_bulk_load is set to zero again). RocksDB stores data into “SST” (Sorted String Table) files and Until a particular SST has been added the data will not be visible to the rest of the system, thus issuing a SELECT on the table currently being bulk loaded will only show older data and will likely not show the most recently added rows. Ending the bulk load mode will cause the most recent SST file to be added. When bulk loading multiple tables, starting a new table will trigger the code to add the most recent SST file to the system — as a result, it is inadvisable to interleave INSERT statements to two or more tables during bulk load mode.

Configuring MyRocks for performance:

Character Sets:

MyRocks works more optimal with case sensitive collations (latin1_bin, utf8_bin, binary)

Transaction

Read Committed isolation level is recommended. MyRocks’s transaction isolation implementation is different from InnoDB, but close to PostgreSQL. Default tx isolation in PostgreSQL is Read Committed.

Compression

Set kNoCompression (or kLZ4Compression) on L0-1 or L0-2
In the bottommost level, using stronger compression algorithm (Zlib or ZSTD) is recommended.
If using zlib compression, set kZlibCompression at the bottommost level (bottommost_compression).
If using zlib compression, set compression level accordingly. The above example (compression_opts=-14:1:0) uses zlib compression level 1. If your application is not write intensive, setting (compression_opts=-14:6:0) will give better space savings (using zlib compression level 6).
For other levels, set kLZ4Compression.

Data blocks, files and compactions

Set level_compaction_dynamic_level_bytes=true
Set proper rocksdb_block_size (default 4096). Larger block size will reduce space but increase CPU overhead because MyRocks has to uncompress many more bytes. There is a trade-off between space and CPU usage.
Set rocksdb_max_open_files=-1. If setting greater than 0, RocksDB still uses table_cache, which will lock a mutex every time you access the file. I think you’ll see much greater benefit with -1 because then you will not need to go through LRUCache to get the table you need.
Set reasonable rocksdb_max_background_jobs
Set not small target_file_size_base (32MB is generally sufficient). Default is 4MB, which is generally too small and creates too many sst files. Too many sst files makes operations more difficult.
Set Rate Limiter. Without rate limiter, compaction very often writes 300~500MB/s on pure flash, which may cause short stalls. On 4x MyRocks testing, 40MB/s rate limiter per instance gave pretty stable results (less than 200MB/s peak from iostat).

Bloom Filter

Configure bloom filter and Prefix Extractor. Full Filter is recommended (Block based filter does not work for Get() + prefix bloom). Prefix extractor can be configured per column family and uses the first prefix_extractor bits as the key. If using one BIGINT column as a primary key, recommended bloom filter size is 12 (first 4 bytes are for internal index id + 8 byte BIGINT).
Configure Memtable bloom filter. Memtable bloom filter is useful to reduce CPU usage, if you see high CPU usage at rocksdb::MemTable::KeyComparator. Size depends on Memtable size. Set memtable_prefix_bloom_bits=41943040 for 128MB Memtable (30/128M=4M keys * 10 bits per key).

Cache

Do not set block_cache at rocksdb_default_cf_options (block_based_table_factory). If you do provide a block cache size on a default column family, the same cache is NOT reused for all such column families.
Consider setting shared write buffer size (db_write_buffer_size)
Consider using compaction_pri=kMinOverlappingRatio for writing less on compaction.

Reference Source: https://github.com/facebook/mysql-5.6/wiki/my.cnf-tuning

The post Tuning MyRocks for performance appeared first on The WebScale Database Infrastructure Operations Experts.

Shiv Iyer, Author at The WebScale Database Infrastructure Operations Experts %

Linux Performance Troubleshooting with eBPF – MinervaDB Webinar

Linux Performance Troubleshooting with eBPF – MinervaDB Webinar

MinervaDB Webinar – Data SRE

MinervaDB Webinar on Data SRE – Building Database Infrastructure for Performance and Reliability

MinervaDB Webinar – Building MySQL Database Infrastructure for Performance and Reliability

MinervaDB Webinar – Building MySQL Database Infrastructure for Performance and Reliability

MinervaDB Webinar – Learn with Shiv Iyer on building MySQL Infrastructure for Performance and Reliability

MinervaDB Webinar – Building MySQL Infrastructure for Performance and Reliability

Building Infrastructure for ClickHouse Performance

Tuning Infrastructure for ClickHouse Performance

Monitor for overheating CPUs

Using PSENSOR to monitor hardware temperature

Choosing RAID for Performance

Huge Pages

Why we recommend to disable Transparent Huge Pages for ClickHouse Performance ?

RAM

Conclusion

Benchmarking ClickHouse Performance on Amazon EC2

Benchmarking ClickHouse Performance – Amazon EC2

Amazon EC2 infrastructure used for benchmarking ClickHouse:

Benchmarking Dataset – New York Taxi data

Downloading the prepared partitions

Query performance Benchmarking

Query 1:

Query 2:

Query 3:

Query 4:

References

Database Replication from MySQL to ClickHouse for High Performance WebScale Analytics

MySQL to ClickHouse Replication

Database Systems and Indexes – What you should know about Indexes for Performance Optimization ?

Optimal Indexing for Performance – How to plan Index Ops. ?

At a very high-level, There are only two kinds of Indexes:

Indexing Attributes

Primary Key Index

Dense Index

Sparse Index

Secondary Index

Clustered Index

Non-Clustered Index

Why there is not standardization implemented for Index creation and operations management ?

Percona Live Open Source Database Conference 2019

Why attend Percona Live ?

Vendor neutral and independent open source Database Systems Conference – What are the topics discussed in Percona Live Open Source Database Conference 2019 – Austin, Texas ?

MySQL Topics

MariaDB Topics

MongoDB Topics

PostgreSQL Topics

Open Source Database Topics

Open Source Databases and Business Goals

How can this event be successful without these sponsors ?

How can I buy tickets to attend this conference ?

You have more questions ?

Tuning MyRocks for performance

Things to remember before tuning MyRocks:

Configuring MyRocks for performance: