What's a big database? And how do you deal with it?

bernard · Jul 2, 2023

I'm curious as to when a database becomes so big that it becomes slow.

If you're Amazon and someone searches for a product among their thousands/millions, why can I get that right away, where as we've all tried to search in some public database and it times out.

What are the underlying things going on here?

Steptoe · Jul 3, 2023

Optimised indexes, offloading data into separate tables where needed, query optimisation, ssd/hdd speed, raw computing power...

It's something I need to look into - I currently have a table that has grown to over 21 million rows and 15GB, and it's beginning to get sluggish. That's a drop in the ocean for many databases though!

CCarter · Jul 3, 2023

There are a lot of things going on with operations like Amazon. From my own experience with SERPWoo and database management - we collect about 100,000,000+ new rows of just SERP data DAILY. It amounts to a little under 100 TBs of data. And we're not a massive operation.

Some simple thoughts on what operations like Amazon can be doing to make things extremely fast:

1. Shards.

We've got Shards setup at SW, where databases are separated onto different servers.

Simply put it, Let's Keywords that start with A through F are on one server, G - M are on another and so on for other letters (this isn't our setup, just an example of Sharding). When a user is querying Keyword "Big Man Tyrone", that query goes to a different server versus "Ryuzaki is wrong".

The time it takes to calculate the location of the server is in nanoseconds, and most likely the location of the server where "Ryuzaki is wrong", if it's a populate query is saved in an in-memory database like Redis.

In-memory basically means the database is stored in RAM versus Harddrive.

Obviously RAM is limited, but it's about 2000x faster to read versus the hardrive. That's why cached data like the server location of "Ryuzaki is wrong" is saved there.

For example: if a person queries that keyword, the location of database is on server 12, it saves that location into the RAM and so the next person to query it, the program doesn't have to take the time to query against the harddrive which is slower than RAM. "Oh that keyword, it's on server 12, we know that because someone else queried that 4 minutes ago".

Now obviously the caches should expire, in 5 mins, or 5 hours or whatever depending on your usage, otherwise you'll get useless data hanging around.

That's an example of Sharding.

2. Replicated server located around the world.

For major operations you'll want servers with replication of the data stored around the world so users in Japan query their requests to servers closer to them versus on the other side of the world. That's why data-centers like Digital Ocean, Linode, and Amazon have locations around the world. It also serves as a fallback in case one datacenter goes down.

3. Master/Slave setup.

This setup is when 2 databases contain the exact same data. The Master server is used to write and perform updates within. The Slave server simply reads all the data and is only used to read. It's a clone of the Master server. When users query a database that has a master/slave setup, they, unknowingly are only getting data from the slave. This helps the whole operation perform faster, since writing and reading on a database can be taxing, and slow things down.

Now if a user sends a request - like to update their address, the programs will only write to the master, and within time - nanoseconds, the slave server will get that updated data. That's why sometimes when you update something on a website like your bank it takes some hours to propagate around to the slave and other servers.

4. Caching.

Caching is pretty simply to understand, MYSQL and all major databases do this by default. You can play around with buffer sizes and stuff to increase or reduce caching times but unless you know what you are doing it's best to leave it to a database expert. So when a user queries "Red Cups" - that might be a new keywords to the system so it gathers all the data from rows and displays all Red Cups within Amazon's system, this could take 1 second.

The next user to query "Red Cups" will get the data back faster because the database has it saved - so 0.0001 seconds.

5. In-Memory Databases like Redis

This is a database that's stored in RAM, so you want to store important stuff that needs to be quickly accessed or caching data. RAM is limited versus HardDrive.

Example the Macbook I'm on has 1 TB (Terabyte) of HardDrive space, but only 64 GB of RAM. I can store a MYSQL database that holds 1 TB on this computer, but for Redis I'll have 64 GBs, which means I cannot possibly store all the MYSQL data into RAM.

I would have to choose stuff that is constantly being access - caching, or user sessions, all stuff that has expiration times in 5-60 minute times. It's a dance, but once you understand the potential then you can go nuts and make any website faster.

Most people when using Redis are using it to cache data and access it.

--

All this stuff is operating in nano/millisecond times. That's what CPU 4.8 GHz 10 Core stuff is all about. The more GHz the more commands a server can execute in 1 second.

There is a lot more to caching, in-memory database, sharding, replications of servers. There is optimization of databases, optimization of queries being sent, and even the backup scenarios in case of crashes.

But that's the gist of what it takes to make databases fasters.

In the scenario of @Steptoe's 21 million rows on 15GBs, that really is a drop in the bucket and it can possibly be made 10x faster by understanding and updating the default configurations on a database without extras. Even simply optimizing the query and making sure you are querying against indexes instead of free balling it will make the database response 2x faster.

bernard · Jul 5, 2023

Thanks for the exhaustive answer @CCarter.

Would you then say, that in 2023, if a database takes 10 second to load, almost regardless of how much it searches, then that is a design/coding flaw and not a question of money/resources?

CCarter · Jul 5, 2023

bernard said:
Would you then say, that in 2023, if a database takes 10 second to load, almost regardless of how much it searches, then that is a design/coding flaw and not a question of money/resources?

95 out of 100 times, Yes. Depending on what you mean "load", but let's assume you mean loading and reading the data into a script or user interface or website like returning Amazon results.

1. INDEX:
It could be very simply that the coder that coded it isn't using an "index" when doing a query. An index is simply a collection of all ids or something which you are querying against. Adding an index speeds up queries 100x, because the location of the rows that has this data is saved in the "index".

Example, if I've got a database that has each day of the year stored for the last 50 years, that's 365.25 days x 50 years so about 18,262.5 (rounding to 18,263) rows of data.

Now I can create an index that stores, sort of like tags, the dates by year, another column. So now when I do an query for dates in 2012 instead of read ALL the 18,263 days, when creating the query I'm going to query "year = 2012", and that immediately reduces the amount of data being search down to only 366 days instead of the 18,263.

That's the gist of why when creating queries you should be executing against indexes that you create, to sort through everything.

Example: the keyword "Ryuzaki is wrong" has a keyword_id of 667. When storing rows of data that correspond to that keyword we know what's it's associate with by the keyword_id. So when we are going through 100 TBs of data we aren't reading ALL the trillions of rows, but only rows with that keyword_id - IF we made sure to put an index together of keyword_ids.

2. Programming Language:

Another factor is programming language. MYSQL is written in C. I did a test as part of my Building A SAAS journal where I tested writing/reading in different languages to MYSQL.

C can write 10,000 times PER SECOND to the database.

C++ can write 1,000 times per second.

PHP and Perl can do 45-75 writes per second.

Python can get almost 100 writes per second.

Go was a lot faster if I recall, but was still under 300 writes per second.

- So in that scenario you can see there are clear winners in terms of speed for executing commands and getting stuff done.

Now coding in C is a lot more difficult than coding on PHP or Python, but if you need that extra speed you need to do what you have to do.

Back in the day Facebook coded in PHP and then used machine voodoo to change the code into C, then execute it, because it was faster by leaps and bounds.

So perhaps the programming language itself is slow to response. And if the query is not optimized you're in trouble.

The more expertise you are the crazier you can get, you can run C code straight from apache with cgi or fastcgi and get 28 millisecond load times, but then you'd have to code the whole website in C. I mean let's not get crazy.

Get a database guy to look at the queries being used, in MYSQL there is a slow-query log file you can enable and see all the commands that are slow, then you can see if you can put indexes against them or optimize them better.

--

Bottomline, 10 seconds for anything to load is troubling, especially on the internet in 2023.

What's a big database? And how do you deal with it?

bernard

Steptoe

CCarter

Final Boss ®

bernard

CCarter

Final Boss ®