Efficient MySQL - Selecting the Right Data Types

Choosing the right data types in MySQL is a fundamental step towards building efficient, high-performing databases. While it might seem like a minor detail, selecting optimal data types directly impacts storage requirements, query speed, and overall system responsiveness. This article will guide you through the process of selecting the most appropriate data types for your MySQL tables, helping you unlock better performance and resource utilization.

While PostgreSQL is great, MySQL is often the preferred choice for applications anticipating significant growth, primarily due to solutions like Vitess. Vitess offers a robust database clustering system for MySQL, enabling horizontal scaling that can be more straightforward to implement than scaling PostgreSQL for massive workloads. This future-proofing against the complexities of expanding PostgreSQL, which can involve intricate sharding strategies and performance tuning for large-scale deployments, makes MySQL with Vitess a go-to default for many.

Beyond its scaling advantages with tools like Vitess, MySQL’s inherent simplicity is frequently cited as one of its key strengths. This simplicity can translate to easier setup, management, and a gentler learning curve, particularly for less complex applications or those with read-heavy workloads. While PostgreSQL boasts a rich feature set and advanced capabilities, MySQL’s streamlined nature is often favored for projects where rapid development and ease of use are paramount.

The trending SQLite db per entity/user/tenant can be straightforward for simpler, more isolated datasets. However, this model can lack maturity when dealing with applications requiring complex relationships and interactions between these entity/user/tenants. Managing potentially thousands or millions of individual database files introduces significant overhead in terms of backups, migrations, and ensuring data consistency across what are essentially siloed datasets. For applications where entities frequently interact or share common data, this separation can hinder performance and complicate development, as querying across these distinct databases becomes cumbersome and inefficient compared to operating within a single, unified database structure.

For these reasons I generally turn to MySQL and so in this article I will discuss how to choose your column data types wisely. To optimize MySQL performance when choosing data types, remember these key guidelines:

Smaller is usually better: Opt for the smallest data type that can reliably hold your data. This improves speed by reducing disk, memory, and CPU cache usage, and generally requires fewer CPU cycles. However, don’t underestimate future data range needs, as later changes can be difficult.
Simple is good: Simpler data types require less CPU processing. For instance, comparing integers is more efficient than comparing characters due to complexities like character sets and collations.
Avoid NULL if possible: Specify columns as NOT NULL unless you explicitly need to store NULL values. Nullable columns are harder for MySQL to optimize, use more storage, and require special internal processing, though the performance gain from changing existing schemas is usually small unless they are known to cause issues.

The process involves first determining the general class of type (numeric, string, temporal) and then selecting the specific type, considering its range, precision, storage requirements, and any special behaviors (e.g., DATETIME vs. TIMESTAMP). MySQL also uses aliases for some data types (like INTEGER for INT), which are for compatibility and don’t impact performance; MySQL will report the base type.

Whole Numbers

Whole numbers in MySQL are stored using integer data types: TINYINT (8 bits), SMALLINT (16 bits), MEDIUMINT (24 bits), INT (32 bits), and BIGINT (64 bits). Each type has a specific range of values it can store, determined by its bit size.

The UNSIGNED attribute can be applied to these types, which prevents negative values and roughly doubles the positive value limit (e.g., a TINYINT UNSIGNED stores 0 to 255, while a signed TINYINT stores -128 to 127). Signed and unsigned types use the same storage space and have similar performance.

Although data is stored according to the chosen integer type, most integer computations in MySQL are performed using 64-bit BIGINT integers, with some exceptions like certain aggregate functions that might use DECIMAL or DOUBLE.

Specifying a “width” for an integer type, like INT(11), does not affect the range of values that can be stored or how computations are done. It only influences the display width in MySQL’s interactive tools. For storage and calculations, INT(1) is the same as INT(20).

Real Numbers

Real numbers in MySQL represent numbers with fractional parts and can also store integers too large for BIGINT. MySQL offers exact (DECIMAL) and inexact (FLOAT, DOUBLE) types.

FLOAT and DOUBLE use standard floating-point math for approximate calculations. It’s generally advised to specify the type (FLOAT or DOUBLE) without defining precision, as precision specifiers are nonstandard and can cause MySQL to alter the data type or round values. FLOAT uses 4 bytes of storage, while DOUBLE uses 8 bytes, offering greater precision and range. Internally, MySQL uses DOUBLE for floating-point calculations.

DECIMAL is recommended for exact fractional results, such as financial data, despite its higher storage and computational costs. An alternative for high-volume financial data is to use BIGINT by multiplying currency amounts to store them as integers (e.g., multiplying dollar amounts by a million to store values to the ten-thousandth of a cent), thus avoiding floating-point imprecision and DECIMAL’s cost.

Strings

MySQL offers a variety of string data types, with VARCHAR and CHAR being the two primary ones for storing character values. Each string column can have its own character set and collation (sorting rules). The way these types are stored can vary by storage engine, with InnoDB being a common example.

`VARCHAR`

VARCHAR stores variable-length character strings and is the most frequently used string type. It’s efficient in terms of storage because it only uses the space needed for the actual string, plus 1 or 2 extra bytes to record the string’s length (1 byte if the maximum length is ≤255 bytes, 2 bytes if more). For instance, a VARCHAR(10) using the latin1 character set will use up to 11 bytes.

This space saving can improve performance. However, since rows are variable-length, updates can cause them to grow, potentially leading to extra work like page splitting in InnoDB if the row no longer fits in its original location. VARCHAR is generally best when:

The maximum column length is significantly larger than the average length of the stored strings.
Updates to the field are infrequent, minimizing fragmentation issues.
A complex character set (like UTF-8) is used, where characters themselves have variable byte lengths. InnoDB might store long VARCHAR values as BLOBs.

`CHAR`

CHAR stores fixed-length character strings. MySQL allocates space for the specified number of characters for every CHAR value. When storing, trailing spaces are removed from the value. For comparisons, values are padded with spaces as needed. Upon retrieval, these padding spaces are typically stripped.

CHAR is useful for:

Storing very short strings.
Storing strings that are all nearly the same length (e.g., MD5 hashes).
Data that is frequently changed, as fixed-length rows are less prone to fragmentation. For very short columns, CHAR can be more efficient. For example, a CHAR(1) (for ‘Y’ or ‘N’) uses 1 byte in a single-byte character set, whereas VARCHAR(1) would use 2 bytes due to the length byte.

`BINARY` and `VARBINARY`

The sibling types for CHAR and VARCHAR are BINARY and VARBINARY, respectively. These store binary strings (byte strings) rather than character strings. MySQL pads BINARY values with \0 (the zero byte) instead of spaces, and this padding is not stripped on retrieval. These types are useful for storing binary data where byte-by-byte comparison is desired, which can be faster and simpler than character comparisons as it’s based on numeric byte values.

Efficient Space Allocation

It’s important to allocate only as much space as genuinely needed for string columns. Even if storing ‘hello’ takes the same disk space in a VARCHAR(5) and a VARCHAR(200), the larger column can consume significantly more memory. MySQL often allocates fixed-size memory chunks for internal operations like sorting or using temporary tables (in-memory or on-disk). Using excessively large column definitions can lead to inefficient memory use, negatively impacting performance.

`BLOB` and `TEXT` types

BLOB and TEXT are string data types in MySQL used for storing large amounts of binary data (BLOB) and character data (TEXT), respectively. Each is a family of types (e.g., TINYTEXT, TEXT, LONGTEXT and TINYBLOB, BLOB, LONGBLOB), with BLOB being a synonym for SMALLBLOB, and TEXT for SMALLTEXT.

MySQL treats each BLOB or TEXT value as an individual object, often storing large values externally to the row, requiring 1 to 4 bytes in the row itself plus the external storage for the data. The primary distinction between them is that BLOB types store binary data without a character set or collation, while TEXT types have an associated character set and collation.

Sorting BLOB and TEXT columns is unique: MySQL only sorts based on the first max_sort_length bytes, not the entire string. Furthermore, MySQL cannot index the full length of these data types and cannot use these indexes for sorting operations. It’s generally advised not to store large objects like images directly in the database as BLOBs due to performance issues with large data volumes, such as slower schema changes. Instead, it’s better to store them in a separate object data store and record their location or filename in the database.

Using `ENUM` as a string alternative

Using an ENUM column in MySQL instead of a traditional string type allows you to store a predefined set of distinct string values. MySQL stores these values very compactly, as integers (1 or 2 bytes depending on the number of options), which represent the position of the string in the defined list.

Key Points:

Storage: ENUMs store integer representations of strings, saving space. For example, ENUM('fish', 'apple', 'dog') would store ‘fish’ as 1, ‘apple’ as 2, and ‘dog’ as 3 (based on their definition order, as seen when retrieving e+0).
Dual Nature: Retrieving an ENUM in a numeric context reveals its integer value (e.g., SELECT e + 0 FROM enum_test;). It’s advised not to use numbers as ENUM constants (e.g., ENUM('1', '2', '3')) as this can be confusing.
Sorting: ENUM fields sort by their internal integer values, not alphabetically by the string values. To sort alphabetically, you can define the ENUM members in alphabetical order or use FIELD() in your query, though the latter might prevent index use for sorting.
Overhead: While compact, ENUMs have overhead because MySQL needs to look up the string representation for the stored integer.
Join Performance: Joining ENUM columns to other ENUM columns can be faster (e.g., 3.5 queries/sec in the provided benchmark compared to 2.6 for VARCHAR to VARCHAR). However, joining ENUM columns to CHAR or VARCHAR columns can be slower (1.7 or 1.8 queries/sec in the benchmark).
Table Size Reduction: Converting VARCHAR columns to ENUM can significantly reduce table size (e.g., about one-third smaller in the example). This also shrinks the primary key, which in InnoDB tables, can make other indexes on the table much smaller too.
Schema Changes: A significant drawback is that any change to the list of valid ENUM values requires a schema modification (ALTER TABLE). This can be a major inconvenience if the allowed values change frequently, especially without a robust system for automated schema changes. The text also warns against an anti-pattern of using “too many ENUMs.”

In summary, ENUMs offer space efficiency and can improve performance in specific join scenarios (ENUM to ENUM). However, they introduce complexities with sorting, potential performance hits when joining with string types, and operational challenges due to the need for schema changes when modifying the list of allowed values.

Date and Time

MySQL offers various data types for dates and times, with microsecond precision being the finest. While types like YEAR and DATE are straightforward, storing combined date and time presents a choice between DATETIME and TIMESTAMP.

`DATETIME`

Stores values from year 1000 to 9999 with microsecond precision.
Uses 8 bytes of storage.
Stores date and time as an integer (YYYYMMDDHHMMSS format), independent of time zone.
Displays in a standard, sortable format (e.g., 2008-01-16 22:37:08).

`TIMESTAMP`

Stores the number of seconds since midnight, January 1, 1970, GMT (Unix timestamp).
Uses 4 bytes of storage, leading to a smaller range (1970 to January 19, 2038).
Its display depends on the time zone settings of the server, OS, and client. This means the same stored TIMESTAMP can display differently based on time zone, unlike DATETIME which preserves the textual representation.
Has special properties: by default, the first TIMESTAMP column is set to the current time on insertion and updated on row modification unless specified otherwise.
Is NOT NULL by default.

Storing Date and Time as an Integer

An increasingly popular alternative is storing the date and time as a Unix epoch (seconds since January 1, 1970, UTC) in an integer column.

A signed 32-bit INT supports dates until 2038.
An unsigned 32-bit INT supports dates until 2106.
A 64-bit integer supports dates much further. This method avoids MySQL’s time zone complexities and can be more space-efficient than DATETIME but may shift handling to application code.

Key Considerations for Choosing:

Date Range: How far into the past or future do you need to represent? DATETIME has a much larger range.
Storage Space: TIMESTAMP uses less space (4 bytes vs. 8 bytes for DATETIME).
Fractional Seconds: Both can support microsecond precision.
Time Zone Handling: TIMESTAMP is time zone aware, which can be beneficial or problematic depending on the use case. DATETIME is time zone naive. Storing as an integer in UTC is another way to manage time zones consistently.
Special Behaviors: TIMESTAMP’s auto-initialization and auto-update features can be useful.

Ultimately, the best choice depends on specific application needs, and storing time as an integer is a viable alternative to consider.

Bit-Packed

MySQL offers several ways to store data compactly using individual bits, though these are technically string types.

`BIT`

Used to store one or more true/false values. BIT(N) stores N bits (up to 64). However, InnoDB stores BIT columns as the smallest integer type large enough, so space savings aren’t guaranteed. MySQL treats BIT as a string type. Retrieving a BIT(1) yields a binary string “0” or “1”. In a numeric context, it converts to the corresponding number. For example, b'00111001' (binary for 57) stored in a BIT(8) column retrieves as the character “9” (ASCII 57) in a string context, but as the number 57 in a numeric context (e.g., a + 0). This can be confusing, and caution is advised; avoiding this type is often better.

`CHAR(0)`

An alternative for a single true/false value is a nullable CHAR(0) column, storing NULL or an empty string. This works but can be unclear for users and complicate queries. TINYINT is generally recommended instead unless space is extremely critical.

`SET`

Suitable for storing many true/false values by combining them into one column. MySQL internally represents this as a packed set of bits, making it storage-efficient. Functions like FIND_IN_SET() simplify its use in queries.

In essence, the seemingly small decision of picking a TINYINT over an INT, or a VARCHAR with an appropriate length over a TEXT type, accumulates across your database, significantly influencing its agility and cost-effectiveness. By investing the time to understand and implement optimal data types, you’re laying the groundwork for a high-performing, scalable, and reliable MySQL database that can effectively support your application’s needs now and in the future.