
Efficient MySQL - Selecting the Right Data Types
Choosing the right data types in MySQL is a fundamental step towards building efficient, high-performing databases. While it might seem like a minor detail, selecting optimal data types directly impacts storage requirements, query speed, and overall system responsiveness. This article will guide you through the process of selecting the most appropriate data types for your MySQL tables, helping you unlock better performance and resource utilization.
While PostgreSQL is great, MySQL is often the preferred choice for applications anticipating significant growth, primarily due to solutions like Vitess. Vitess offers a robust database clustering system for MySQL, enabling horizontal scaling that can be more straightforward to implement than scaling PostgreSQL for massive workloads. This future-proofing against the complexities of expanding PostgreSQL, which can involve intricate sharding strategies and performance tuning for large-scale deployments, makes MySQL with Vitess a go-to default for many.
Beyond its scaling advantages with tools like Vitess, MySQL’s inherent simplicity is frequently cited as one of its key strengths. This simplicity can translate to easier setup, management, and a gentler learning curve, particularly for less complex applications or those with read-heavy workloads. While PostgreSQL boasts a rich feature set and advanced capabilities, MySQL’s streamlined nature is often favored for projects where rapid development and ease of use are paramount.
The trending SQLite db per entity/user/tenant can be straightforward for simpler, more isolated datasets. However, this model can lack maturity when dealing with applications requiring complex relationships and interactions between these entity/user/tenants. Managing potentially thousands or millions of individual database files introduces significant overhead in terms of backups, migrations, and ensuring data consistency across what are essentially siloed datasets. For applications where entities frequently interact or share common data, this separation can hinder performance and complicate development, as querying across these distinct databases becomes cumbersome and inefficient compared to operating within a single, unified database structure.
For these reasons I generally turn to MySQL and so in this article I will discuss how to choose your column data types wisely. To optimize MySQL performance when choosing data types, remember these key guidelines:
- Smaller is usually better: Opt for the smallest data type that can reliably hold your data. This improves speed by reducing disk, memory, and CPU cache usage, and generally requires fewer CPU cycles. However, don’t underestimate future data range needs, as later changes can be difficult.
- Simple is good: Simpler data types require less CPU processing. For instance, comparing integers is more efficient than comparing characters due to complexities like character sets and collations.
- Avoid NULL if possible: Specify columns as
NOT NULL
unless you explicitly need to storeNULL
values. Nullable columns are harder for MySQL to optimize, use more storage, and require special internal processing, though the performance gain from changing existing schemas is usually small unless they are known to cause issues.
The process involves first determining the general class of type (numeric, string, temporal) and then selecting the specific type, considering its range, precision, storage requirements, and any special behaviors (e.g., DATETIME
vs. TIMESTAMP
). MySQL also uses aliases for some data types (like INTEGER
for INT
), which are for compatibility and don’t impact performance; MySQL will report the base type.
Whole Numbers
Whole numbers in MySQL are stored using integer data types: TINYINT
(8 bits), SMALLINT
(16 bits), MEDIUMINT
(24 bits), INT
(32 bits), and BIGINT
(64 bits). Each type has a specific range of values it can store, determined by its bit size.
The UNSIGNED
attribute can be applied to these types, which prevents negative values and roughly doubles the positive value limit (e.g., a TINYINT UNSIGNED
stores 0 to 255, while a signed TINYINT
stores -128 to 127). Signed and unsigned types use the same storage space and have similar performance.
Although data is stored according to the chosen integer type, most integer computations in MySQL are performed using 64-bit BIGINT
integers, with some exceptions like certain aggregate functions that might use DECIMAL
or DOUBLE
.
Specifying a “width” for an integer type, like INT(11)
, does not affect the range of values that can be stored or how computations are done. It only influences the display width in MySQL’s interactive tools. For storage and calculations, INT(1)
is the same as INT(20)
.
Real Numbers
Real numbers in MySQL represent numbers with fractional parts and can also store integers too large for BIGINT
. MySQL offers exact (DECIMAL
) and inexact (FLOAT
, DOUBLE
) types.
FLOAT
and DOUBLE
use standard floating-point math for approximate calculations. It’s generally advised to specify the type (FLOAT
or DOUBLE
) without defining precision, as precision specifiers are nonstandard and can cause MySQL to alter the data type or round values. FLOAT
uses 4 bytes of storage, while DOUBLE
uses 8 bytes, offering greater precision and range. Internally, MySQL uses DOUBLE
for floating-point calculations.
DECIMAL
is recommended for exact fractional results, such as financial data, despite its higher storage and computational costs. An alternative for high-volume financial data is to use BIGINT
by multiplying currency amounts to store them as integers (e.g., multiplying dollar amounts by a million to store values to the ten-thousandth of a cent), thus avoiding floating-point imprecision and DECIMAL
’s cost.
Strings
MySQL offers a variety of string data types, with VARCHAR
and CHAR
being the two primary ones for storing character values. Each string column can have its own character set and collation (sorting rules). The way these types are stored can vary by storage engine, with InnoDB being a common example.
VARCHAR
VARCHAR
stores variable-length character strings and is the most frequently used string type. It’s efficient in terms of storage because it only uses the space needed for the actual string, plus 1 or 2 extra bytes to record the string’s length (1 byte if the maximum length is ≤255 bytes, 2 bytes if more). For instance, a VARCHAR(10)
using the latin1
character set will use up to 11 bytes.
This space saving can improve performance. However, since rows are variable-length, updates can cause them to grow, potentially leading to extra work like page splitting in InnoDB if the row no longer fits in its original location. VARCHAR
is generally best when:
- The maximum column length is significantly larger than the average length of the stored strings.
- Updates to the field are infrequent, minimizing fragmentation issues.
- A complex character set (like UTF-8) is used, where characters themselves have variable byte lengths. InnoDB might store long
VARCHAR
values asBLOB
s.
CHAR
CHAR
stores fixed-length character strings. MySQL allocates space for the specified number of characters for every CHAR
value. When storing, trailing spaces are removed from the value. For comparisons, values are padded with spaces as needed. Upon retrieval, these padding spaces are typically stripped.
CHAR
is useful for:
- Storing very short strings.
- Storing strings that are all nearly the same length (e.g., MD5 hashes).
- Data that is frequently changed, as fixed-length rows are less prone to fragmentation. For very short columns,
CHAR
can be more efficient. For example, aCHAR(1)
(for ‘Y’ or ‘N’) uses 1 byte in a single-byte character set, whereasVARCHAR(1)
would use 2 bytes due to the length byte.
BINARY
and VARBINARY
The sibling types for CHAR
and VARCHAR
are BINARY
and VARBINARY
, respectively. These store binary strings (byte strings) rather than character strings. MySQL pads BINARY
values with \0
(the zero byte) instead of spaces, and this padding is not stripped on retrieval. These types are useful for storing binary data where byte-by-byte comparison is desired, which can be faster and simpler than character comparisons as it’s based on numeric byte values.
Efficient Space Allocation
It’s important to allocate only as much space as genuinely needed for string columns. Even if storing ‘hello’ takes the same disk space in a VARCHAR(5)
and a VARCHAR(200)
, the larger column can consume significantly more memory. MySQL often allocates fixed-size memory chunks for internal operations like sorting or using temporary tables (in-memory or on-disk). Using excessively large column definitions can lead to inefficient memory use, negatively impacting performance.
BLOB
and TEXT
types
BLOB
and TEXT
are string data types in MySQL used for storing large amounts of binary data (BLOB
) and character data (TEXT
), respectively. Each is a family of types (e.g., TINYTEXT
, TEXT
, LONGTEXT
and TINYBLOB
, BLOB
, LONGBLOB
), with BLOB
being a synonym for SMALLBLOB
, and TEXT
for SMALLTEXT
.
MySQL treats each BLOB
or TEXT
value as an individual object, often storing large values externally to the row, requiring 1 to 4 bytes in the row itself plus the external storage for the data. The primary distinction between them is that BLOB
types store binary data without a character set or collation, while TEXT
types have an associated character set and collation.
Sorting BLOB
and TEXT
columns is unique: MySQL only sorts based on the first max_sort_length
bytes, not the entire string. Furthermore, MySQL cannot index the full length of these data types and cannot use these indexes for sorting operations. It’s generally advised not to store large objects like images directly in the database as BLOB
s due to performance issues with large data volumes, such as slower schema changes. Instead, it’s better to store them in a separate object data store and record their location or filename in the database.
Using ENUM
as a string alternative
Using an ENUM
column in MySQL instead of a traditional string type allows you to store a predefined set of distinct string values. MySQL stores these values very compactly, as integers (1 or 2 bytes depending on the number of options), which represent the position of the string in the defined list.
Key Points:
- Storage:
ENUM
s store integer representations of strings, saving space. For example,ENUM('fish', 'apple', 'dog')
would store ‘fish’ as 1, ‘apple’ as 2, and ‘dog’ as 3 (based on their definition order, as seen when retrievinge+0
). - Dual Nature: Retrieving an
ENUM
in a numeric context reveals its integer value (e.g.,SELECT e + 0 FROM enum_test;
). It’s advised not to use numbers asENUM
constants (e.g.,ENUM('1', '2', '3')
) as this can be confusing. - Sorting:
ENUM
fields sort by their internal integer values, not alphabetically by the string values. To sort alphabetically, you can define theENUM
members in alphabetical order or useFIELD()
in your query, though the latter might prevent index use for sorting. - Overhead: While compact,
ENUM
s have overhead because MySQL needs to look up the string representation for the stored integer. - Join Performance: Joining
ENUM
columns to otherENUM
columns can be faster (e.g., 3.5 queries/sec in the provided benchmark compared to 2.6 forVARCHAR
toVARCHAR
). However, joiningENUM
columns toCHAR
orVARCHAR
columns can be slower (1.7 or 1.8 queries/sec in the benchmark). - Table Size Reduction: Converting
VARCHAR
columns toENUM
can significantly reduce table size (e.g., about one-third smaller in the example). This also shrinks the primary key, which in InnoDB tables, can make other indexes on the table much smaller too. - Schema Changes: A significant drawback is that any change to the list of valid
ENUM
values requires a schema modification (ALTER TABLE
). This can be a major inconvenience if the allowed values change frequently, especially without a robust system for automated schema changes. The text also warns against an anti-pattern of using “too manyENUM
s.”
In summary, ENUM
s offer space efficiency and can improve performance in specific join scenarios (ENUM
to ENUM
). However, they introduce complexities with sorting, potential performance hits when joining with string types, and operational challenges due to the need for schema changes when modifying the list of allowed values.
Date and Time
MySQL offers various data types for dates and times, with microsecond precision being the finest. While types like YEAR
and DATE
are straightforward, storing combined date and time presents a choice between DATETIME
and TIMESTAMP
.
DATETIME
- Stores values from year 1000 to 9999 with microsecond precision.
- Uses 8 bytes of storage.
- Stores date and time as an integer (YYYYMMDDHHMMSS format), independent of time zone.
- Displays in a standard, sortable format (e.g.,
2008-01-16 22:37:08
).
TIMESTAMP
- Stores the number of seconds since midnight, January 1, 1970, GMT (Unix timestamp).
- Uses 4 bytes of storage, leading to a smaller range (1970 to January 19, 2038).
- Its display depends on the time zone settings of the server, OS, and client. This means the same stored
TIMESTAMP
can display differently based on time zone, unlikeDATETIME
which preserves the textual representation. - Has special properties: by default, the first
TIMESTAMP
column is set to the current time on insertion and updated on row modification unless specified otherwise. - Is
NOT NULL
by default.
Storing Date and Time as an Integer
An increasingly popular alternative is storing the date and time as a Unix epoch (seconds since January 1, 1970, UTC) in an integer column.
- A signed 32-bit
INT
supports dates until 2038. - An unsigned 32-bit
INT
supports dates until 2106. - A 64-bit integer supports dates much further. This method avoids MySQL’s time zone complexities and can be more space-efficient than
DATETIME
but may shift handling to application code.
Key Considerations for Choosing:
- Date Range: How far into the past or future do you need to represent?
DATETIME
has a much larger range. - Storage Space:
TIMESTAMP
uses less space (4 bytes vs. 8 bytes forDATETIME
). - Fractional Seconds: Both can support microsecond precision.
- Time Zone Handling:
TIMESTAMP
is time zone aware, which can be beneficial or problematic depending on the use case.DATETIME
is time zone naive. Storing as an integer in UTC is another way to manage time zones consistently. - Special Behaviors:
TIMESTAMP
’s auto-initialization and auto-update features can be useful.
Ultimately, the best choice depends on specific application needs, and storing time as an integer is a viable alternative to consider.
Bit-Packed
MySQL offers several ways to store data compactly using individual bits, though these are technically string types.
BIT
Used to store one or more true/false values. BIT(N)
stores N bits (up to 64). However, InnoDB stores BIT
columns as the smallest integer type large enough, so space savings aren’t guaranteed. MySQL treats BIT
as a string type. Retrieving a BIT(1)
yields a binary string “0” or “1”. In a numeric context, it converts to the corresponding number. For example, b'00111001'
(binary for 57) stored in a BIT(8)
column retrieves as the character “9” (ASCII 57) in a string context, but as the number 57 in a numeric context (e.g., a + 0
). This can be confusing, and caution is advised; avoiding this type is often better.
CHAR(0)
An alternative for a single true/false value is a nullable CHAR(0)
column, storing NULL
or an empty string. This works but can be unclear for users and complicate queries. TINYINT
is generally recommended instead unless space is extremely critical.
SET
Suitable for storing many true/false values by combining them into one column. MySQL internally represents this as a packed set of bits, making it storage-efficient. Functions like FIND_IN_SET()
simplify its use in queries.
In essence, the seemingly small decision of picking a TINYINT
over an INT
, or a VARCHAR
with an appropriate length over a TEXT
type, accumulates across your database, significantly influencing its agility and cost-effectiveness. By investing the time to understand and implement optimal data types, you’re laying the groundwork for a high-performing, scalable, and reliable MySQL database that can effectively support your application’s needs now and in the future.