|
ABOUT THE AUTHORS
Jeff Garbus’s
background
includes a bachelor’s degree from Rensselaer Polytechnic
Institute, and work experience from PCs to mainframes and back
again. He has many years of Microsoft and Sybase SQL Server
experience, with a special emphasis on assisting clients in
migrating from existing systems to pilot and large-scale
projects. He is well known in the industry, having spoken at
user conferences and to user groups for many years, written
articles and columns for many magazines nationally and
internationally, as well as having written several books.
Recently his focus has been on very large databases, data
warehousing, training, and remote database administration. Jeff
has been in consulting for fourteen years, training for nine,
and in the software business for eighteen. He has a demonstrated
talent for staying at the leading edge of technology, as well as
transferring his knowledge to others. Jeff currently is CEO of
Soaring Eagle Consulting, an RDBMS consulting and training firm
based in Tampa, which specializes in solving business problems
as well as performance problems. Soaring Eagle now offers remote
database administration services.
Ashish Gupta is
currently a database administrator with Mizuho Securities in
Japan. He has more than nine years of database administration
experience with various global investment banks and five years
of software design and development experience. He received a
bachelor of science degree in computer science from the Indian
Institute of Technology, Roorkee, India.
You may read the
rest of this book by ordering online on
AMAZON.com or on the Publishers site
WORDWARE.com
|
WHAT IS DATA PARTITIONING?
Data partitioning breaks up large tables and
indexes into smaller pieces that can reside on separate
partitions, allowing DBAs to have a very fine level of control
over data placement, maintenance, and management.
A
segment
is a portion of a
device that is defined within ASE. It is used for the storage of
specific types of data such as system data, log data, and the
data itself. Partitions can be placed on individual segments and
multiple partitions can be placed on a single segment. In turn,
a segment or segments can be placed on any logical or physical
device, thus isolating I/O and aiding performance and data
availability.
Data in a
table or index on one partition can be managed and processed
separately from data in other partitions. Queries only need to
access partitions that contain the required data.
DBAs can
perform management and maintenance tasks much more quickly on
individual smaller partitions rather than on huge tables and
indexes. To save even more time, some common tasks can be run in
parallel on multiple partitions. Additionally, a DBAcan automate
tasks on partitions. As the size of the data grows, more
partitions can be added.
There are four methods of data partitioning
offered in ASE 15. The first is called round- obin partitioning,
the only method used in pre-ASE 15 versions. This method places
data sequentially on partitions. There is no way to specify
which data goes on which partition, and all partitions will be
involved in query processing.
The next three methods are referred to as
“semantic data partitioning” methods because the placement of
data on partitions can be specified. They are range, list,
and hash partitioning.
The partition method expected to be most widely
used in ASE 15 is range partitioning. With this method,
ranges of values can be specified to be placed on a given
partition.
With list partitioning, individual values can be
placed on separate partitions.
In hash partitioning, the data is placed on
partitions based on the columns specified and and internal
hashing algorithm. |
|
WHY USE DATA PARTITIONING?
By using data partitioning in ASE 15, businesses lower the cost
of keeping and using databases of any size on ASE, while
improving the performance of their applications and making their
data more available. |
|
REDUCING THE COST OF MANAGING and
MAINTAINING DATABASES
The most costly aspect of operating a database
of any size is the time and effort required to manage and
maintain it. Databases can become so large that there simply
isn’t enough time to perform all of the necessary maintenance
tasks. This not only has a direct impact on application
performance but also puts data in danger by making backups less
practical due to the time required.
By making the use of various utilities much more
efficient, data partitioning in ASE 15 has a positive impact on
the time and resources spent maintaining and managing large
databases.
Imagine that multiple activities are happening
on a large table. A large complex DSS-style query is running,
OLTP processing is occurring, bcp is being used to load data,
and update statistics is running to freshen statistics. In an
unpartitioned table, the activities under way are enough to slow
operations to an unacceptable level if not bring many of them to
a halt. Now imagine the same table, but this time it is data
partitioned. Each of the operations that are running are doing
so on separate partitions without affecting the others.
By partitioning a table or index into smaller
pieces, DBAs can run utilities on a per-artition basis. This
results in the utilities running faster, allowing other
operations to work efficiently on data in other partitions and
ensuring that the bulk of the data in the table is available for
applications. In this way, the use of data partitioning in ASE
15 increases the operational scalability of applications using
VLDBs.
With partitioned tables and indexes, a DBA can
schedule maintenance tasks and cycle through one partition of a
table at a time, or perform tasks simultaneously on multiple
partitions if preferred. If maintenance on a given partition is
not practical at a certain time, that partition can be skipped
and returned to later. The maintenance schedule can also be
tightly integrated with ASE’s Job Scheduler, thus freeing up the
DBA for other important activities.
Two commonly run utilities are also two of the
most resource and time consuming — update statistics and reorg.
Both need to be run on tables and indexes on a regular basis in
order to assure good performance from queries.
The update statistics utility gathers statistics
about the table, its indexes, and the distribution of data in
its columns. This information is needed by the query processing
engine’s optimizer to determine the most efficient way to access
the required data. The reorg rebuild utility is used to reclaim
empty space in tables and indexes. This operation is important
for space management and for efficient query performance. Some
tables are so large that it is virtually impossible to run
either of these two important utilities without exceeding the
allotted time and impeding regular business operations.
As an example, many databases contain a
date/time column used when recording a transaction. It’s
important to keep the distribution statistics for such columns
up to date so that queries run as efficiently as possible. But
to update statistics for an unpartitioned table and index, the
entire column has to be read. With huge tables, this can have a
major impact on performance and data availability.
The same operations on a partitioned table and
index will take considerably less time and have minimal impact
on overall performance and data availability. This is because
the distribution statistics only have to be gathered from the
partition that contains the most recent date/time values. The
same applies to the reorg utility. When run on a table and index
that are partitioned, only the target partition will be
affected, keeping the rest of the table and index free for use.
Another time-consuming but necessary task is the
running of common table and index diagnostic utilities such as
the DBCCs that check for data consistency. Running these
diagnostics on a per-partition basis makes this critical task
more efficient.
Still another time- and resource-intensive
operation is the archiving off or deletion of large amounts of
data that are no longer needed in the table. The bcp (bulk copy
program) can be run on individual partitions, allowing data to
be put into or taken out of a table without interfering with
other operations that are using that same table. Only the
partition whose data is being imported or exported via bcp will
be affected. By partitioning tables, data that “ages out” can be
managed more easily. bcp can also run in parallel on multiple
partitions simultaneously.
The truncate table..partition command is used to
perform mass deletions of data from a table. Often this is data
copied off of the table via bcp or other backup methods. In the
past, truncating the table deleted all data from it and could
take a long time. Now, however, data can be deleted on a
per-partition basis. |
|
DATA AVAILABILITY
The
availability of data to applications is vital for the
functioning of any business. If applications can’t get to the
data they require, work doesn’t get done.
As we have
seen, the purpose of partitioning tables and indexes is to
divide them into separate and smaller pieces so that operations
on each piece can be isolated from other operations on other
pieces, thus ensuring that more data can be accessed by
applications more of the time.
With a
partitioned index, the reorg utility can be running on partition
2 while statistics on partition 3 are being updated. Either of
these utilities would make the index unavailable on an
unpartitioned index. In this case, both are running
simultaneously while data on partition 1 is still available to
an application.
Similarly, if queries are changing the data,
indexes can be updated on one partition without affecting
operations on others.
In a mixed workload environment, DBAs can design
partitions so that transactional operations occur on one
partition while DSS queries run on others.
The data availability that partitioning offers
is critical to the performance of applications running in a
mixed workload environment.
Partitions also enhance availability by being
placed on various physical devices. If one device fails, the
remaining partitions will still be available for use.
INDEX
PARTITIONING
There are two methods of index partitioning:
global and local. A global index has its entire structure (one
B-tree) on one partition. When queries on partitioned tables use
a global index, they will follow pointers from the index to the
data on the table’s data partition(s). The index structure
covers all data partitions of the table.
The second type of index is a local index. Here
the index’s structure is broken up into pieces (multiple
B-trees), which are stored together on a single partition. Since
the partitioned index has to be based on the same column order
of values as the data partitions, a local index is similar to
having smaller
individual
indexes pointing to data in corresponding data partitions. If a
query is searching for only a fraction of the data in the table,
it will only have to read one piece of the index. This increases
concurrency by allowing different queries to simultaneously
access different pieces of the same index. Aquery that is
retrieving data from one partition will not interfere with a
query that is using another partition. The result is less
blocking, less I/O, more data availability, and higher
performance. Another advantage to using partitioned indexes is
that the size of each is based on the number of rows in its
corresponding table partition. Smaller indexes take less time to
scan. |
|
HOW TO
AND WHEN TO USE DATA PARTITIONING
Partitioning tables and indexes is relatively
simple. Atable can be partitioned at its creationor later. Once
partitioned, data put into the table will follow the rules
set by the partitioning method, whether it’s done with an insert
or update or by using bcp to import data. An index can also be
partitioned when it’s created, but an existing index cannot be
partitioned. It will need to be dropped and recreated with
partitioning.
Here are some situations where DBAs might
consider using the three semantic methods of partitioning
described earlier:
Range
partitioning
— This method allows DBAs to specify what data will be
contained in each partition based on a range of values in
the key column. For example, the values 1, 2, 3, and 4 could
be on partition 1, while 5, 6, 7, and 8 are on partition 2,
and so on. Below is an example of the syntax used to create
a table that uses range partitioning on adate/time column:
-
create table customer (ord_date
datetime not null,
-
name varchar(20) not null,
-
address varchar(40) not null,
other columns ...)
partition by range
(ord_date)
(ord_date1 values <= (3/31/05) on
segment1,
ord_date2 values <= (6/30/05) on
segment2,
ord_date3 values <= (9/30/05) on
segment3,
ord_date4 values <= (12/31/05) on
segment4)
Range partitioning is especially useful for
tables with constant updates, inserts, and deletes that
contain a column or columns with sequential data in them
such as a customer ID or a transaction date. Such tables
require the most attention from DBAs for maintenance and
management. These tables are also commonly used in decision
support style queries, making them excellent candidates for
range partitioning. While transactions are occurring
on a single partition, DBA tasks and DSS queries can access
other partitions. At the same time, it is critical that the
DBAkeep the distribution statistics up to date on the active
transactional partition so that new data can be described to
the query processing engine. As we have seen, the time it
takes to update the statistics is considerably shorter when
it only has to be run on a single partition.
List
partitioning—List
partitioning is similar to range partitioning, but here the
actual values to be placed on a partition are specified.
List partitioning is useful for controlling where specific
values are stored, even if the column itself is not sorted,
and in cases where the order of values in the partition
isnot important. Below is an example of the syntax used to
create a table with list partitioning:
null, regionkey varchar(30) not
null, comment varchar(152) not null)
on segment 1
partition by list
(n_regionkey)
(region1 values ('Americas'),
region2 values ('Asia'),
region3 values ('Europe'),
region4 values ('Australia',
'Other'))
Hash
partitioning—This
method distributes values to partitions based on the
column(s) specified and an internal hashing mechanism. There
is no need to specify a list or range of values. If
the column key contains unique values or values that have
little duplication, hash partitioning will balance the data
across all the partitions. However, if there is extensive
duplication of values, the partitions can become skewed,
with some containing more rows than others. Hash partitions
are useful when many partitions are desired for a largetable
and the values in the key column are not in a sorted order.
They also help equality searches done by the query
processing engine run more efficiently.
not null, l_suppkey integer
not null, l_linenumber integer not null,
l_quantity double not null,
l_extendedprice double not null, other
columns ...)
partition by hash
(l_orderkey,
l_linenumber)
-
(litem_hash1 on segment1,
-
litem_hash2 on segment2,
-
litem_hash3 on segment3,
-
litem_hash4 on segment4)
SUMMARY
If you have a huge amount of data and need to
break it up to manage I/O issues or to manage a smaller subset
of data, partitions are for you. ASE 15 introduces three methods
of partitioning: range, list, and hash. |
|
|
|