 |
|
 |
|
Next: Excellent peopleSoft HRMS Consultant Available
|
| Author |
Message |
External

Since: Jan 15, 2008 Posts: 527
|
(Msg. 16) Posted: Fri Aug 29, 2008 7:47 am
Post subject: Re: sequential disk read speed [Login to view extended thread Info.] Archived from groups: comp>databases>theory (more info?)
|
|
|
"David BL" wrote in message
>On Aug 28, 9:34 pm, "Brian Selzer" wrote:
>> "David BL" wrote in message
>>
>>
>>
>>
>>
>>
>>
>> > On Aug 28, 10:47 am, "Brian Selzer" wrote:
>> >> "David BL" wrote in message
>>
>> >>
>>
>>
>> >> >> , and since there are four disks, the average seek
>> >> >> time for the disk subsystem is reduced to a quarter of that or
>> >> >> roughly
>> >> >> .625ms.
>>
>> >> > In order for the effective seek time to be reduced to a quarter the
>> >> > seeking must be independent. To achieve that I think the striping
>> >> > would need to be very coarse (eg 512kb or 1Mb).
>>
>> >> Drives that support disconnection or some other command queueing
>> >> mechanism
>> >> are all that is needed for seeking to be independent.
>>
>> > If stripes are somewhat smaller than the DBMS block size, then every
>> > drive (in the RAID 0) will be involved in the reading of each and
>> > every DBMS block. No matter how you order those reads, each drive
>> > needs to read a large amount of scattered data and the head will seek
>> > around a lot. If that is the case then the only advantage arises
>> > from your previously mentioned reduction in the overall range of
>> > tracks over which the data resides on a given platter.
>>
>> > Alternatively if the stripe size is larger then each drive will read a
>> > somewhat independent set of the DBMS blocks, and the effective seek
>> > time can be reduced assuming the DBMS is able to issue overlapping
>> > read requests for the DBMS blocks.
>>
>> Your argument rests on the assumption that data is randomly distributed
>> in
>> the stripes on the disk and doesn't take into account the fact that a
>> high-end caching controller eliminates latency by reading an entire track
>> at
>> once. Isn't it true that there is a physical affinity between related
>> data?
>> Isn't it more likely that an index will occupy contiguous stripes than
>> some
>> random set--regardless of stripe size? Can you show that the number of
>> tracks accessed by say, 128 coarse stripe reads is any less than the
>> number
>> of tracks accessed by 1024 fine stripe reads?
>
>Yes, sometimes the DBMS manages to cluster all the necessary data so
>there is very little seeking required, and in that case it won’t
>matter what stripe size is used.
>
>However, that is not always possible. For example consider a B+Tree
>on 1 billion records and in a short period of time the DBMS needs to
>read 100 records for given index values that are effectively at random
>with respect to the ordering on that data type. To keep it simple
>ignore the reading of the internal nodes of the B+Tree. Typically
>those 100 records will appear in roughly 100 different leaf nodes of
>the B+Tree. Furthermore due to the sheer size of the overall data
>those leaf nodes will tend to reside on different tracks. The
>unfortunate reality is that it isn’t possible to read these records
>without a lot of head seeking, even if the reads are ordered according
>to track position (ie elevator seeking). Now if RAID0 is used and the
>stripes are smaller that the B+Tree leaf nodes, then every drive will
>need to contribute to the reading of every leaf node. Each drive can
>read the stripes in any order it likes but it won’t avoid the fact
>that each drive performs ~100 seeks. If instead, each B+Tree leaf
>node resides in a single stripe (and therefore on a single drive) then
>with four drives in the RAID0, each drive will only need to perform
>~25 seeks.
>
You're oversimplifying. With a stripe size of 64K, it is highly unlikely
that a leaf node will span more than one stripe; therefore, it is highly
unlikely for every drive to contribute to the reading of every leaf node.
Also, you appear to be discounting concurrency, and environments where
concurrency is important such as typical OLTP environments are where
technologies such as elevator seeking are most effective.
By the way, Oracle documentation states that an 8K block size is optimal for
most systems and defaults DB_FILE_MULTIBLOCK_READ_COUNT to 8. 8K * 8 = 64K.
Interestingly, Sql Server uses 8K pages organized into 64K extents, which
happens to be the unit of physical storage allocation. Do you know
something they don't?
>> >> I think using a coarse stripe is counterproductive. There would be a
>> >> bigger
>> >> chance that a seek in the middle of the read would be required.
>> >> Consider:
>> >> if 3.5 stripes fit on a track in one zone of the disk, then on average
>> >> every
>> >> fourth read would require an additional seek to get the remaining half
>> >> stripe. If on the other hand, 28 stripes fit on a track, then no
>> >> additional
>> >> seeks would be necessary. Even if it were 28.5 stripes instead of 28,
>> >> one
>> >> additional seek for every 29 reads is a whole lot better than one for
>> >> every
>> >> 4.
>>
>> > Firstly, hard-disks are quite good at stepping onto the next track in
>> > the manner normally used for very large "contiguous" reads or writes.
>>
>> The best track-to-track seek time I've seen is 0.2ms for reads, 0.4ms for
>> writes. That's phenomenal but can still add up.
>
>It’s insignificant when reading or writing 1Mb at a time.
> >> Stay informed about: sequential disk read speed |
|
| Back to top |
|
 |  |
External

Since: Jan 15, 2008 Posts: 527
|
(Msg. 17) Posted: Fri Aug 29, 2008 9:14 pm
Post subject: Re: sequential disk read speed [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
"David BL" wrote in message
>On Aug 29, 7:47 pm, "Brian Selzer" wrote:
>> "David BL" wrote in message
>>
[snip]
> > You're oversimplifying. With a stripe size of 64K, it is highly
> > unlikely
> > that a leaf node will span more than one stripe; therefore, it is highly
> > unlikely for every drive to contribute to the reading of every leaf
> > node.
>
> I don't see how I'm oversimplifying.
>
> My point is that stripes need to be at least as coarse as the DBMS
> block size. Do you agree?
>
Yes, I think the stripe size should be a multiple of the DBMS block size.
> The choice of DBMS block size is another question entirely.
>
> > Also, you appear to be discounting concurrency, and environments where
> > concurrency is important such as typical OLTP environments are where
> > technologies such as elevator seeking are most effective.
>
> Concurrency has nothing to do with the fact that if the stripe size is
> too small the seeking of the drives won't be independent.
>
That wasn't why I brought up concurrency. You dismissed elevator seeking as
an optimization mechanism with respect to the number of seeks required by
focusing on what appeared to be a single query. Ultimately the same number
of seeks will be required for a particular query, but when combined with
ninety-nine other queries, some of those seeks can be shared with other
queries, thus reducing the total number of seeks required to satisfy the
hundred.
> > By the way, Oracle documentation states that an 8K block size is optimal
> > for
> > most systems and defaults DB_FILE_MULTIBLOCK_READ_COUNT to 8. 8K * 8 =
> > 64K.
> > Interestingly, Sql Server uses 8K pages organized into 64K extents,
> > which
> > happens to be the unit of physical storage allocation. Do you know
> > something they don't?
>
> Sql Server 6.5 used 2k pages and this changed to 8k pages in Sql
> Server 7.0 released in 1998. Do you expect that 64k extents are still
> optimal a decade later given that the product of transfer rate and
> seek time has been steadily increasing?
>
Sql Server 2008 still uses 8K pages and 64K extents. Also, the Oracle
documentation that cited an 8K block size as being optimal was for their
latest version, 11g.
I think that 64K extents are still optimal because the technology employed
for serializing updates is still locking, and in a concurrent environment
with an escalating locking heirachy, a page size or extent size that is too
large will cause transactions to block more often. Do you know of a sound
and practicable alternative to an escalating locking heirarchy for
serializing updates?
> 64k blocks are generally too small on modern disks. A 64k block can
> be transferred in a tenth of the time it takes to seek to it.
Why is that a problem? Isn't it more efficient if I can satisfy the same
query by reading 100 64K blocks instead of 100 1M blocks? >> Stay informed about: sequential disk read speed |
|
| Back to top |
|
 |  |
External

Since: Jan 22, 2008 Posts: 177
|
(Msg. 18) Posted: Fri Aug 29, 2008 10:43 pm
Post subject: Re: sequential disk read speed [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
On Aug 30, 9:14 am, "Brian Selzer" wrote:
> "David BL" wrote in message
>
>
>
> >On Aug 29, 7:47 pm, "Brian Selzer" wrote:
> >> "David BL" wrote in message
> >>
>
> [snip]
>
> > > You're oversimplifying. With a stripe size of 64K, it is highly
> > > unlikely
> > > that a leaf node will span more than one stripe; therefore, it is highly
> > > unlikely for every drive to contribute to the reading of every leaf
> > > node.
>
> > I don't see how I'm oversimplifying.
>
> > My point is that stripes need to be at least as coarse as the DBMS
> > block size. Do you agree?
>
> Yes, I think the stripe size should be a multiple of the DBMS block size.
>
> > The choice of DBMS block size is another question entirely.
>
> > > Also, you appear to be discounting concurrency, and environments where
> > > concurrency is important such as typical OLTP environments are where
> > > technologies such as elevator seeking are most effective.
>
> > Concurrency has nothing to do with the fact that if the stripe size is
> > too small the seeking of the drives won't be independent.
>
> That wasn't why I brought up concurrency. You dismissed elevator seeking as
> an optimization mechanism with respect to the number of seeks required by
> focusing on what appeared to be a single query. Ultimately the same number
> of seeks will be required for a particular query, but when combined with
> ninety-nine other queries, some of those seeks can be shared with other
> queries, thus reducing the total number of seeks required to satisfy the
> hundred.
I'm not sure how I implied that elevator seeking isn't worthwhile.
> > > By the way, Oracle documentation states that an 8K block size is optimal
> > > for
> > > most systems and defaults DB_FILE_MULTIBLOCK_READ_COUNT to 8. 8K * 8 =
> > > 64K.
> > > Interestingly, Sql Server uses 8K pages organized into 64K extents,
> > > which
> > > happens to be the unit of physical storage allocation. Do you know
> > > something they don't?
>
> > Sql Server 6.5 used 2k pages and this changed to 8k pages in Sql
> > Server 7.0 released in 1998. Do you expect that 64k extents are still
> > optimal a decade later given that the product of transfer rate and
> > seek time has been steadily increasing?
>
> Sql Server 2008 still uses 8K pages and 64K extents. Also, the Oracle
> documentation that cited an 8K block size as being optimal was for their
> latest version, 11g.
>
> I think that 64K extents are still optimal because the technology employed
> for serializing updates is still locking, and in a concurrent environment
> with an escalating locking heirachy, a page size or extent size that is too
> large will cause transactions to block more often. Do you know of a sound
> and practicable alternative to an escalating locking heirarchy for
> serializing updates?
I think these systems lock pages not extents, and anyway locking
granularity can in principle be orthogonal to the unit of I/O or unit
of allocation.
> > 64k blocks are generally too small on modern disks. A 64k block can
> > be transferred in a tenth of the time it takes to seek to it.
>
> Why is that a problem? Isn't it more efficient if I can satisfy the same
> query by reading 100 64K blocks instead of 100 1M blocks?
Yes, but a DBMS will often be able to satisfy a given query by reading
fewer blocks. For example full table scans are much more efficient
with 1M blocks. Also, if you increase the block size by 10x then the
height of a B+Tree will tend to smaller. For example a tree of height
3 may be able to index a trillion rather than only a billion records.
The advantages of a larger block size are more apparent in a database
storing data where there is a greater tendency for locality based on
affinity to be useful. For example, it would be rather silly to use
64k blocks to store multi-resolution terra pixel images. >> Stay informed about: sequential disk read speed |
|
| Back to top |
|
 |  |
External

Since: Jan 15, 2008 Posts: 527
|
(Msg. 19) Posted: Sat Aug 30, 2008 3:32 pm
Post subject: Re: sequential disk read speed [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
"David BL" wrote in message
> On Aug 30, 9:14 am, "Brian Selzer" wrote:
>> "David BL" wrote in message
>>
>>
>>
>> >On Aug 29, 7:47 pm, "Brian Selzer" wrote:
>> >> "David BL" wrote in message
>> >>
>>
>> [snip]
>>
>> > > You're oversimplifying. With a stripe size of 64K, it is highly
>> > > unlikely
>> > > that a leaf node will span more than one stripe; therefore, it is
>> > > highly
>> > > unlikely for every drive to contribute to the reading of every leaf
>> > > node.
>>
>> > I don't see how I'm oversimplifying.
>>
>> > My point is that stripes need to be at least as coarse as the DBMS
>> > block size. Do you agree?
>>
>> Yes, I think the stripe size should be a multiple of the DBMS block size.
>>
>> > The choice of DBMS block size is another question entirely.
>>
>> > > Also, you appear to be discounting concurrency, and environments
>> > > where
>> > > concurrency is important such as typical OLTP environments are where
>> > > technologies such as elevator seeking are most effective.
>>
>> > Concurrency has nothing to do with the fact that if the stripe size is
>> > too small the seeking of the drives won't be independent.
>>
>> That wasn't why I brought up concurrency. You dismissed elevator seeking
>> as
>> an optimization mechanism with respect to the number of seeks required by
>> focusing on what appeared to be a single query. Ultimately the same
>> number
>> of seeks will be required for a particular query, but when combined with
>> ninety-nine other queries, some of those seeks can be shared with other
>> queries, thus reducing the total number of seeks required to satisfy the
>> hundred.
>
> I'm not sure how I implied that elevator seeking isn't worthwhile.
>
You didn't, and I didn't say that you did. But your implication that it
doesn't affect the number of seeks, while true, is an oversimplification in
that it doesn't take into account that in a concurrent environment, many of
those seeks can be shared by other queries.
>> > > By the way, Oracle documentation states that an 8K block size is
>> > > optimal
>> > > for
>> > > most systems and defaults DB_FILE_MULTIBLOCK_READ_COUNT to 8. 8K * 8
>> > > =
>> > > 64K.
>> > > Interestingly, Sql Server uses 8K pages organized into 64K extents,
>> > > which
>> > > happens to be the unit of physical storage allocation. Do you know
>> > > something they don't?
>>
>> > Sql Server 6.5 used 2k pages and this changed to 8k pages in Sql
>> > Server 7.0 released in 1998. Do you expect that 64k extents are still
>> > optimal a decade later given that the product of transfer rate and
>> > seek time has been steadily increasing?
>>
>> Sql Server 2008 still uses 8K pages and 64K extents. Also, the Oracle
>> documentation that cited an 8K block size as being optimal was for their
>> latest version, 11g.
>>
>> I think that 64K extents are still optimal because the technology
>> employed
>> for serializing updates is still locking, and in a concurrent environment
>> with an escalating locking heirachy, a page size or extent size that is
>> too
>> large will cause transactions to block more often. Do you know of a
>> sound
>> and practicable alternative to an escalating locking heirarchy for
>> serializing updates?
>
> I think these systems lock pages not extents, and anyway locking
> granularity can in principle be orthogonal to the unit of I/O or unit
> of allocation.
>
I don't see how since there is clearly a correlation between the size of
each unit of I/O and the contention for what is on each unit of I/O.
>> > 64k blocks are generally too small on modern disks. A 64k block can
>> > be transferred in a tenth of the time it takes to seek to it.
>>
>> Why is that a problem? Isn't it more efficient if I can satisfy the same
>> query by reading 100 64K blocks instead of 100 1M blocks?
>
> Yes, but a DBMS will often be able to satisfy a given query by reading
> fewer blocks. For example full table scans are much more efficient
> with 1M blocks. Also, if you increase the block size by 10x then the
> height of a B+Tree will tend to smaller. For example a tree of height
> 3 may be able to index a trillion rather than only a billion records.
>
I think that if the disk subsystem is sophisticated enough, the performance
benefit of an increased block size is lost. For example, if the controller
caches entire tracks to eliminate latency, then a larger block size would
not improve performance one bit--in fact, it would tend to reduce it because
of the vastly increased volume of data that must be transferred from the
cache to RAM. In the example above, you would have to move more than 16
times as much data from the cache to RAM to answer the same query.
> The advantages of a larger block size are more apparent in a database
> storing data where there is a greater tendency for locality based on
> affinity to be useful. For example, it would be rather silly to use
> 64k blocks to store multi-resolution terra pixel images.
>
It would be equally silly to physically store multi-resolution terra pixel
images alongside scalar data that can be joined or restricted on. If
necessary, place the image heap on a separate disk subsystem with a separate
stripe size and depth, but store the scalar data on disk subsystem with a
stripe size optimal for computing joins and restrictions on it. >> Stay informed about: sequential disk read speed |
|
| Back to top |
|
 |  |
External

Since: Jan 22, 2008 Posts: 177
|
(Msg. 20) Posted: Sun Aug 31, 2008 8:09 pm
Post subject: Re: sequential disk read speed [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
On Aug 31, 3:32 am, "Brian Selzer" wrote:
> "David BL" wrote in message
>
>
>
>
>
>
>
> > On Aug 30, 9:14 am, "Brian Selzer" wrote:
> >> "David BL" wrote in message
>
> >>
>
> >> >On Aug 29, 7:47 pm, "Brian Selzer" wrote:
> >> >> "David BL" wrote in message
> >> >>
>
> >> [snip]
>
> >> > > You're oversimplifying. With a stripe size of 64K, it is highly
> >> > > unlikely
> >> > > that a leaf node will span more than one stripe; therefore, it is
> >> > > highly
> >> > > unlikely for every drive to contribute to the reading of every leaf
> >> > > node.
>
> >> > I don't see how I'm oversimplifying.
>
> >> > My point is that stripes need to be at least as coarse as the DBMS
> >> > block size. Do you agree?
>
> >> Yes, I think the stripe size should be a multiple of the DBMS block size.
>
> >> > The choice of DBMS block size is another question entirely.
>
> >> > > Also, you appear to be discounting concurrency, and environments
> >> > > where
> >> > > concurrency is important such as typical OLTP environments are where
> >> > > technologies such as elevator seeking are most effective.
>
> >> > Concurrency has nothing to do with the fact that if the stripe size is
> >> > too small the seeking of the drives won't be independent.
>
> >> That wasn't why I brought up concurrency. You dismissed elevator seeking
> >> as
> >> an optimization mechanism with respect to the number of seeks required by
> >> focusing on what appeared to be a single query. Ultimately the same
> >> number
> >> of seeks will be required for a particular query, but when combined with
> >> ninety-nine other queries, some of those seeks can be shared with other
> >> queries, thus reducing the total number of seeks required to satisfy the
> >> hundred.
>
> > I'm not sure how I implied that elevator seeking isn't worthwhile.
>
> You didn't, and I didn't say that you did. But your implication that it
> doesn't affect the number of seeks, while true, is an oversimplification in
> that it doesn't take into account that in a concurrent environment, many of
> those seeks can be shared by other queries.
If seeks have a reasonable probability of being shared, doesn’t that
necessarily mean the size of memory is approaching the size of
secondary storage, and a large read cache will achieve the same
effect?
I’ve heard of techniques like buffer trees that take buffering of
reads and writes to an extreme level, allowing a system to get much
closer to the theoretical optimum for I/O. However such techniques
don’t seem compatible with OLTP where transactions are using strict
2PL and need to be completed quickly in order release locks. I can
see that MVCC will provide far more opportunity for long running read
only transactions to share reads (and seeks), but I doubt whether it
would allow much sharing of reads when the size of memory is only a
tiny fraction of the total size of the database.
> >> > > By the way, Oracle documentation states that an 8K block size is
> >> > > optimal
> >> > > for
> >> > > most systems and defaults DB_FILE_MULTIBLOCK_READ_COUNT to 8. 8K * 8
> >> > > =
> >> > > 64K.
> >> > > Interestingly, Sql Server uses 8K pages organized into 64K extents,
> >> > > which
> >> > > happens to be the unit of physical storage allocation. Do you know
> >> > > something they don't?
>
> >> > Sql Server 6.5 used 2k pages and this changed to 8k pages in Sql
> >> > Server 7.0 released in 1998. Do you expect that 64k extents are still
> >> > optimal a decade later given that the product of transfer rate and
> >> > seek time has been steadily increasing?
>
> >> Sql Server 2008 still uses 8K pages and 64K extents. Also, the Oracle
> >> documentation that cited an 8K block size as being optimal was for their
> >> latest version, 11g.
>
> >> I think that 64K extents are still optimal because the technology
> >> employed
> >> for serializing updates is still locking, and in a concurrent environment
> >> with an escalating locking heirachy, a page size or extent size that is
> >> too
> >> large will cause transactions to block more often. Do you know of a
> >> sound
> >> and practicable alternative to an escalating locking heirarchy for
> >> serializing updates?
>
> > I think these systems lock pages not extents, and anyway locking
> > granularity can in principle be orthogonal to the unit of I/O or unit
> > of allocation.
>
> I don't see how since there is clearly a correlation between the size of
> each unit of I/O and the contention for what is on each unit of I/O.
Locking doesn’t even need to be page based. Are you aware of the
distinction between latches and locks?
> >> > 64k blocks are generally too small on modern disks. A 64k block can
> >> > be transferred in a tenth of the time it takes to seek to it.
>
> >> Why is that a problem? Isn't it more efficient if I can satisfy the same
> >> query by reading 100 64K blocks instead of 100 1M blocks?
>
> > Yes, but a DBMS will often be able to satisfy a given query by reading
> > fewer blocks. For example full table scans are much more efficient
> > with 1M blocks. Also, if you increase the block size by 10x then the
> > height of a B+Tree will tend to smaller. For example a tree of height
> > 3 may be able to index a trillion rather than only a billion records.
>
> I think that if the disk subsystem is sophisticated enough, the performance
> benefit of an increased block size is lost. For example, if the controller
> caches entire tracks to eliminate latency, then a larger block size would
> not improve performance one bit--in fact, it would tend to reduce it because
> of the vastly increased volume of data that must be transferred from the
> cache to RAM. In the example above, you would have to move more than 16
> times as much data from the cache to RAM to answer the same query.
I agree that if the disk subsystem caches a track then a smaller unit
of I/O can be desirable.
I think we’re talking past each other because you’re associating block
with unit of I/O whereas I’m associating it more with the unit of
allocation. As I see it, in the above you’re only comparing
relatively small and uninteresting differences in read performance for
a given allocation of data to sectors and tracks.
Changing the unit of allocation has an enormous impact on how data is
allocated by the DBMS. This in turn can have a huge impact on the
number of seeks.
With a very small unit of allocation, on a small contiguous area of
the disk there can be allocations for many unrelated tables written by
many unrelated transactions. In theory the sweet spot for the unit of
allocation occurs when the time to seek is similar in magnitude to the
time to transfer.
One can imagine more sophisticated allocators – eg that reserve large
areas of the disk in ways that promote better clustering of related
data. One can even imagine that multiple independent allocators (ie
heaps) should be used within the DBMS. However the same effect can
be achieved more simply and with less wastage of space by using a
single allocator for the entire DBMS from which reasonably large
blocks are always allocated. This can be compared to an in-memory
programming environment that employs a single heap allocator for all
threads, and wherever that would lead to high latency due to poor
localisation of very small allocations, the programmer instead
allocates a somewhat larger block and breaks it up into smaller pieces
to meet the requirements of the smaller allocations.
I cannot see any good reason why a DBMS would skimp on the unit of
allocation (eg 64k), particularly when it can be coarser than the unit
of I/O (eg 8k). I wonder whether legacy of code base or backwards
compatibility is rearing its ugly head?
> > The advantages of a larger block size are more apparent in a database
> > storing data where there is a greater tendency for locality based on
> > affinity to be useful. For example, it would be rather silly to use
> > 64k blocks to store multi-resolution terra pixel images.
>
> It would be equally silly to physically store multi-resolution terra pixel
> images alongside scalar data that can be joined or restricted on. If
> necessary, place the image heap on a separate disk subsystem with a separate
> stripe size and depth, but store the scalar data on disk subsystem with a
> stripe size optimal for computing joins and restrictions on it.
I agree that that the optimal unit of I/O depends on the nature of the
data and its usage patterns. I don’t think multiple heaps is
generally necessary. >> Stay informed about: sequential disk read speed |
|
| Back to top |
|
 |  |
External

Since: Jan 15, 2008 Posts: 527
|
(Msg. 21) Posted: Mon Sep 01, 2008 9:00 am
Post subject: Re: sequential disk read speed [Login to view extended thread Info.] Archived from groups: per prev. post (more info?)
|
|
|
"David BL" wrote in message
[snip]
> >
> > > I'm not sure how I implied that elevator seeking isn't worthwhile.
> >
> > You didn't, and I didn't say that you did. But your implication that it
> > doesn't affect the number of seeks, while true, is an oversimplification
> > in
> > that it doesn't take into account that in a concurrent environment, many
> > of
> > those seeks can be shared by other queries.
>
> If seeks have a reasonable probability of being shared, doesn’t that
> necessarily mean the size of memory is approaching the size of
> secondary storage, and a large read cache will achieve the same
> effect?
>
> I’ve heard of techniques like buffer trees that take buffering of
> reads and writes to an extreme level, allowing a system to get much
> closer to the theoretical optimum for I/O. However such techniques
> don’t seem compatible with OLTP where transactions are using strict
> 2PL and need to be completed quickly in order release locks. I can
> see that MVCC will provide far more opportunity for long running read
> only transactions to share reads (and seeks), but I doubt whether it
> would allow much sharing of reads when the size of memory is only a
> tiny fraction of the total size of the database.
>
I think you would be surprised. In a typical OLTP database, only a tiny
fraction of the total size of the database is being manipulated. Most
accesses involve what happened yesterday and what is happening today. Most
of the remainder involve what is happening this month and possibly what
happened last month. For example, in a manufacturing system, those
manufacturing orders that are currently running in the plant are accessed
most often. Those orders that ran yesterday are the next most often
queried, followed by those that have run so far this month and then last
month. Only an occasional query will look at older information--perhaps
looking at the last time a particular part was produced, for example.
A bigger block size will cause a lot of information to be transferred to
memory that isn't required to answer any queries.
[snip]
> > > I think these systems lock pages not extents, and anyway locking
> > > granularity can in principle be orthogonal to the unit of I/O or unit
> > > of allocation.
> >
> > I don't see how since there is clearly a correlation between the size of
> > each unit of I/O and the contention for what is on each unit of I/O.
>
> Locking doesn’t even need to be page based. Are you aware of the
> distinction between latches and locks?
>
Yes, I am. A latch guarantees the integrity of a row during the reading of
that row.
The overhead required for row-level locking can drastically reduce
performance. For an update that touches a few rows, row-level locking is
optimal. For an update that touches several thousand rows, locking pages
may make more sense. For an update that touches millions of rows, extent
locks or even table locks are more efficient. Another factor is the unit of
I/O for the transaction logs. If the transaction log contains the pages
that are different instead of just the rows that are different, then pages
must be locked along with any rows, because if one transaction changes one
row on a page, and another transaction changes another row on the same page,
and if one of those transactions is rolled back, then the rollback could
overwrite the change made by the transaction that will ultimately commit, or
if both roll back, the database could end up having one of the changes
recorded. It gets a lot more complicated when there are a lot of index
pages recorded in the log as well as just rows, because indexes must also be
locked, or at least invalidated until the updates propogate to them.
> > >> > 64k blocks are generally too small on modern disks. A 64k block
> > >> > can
> > >> > be transferred in a tenth of the time it takes to seek to it.
> >
> > >> Why is that a problem? Isn't it more efficient if I can satisfy the
> > >> same
> > >> query by reading 100 64K blocks instead of 100 1M blocks?
> >
> > > Yes, but a DBMS will often be able to satisfy a given query by reading
> > > fewer blocks. For example full table scans are much more efficient
> > > with 1M blocks. Also, if you increase the block size by 10x then the
> > > height of a B+Tree will tend to smaller. For example a tree of height
> > > 3 may be able to index a trillion rather than only a billion records.
> >
> > I think that if the disk subsystem is sophisticated enough, the
> > performance
> > benefit of an increased block size is lost. For example, if the
> > controller
> > caches entire tracks to eliminate latency, then a larger block size
> > would
> > not improve performance one bit--in fact, it would tend to reduce it
> > because
> > of the vastly increased volume of data that must be transferred from the
> > cache to RAM. In the example above, you would have to move more than 16
> > times as much data from the cache to RAM to answer the same query.
>
> I agree that if the disk subsystem caches a track then a smaller unit
> of I/O can be desirable.
>
> I think we’re talking past each other because you’re associating block
> with unit of I/O whereas I’m associating it more with the unit of
> allocation. As I see it, in the above you’re only comparing
> relatively small and uninteresting differences in read performance for
> a given allocation of data to sectors and tracks.
>
> Changing the unit of allocation has an enormous impact on how data is
> allocated by the DBMS. This in turn can have a huge impact on the
> number of seeks.
>
> With a very small unit of allocation, on a small contiguous area of
> the disk there can be allocations for many unrelated tables written by
> many unrelated transactions. In theory the sweet spot for the unit of
> allocation occurs when the time to seek is similar in magnitude to the
> time to transfer.
>
Interestingly, on a high-end 15k drive, revolutions take 4ms, and the
average seek time is 2.9ms. Reading an entire track, therefore, should take
at most 4ms. Based on comparisons of the specs for different size drives of
the same series, I don't think more than one head is being read at the same
time, but if they implemented that, it would be possible to read an entire
cylinder in the time it takes to read a track, 4ms. But then, it might be
cheaper to just add more drives.
> One can imagine more sophisticated allocators – eg that reserve large
> areas of the disk in ways that promote better clustering of related
> data. One can even imagine that multiple independent allocators (ie
> heaps) should be used within the DBMS. However the same effect can
> be achieved more simply and with less wastage of space by using a
> single allocator for the entire DBMS from which reasonably large
> blocks are always allocated. This can be compared to an in-memory
> programming environment that employs a single heap allocator for all
> threads, and wherever that would lead to high latency due to poor
> localisation of very small allocations, the programmer instead
> allocates a somewhat larger block and breaks it up into smaller pieces
> to meet the requirements of the smaller allocations.
>
It's a trade off: bigger allocation units would increase the throughput for
scans but would reduce the throughput for seeks.
> I cannot see any good reason why a DBMS would skimp on the unit of
> allocation (eg 64k), particularly when it can be coarser than the unit
> of I/O (eg 8k). I wonder whether legacy of code base or backwards
> compatibility is rearing its ugly head?
>
Possibly. I would think, however, that given the competitiveness between
Oracle and Microsoft, that if more speed could be achieved by increasing the
allocation unit, then one of them would have done it. They may also be
anticipating the development of lower cost solid-state drives, where seek
times and latency disappear, and there a larger block size would be a
disadvantage.
>
> > > The advantages of a larger block size are more apparent in a database
> > > storing data where there is a greater tendency for locality based on
> > > affinity to be useful. For example, it would be rather silly to use
> > > 64k blocks to store multi-resolution terra pixel images.
> >
> > It would be equally silly to physically store multi-resolution terra
> > pixel
> > images alongside scalar data that can be joined or restricted on. If
> > necessary, place the image heap on a separate disk subsystem with a
> > separate
> > stripe size and depth, but store the scalar data on disk subsystem with
> > a
> > stripe size optimal for computing joins and restrictions on it.
>
> I agree that that the optimal unit of I/O depends on the nature of the
> data and its usage patterns. I don’t think multiple heaps is
> generally necessary. >> Stay informed about: sequential disk read speed |
|
| Back to top |
|
 |  |
| Related Topics: | "Fuzzy" text search using n-grams (bigrams) -- speed? - Hi all, Let's suppose I'm writing a website where users search for movie titles, and suppose there are 200,000 movies. The site is in PHP/ MySQL. I'd like to implement a "fuzzy" text search so that similar movie titles come up in a list no ma...
How to flush data most efficiently from memory to disk whe.. - Hi, I'm looking into designing an in-memory DB and I wonder: How to flush data most efficiently when I checkpoint? Say I have a page size of 8K and 1K of those have been updated in random places, that is, the changes may be contiguous but most likely....
Problem with Nested Sets - Hello, I have a table which represents a tree of forums using nested sets. Here are the fields: id, root_id, left, right, level, label. I have a string which is like a path. For example, Forum/Sub-forum/Sub-sub-forum I want to get the id of the forum..
space filling curves - I recently started reading about various ways of making various functions of a DB faster...and I keep running into space filling curves. Unfortunately, I just can't grasp the concept. Following is what I understand so far, I'll appreciate it if someone...
can two stored procedures in same transaction cause deadlock - Hi, We are experiencing a deadlock issue using MS SQL 2000 that's generating some debate in our office. We have two stored procedures SP1 and SP2 running in the same transaction along with couple other stored procedures, SP1 does a deletion on one.. |
|
You can post new topics in this forum You can reply to topics in this forum You can edit your posts in this forum You can delete your posts in this forum You can vote in polls in this forum
|
|
|
|
 |
|
|