Rising Results in Sandra's Memory Bandwith benchmark

Discuss software and how to tweak more performance out of your system.
Jim
K6'er Elite
Posts: 1745
Joined: Wed Jan 21, 2004 7:10 pm
Location: Toronto

Rising Results in Sandra's Memory Bandwith benchmark

Post by Jim »

Peter, I did start a new post over here, and finished typing it up; but the last part of it I decided to PM you instead. Wound up losing my post in the process of sending the PM. Careless of me, that. However, I could not sit and watch what I believe to be nonsense being posted without response in the thread in which it appeared.

O.K. I will state it all again, perhaps more clearly this time.

1) The K6-3+ can cache all the RAM addresses a Super 7 can throw at it; but it does not have sufficient cache lines to store all of the data required by the OS and Sandra while running the Memory Bandwidth test. NOTE : If Peter argues this I want KachiWachi's opinion on the subject.

2) Where a large amount of RAM is uncached at 3rd level, a significant amount of data required by Sanda and the OS is going to windup residing in RAM that is uncached at third level. (Though it is at least address cached at 2nd level.)

3) When the CPU requires data it first searchs the tag ram to see if the contents of that RAM address is stored in one of the 2nd level cache lines. If the data is not available there, it then searchs to see if it is stored in one of the 3rd level cache lines. If the data is stored in neither cache then the processor has to search RAM. When the processor finds the data, if it is residing in RAM that is uncached at 3rd level it writes the data into a 2nd level cache line overwriting something else. (Because the second level cache is always full)

4) When the processor next needs that overwritten data, it will no longer be available in second level cache, so the processor, having fruitlessly checked 2nd level cache, checks 3rd level cache. If the data is not there, then again the processor goes back to searching RAM. If this time the data is residing in RAM that is cached at 3rd level cache, then the processor writes that data into one of the 3rd level cache lines, and 2nd level too for all I know. (In which case again overwriting something else).

5) Any time that the processor requires data that is not stored in second level cache, or 3rd level cache, it goes into RAM, and if the data is stored in RAM that is uncached at 3rd level the data WILL be written into 2nd level cache overwriting something else. Conversely, (and worse from the standpoint of cache optimization), if the data is found in either cache, the cache contents will not change.

6) Because the Algorithyms are written in such a manner that when something has to be overwritten, the most recently accessed data is preserved in second level cache, rather than data which CANNOT be cached at 3rd level because it is residing in RAM that is uncached at 3rd level; sometimes data which cannot be cached at 3rd level winds up being overwritten and thus is no longer available in cache at all. (These overwritten data points may windup being brought into 2nd level cache and then bumped out again by being overwritten any number ot times.)

7) Over a period of time, however, what happens is that those data points which can be cached in the 3rd level cache lines get cached there, and those data points which cannot be cached in the 3rd level cache lines wind up in the 2nd level cache lines. The effects of #6 above however delay this happening. What happens is, one by one, on a least used basis, the data points which can be cached in the 3rd level cache lines get overwritten in 2nd level cache. And the next time the processor searchs RAM for that item it gets brought into 3rd level cache, leaving room in 2nd level cache for one more data point which cannot be cached at 3rd level, to reside in 2nd level cache, without being disturbed by being overwritten by some data point which has been cached at 3rd level. i.e. The more data points that get stored in the 3rd level cache lines, the less often the processor has to search RAM and bring a data point into 2nd level cache overwriting something else, thereby bumping it out of cache.

8) However, it is only when a data point, which is stored in RAM cached at 3rd level, gets overwritten in 2nd level cache, thereby allowing some other data point to reside in 2nd level cache, that the 3rd level cache optimizes to some degree, since the next time the processor needs the overwritten data point it WILL be brought into 3rd level cache, without bumping another needed data point out of 3rd level cache. Hence the process of optimization is slow.

9) Over a period of time what happens is the cache optimizes DESPITE the algorithym, (which always wants to bump the item least recently used out of 2nd level cache regardless of whether it can be cached at 3rd level or not), with those data points which cannot be cached in 3rd level cache accumulating in 2nd level cache and those data points which can be cached by 3rd level cache accumulating in 3rd level cache.

10) This results in progressively higherSandra Memory Bandwidth scores because the cache is apparently not flushed between runs.

11) That, I hope, is a clear, (though not 100% accurate), explanation of the rising results in the Sandra Memory Bandwidth benchmark that are associated, and associated only, with systems having a large amount of RAM uncached at 3rd level. Were all of that RAM cached at 3rd level the initial results would be higher; but subsequent runs would show no gain.
NOTE : Stedman did NOT have a large amount of RAM uncached at 3rd level.
Where none of the RAM is cached at 3rd level the initial results will be lower, except that apparently there is some cost to cache operation which will allow slightly higher initial results if cache is disabled; but at the cost of losing the gains associated w/ cache optimization over a period of time. (KachiWachi gets better results w/ cache disabled; but I don't think he has a large amount of RAM uncached at 3rd level, - so he has nothing to lose.)

12) The point you seem to fail to understand is that having large amounts of uncached RAM at 3rd level does not give one higher results. What it does is give you lower results than you would have got if the same amount of ram were cached at 3rd level. It is only with repeated runs gradually optimizing the cache contents that the result becomes what you would have got in the first place if all of the same amount of RAM had been cached at 3rd level.

13) As for gain shown by other programs running after Sandra, what is happening there is these programs are taking advantage of Sandra's optimization of the cache at both 2nd, and 3rd level. But running these other programs multiple times in succession after running Sandra will not show further gains, and may well show a loss; probably brought on by cache flushing.

14) Finally re "CHEAT BENCH", I think that is out of line. Real world aplications that run for an extended period of time doing repetitive operations, (and at the machine code level they most certainly do), will also show the benefit of cache optimization. It may not be readily measured, but it is there. You want to check it, run a background application that tracks cache hits. That would be the easiest way to settle this whole exceedingly stupid argument. In any case as long as a person posting results specifies exactly how he or she got them, it cannot be called cheating.
Last edited by Jim on Mon Feb 05, 2007 11:40 pm, edited 4 times in total.
Superpuppy 3
K6-3+ 450 ACZ (6x100)
DFI K6BV3+/66 Rev B2 (2 Meg) w/ 2x28mm Chipset Fans
2x256 Meg PC 133 Hynix SDRAM
1x 20G Maxtor (7200)
2x 80G Maxtor (7200) Ducted w/ 2x486 Fans Mount
52/24/52/16 LG CDR/RW/DVD
8/4/3/12/24/16/32 LG Super Multi
ATI 9000 aiw Radeon AGP
SB Audigy 1 MP3 Sound
CMD 649 IDE Controller
NEC USB 2 Card
User avatar
Stedman5040
Veteran K6'er
Posts: 271
Joined: Mon Sep 05, 2005 4:22 pm

Post by Stedman5040 »

Tried out the Asus P5A with the ALI Aladdin V chipset with a K6-III+ cpu running at 5.5x100 with the same memory configuration as run on the SIS530 board. So memory is at 192Mb with one stick at 128mb and the other at 64 mb. With onboard cache enabled of 512k using the method previously described using Sandra I got about the same gain in performance using the Superpi 1M benchmark

Set up as follows

K6-III+ @550 (5.5x100)
Asus P5A v1.4
192mb memory @ 2-2-2-6
20gig Wd HDD
Voodoo3 3000 16mb AGP graphics
Hercules Muse sound card
realtek nic

No wpcredit tweaks used the time for Superpi 1M was 351secs straight after firing up into windows me

Ran Sandra 2004 memory benchmark for three times and got marks of 174/171 reducing to 171/169 on third run.

Time for Superpi 1M after this was 312 secs and this represents a 12.5% improvement in this bench. This is close to the improvement found in the GA-5SMM board with the SIS530 chipset.

When I get the chance I will look at using maximum memory of 3x256mb modules and rerun the test. I will also look at the VIA chipset on my EP-MVP3G2 board.

Regards,

Stedman.
DonPedro
K6'er Elite
Posts: 578
Joined: Wed Jul 27, 2005 2:11 pm

Post by DonPedro »

jim and everybody,

I just stop by for now to briefly say a 100 times thank you for moving the debate here and starting a new thread!

very appreciated!
:)

did not sleep last night, so I am too tired right now to concentrate on "hard"-posting ... ;)
User avatar
KachiWachi
K6'er Elite
Posts: 507
Joined: Wed Sep 21, 2005 10:53 am
Location: Pennsylvania, USA

Post by KachiWachi »

@ 11) -> "KachiWachi gets better results w/ cache disabled; but I don't think he has a large amount of RAM uncached at 3rd level, - so he has nothing to lose."

Don't forget that my readings were taken in DOS (6.22 to be exact), so there is no system overhead (Windows stuff) going on...and the test program is small.

The i430VX will cache up to 64 MB RAM...so with the full compliment of 128 MB, 50% of the RAM is uncached at the third level.

My DFI happens to have an on-board 512 KB cache (Ultra Low Power, MCache)...though other i430VX-based boards *may* use a 256 KB (or a 512 KB...if available) COASt Module (PBSRAM).

Useful DOS tools:

CacheChk
CacheMem
CacheGraf
Moderator - Wim's BIOS

PC #1 - DFI 586IPVG, K6-2/+ 450 (Cyrix MII 433), 128 MB EDO. BIOS patched by Jan Steunebrink.
PC #2 - Amptron PM-7900 (M520), i200 non-MMX, 128 MB EDO
PC #3 - HP8766C, PIII-667, 768 MB SDRAM
PC #4 - ASUS P3V4X, PIII-733, 256 MB SDRAM
PC #5 - Gateway 700X, P4-2.0 GHz, 768 MB PC800 RDRAM
PC #6 - COMPAQ Evo N1020v laptop, P4-2.4 GHz, 1 GB PC2700 DDR
PC #7 - Dell Dimension 4600i, P4-2.8 GHz, 512 MB PC2700 DDR
PC #8 - Acer EeePC netbook, Atom N270 @ 1.60 GHz, 1 GB RAM
PC #9 - ??? ;)
Jim
K6'er Elite
Posts: 1745
Joined: Wed Jan 21, 2004 7:10 pm
Location: Toronto

Post by Jim »

@ KachiWachi : I don't know how DOS works on memory loading. I know Windopes loads into uncached RAM first, leaving cached RAM for user programs to speed up their operation. I think that is the reason that other benchmarks run after Sandra has been used to optimize cache show a gain. Because Windopes loads into uncached Ram, Sandra optimizes the OS required data points into 2nd level cache, (because they can't be cached at 3rd level). That leaves the new program at least part of the RAM that is cached at third level to load into. The fact that Windopes required Data points are already optimized in 2nd level cache, combined with the fact that third level cached ram is left available for the bench program, allows the Bench program to achieve a high level of cache optimization, leading to better results in the course of just one run.

As regards what you get from your machines, I think if you were to run Sandra's Memory Bandwidth benchmark repeatedly on both machines, first w/ the cache disabled, then after w/ the cache enabled, you would find that although you would get lower results on both machines w/ cache enabled, (First because of the inherant small loss entailed w/ running cache enabled, and second because neither one has a LARGE enough amount of uncached RAM at 3rd level to overcome that loss), nonetheless the machine w/ the larger amount of uncached ram at 3rd level, would show a smaller loss than the other machine after several runs of the Sandra benchmark had optimized the 3rd level cache.

@ Stedman : Thanks Tony !! Very kind of you to offer to help settle this mess. Which board you use is not critical, as long as it will support a fair amount of uncached RAM at 3rd level. Ideally what you should do is :

First set the board up w/ no more RAM than the 3rd level cache can handle. Then with that setup, using whatever tweaks you use to achieve best results, (specify them please), run a set of tests at each of : 5.5x100, 5.5x105, 6x100, and if possible 6x105. These sets of tests should be run twice : once w/ 3rd level cache enabled, and once w/ 3rd level cache disabled.

Each set of tests should include running some benchmark program to determine its normal results, then running the Sandra Memory Bandwidth benchmark at least 3 times; and in the event of rising results, until the results stop rising. Once that has been done then rerun the first benchmark, (consistantly using the same one in all test sets), to determine if, (and to what extent), that benchmark has gained from Sandra preoptimizing the cache for it.

Having completed those tests, then reconfigure the machine, so that it has a maxed RAM installation, 512Meg or 768Meg, depending on mobo used. Having done that, then using the same tweaks used before, run all the same test sets over again. i.e. 5.5x100, 5.5x105, 6x100, & 6x105, both w/ and without cache enabled.

Note : You will find that the more RAM that is uncached at 3rd level, the more Sandra Memory Bandwidth benchmark runs it will take before the Sandra results peak. When they peak, is I believe when the cache contents have been optimized as best it can be. The following run after the peak will show falling results. So what you have to do is determine how many Sandra runs it takes for results to peak given the amount of RAM installed. (The number of runs will be consistant for any given amount of RAM). Once you know how many runs it takes for results to peak, then you can run your test sets secure in the knowledge of when is the best time to cut off the Sandra runs and run the other benchmark.

Note Also : Because the run following the peak run in the Sandra results invariably starts showing declining results if you continue running Sandra, it may be best to cut off the Sandra runs 1 run before the peak once you know how many runs it takes to peak. That way the benchmark following the Sandra runs may show an even higher result, than if you run Sandra until it peaks, leaving the other benchmark in the "falling results due" position.

Note Also : If you have Norton Utilities 2000, their "System Doctor" applet has a cache hit tracker which can be enabled to run in the background during the tests; and you can watch what % of cache hits you are getting during each run.

All in all, its a heck of a lot of work, and further, ideally it should be done on each of the various chipsets separately; but that is the right way to do it, to settle a number of issues. 1) Does Sandra dislike the 5.5 multiplier? (I got 241IntMMX on my P5A-B @ 5.5x112 w/ 768Meg of RAM installed; - Beat that Peter). 2) Are rising results in the Sandra Memory Bandwidth benchmark associated ONLY w/ Large amounts of uncached RAM at 3rd level cache? 3) Does the benefit accrued to other benchmarks by preoptimizing cache w/ Sandra remain consistant despite variations in amount of RAM installed and variation in whether or not 3rd level cache is enabled? 4) Is it in fact a rising incidence of cache hits achieved through optimization of cache content that leads to the rising results from Sandra's Memory Bandwidth benchmark, and other benchmark programs run after Sandra has preoptimized cache content? 5) Does Peter, (who is a nice guy), need to take a laxative on this one?

Finally, if that is so much work that you decide to take a pass on it; no hard feelings. I can't do it right now, because both my good machines are still hors de combat. But when SP3 is fully back up, if nobody else has done it by then, I WILL do it; because I do not like to see absolute nonsense being accepted as truth.
Last edited by Jim on Tue Feb 06, 2007 8:10 am, edited 1 time in total.
Superpuppy 3
K6-3+ 450 ACZ (6x100)
DFI K6BV3+/66 Rev B2 (2 Meg) w/ 2x28mm Chipset Fans
2x256 Meg PC 133 Hynix SDRAM
1x 20G Maxtor (7200)
2x 80G Maxtor (7200) Ducted w/ 2x486 Fans Mount
52/24/52/16 LG CDR/RW/DVD
8/4/3/12/24/16/32 LG Super Multi
ATI 9000 aiw Radeon AGP
SB Audigy 1 MP3 Sound
CMD 649 IDE Controller
NEC USB 2 Card
User avatar
Stedman5040
Veteran K6'er
Posts: 271
Joined: Mon Sep 05, 2005 4:22 pm

Post by Stedman5040 »

Jim

I will start with the GA-5SMM board as this has been set up and has caused all this in the first place. Should be fun as the L3 512k on board cache will only store up to 64Mb of dram addresses. What a pile of pants! Anyway I can also run an ordinary K6-2 with this as well to see the difference no on die cache makes with loading the memory up to maximum. With over 700mb of uncached memory it should be interesting to see the result.

Stedman
User avatar
KachiWachi
K6'er Elite
Posts: 507
Joined: Wed Sep 21, 2005 10:53 am
Location: Pennsylvania, USA

Post by KachiWachi »

Another interesting tidbit of information to ponder...

Since Windows loads from the top down, in my system, no third level caching begins until the top 64 MB of RAM gets filled with whatever. Once I pass that point, the cache becomes functional.

On the other hand, RAM access is faster than HDD access (swap file), so the machine will still run faster overall...even with uncached RAM in place.
Moderator - Wim's BIOS

PC #1 - DFI 586IPVG, K6-2/+ 450 (Cyrix MII 433), 128 MB EDO. BIOS patched by Jan Steunebrink.
PC #2 - Amptron PM-7900 (M520), i200 non-MMX, 128 MB EDO
PC #3 - HP8766C, PIII-667, 768 MB SDRAM
PC #4 - ASUS P3V4X, PIII-733, 256 MB SDRAM
PC #5 - Gateway 700X, P4-2.0 GHz, 768 MB PC800 RDRAM
PC #6 - COMPAQ Evo N1020v laptop, P4-2.4 GHz, 1 GB PC2700 DDR
PC #7 - Dell Dimension 4600i, P4-2.8 GHz, 512 MB PC2700 DDR
PC #8 - Acer EeePC netbook, Atom N270 @ 1.60 GHz, 1 GB RAM
PC #9 - ??? ;)
Jim
K6'er Elite
Posts: 1745
Joined: Wed Jan 21, 2004 7:10 pm
Location: Toronto

Post by Jim »

@ Stedman : Thanks again Tony. One other point. I said That the number of runs of the Sandra Memory Bandwidth benchmark required to achieve peak results with any given amount of RAM installed will remain consistant.

I perhaps should have said that the number of runs of the Sandra Memory Bandwidth benchmark required to achieve peak results with any given amount of RAM installed on any given board will remain consistant. i.e. change boards, and the number of runs required to achieve peak results with any given amount of RAM may change also.

@ KachiWachi : True! And for that reason you may find that your machine would run a bit faster with more RAM installed. Incidentally, I have a brand new, (unused, - still sitting in its original box), Amtron PM8700 that you are welcome to if you want it.
Superpuppy 3
K6-3+ 450 ACZ (6x100)
DFI K6BV3+/66 Rev B2 (2 Meg) w/ 2x28mm Chipset Fans
2x256 Meg PC 133 Hynix SDRAM
1x 20G Maxtor (7200)
2x 80G Maxtor (7200) Ducted w/ 2x486 Fans Mount
52/24/52/16 LG CDR/RW/DVD
8/4/3/12/24/16/32 LG Super Multi
ATI 9000 aiw Radeon AGP
SB Audigy 1 MP3 Sound
CMD 649 IDE Controller
NEC USB 2 Card
Jim
K6'er Elite
Posts: 1745
Joined: Wed Jan 21, 2004 7:10 pm
Location: Toronto

Post by Jim »

@ Stedman :
Something I forgot to mention, is there should be a system restart after each set of runs, (i.e. Set parameters, Restart, run benchmark program x, run Sandra's Memory Bandwidth benchmark x times, rerun benchmark program x; = 1 set), to insure that one set of runs does not influence the results of the next. Also if using a cache hit tracker, be advised that the tracker's operation will skew the results of the other tests, and may require more Sandra runs to optimize the cache contents. Test results achieved with a cache hit tracker running, can only be compared to other tests run with a tracker running, Not with reults achieved w/ no tracker running.

Hope these last minute additions did not come too late.
Superpuppy 3
K6-3+ 450 ACZ (6x100)
DFI K6BV3+/66 Rev B2 (2 Meg) w/ 2x28mm Chipset Fans
2x256 Meg PC 133 Hynix SDRAM
1x 20G Maxtor (7200)
2x 80G Maxtor (7200) Ducted w/ 2x486 Fans Mount
52/24/52/16 LG CDR/RW/DVD
8/4/3/12/24/16/32 LG Super Multi
ATI 9000 aiw Radeon AGP
SB Audigy 1 MP3 Sound
CMD 649 IDE Controller
NEC USB 2 Card
User avatar
KachiWachi
K6'er Elite
Posts: 507
Joined: Wed Sep 21, 2005 10:53 am
Location: Pennsylvania, USA

Post by KachiWachi »

Jim sez -

"I know Windopes loads into uncached RAM first, leaving cached RAM for user programs to speed up their operation."

This is only true if you have uncached RAM on the system.

If all of the RAM is cached, Windows will still load from the top downward...so anything that happens to be present in RAM at the time will have the ability to be cached...as required.


Also...I wouldn't necessarily use the word optimized. I would say that all of the information (data/instructions) required to perform the operation is cached. Optimization would refer to having the cache contents ordered for the most efficient usage...which I don't think happens.
Moderator - Wim's BIOS

PC #1 - DFI 586IPVG, K6-2/+ 450 (Cyrix MII 433), 128 MB EDO. BIOS patched by Jan Steunebrink.
PC #2 - Amptron PM-7900 (M520), i200 non-MMX, 128 MB EDO
PC #3 - HP8766C, PIII-667, 768 MB SDRAM
PC #4 - ASUS P3V4X, PIII-733, 256 MB SDRAM
PC #5 - Gateway 700X, P4-2.0 GHz, 768 MB PC800 RDRAM
PC #6 - COMPAQ Evo N1020v laptop, P4-2.4 GHz, 1 GB PC2700 DDR
PC #7 - Dell Dimension 4600i, P4-2.8 GHz, 512 MB PC2700 DDR
PC #8 - Acer EeePC netbook, Atom N270 @ 1.60 GHz, 1 GB RAM
PC #9 - ??? ;)
Jim
K6'er Elite
Posts: 1745
Joined: Wed Jan 21, 2004 7:10 pm
Location: Toronto

Post by Jim »

@ KachiWachi : Your first point is obvious. If there is no uncached RAM in the system obviously Windopes MUST load into cached RAM. On the second point, I think you are wrong. I think that is exactly what does happen; over a period of time the cache contents are progressively reordered in the direction of the most efficient possible operation. Normally they never achieve full optimization, because normally the data points used are constantly changing. But when the same set of data points is being recycled continually, then a very high level of optimization can be achieved.
Superpuppy 3
K6-3+ 450 ACZ (6x100)
DFI K6BV3+/66 Rev B2 (2 Meg) w/ 2x28mm Chipset Fans
2x256 Meg PC 133 Hynix SDRAM
1x 20G Maxtor (7200)
2x 80G Maxtor (7200) Ducted w/ 2x486 Fans Mount
52/24/52/16 LG CDR/RW/DVD
8/4/3/12/24/16/32 LG Super Multi
ATI 9000 aiw Radeon AGP
SB Audigy 1 MP3 Sound
CMD 649 IDE Controller
NEC USB 2 Card
lazy_kalabok

Post by lazy_kalabok »

maybe one interesting fact for you guys:

tried out a cyrrix cpu on an aladdin7 - 64kb L1 only with 256mb sdram.
sandra gave constant results everytime, as well as superpi/everest before and after.
for a k6+ cpu its nearly the same, sometimes superpi needed even more time to complete.
User avatar
KachiWachi
K6'er Elite
Posts: 507
Joined: Wed Sep 21, 2005 10:53 am
Location: Pennsylvania, USA

Post by KachiWachi »

@second point -

This will be determined by how the External Cache is mapped to RAM.

If it is Direct Mapped, not possible.

If it is Associative (in one form or another)...possible.

Go re-read the section in the PCGuide that concerns this.

For me, the i430VX External Cache strategy is Direct Mapped, so WYSIWYG.

On the other hand, the CPU uses a unified, four-way set-associative cache (see the 23535.pdf for more information on how the CPU works).

You might find it interesting to run your tests with the CPU L2 disabled (EFER bit 4 = 1)...just to see what happens.

You may also wish to investigate the performance differences with Write Allocation on and off, and with EWBE set to the one of the three available settings (All, SEWBED, GEWBED).

Thanks.
Moderator - Wim's BIOS

PC #1 - DFI 586IPVG, K6-2/+ 450 (Cyrix MII 433), 128 MB EDO. BIOS patched by Jan Steunebrink.
PC #2 - Amptron PM-7900 (M520), i200 non-MMX, 128 MB EDO
PC #3 - HP8766C, PIII-667, 768 MB SDRAM
PC #4 - ASUS P3V4X, PIII-733, 256 MB SDRAM
PC #5 - Gateway 700X, P4-2.0 GHz, 768 MB PC800 RDRAM
PC #6 - COMPAQ Evo N1020v laptop, P4-2.4 GHz, 1 GB PC2700 DDR
PC #7 - Dell Dimension 4600i, P4-2.8 GHz, 512 MB PC2700 DDR
PC #8 - Acer EeePC netbook, Atom N270 @ 1.60 GHz, 1 GB RAM
PC #9 - ??? ;)
Jim
K6'er Elite
Posts: 1745
Joined: Wed Jan 21, 2004 7:10 pm
Location: Toronto

Post by Jim »

@ KachiWachi : You are right about Direct Mapping; but Optimization means best possible. What you are talking about is not possible. I should have read what you said more carefully.

@ Kalabok : I gather the Cyrix has no second level cache. How much cache did the mobo used have? Are we talking K6-2+ (128k 2nd level cache) or K6-3+ (256k 2nd level cache)?
Superpuppy 3
K6-3+ 450 ACZ (6x100)
DFI K6BV3+/66 Rev B2 (2 Meg) w/ 2x28mm Chipset Fans
2x256 Meg PC 133 Hynix SDRAM
1x 20G Maxtor (7200)
2x 80G Maxtor (7200) Ducted w/ 2x486 Fans Mount
52/24/52/16 LG CDR/RW/DVD
8/4/3/12/24/16/32 LG Super Multi
ATI 9000 aiw Radeon AGP
SB Audigy 1 MP3 Sound
CMD 649 IDE Controller
NEC USB 2 Card
DonPedro
K6'er Elite
Posts: 578
Joined: Wed Jul 27, 2005 2:11 pm

Post by DonPedro »

sorry that my time to really take part here is very limited at the moment. to answer what I think would be necessary for clarifcation for all of this would need more time than just stop by, read and eventually drop a note. I hope I will be able to concentrate on this very soon.....

@kachiwachi, jim

the aladdin7 has no mainboard-cache at all.

qall

just a very "short" note regarding "optimized" cache.

program A, which is completely unrelated to program B, can not under any circumstances "optimize" the cache (be it 3rd or 2nd level) for program B.

let us just recall how any program works.
once A is finished (and exited), program B starts and allocates memory which it gets from the supervisor - the OS. then B starts to work and uses the memory block it got control over. since B does its very own job, completely independent of any other program, it does certainly not read from that memory block to make use of the data found there which has meaningless "random" values from the perspective of program B and is absolutely of no use for it. if it would be the case then the caches would already be in a favourable state to B: values in main memory, 3rd level and 2nd level cache contain already the data that B needs and tagrams are perfectly setup. B could work on the data without any calculations. it could finish in zero-time. I hope we agree that this could not be and is not the case.

so even under the circumstances that the OS allocated memory for B exactly at those addresses which program A had used just before and B "could" "take advantage" of the tagrams "perfect" address-lines setup, ALL data B is going to use is from B's AND THE TAGRAM'S perspective "DIRTY"!

data stored in:
- main memory,
- 3rd and
- 2nd level cache memory.

they all contain the wrong data. and this means that there is no /can be no positive cacheing effect of READING DATA from there, because B does not read from there at this early stage in its process. and I suppose we agree that THE cacheing effect happens when data is READ from memory addresses that already are kept as a copy in 3rd or 2nd level cache. the cacheing effect happens when it is possible to read from cache, not memory. the time the tagram needs to do its job (keeping track of the addresses used) is comparatively "negligable" here and will happen so or so.

so what B does next is common programmer's practice: it resets the memory to some predefined state so it can trustfully work on it - it will (in most cases) zeroeing the values, which is WRITING to the memory (no cache effect here, except a quite small one in case of write-back cache, but this effect is true under all circumstances and we do not deal with it here). and WRITING sets the tagrams in "dirty mode". B does this only once at the beginning and the only benefit could be in the special case that the memory addresses it got from the OS are exactly the same what A has used before and therefore the tagrams are setup perfectly (keeping-track of addresses-wise). so it might win some split-spit seconds here ONLY ONCE because of that special case.

if B does not reset the memory because the programmer decided so and provides other means to secure that the program does not read from memory which data is random data and therefore useless junk, then it certainly starts WRITING to this memory BEFORE it reads from there at a later point and can THEN PROFIT FROM CACHEING.

so program B alone, regardless of what program ran before is "resonsible" whether it can profit from data that is stored in whatever cache and can only benefit from cacheing (READING) once it had written its own data to the memory. the tagram(s) of both 2nd and 3rd level will make sure - because of their nature - that it always happens that way.

there is no way that A can a posteriori interfere that the caches are not dirty from B's and tagram's perspective, hence no pre-caching aka "pre-optimizing" (TM Jim) is possible.

well, I just wanted to drop a short message .... ;)

@jim

re "CHEAT BENCH":

- there exists no program that does not run on machine code level. I think you got something very wrong here

- ANY program will profit from optimal working cache. if one will profit the way you construct "optimized cache" I doubt.

- do you understand the meaning of putting a word or phrase under quotation marks? and what the heck could it only mean, if these words additionally are followed by an " :) "
like for example " ....some go even that far that they openly admit "cheating". you are such a baaad trickster, stedman! :) "
I wonder why stedman did not complain, but you feel pressed to cry foul.
Post Reply