วันเสาร์ที่ 29 สิงหาคม พ.ศ. 2552

THE ARM Processor

RISC
vs
CISC



In the early days of computing, you had a lump of silicon which performed a number of instructions. As time progressed, more and more facilities were required, so more and more instructions were added. However, according to the 20-80 rule, 20% of the available instructions are likely to be used 80% of the time, with some instructions only used very rarely. Some of these instructions are very complex, so creating them in silicon is a very arduous task. Instead, the processor designer uses microcode. To illustrate this, we shall consider a modern CISC processor (such as a Pentium or 68000 series processor). The core, the base level, is a fast RISC processor. On top of that is an interpreter which 'sees' the CISC instructions, and breaks them down into simpler RISC instructions.

Already, we can see a pretty clear picture emerging. Why, if the processor is a simple RISC unit, don't we use that? Well, the answer lies more in politics than design. However Acorn saw this and not being constrained by the need to remain totally compatible with earlier technologies, they decided to implement their own RISC processor.

Up until now, we've not really considered the real differences between RISC and CISC, so...

A Complex Instruction Set Computer (CISC) provides a large and powerful range of instructions, which is less flexible to implement. For example, the 8086 microprocessor family has these instructions:

JA Jump if Above
JAE Jump if Above or Equal
JB Jump if Below
...
JPO Jump if Parity Odd
JS Jump if Sign
JZ Jump if Zero

There are 32 jump instructions in the 8086, and the 80386 adds more. I've not read a spec sheet for the Pentium-class processors, but I suspect it (and MMX) would give me a heart attack!
By contrast, the Reduced Instruction Set Computer (RISC) concept is to identify the sub-components and use those. As these are much simpler, they can be implemented directly in silicon, so will run at the maximum possible speed. Nothing is 'translated'. There are only two Jump instructions in the ARM processor - Branch and Branch with Link. The "if equal, if carry set, if zero" type of selection is handled by condition options, so for example:

BLNV Branch with Link NeVer (useful!)
BLEQ Branch with Link if EQual

and so on. The BL part is the instruction, and the following part is the condition. This is made more powerful by the fact that conditional execution can be applied to most instructions! This has the benefit that you can test something, then only do the next few commands if the criteria of the test matched. No branching off, you simply add conditional flags to the instructions you require to be conditional:
SWI "OS_DoSomethingOrOther" ; call the SWI
MVNVS R0, #0 ; If failed, set R0 to -1
MOVVC R0, #0 ; Else set R0 to 0

Or, for the 80486:
INT $...whatever... ; call the interrupt
CMP AX, 0 ; did it return zero?
JE failed ; if so, it failed, jump to fail code
MOV DX, 0 ; else set DX to 0
return
RET ; and return
failed
MOV DX, 0FFFFH ; failed - set DX to -1
JMP return

The odd flow in that example is designed to allow the fastest non-branching throughput in the 'did not fail' case. This is at the expense of two branches in the 'failed' case.
I am not, however, an x86 coder, so that can possibly be optimised - mail me if you have any suggestions...


Most modern CISC processors, such as the Pentium, uses a fast RISC core with an interpreter sitting between the core and the instruction. So when you are running Windows95 on a PC, it is not that much different to trying to get W95 running on the software PC emulator. Just imagine the power hidden inside the Pentium...

Another benefit of RISC is that it contains a large number of registers, most of which can be used as general purpose registers.

This is not to say that CISC processors cannot have a large number of registers, some do. However for it's use, a typical RISC processor requires more registers to give it additional flexibility. Gone are the days when you had two general purpose registers and an 'accumulator'.

One thing RISC does offer, though, is register independence. As you have seen above the ARM register set defines at minimum R15 as the program counter, and R14 as the link register (although, after saving the contents of R14 you can use this register as you wish). R0 to R13 can be used in any way you choose, although the Operating System defines R13 is used as a stack pointer. You can, if you don't require a stack, use R13 for your own purposes. APCS applies firmer rules and assigns more functions to registers (such as Stack Limit). However, none of these - with the exception of R15 and sometimes R14 - is a constraint applied by the processor. You do not need to worry about saving your accumulator in long instructions, you simply make good use of the available registers.

The 8086 offers you fourteen registers, but with caveats:
The first four (A, B, C, and D) are Data registers (a.k.a. scratch-pad registers). They are 16bit and accessed as two 8bit registers, thus register A is really AH (A, high-order byte) and AL (A low-order byte). These can be used as general purpose registers, but they can also have dedicated functions - Accumulator, Base, Count, and Data.
The next four registers are Segment registers for Code, Data, Extra, and Stack.
Then come the five Offset registers: Instruction Pointer (PC), SP and BP for the stack, then SI and DI for indexing data.
Finally, the flags register holds the processor state.
As you can see, most of the registers are tied up with the bizarre memory addressing scheme used by the 8086. So only four general purpose registers are available, and even they are not as flexible as ARM registers.

The ARM processor differs again in that it has a reduced number of instruction classes (Data Processing, Branching, Multiplying, Data Transfer, Software Interrupts).

A final example of minimal registers is the 6502 processor, which offers you:
Accumulator - for results of arithmetic instructions
X register - First general purpose register
Y register - Second general purpose register
PC - Program Counter
SP - Stack Pointer, offset into page one (at &01xx).
PSR - Processor Status Register - the flags.
While it might seem like utter madness to only have two general purpose registers, the 6502 was a very popular processor in the '80s. Many famous computers have been built around it.
For the Europeans: consider the Acorn BBC Micro, Master, Electron...
For the Americans: consider the Apple2 and the Commadore PET.
The ORIC uses a 6502, and the C64 uses a variant of the 6502.
(in case you were wondering, the Speccy uses the other popular processor - the ever bizarre and freaky Z80)

So if entire systems could be created with a 6502, imagine the flexibility of the ARM processor.
It has been said that the 6502 is the bridge between CISC design and RISC. Acorn chose the 6502 for their original machines such as the Atom and the System# units. They went from there to design their own processor - the ARM.



To summarise the above, the advantages of a RISC processor are:

Quicker time-to-market. A smaller processor will have fewer instructions, and the design will be less complicated, so it may be produced more rapidly.


Smaller 'die size' - the RISC processor requires fewer transistors than comparable CISC processors...
This in turn leads to a smaller silicon size (I once asked Russell King of ARMLinux fame where the StrongARM processor was - and I was looking right at it, it is that small!)
...which, in turn again, leads to less heat dissipation. Most of the heat of my ARM710 is actually generated by the 80486 in the slot beside it (and that's when it is supposed to be in 'standby').


Related to all of the above, it is a much lower power chip. ARM design processors in static form so that the processor clock can be stopped completely, rather than simply slowed down. The Solo computer (designed for use in third world countries) is a system that will run from a 12V battery, charging from a solar panel.


Internally, a RISC processor has a number of hardwired instructions.
This was also true of the early CISC processors, but these days a typical CISC processor has a heart which executes microcode instructions which correlate to the instructions passed into the processor. Ironically, this 'heart' tends to be RISC. :-)


As touched on my Matthias below, a RISC processor's simplicity does not necessarily refer to a simple instruction set.
He quotes LDREQ R0,[R1,R2,LSR #16]!, though I would prefer to quote the 26 bit instruction LDMEQFD R13!, {R0,R2-R4,PC}^ which restores R0, R2, R3, R4, and R15 from the fully descending stack pointed to by R13. The stack is adjusted accordingly. The '^' pushes the processor flags into R15 as well as the return address. And it is conditionally executed. This allows a tidy 'exit from routine' to be performed in a single instruction.
Powerful, isn't it?
The RISC concept, however, does not state that all the instructions are simple. If that were true, the ARM would not have a MUL, as you can do the exact same thing with looping ADDing. No, the RISC concept means the silicon is simple. It is a simple processor to implement.
I'll leave it as an exercise for the reader to figure out the power of Mathias' example instruction. It is exactly on par with my example, if not slightly more so!


RISC vs ARM
You shouldn't call it "RISC vs CISC" but "ARM vs CISC". For example conditional execution of (almost) any instruction isn't a typical feature of RISC processors but can only(?) be found on ARMs. Furthermore there are quite some people claiming that an ARM isn't really a RISC processor as it doesn't provide only a simple instruction set, i.e. you'll hardly find any CISC processor which provides a single instruction as powerful as a
LDREQ R0,[R1,R2,LSR #16]!

Today it is wrong to claim that CISC processors execute the complex instructions more slowly, modern processors can execute most complex instructions with one cycle. They may need very long pipelines to do so (up to 25 stages or so with a Pentium III), but nonetheless they can. And complex instructions provide a big potential of optimisation, i.e. if you have an instruction which took 10 cycles with the old model and get the new model to execute it in 5 cycles you end up with a speed increase of 100% (without a higher clock frequency). On the other hand ARM processors executed most instruction in a single cycle right from the start and thus don't have this optimisation potential (except the MUL instruction).
The argument that RISC processors provide more registers than CISC processors isn't right. Just take a look at the (good old) 68000, it has about the same number of registers as the ARM has. And that 80x86 compatible processors don't provide more registers is just a matter of compatibility (I guess). But this argument isn't completely wrong: RISC processors are much simpler than CISC processors and thus take up much less space, thus leaving space for additional functionality like more registers. On the other hand, a RISC processor with only three or so registers would be a pain to program, i.e. RISC processors simply need more registers than CISC processors for the same job.

And the argument that RISC processors have pipelining whereas CISCs don't is plainly wrong. I.e. the ARM2 hadn't whereas the Pentium has...

The advantages of RISC against CISC are those today:

RISC processors are much simpler to build, by this again results in the following advantages:
easier to build, i.e. you can use already existing production facilities
much less expensive, just compare the price of a XScale with that of a Pentium III at 1 GHz...
less power consumption, which again gives two advantages:
much longer use of battery driven devices
no need for cooling of the device, which again gives to advantages:
smaller design of the whole device
no noise



RISC processors are much simpler to program which doesn't only help the assembler programmer, but the compiler designer, too. You'll hardly find any compiler which uses all the functions of a Pentium III optimally...
And then there are the benefits of the ARM processors:

Conditional execution of most instructions, which is a very powerful thing especially with large pipelines as you have to fill the whole pipeline every time a branch is taken, that's why CISC processors make a huge effort for branch prediction


The shifting of registers while other instructions are executed which mean that shifts take up no time at all (the 68000 took one cycle per bit to shift)


The conditional setting of flags, i.e. ADD and ADDS, which becomes extremely powerful together with the conditional execution of instructions


The free use of offsets when accessing memory, i.e.
LDR R0,[R1,#16]
LDR R0,[R1,#16]!
LDR R0,[R1],#16
LDR R0,[R1,R2]
LDR R0,[R1,R2]!
LDR R0,[R1],R2
...
The 68000 could only increase the address register by the size of the data read (i.e. by 1, 2 or 4). Just imagine how much better an ARM processor can be programmed to draw (not only) a vertical line on the screen.


The (almost) free use of all registers with all instructions (which may well be an advantage of any RISC processor). It simply is great to be able to use
ADD PC,PC,R0,LSL #2
MOV R0,R0
B R0is0
B R0is1
B R0is2
B R0is3
...
or even
ADD PC,PC,R0,LSL #3
MOV R0,R0
MOV R1,#1
B Continue
MOV R2,#2
B Comtinue
MOV R2,#4
B Continue
MOV R2,#8
B Continue
...
I used this technique when programming my C64 emulator even more excessively to emulate the 6510. There the shift is 8 which gives 256 bytes for each instruction to emulate. Within those 256 bytes there is not only the code for the emulation of the instruction but also the code to react on interrupts, the fetching of the next instruction and the jump to the emulation code of that instruction, i.e. the code to emulate the CLC (clear C flag) looks like this:
ADD R10,R10,#1 ; increment PC of 6510 to point to next
; instruction
BIC R6,R6,#1 ; clear C flag of 6510 status register
LDR R0,[R12,#64] ; read 6510 interrupt state
CMP R0,#0 ; interrupt occurred?
BNE &00018040 ; yes -> jump to interrupt handler
LDRB R1,[R4,#1]! ; read next instruction
ADD PC,R5,R1,LSL #8 ; jump to emulation code
MOV R0,R0 ; lots of these to fill up the 256 bytes
This means that there is only one single jump for each instruction emulated. By this (and a bit more) the emulator is able to reach 76% of the speed of the original C64 with an A3000, 116% with an A4000, 300% with an A5000 and 3441% with my RiscPC (SA at 287 MHz). The code may look hard to handle, but the source of it looks much better:
;-----------;
; $18 - CLC ;
;-----------;
ADD R10,R10,#1 ; increment PC of 6510
BIC R6,R6,#000001 ; clear C flag of 6510 status register
FNNextCommand ; do next command
FNFillFree ; fill remaining space





A PIPE LINE
A conventional processor executes instructions one at a time, just as you expect it to when you write your code. Each execution can be broken down into three parts, which anybody who has learned this stuff at college will have fetch, decode, execute burned into their memory.

In English...

Fetch
Retrieve the instruction from memory.
Don't get all techie - whether the instruction comes from system memory or the processor cache is irrelevant, the instruction is not loaded 'into' the processor until it is specifically requested. The cache simply serves to speed things up. By loading chunks of system memory into the cache, the processor can satisfy many more of its instruction fetches by pulling instructions from the cache. This is necessary because processors are very fast (StrongARMs, 200MHz+; Pentiums up to GHz!) and system memory is not (33, 66, or 133MHz). To see the effect the cache has on your processor, use *Cache Off.


Decode
Figure out what the instruction is, and what is supposed to be done.


Execute
Perform the requested operation.
Each of these operations is performed along with the electronic 'heartbeat', the clock rate. Example clock rates for several microprocessors included in Acorn products are given here as an example:

BBC microcomputer 6502 2MHz
Acorn A310-A3000 ARM 2 8MHz
Acorn A5000 ARM 3 25MHz
Acorn A5000/I ARM 3 30MHz
RiscPC600 ARM610 33MHz
RiscPC700 ARM710 40MHz
Early PC co-processor 486SXL-40 33MHz (not 40!)
RiscPC (StrongARM) SA110 202MHz - 278MHz+

As shown in the PC world, processors are running into GHz speeds (1,000,000,000 ticks/sec) which will necessitate much in the way of speed tweaks (huge amounts of cache, extremely optimised pipeline) because there is no way the rest of the system can keep up. Indeed, the rest of the system is likely to be operating at a quarter of the speed of the processor. The RiscPC is designed to work, I believe, at 33MHz. That is why people thought the StrongARM wouldn't give much of a speed boost. However the small size of ARM programs, coupled with a rather large cache, made the StrongARM a viable proposition in the RiscPC, it bottlenecked horribly, but other factors meant that this wasn't so visible to the end-user, so the result was a system which is much faster than the ARM710. More recently, the Kinetic StrongARM processor card. This attempts to alleviate bottlenecks by installing a big wodge of memory directly on the processor card and using that. It even goes so far as to install the entirety of RISC OS into that memory so you aren't kept waiting for the ROMs (which are slower even than RAM).
There is an obvious solution. Since these three stages (fetch, decode, execute) are fairly independent, would it not be possible to:

fetch instruction #3
decode instruction #2
execute instruction #1

...then, on the next clock tick...

fetch instruction #4
decode instruction #3
execute instruction #2

...tick...

fetch instruction #5
decode instruction #4
execute instruction #3

Processor Types




ARM 1 (v1)
This was the very first ARM processor. Actually, when it was first manufactured in April 1985, it was the very first commercial RISC processor. Ever.
As a testament to the design team, it was "working silicon" in it's first incarnation, it exceeded it's design goals, and it used less than 25,000 transistors.
The ARM 1 was used in a few evaluation systems on the BBC micro (Brazil - BBC interfaced ARM), and a PC machine (Springboard - PC interfaced ARM).
It is believed a large proportion of Arthur was developed on the Brazil hardware.
In essence, it is very similar to an ARM 2 - the differences being that R8 and R9 are not banked in IRQ mode, there's no multiply instruction, no LDR/STR with register-specified shifts, and no co-processor gubbins.



ARM evaluation system for BBC Master
(original picture source not known - downloaded from a website full of BBC-related images
this version created by Rick Murray to include zoomed-up ARM down the bottom...)




ARM 2 (v2)
Experience with the ARM 1 suggested improvements that could be made. Such additions as the MUL and MLA instructions allowed for real-time digital signal processing. Back then, it was to aid in generating sounds. Who could have predicted exactly how suitable to DSP the ARM would be, some fifteen years later?
In 1985, Acorn hit hard times which led to it being taken over by Olivetti. It took two years from the arrival of the ARM to the launch of a computer based upon it...
...those were the days my friend, we thought they'd never end.
When the first ARM-based machines rolled out, Acorn could gladly announce to the world that they offered the fastest RISC processor around. Indeed, the ARM processor kicked ass across the computing league tables, and for a long time was right up there in the 'fastest processors' listings. But Acorn faced numerous challenges. The computer market was in disarray, with some people backing IBM's PC, some the Amiga, and all sorts of little itty-bitty things. Then Acorn go and launch a machine offering Arthur (which was about as nice as the first release of Windows) which had no user base, precious little software, and not much third party support. But they succeeded.

The ARM 2 processor was the first to be used within the RISC OS platform, in the A305, A310, and A4x0 range. It is an 8MHz processor that was used on all of the early machines, including the A3000. The ARM 2 is clocked at 8MHz, which translates to approximately four and a half million instructions per second (0.56 MIPS/MHz).


No current image - can you help?




ARM 3 (v2as)
Launched in 1989, this processor built on the ARM 2 by offering 4K of cache memory and the SWP instruction. The desktop computers based upon it were launched in 1990.
Internally, via the dedicated co-processor interface, CP15 was 'created' to provide processor control and identification.
Several speeds of ARM 3 were produced. The A540 runs a 26MHz version, and the A4 laptop runs a 24MHz version. By far the most common is the 25MHz version used in the A5000, though those with the 'alpha variant' have a 33MHz version.
At 25MHz, with 12MHz memory (a la A5000), you can expect around 14 MIPS (0.56 MIPS/MHz).
It is interesting to notice that the ARM3 doesn't 'perform' faster - both the ARM2 and the ARM3 average 0.56 MIPS/MHz. The speed boost comes from the higher clock speed, and the cache.
Oh, and just to correct a common misunderstanding, the A4 is not a squashed down version of the A5000. The A4 actually came first, and some of the design choices were reflected in the later A5000 design.


ARM3 with FPU
(original picture downloaded from Arcade BBS, archive had no attribution)




ARM 250 (v2as)
The 'Electron' of ARM processors, this is basically a second level revision of the ARM 3 design which removes the cache, and combines the primary chipset (VIDC, IOC, and MEMC) into the one piece of silicon, making the creation of a cheap'n'cheerful RISC OS computer a simple thing indeed. This was clocked at 12MHz (the same as the main memory), and offers approximately 7 MIPS (0.58 MIPS/MHz).
This processor isn't as terrible as it might seem. That the A30x0 range was built with the ARM250 was probably more a cost-cutting exercise than intention. The ARM250 was designed for low power consumption and low cost, both important factors in devices such as portables, PDAs, and organisers - several of which were developed and, sadly, none of which actually made it to a release.
No current image - can you help?




ARM 250 mezzanine
This is not actually a processor. It is included here for historical interest. It seems the machines that would use the ARM250 were ready before the processor, so early releases of the machine contained a 'mezzanine' board which held the ARM 2, IOC, MEMC, and VIDC.



ARM 4 and ARM 5
These processors do not exist.
More and more people began to be interested in the RISC concept, as at the same sort of time common Intel (and clone) processors showed a definite trend towards higher power consumption and greater need for heat dissipation, neither of which are friendly to devices that are supposed to be running off batteries.
The ARM design was seen by several important players as being the epitome of sleek, powerful RISC design.
It was at this time a deal was struck between Acorn, VLSI (long-time manufacturers of the ARM chipset), and Apple. This lead to the death of the Acorn RISC Microprocessor, as Advanced RISC Machines Ltd was born. This new company was committed to design and support specifically for the processor, without the hassle and baggage of RISC OS (the main operating system for the processor and the desktop machines). Both of those would be left to Acorn.

In the change from being a part of Acorn to being ARM Ltd in it's own right, the whole numbering scheme for the processors was altered.




ARM 610 (v3)
This processor brought with it two important 'firsts'. The first 'first' was full 32 bit addressing, and the second 'first' was the opening for a new generation of ARM based hardware.
Acorn responded by making the RiscPC. In the past, critics were none-too-keen on the idea of slot-in cards for things like processors and memory (as used in the A540), and by this time many people were getting extremely annoyed with the inherent memory limitations in the older hardware, the MEMC can only address 4Mb of memory, and you can add more by daisy-chaining MEMCs - an idea that not only sounds hairy, it is hairy!
The RiscPC brought back the slot-in processor with a vengeance. Future 'better' processors were promised, and a second slot was provided for alien processors such as the 80486 to be plugged in. As for memory, two SIMM slots were provided, and the memory was expandable to 256Mb. This does not sound much as modern PCs come with half that as standard. However you can get a lot of milage from a RiscPC fitted with a puny 16Mb of RAM.
But, always, we come back to the 32 bit. Because it has been with us and known about ever since the first RiscPC rolled out, but few people noticed, or cared. Now as the new generation of ARM processors drop the 26 bit 'emulation' modes, we RISC OS users are faced with the option of getting ourselves sorted, or dying.
Ironically, the other mainstream operating systems for the RiscPC hardware - namely ARMLinux and netbsd/arm32 are already fully 32 bit.

Several speeds were produced; 20MHz, 30Mhz, and the 33MHz part used in the RiscPC.
The ARM610 processor features an on-board MMU to handle memory, a 4K cache, and it can even switch itseld from little-endian operation to big-endian operation. The 33MHz version offers around 28MIPS (0.84 MIPS/MHz).


The RiscPC ARM610 processor card
(original picture by Rick Murray, © 2002)




ARM 710 (v3)
As an enhancement of the ARM610, the ARM 710 offers an increased cache size (8K rather than 4K), clock frequency increased to 40MHz, improved write buffer and larger TLB in the MMU.
Additionally, it supports CMOS/TTL inputs, Fastbus, and 3.3V power but these features are not used in the RiscPC.
Clocked at 40MHz, it offers about 36MIPS (0.9 MIPS/MHz); which when combined with the additional clock speed, it runs an appreciable amount faster than the ARM 610.

ARM710 side by side with an 80486, the coin is a British 10 pence coin.
(original picture by Rick Murray, © 2001)




ARM 7500
The ARM7500 is a RISC based single-chip computer with memory and I/O control on-chip to minimise external components. The ARM7500 can drive LCD panels/VDUs if required, and it features power management. The video controller can output up to a 120MHz pixel rate, 32bit sound, and there are four A/D convertors on-chip for connection of joysticks etc.
The processor core is basically an ARM710 with a smaller (4K) cache.
The video core is a VIDC2.
The IO core is based upon the IOMD.
The memory/clock system is very flexible, designed for maximum uses with minimum fuss. Setting up a system based upon the ARM7500 should be fairly simple.



ARM 7500FE
A version of the ARM 7500 with hardware floating point support.


ARM7500FE, as used in the Bush Internet box.

(original picture by Rick Murray, © 2002)



StrongARM / SA110 (v4)
The StrongARM took the RiscPC from around 40MHz to 200-300MHz and showed a speed boost that was more than the hardware should have been able to support. Still severely bottlednecked by the memory and I/O, the StrongARM made the RiscPC fly. The processor was the first to feature different instruction and data caches, and this caused quite a lot of self-modifying code to fail including, amusingly, Acorn's own runtime compression system. But on the whole, the incompatibilities were not more painful than an OS upgrade (anybody remember the RISC OS 2 to RISC OS 3 upgrade, and all the programs that used SYS OS_UpdateMEMC, 64, 64 for a speed boost froze the machine solid!).
In instruction terms, the StrongARM can offer half-word loads and stores, and signed half-word and byte loads and stores. Also provided are instructions for multiplying two 32 bit values (signed or unsigned) and replying with a 64 bit result. This is documented in the ARM assembler user guide as only working in 32-bit mode, however experimentation will show you that they work in 26-bit mode as well. Later documentation confirms this.
The cache has been split into separate instruction and data cache (Harvard architecture), with both of these caches being 16K, and the pipeline is now five stages instead of three.
In terms of performance... at 100MHz, it offers 114MIPS which doubles to 228MIPS at 200MHz (1.14 MIPS/MHz).

A StrongARM mounted on a LART board.
In order to squeeze the maximum from a RiscPC, the Kinetic includes fast RAM on the processor card itself, as well as a version of RISC OS that installs itself on the card. Apparently it flies due to removing the memory bottleneck, though this does cause 'issues' with DMA expansion cards.



A Kinetic processor card.



SA1100 variant
This is a version of the SA110 designed primarily for portable applications. I mention it here as I am reliably informed that the SA1100 is the processor inside the 'faster' Panasonic satellite digibox. It contains the StrongARM core, MMU, cache, PCMCIA, general I/O controller (including two serial ports), and a colour/greyscale LCD controller. It runs at 133MHz or 200MHz and it consumes less than half a watt of power.





Thumb
The Thumb instruction set is a reworking of the ARM set, with a few things omitted. Thumb instructions are 16 bits (instead of the usual 32 bit). This allows for greater code density in places where memory is restricted. The Thumb set can only address the first eight registers, and there are no conditional execution instructions. Also, the Thumb cannot do a number of things required for low-level processor exceptions, so the Thumb instruction set will always come alongside the full ARM instruction set. Exceptions and the like can be handled in ARM code, with Thumb used for the more regular code.





Other versions
These versions are afforded less coverage due, mainly, to my not owning nor having access to any of these versions.
While my site started as a way to learn to program the ARM under RISC OS, the future is in embedded devices using these new systems, rather than the old 26 bit mode required by RISC OS...
...and so, these processors are something I would like to detail, in time.

M variants
This is an extension of the version three design (ARM 6 and ARM 7) that provides the extended 64 bit multiply instructions.
These instructions became a main part of the instruction set in the ARM version 4 (StrongARM, etc).



T variants
These processors include the Thumb instruction set (and, hence, no 26 bit mode).



E variants
These processors include a number of additional instructions which provide improved performance in typical DSP applications. The 'E' standing for "Enchanced DSP".





The future
The future is here. Newer ARM processors exist, but they are 32 bit devices.
This means, basically, that RISC OS won't run on them until all of RISC OS is modified to be 32 bit safe. As long as BASIC is patched, a reasonable software base will exist. However all C programs will need to be recompiled. All relocatable modules will need to be altered. And pretty much all assembler code will need to be repaired. In cases where source isn't available (ie, anything written by Computer Concepts), it will be a tedious slog.
It is truly one of the situations that could make or break the platform.
I feel, as long as a basic C compiler/linker is made FREELY available, then we should go for it. It need not be a 'good' compiler, as long as it will be a drop-in replacement for Norcroft CC version 4 or 5. Why this? Because RISC OS depends upon enthusiasts to create software, instead of big corporations. And without inexpensive reasonable tools, they might decide it is too much to bother with converting their software, so may decide to leave RISC OS and code for another platform.

I, personally, would happily download a freebie compiler/linker and convert much of my own code. It isn't plain sailing for us - think of all of the library code that needs to be checked. It will be difficult enough to obtain a 32 bit machine to check the code works correctly, never mind all the other pitfalls. Asking us for a grand to support the platform is only going to turn us away in droves. Heck, I'm still using ARM 2 and ARM 3 systems. Some of us smaller coders won't be able to afford such a radical upgrade. And that will be VERY BAD for the platform. Look how many people use the FREE user-created Internet suite in preference to commercial alternatives. Look at all of the support code available on Arcade BBS. Much of that will probably go, yes. But would a platform trying to re-establish itself really want to say goodbye to the rest?
I don't claim my code is wonderful, but if only one person besides myself makes good use of it - then it has been worth it.

The Stack



The 6502 microprocessor features support for a stack, located at &1xx in memory, and extending for 256 bytes. It also featured instructions which performed instructions more quickly relative to page zero (&0xx).

Both of these are inflexible, and not in keeping with the RISC concept.
The ARM processor provides instructions for manipulating the stack (LDM and STM). The actual location where your stack lays it's hat is entirely up to you and the rules of good programming.

For example:

MOV R13, #&8000
STMFD R13!, {R0-R12, R14}

would work, but is likely to scribble your registers over something important. So typically you would set R13 to the end of your workspace, and stack backwards from there.
These are conventions used in RISC OS. You can replace R13 with any register except R14 (if you need it) and R15. As R14 and R15 have a defined purpose, the next register down is R13, so that is used as the stack pointer.
Likewise, in RISC OS, the stacks are fully descending (FD, or IA) which means the stack grows downwards in memory, and the updated stack pointer points to the next free location.

You can, quite easily, shirk convention and stack using whatever register you like (R0-R13 and R14 if you don't need it) and also you can set up any kind of stack you like, growing up, growing down, pointer to next free or last used... But be aware that when RISC OS provides you with stack information (if you are writing a module, APCS assembler, BASIC assembler, or being a transient utility, for example) it will pass the address in R13 and expect you to be using a fully descending stack. So while you can use whatever type of stack/location that suits you, it is suggested you follow the OS style. It makes life easier.

If you are not sure what a stack is, exactly, then consider it a temporary dumping area. When you start your program, you will want to put R14 somewhere so you know where to branch to in order to exit. Likewise, every time you BL, you will want to put R14 someplace if you plan to call another BL.
To make this clearer:

; ...entry, R14 points to exit location

BL one
BL two
MOV PC, R14 ; exit

.one
; R14 points to instruction after 'BL one'
...do stuff...
MOV PC, R14 ; return

.two
; R14 points to instruction after 'BL two'
...do stuff...
BL three
MOV PC, R14 ; return

.three
; R14 points to instruction after 'BL three'
B four
; no return

.four
; Not a BL, so R14 unchanged
MOV PC, R14 ; returns from .three because R14 not changed.

Take a moment to work through that code. It is fairly simple. And fairly obvious is that something needs to be done with R14, otherwise you won't be able to exit. Now, a viable answer is to shift R14 into some other register. So now consider that the "...do stuff..." parts use ALL of the remaining registers.
Now what? Well, what we need is a controlled way to dump R14 into memory until we come to need it.
That's what a stack is.
That code again:

; ...entry, R14 points to exit location, we assume R13 is set up

STMFD R13!, {R14}
BL one
BL two
LDMFD R13!, {PC} ; exit

.one
; R14 points to instruction after 'BL one'
STMFD R13!, {R14}
...do stuff...
LDMFD R13!, {PC} ; return

.two
; R14 points to instruction after 'BL two'
STMFD R13!, {R14}
...do stuff...
BL three
LDMFD R13!, {PC} ; return

.three
; R14 points to instruction after 'BL three'
B four
; no return

.four
; Not a BL, so R14 unchanged
LDMFD R13!, {PC} ; returns from .three because R14 not changed.

A quick note, you can write:
STMFD R13!, {R14}
...do stuff...
LDMFD R13!, {R14}
MOV PC, R14
but the STM/LDM does NOT keep track of which stored values belong in which registers, so you can store R14, and reload it directly into PC thus disposing of the need to do a MOV afterwards.
The caveat is that the registers are saved in ascending order...

STMFD R13!, {R7, R0, R2, R1, R9, R3, R14}
will save R0, R1, R2, R3, R7, R9, and R14 (in that order). So code like:
STMFD R13!, {R0, R1}
LDMFD R13!, {R1, R0}


Memory Management




Introduction
The RISC OS machines work with two different types of memory - logical and physical.
The logical memory is the memory as seen by the OS, and the programmer. Your application begins at &8000 and continues until &xxxxx.
The physical memory is the actual memory in the machine.
Under RISC OS, memory is broken into pages. Older machines have a page of 8/16/32K (depending on installed memory), and newer machines have a fixed 4K page. If you were to examine the pages in your application workspace, you would most likely see that the pages were seemingly random, not in order. The pages relate to physical memory, combined to provide you with xxxx bytes of logical memory. The memory controller is constantly shuffling memory around so that each task that comes into operation 'believes' it is loaded at &8000. Write a little application to count how many wimp polls occur every second, you'll begin to appreciate how much is going on in the background.




MEMC : Older systems
In ARM 2, 250, and 3 machines; the memory is controlled by the MEMC (MEMory Controller). This unit can cope with an address space of 64Mb, but in reality can only access 4Mb of physical memory. The 64Mb space is split into three sections:
0Mb - 32Mb : Logical RAM
32Mb - 48Mb : Physical RAM
48Mb - 64Mb : System ROMs and I/O

Parts of the system ROMs and I/O are mapped over each other, so reading from it gives you code from ROM, and writing to it updates things like the VIDC (video/sound).
It is possible to fit up to 16Mb of memory to an older machine, but you will need a matched MEMC for each 4Mb. People have reported that simply fitting two MEMCs (to give 8Mb) is either hairy or unreliable, or both. In practice, the hardware to do this properly only really existed for the A540 machine, where each 4Mb was a slot-in memory card with an on-board MEMC. Other solutions for, say, the A5000 and the A410, are elaborate bodges. Look at http://www.castle.org.uk/castle/upg25.htm for an example of what is required to fit 8Mb into an A5000!

The MEMC is capable of restricting access to pages of memory in certain ways, either complete access, no access, no access in USR mode, or read-only access. Older versions of RISC OS only implemented this loosely, so you need to be in SVC mode to access hardware directly but you could quite easily trample over memory used by other applications.




MMU : Newer systems
The newer systems, with ARM6 or later processor, have an MMU built into the processor. This consists of the translation look-aside buffer (TLB), access control logic, and translation table walk logic. The MMU supports memory accesses based upon 1Mb sections or 4K pages. The MMU also provides support for up to 16 'domains', areas of memory with specific access rights.
The TLB caches 64 translated entries. If the entry is for a virtual address, the control logic determines if access is permitted. If it is, the MMU outputs the appropriate physical address otherwise is signals the processor to abort.
If the TLB misses (it doesn't contain an entry for the virtual address), the walk logic will retrieve the translation information from the (full) translation table in physical memory.
If the MMU should be disabled, the virtual address is output directly as the physical address.
It gets a lot more complicated, suffice to say that more access rights are possible and you can specify memory to be bufferable and/or cacheable (or not), and the page size is fixed to 4K. A normal RiscPC offers two banks of RAM, and is capable of addressing up to 256Mb of RAM in fairly standard PC-style SIMMs, plus up to 2Mb of VRAM double-ported with the VIDC, plus hardware/ROM addressing.

On the RiscPC, the maximum address space of an application is 28Mb. This is not a restriction of the MMU but a restriction in the 26-bit processor mode used by RISC OS. A 32-bit processor mode could, in theory, allocate the entire 256K to a single task.
All current versions of RISC OS are 26-bit.




System limitations
Consider a RiscPC with an ARM610 processor.
The cache is 4K.
The bus speed is 16MHz (note, only slightly faster than the A5000!), and the hardware does not support burst-mode for memory accesses.
Upon a context switch (ie, making an application 'active') you need to remap it's memory to begin at &8000 and flush the cache.
I'll leave you to do the maths. :-)


Memory schemes
and
multitasking




Introduction
This is a reference, designed to help you understand the various types of memory handling and multitasking that exist.


Memory is a resource that needs careful management. It is expensive (£/Mb is much higher for memory than for conventional harddisc storage). A good system will offer flexible facilities trading off speed for functionality.
You need memory because it is fast. It is rarely as fast as the processor, these days, but it is faster than harddiscs. Because we need fast. We need big, so we can hold these large programs and large amounts of data that seem to be around. It boggles the mind that a commercial mainframe did accounts and stuff with a mere 4K of memory.

Typically, there will be three or four, possibly five, kinds of storage in the computer.

Level 1 cache
This is inside the processor, usually operating at the core speed of the processor. It is between 4K and 32K usually.


Level 2 cache
If the difference between the processor speed and system memory is quite large, you will often have a level 2 cache. This is mounted on the motherboard, and typically runs at a speed roughly halfway between the processor speed and the speed of the system memory.
It is usually between 64K and 512K. RISC OS machines do not have Level 2 cache.


Level 3 cache
If your processor is running at some silly speed (such as 1GHz) and your system memory is running at a tenth of that, you might like a chunk (say a Mb or two) of cache between level 2 and system memory, so that you can further improve speed.
Each layer of cache is getting slower, until we reach...


System memory
Your DRAM, SRAM, SIMMs, DIMMs, or whatever you have fitted. Speeds range from 2MHz in the old home computers, to around 133MHz in a typical PC compatible. Older PCs use 33MHz or 66MHz buses.
The ARM2/250 machines have an 8MHz bus, the ARM3 machines (A5000,...) have a 12MHz bus, the RiscPC has a 16MHz bus. In these cases, only the ARM2 is clocked at the same speed as the bus. The ARM3 is clocked at 25 or 30MHz, the ARM610 at 33MHz, the ARM710 at 40MHz and the StrongARM at a variety of speeds up to 280-ish MHz.


Harddisc
Slow, huge, cheap.



Basic monoprogramming
This is where all of the memory is just available, and you run one application at a time. The kernel/OS/BIOS (whatever) sits in one place, either in RAM or ROM and it is mapped into the address map.
Consider:

.----------------. .----------------.
| OS in ROM | | Device drivers |
| | | in ROM |
|----------------| |----------------|
| | | |
| Your | | Your |
| application | | application |
| | |----------------|
|----------------| | |
|System workspace| | OS in RAM |
'----------------' '----------------'

The first example is similar to the layout of the BBC microcomputer. The second is not that different to a basic MS-DOS system, the OS is loaded low in memory, the BIOS is mapped in at the top, and the application sits in the middle.
To be honest, the first example is used a lot under RISC OS as well. It is exactly what a standard application is supposed to believe. The OS uses page zero (&0000 - &7FFF) for internal housekeeping, it (your app) begins at &8000, and the hardware/OS sit way up in the ether at &3800000.
Memory management under RISC OS is more complex, but this is how a typical application will see things.

When the memory is organised in this way, only one application can be running. When the user enters a command, if it is an application then that application is copied from disc into memory, then it is executed. When the application is done with, the operating system reappears, waiting for you to give it something else to do.




Basic multiprogramming
Here, we are running several applications. While they are not running concurrently (to do so would be impossible, a processor can only do one thing at a time), the amount of time given to an application is tiny, so the system is spending a lot of time faffing around hopping from one application to the next, all giving you the illusion that n applications are all happily running together on your computer.
Memory is typically handled as non-contiguous blocks. On an ARM machine, pages are brought together to fake a chunk of memory beginning at &8000. Anybody who has tried an address translation in their allocated memory will know two things. Firstly, it is near impossible to get an actual physical memory address out of the OS.
The following program demonstrates this:

END = &10000 : REM Constrain slot to 32K

DIM willow% 16
SYS "Wimp_SlotSize", -1, -1 TO slot%
SYS "OS_ReadMemMapInfo" TO page%

PRINT "Using "+STR$(slot% / page%)+" pages, each page being "+STR$(page%)+" bytes."
PRINT "Pages used: ";

more% = slot% / page%
FOR loop% = 0 TO (more% - 1)
willow%!0 = 0
willow%!4 = &8000 + (loop% * page%)
willow%!8 = 0
willow%!12= -1
SYS "OS_FindMemMapEntries", willow%
IF loop% > 0 THEN PRINT ", ";
PRINT STR$(willow%!0);
NEXT
PRINT
END

This outputs something similar to:
Using 8 pages, each page being 4096 bytes.
Pages used: 2555, 2340, 2683, 2682, 2681, 2680, 2679, 2678



RISC OS handles memory by loading everything into memory. These applications are then 'paged in' by remapping the memory pointers in the page tables, consequently, other tasks are mapped out.

Windows/Unix systems load applications into memory, supported by a system called 'virtual memory' which dumps unused pages to disc in order to free system memory for applications that need it. I am not sure how Windows organises its memory, if it does it in a style similar to RISC OS (ie, remap to start from a specific address) or if each application is just told 'you are here'.
Virtual memory is useful, as you can fit a 32Mb program into 16Mb of memory if you are careful how you load it, and swap out old parts for new parts as necessary.

Some systems use a lazy-paging form of memory. In this case, only the first page of memory is filled by the application when execution starts. As more of the application is executed, the operating system fills in the parts as required.
By contrast, under RISC OS an application needs to load. Consider loading, well, practically anything, off of floppy disc. It takes time.




Virtual memory
When you no longer have actual physical memory, you may have virtual memory. A set of memory locations that don't exist, but the operating system tries real hard to convince you they do. And in the centre of the ring is the MMU (Memory Management Unit, inspired name, no?) keeping control
[note: you need an MMU anyway when your memory is broken into remappable pages, this just seemed like a good time to introduce it!]
When the processor is instructed to jump to &8000 to begin executing an application, it passes the address &8000 to the MMU. This translates the address into the correct real address and outputs this on the address lines, say &12FC00. The processor is not aware of this, the application is not aware of this, the computer user is not aware of this.

So we can take this one stage further by mapping onwards into memory that does not exist at all. In this case, the MMU will hiccup and say "Oi! You! No!" and the operating system will be called in a panic (correctly known as a "page fault"). The operating system will be calm and collected and think, "Ah, virtual memory". A little-used page of real memory will be shoved out to disc, then the page that the MMU was trying to find will be reloaded in place of the page we just got rid of. The memory map will be updated accordingly, then control will be handed back to the user application at the exact point the page fault occured. It would, unknowing of all of this palaver, perform that instruction again, only this time the MMU will (happily?) output the correct address to the memory system, and all will continue.




Page tables and the MMU
The page table exists to map each page into an address. This allows the operating system to keep track of which memory is pretending to be which. However it is more complex. Some pages cannot be remapped, some pages are doubly mapped, some are not to be touched in user mode code, some aren't to be touched at all. Some are read only. Some just don't exist. All of this must be kept track of.
So the MMU takes an address, looks it up in the page table, and spits out the correct address.

Let's do some maths. We'll assume a 4K page size (a la RISC OS in a RiscPC). A 32bit address space has a million pages. With one million pages, you'll need one million entries. In the ARM MMU, each entry takes 7 words. So we are looking at seven megabytes just to index our memory.
It gets better. Every single memory reference will be passed through the MMU. So we'll want it to operate in nanoseconds. Faster, if possible.
In reality, it is somewhat easier as most typical machines don't have enough memory to fill the entire addressing space, indeed many are unlikely to get close on technical reasons (the RiscPC can have 258Mb maximum RAM, or 514Mb with Kinetic - the extra 2Mb is the VRAM). Even so, the page tables will get large.

So there are three options:

Have a huge array of fast registers in the MMU. Costly. Very.
Hold the page tables in main memory. Slow. Very.
Compromise. Cache the active pages in the MMU, and store the rest on disc.
An example. A RiscPC, 64Mb of RAM, 2Mb of VRAM, 4Mb of ROM and hardware I/O (double mapped). That's 734000320 bytes, or 17920 pages. It would take 71680 bytes to store each address. But an address on it's own isn't much use. Seven words comprise an entry in the ARM's MMU. So our 17920 pages would require 501760 bytes in order to fully index the memory.
You just can't store that lot in the MMU. So you'll store a snippet, say 16K worth?, and keep the rest in RAM.



The TLB
The Translation Lookaside Buffer is a way to make paging even more responsive. Typically, a program will make heavy use of a few pages and barely touch the rest. Even if you plan to byte read the entire memory map, you will be making four thousand hits in one page before going to the next.
A solution to this is to fit a little bit in the MMU that can map virtual addresses to their physical counterparts without traversing the page table. This is the TLB. It lives within the MMU and contains details of a small number of pages (usually between four and sixty four - the ARM610 MMU TLB has thirty two entries).
Now, when we have a page lookup, we first pass our virtual address to the TLB which will check all of the addresses stored, and the protection level. If a match is found, the TLB will spit out the physical address and the page table isn't touched.
If a miss is encountered, then the TLB will evict one of it's entries and load in the page information looked up in the page table, so the TLB will know the new page requested, so it can quickly satisfy the result for the next memory access, as chances are the next access will be in the page just requested.
So far we have figured on the hardware doing all of this, as in the ARM processor. Some RISC processors (such as the Alpha and the MIPS) will pass the TLB miss problem to the operating system. This may allow the OS to use some intelligence to pre-load certain pages into the TLB.




Page size
Users of an RISC OS 3.5 system running on an ARM610 with two or more large (say, 20Mb) applications running will know the value of a 4K page. Because it's bloody slow. To be fair, this isn't the fault of the hardware, but more the WIMP doing stuff the kernel should do (as happens in RISC OS 3.7) and doing it slower!
Like with harddisc LFAUs, what you need is a sensible trade-off between page granularity and page size. You could reduce the wastage in memory by making pages small, say 256 bytes. But then you would need a lot of memory to store the page table. A bigger page table, slower to scan through it. Or you could have 64K pages, which make the page table small, but can waste huge amounts of memory.
To consider, a 32K program would require eight 4K pages, or sixty four 512 byte pages. If your system remaps memory when shuffling pages around, it is quicker to move a smaller number of large pages than a larger number of small pages.

The MEMC in older RISC OS machines had a fixed page table. So the size of page depended upon how much memory was utilised.

MEMORY
PAGE SIZE

0.5Mb
8K

1Mb
8K

2Mb
16K

4Mb
32K


3Mb wasn't a valid option, and 4Mb is the limit. You can increase this by fitting a slave MEMC, in which case you are looking at 2 lots of 4Mb (invisible to the OS/user).
In a RiscPC, the MMU accesses a number of 4K pages. The limits are due, I suspect, to the system bus or memory system, not the MMU itself.
Most commercial systems use page sizes in the order 512 bytes to 64K.
The later ARM processors (ARM6 onwards) and the Intel Pentium both use page sizes of 4K.




Page replacement algorithms
When a page fault occurs, the operating system has to pick a page to dump, to allow the required page to be loaded. There are several ways that this may be achieved. None of these are perfect, they are a compromise of efficiency.
Not Recently Used
This requires two bits to be reserved in the page table, a bit for read/write and a bit for page reference. Upon each access, the paging hardware (and it must be done in hardware for speed) will set the bits as necessary. Then on a fixed interval the operating system will clear these bits - either when idling or upon clock interrupt? This then allows you to track the recent page accesses, so when flushing out a page you can spot those that have not recently been read/written or referenced. NRU would remove a page at random. While it is not the best way of sorting out which pages to remove, it is simple and gives reasonably good results.

First-In First-Out
It is hoped you are familiar with the concept of FIFO, from buffering and the like. If you are not, consider the lame analogy of the hose pipe in which the first water in will be the first water to come out the other end. It is rarely used, I'll leave the whys and where-fores as an exercise for the bemused reader. :-)

Second Chance
A simple modification to the FIFO arrangement is to look at the access bit, and if it is zero then we know the page is not in current use and can be thrown. If the bit is set, then the page is shifted to the end of the page list as if it was a new access, and the page search continues.
What we are doing here is looking for a page unused since the last period (clock tick?). If by some miracle ALL the pages are current and active, then Second Change will revert to FIFO.

Clock Although Second Chance is good, all that page shuffling is inefficient so the pages are instead referenced in a circular list (ie, clock). If the page being examined in in use, we move on and look at the next page. With no concept of the start and end of the list, we just keep going until we come to a usable page.

Least Recently Used
LRU is possible, but it isn't cheap. You maintain a list of all the pages, sorted by the most recently used at the front of the list, to the least recently used at the back. When you need a page, you pull the last entry and use it. Because of speed, this is only really possible in hardware as the list should be updated each memory access.

Not Frequently Used
In an attempt to simulate LRU in software, we can maintain something vaguely similar to LRU in a software implementation, in which the OS scans the available pages on each clock tick and increments a counter (held in memory, one for each page) depending on the read/written bit.
Unfortunately, it doesn't forget. So code heavily used then no longer necessary (such as a rendering core) will have a high count for quite a while. Then, code that is not called often but should be all the more responsive, such as redraw code, will have a lower count and thus stand the possibility of being kicked out, even though the higher-rated renderer is no longer needed but not kicked out as it's count is higher.
But this can be fixed, and the fix emulates LRU quite well. It is called aging. Just before the count is incremented, it is shifted one bit to the right. So after a number of shifts the count will be zero unless the bit is added. Here you might be wondering how adding a bit can work, if you've just shifted a bit off. The answer is simple. The added bit is added to the leftmost position, ie most significant.
The make this clearer...

Once upon a time: 0 0 1 0 1 1
Clock tick : 0 0 0 1 0 1
Clock tick : 0 0 0 0 1 0
Memory accessed : 1 0 0 0 0 1
Clock tick : 0 1 0 0 0 0
Memory accessed : 1 0 1 0 0 0




Multitasking
There is no such thing as true multitasking (despite what they may claim in the advocacy newsgroups). To multitask properly, you need a processor per process, with all the relevant bits so processes are not kept waiting. Effectively, a separate computer for each task.
However, it is possible to provide the illusion of running several things at once. In the old days, things happened in the background under interrupt control. Keyboards were scanned, clocks were updated. As computers became more powerful, more stuff happened in the background. Hugo Fiennes wrote a soundtracker player that runs on interrupts, so works in the background. You set it going, it carries on independent of your code.

So people began to think of the ability to apply this to applications. After all, most of the time an application is spent waiting for user input. In fact, the application may easily do sweet sod all for almost 100% of the time - measured by an event counter in Emily's polling loop, I type ~1 character a second, the RiscPC polls a few hundred times a second. That was measured in a multitasking application, using polling speed as a yardstick. Imagine if we were to record loops in a single-tasking program. So the idea was arrived at. We can load several programs into memory, provide them some standard facilities and messaging systems, and then let them run for a predefined duration. When the duration is up, we pass control to the next program. When that has used its time, we go to the next program, and so on.
As a brief aside, I wish to point out Schrödinger's cat. A rather cute little moggy, but an extremely important one. It is physically impossible to measure system polling speed in software, and pretty difficult to measure it in hardware. You see, the very act of performing your measurement will affect the results. And you cannot easily 'account' for the time taken to make your measurements because measuring yourself is subject to the same artefacts as when measuring other things. You can only say 'to hell with it', and have your program report your polling rate as being 379 polls/sec, knowing that your measuring code may be eating around 20% of the available time, and use the figures in a relative form rather than trying to state "My computer achieves 379 polls every second". While there is no untruth in that, your computer might do 450 if you weren't so busy watching! You simply can't be JAFO.
...and you need to go to school/college and get bored rigid to find out what relevance any of this has to your cat. Mine is sitting on my monitor, asleep, blissfully unaware of all these heavy scientific concepts. She's probably got the right idea...




Co-operative multitasking
One such way of multitasking is relatively clean and simple. The application, once control has passed to it, has full control for as long as it needs. When it has finished, control is explicitly passed back to the operating system.
This is the multitasking scheme used in RISC OS.



Pre-emptive multitasking
Seen as the cure to all the world's ills by many advocates who have seen Linux (not Windows!), this works differently. Your application is given a timeslice. You can process whatever you want in your timeslice. When your timeslice is up, control is wrested away and given to another process. You have no say in the matter, peon.




I don't wish to get into an advocacy war here. My personal preference is co-operative, however I don't feel that either is the answer. Rather, a hybrid using both technologies could make for a clean system. The major drawback of CMT is that if an application dies and goes into a never-ending loop, control won't come back. The application needs to be forceably killed off.
Niall Douglas wrote a pre-emption system for RISC OS applications. Surprisingly, you didn't really notice anything much until an application entered some heavy processing (say, ChangeFSI) at which point life carried right on as normal while the task which would have stalled the machine for a while chugged away in the background.



--------------------------------------------------------------------------------


Processor setup via co-processor 15
and about co-processors




Introduction
ARM processors after (and including) the ARM 3 offer various ID and internal configuration facilities by providing internally a co-processor 15 which you can read from and and write to.
The setup is controlled by co-processor 15 registers, accessed with MRC and MCR in non-user mode.

These registers are particular to the processor specified.






ARM 3
Register 0 - Processor identification (read only)

Bits 0 - 7 Revision of processor
Bits 8 - 15 Should be '3', identifying processor as an ARM3
Bits 16 - 23 Manufacturer code (&56 = VLSI Technology Inc.)
Bits 24 - 31 Designer code (&41 = ARM Ltd)




Register 1 - Cache flush (write only)
Write-sensitive, writing anything to register 1 will cause the cache to be flushed.


Register 2 - Miscellaneous control

Bit 0 - Turns the cache on (1) or off (0)
Bit 1 - Determines if user mode and non-user mode use the same address
mapping. 1 if they do, or 0. Should be 1 for use with MEMC.
Bit 2 - 0 for normal operation, 1 for special monitor mode (processor
runs at memory speed and address/data always put on external
pins even if data fetched from cache - for logic analyser
to trace the program properly).

Other bits reserved.




Register 3 - Which areas are cachable
Controls which areas of memory are cachable, in 2Mb chunks.
Bit 0 - 1 if virtual addresses &0000000-&01FFFFF are cachable, 0 if not
Bit 0 - 1 if virtual addresses &0200000-&03FFFFF are cachable, 0 if not
...
Bit 31 - 1 if virtual addresses &3E00000-&3FFFFFF are cachable, 0 if not




Register 4 - Which areas are updateable
Controls which areas of memory are updateable, in 2Mb chunks. Writes to non-updateable memory go to the real memory, not the cache. This is suitable for things like ROMs, since you don't want the cached data to be altered by attempted writes.
Bit 0 - 1 if virtual addresses &0000000-&01FFFFF are updateable, 0 if not
Bit 0 - 1 if virtual addresses &0200000-&03FFFFF are updateable, 0 if not
...
Bit 31 - 1 if virtual addresses &3E00000-&3FFFFFF are updateable, 0 if not




Register 5 - Which areas are disruptive
Controls which areas of memory are disruptive, in 2Mb chunks. Writes to disruptive areas of memory cause the cache to be flushed. For example, writing to physical memory at &2000000-&2FFFFFF on an MEMC system will usually cache virtually addresses memory and if this location was cached, an attempt to read it would read back the old contents.
Bit 0 - 1 if virtual addresses &0000000-&01FFFFF are disruptive, 0 if not
Bit 0 - 1 if virtual addresses &0200000-&03FFFFF are disruptive, 0 if not
...
Bit 31 - 1 if virtual addresses &3E00000-&3FFFFFF are disruptive, 0 if not

Register 2 is set to zero after power-up, and registers 3-5 are undefined. The registers 3-5 should be set up correctly before the cache is switched on. You should always check the processoridentity before setting up the registers, unless you are completely certain your code will only ever be executed on an ARM3 processor.





ARM 610
Register 0 - Processor identification (read only)
The value returned for an ARM610 processor should be &4156061x.

Bits 0 - 7 Revision of processor (&1x)
Bits 8 - 15 Processor identity
Bits 16 - 23 Manufacturer code (&56 = VLSI Technology Inc.)
Bits 24 - 31 Designer code (&41 = ARM Ltd)




Register 1 - Control (write only)
All values set to 0 at power-up.

Bit 0 - On-chip MMU turned off (0) or on (1)
Bit 1 - Address alignment fault disabled (0) or enabled (1)
Bit 2 - Instruction/data cache turned off (0) or on (1)
Bit 3 - Write buffer turned off (0) or on (1)
Bit 4 - 26 bit program space if 0, 32 bit program space if 1
Bit 5 - 26 bit data space if 0, 32 bit data space if 1
Bit 6 - Early abort mode if 0, late abort mode if 1
Bit 7 - Little-endian operation if 0, big-endian if 1
Bit 8 - System bit - controls the ARM610 permission system




Register 2 - Translation Table Base (write only)
Bits 14-31 hold the base of the currently active Level One page table.


Register 3 - Domain Access Control (write only)
This register holds the current access control for domains 0 to 15. Each domain has two bits (domain 0 bits 0,1 ... domain 15 bits 30,31) which may be set as follows:
00 No Access - Domain fault generated if tried to access
01 Client - Accesses are checked against permission bits in
section/page descriptor
10 Reserved - Currently behaves like no access mode
11 Manager - Accesses are NOT checked, permission faults cannot
be generated




Register 4 - Reserved - do not attempt to access


Register 5 - Page fault status / TLB flush
When reading, this holds the status of the last data fault (not updated for pre-fetch fault). Only the bottom byte is of significance.
Bits 0 - 3 Status
Bits 4 - 7 Domain
Bits 8 - 11 Set to zero
Bits 12 - 31 Whatever was the last value on the internal data bus



When writing to this register, any value written will cause the Translation Look-aside Buffer to be flushed.


Register 6 - Data fault address / TLB purge
When reading this register, you can determine the virtual address of the last page fault.

When writing this register, the value given (in bits 14-31) is treated as an address. The TLB will be searched for a corresponding address and if it is found, it is marked as invalid. This is to allow the page table in main memory to be updated and the now-invalid entries in the on-chip TLB to be purged without assuming the penalty of flushing the entire TLB.


Register 7 - IDC flush (write only)
Any data written to this location will cause the IDC (Instruction/Data cache) to be flushed.


Registers 8 to 15 - Reserved
Accessing these registers will cause the undefined instruction trap to be taken.





ARM 710
This is similar to the ARM610.
Register 0 - Processor identification (read only)
The value returned for an ARM610 processor should be &4104710x.

Bits 0 - 3 Revision of processor?
Bits 3 - 15 Processor identity - &710
Bits 16 - 23 Manufacturer code
Bits 24 - 31 Designer code (&41 = ARM Ltd)




Register 1 - Control (write only)
All values set to 0 at power-up.

Bit 0 - On-chip MMU turned off (0) or on (1)
Bit 1 - Address alignment fault disabled (0) or enabled (1)
Bit 2 - Instruction/data cache turned off (0) or on (1)
Bit 3 - Write buffer turned off (0) or on (1)
Bit 4 - 26 bit program space if 0, 32 bit program space if 1
Bit 5 - 26 bit data space if 0, 32 bit data space if 1
Bit 6 - Early abort mode if 0, late abort mode if 1
Bit 7 - Little-endian operation if 0, big-endian if 1
Bit 8 - System bit - controls the ARM710 permission system
Bit 9 - ROM bit - controls the ARM710 permission system




Register 2 - Translation Table Base (write only)
Bits 14-31 hold the base of the currently active Level One page table.


Register 3 - Domain Access Control (write only)
This register holds the current access control for domains 0 to 15. Each domain has two bits (domain 0 bits 0,1 ... domain 15 bits 30,31) which may be set as follows:
00 No Access - Domain fault generated if tried to access
01 Client - Accesses are checked against permission bits in
section/page descriptor
10 Reserved - Currently behaves like no access mode
11 Manager - Accesses are NOT checked, permission faults cannot
be generated




Register 4 - Reserved - do not attempt to access


Register 5 - Page fault status / TLB flush
When reading, this holds the status of the last data fault (not updated for pre-fetch fault). Only the bottom byte is of significance.
Bits 0 - 3 Status
Bits 4 - 7 Domain
Bits 8 - 11 Set to zero
Bits 12 - 31 Whatever was the last value on the internal data bus



When writing to this register, any value written will cause the Translation Look-aside Buffer to be flushed.


Register 6 - Data fault address / TLB purge
When reading this register, you can determine the virtual address of the last page fault.

When writing this register, the value given (in bits 14-31) is treated as an address. The TLB will be searched for a corresponding address and if it is found, it is marked as invalid. This is to allow the page table in main memory to be updated and the now-invalid entries in the on-chip TLB to be purged without assuming the penalty of flushing the entire TLB.


Register 7 - IDC flush (write only)
Any data written to this location will cause the IDC (Instruction/Data cache) to be flushed.


Registers 8 to 15 - Reserved
Accessing these registers will cause the undefined instruction trap to be taken.





ARM 7500
The registers are exactly the same as the ARM710, except the processor ID (register 0) will be different. The datasheet did not specify what should be expected.





ARM 7500FE
The registers are exactly the same as the ARM710, except the processor ID (register 0) will be different. The datasheet did not specify what should be expected, however interrogation of the Bush set-top box reveals &41077100.





StrongARM SA110
Register 0 - Processor identification (read only)
The value returned for an SA110 processor should be &4401A10x.

Bits 0 - 3 Processor revision number



Register 1 - Control (read/write)
All values set to 0 at power-up.

Bit 0 - On-chip MMU turned off (0) or on (1)
Bit 1 - Address alignment fault disabled (0) or enabled (1)
Bit 2 - Data cache turned off (0) or on (1)
Bit 3 - Write buffer turned off (0) or on (1)
Bit 7 - Little-endian operation if 0, big-endian if 1
Bit 8 - System bit - controls the MMU permission system
Bit 9 - ROM bit - controls the MMU permission system
Bit 12 - Instruction cache turned off (0) or on (1)




Register 2 - Translation Table Base (read/write)
Bits 14-31 hold the base of the currently active Level One page table.


Register 3 - Domain Access Control (read/write)
This register holds the current access control for domains 0 to 15.
The document I have contains no further details, though I would assume it would be similar to the ARM610/710/etc usage.


Register 4 - Reserved - do not attempt to access


Register 5 - Fault status (read/write)
When reading, this holds the status of the last data fault (not updated for pre-fetch fault). Only the bottom byte is of significance.
Bits 0 - 3 Status
Bits 4 - 7 Domain
Bit 8 Zero
Bits 9 - 31 Undefined on read, ignored on write




Register 6 - Fault address (read/write)
When reading this register, you can determine the virtual address of the last page fault.


Register 7 - Cache control (write only)
Any data written to this location will cause the selected cache to be flushed.
The OPC_2 and CRm co-processor fields select which cache
operation should occur:

Function OPC_2 CRm Data

Flush I + D 00 %0111 -
Flush I 00 %0101 -
Flush D 00 %0110 -
Flush D single 01 %0110 Virtual address
Clean D entry 01 %1010 Virtual address
Drain write buf. %0100 %1010 -




Register 8 - TLB operations (write only)
Any data written to this location will cause the selected TLB flush operation.
The OPC_2 and CRm co-processor fields select which cache
operation should occur:

Function OPC_2 CRm Data

Flush I + D 00 %0111 -
Flush I 00 %0101 -
Flush D 00 %0110 -
Flush D single 01 %0110 Virtual address




Registers 9 to 14 - Reserved
Accessing these registers will cause the undefined instruction trap to be taken.


Register 15 - Test, Clock, and Idle (write only)

The OPC_2 and CRm co-processor fields select the following...

Function OPC_2 CRm

Enable odd word 01 01
loading of
Icache LFSR

Enable even word 01 10
loading of
Icache LFSR

Clear Icache 01 %0100
LFSR

Move LFSR to 01 %1000
R14,Abort

Enable clock 10 01
switching

Disable clock 10 10
switching

Disable nMCLK 10 %0100
output

Wait for 10 %1000
interrupt






ARM9...XScale
Unfortunately I do not have details of these registers.
Try http://www.arm.com/.





How to read these registers
The code I knocked up for the Bush box processor ID was:
10 DIM code% 32
20 P% = code%
30 [ OPT 3
40 SWI "OS_EnterOS"
50 MRC CP15, 0, R0, C0, C0
60 TSTP PC, #&F0000000
70 MOV R0, R0
80 MOV PC, R14
90 ]
100 PRINT ~USR(code%)
When run, this would print:
>RUN
00008FAC OPT 3
00008FAC EF000016 SWI "OS_EnterOS"
00008FB0 EE100F10 MRC CP15, 0, R0, C0, C0
00008FB4 E31FF20F TSTP PC, #&F0000000
00008FB8 E1A00000 MOV R0, R0
00008FBC E1A0F00E MOV PC, R14
41077100
>
Note that this code must run in a privileged mode.





Co-processors
There are between zero and three possible co-processors. Most desktop ARM systems do not have logic for external co-processors, so we may either use that which is built into the ARM itself, or an emulated co-processor.
CP15 is reserved on the ARM 3 and later processors for internal configuration, as described in this document.
CP0 and CP1 is used by the floating point system. It may either be an external floating point chip (as used with the ARM 3), hardware built into the processor (as in the ARM 7500FE), or a totally software-based emulation (as with the FPEmulator that we all know).
Here is a short exercise for you:

10 DIM code% 16
20 P% = code%
30 [ OPT 3
40 CDP CP1, 0, C0, C1, C2, 0
50 ADFS F0, F1, F3
60 MOV PC, R14
70 ]
>RUN
00008F78 OPT 3
00008F78 EE010102 CDP CP1, 0, C0, C1, C2
00008F7C EE010102 ADFS F0, F1, F2
00008F80 E1A0F00E MOV PC, R14
>
What do you notice? :-)


When the ARM executes a co-processor instruction, or an undefined instruction, it will offer it to any co-processors which may be presently attached. If hardware is available to process the given instruction, then it is expected to do so. If it is busy at the time the instruction is offered, the ARM will wait for it.
If there is no co-processor capable of executing the instruction, the ARM will take its undefined instruction trap, in which case the following will happen:

The PSR and PC are both saved (the method differs for 26 bit and 32 bit ARMs)
SVC mode (26 bit) / UND mode (32 bit) is entered, and the I bit of the PSR is set
The instruction at address &00000004 is executed
This trap may be used to add instructions to the instruction set by emulation, or to implement a software emulation of hardware that isn't fitted. The Floating Point Emulator works by doing this.
To return, simply pull the saved PC and PSR (depends on 26/32 bit) and push them to the current PC and PSR, like MOVS PC, R14 in 26 bit systems. This will pick up with the instruction following the one which caused the trap.

All of the co-processor instructions can be executed conditionally. Please note that the conditionals relate to the status of the ARM processor, and not the status of any of the co-processors. This is because the ARM always tries the instruction first, and offers it around and maybe takes the undefined application trap, so the conditions are ARM related.
To make this clearer:

10 DIM code% 32
20 P% = code%
30 [ OPT 3
40 FLTS F0, R0
50 FLTS F1, R1
60 FMLS F2, F0, F1
70 FIX R0, F2
80 MOVS PC, R14
90 ]
100 INPUT "First number : "A%
110 INPUT "Second number: "B%
120 PRINT USR(code%)

This probably won't assemble without an enhanced BASIC assembler.
Anyway, you might think the ARM will hand over to the floating point co-processor to do the four FP instructions, then hand back afterwards.
If you did, you would be incorrect!

What actually is executed is:

MCR CP1, 0, R0, C0, C0
MCR CP1, 0, R1, C1, C0
CDP CP1, 9, C2, C0, C1
MRC CP1, 0, R0, C0, C2
It is worth pointing out that objasm specifies co-processor registers using the CR notation (ie, CR0 - CR15), which is first defined with the CN directive. It does not appear as if default co-processor instructions are defined in Nick Roberts' ASM, though I've only looked in the instructions at the "defined symbols" section...
Darren Salt's ExtBASICasm provides the register names C0 - C15 to refer to the co-processors. So if any of these examples fail when you try to assemble them, please check what format your assembler provides these instructions.






MRC
The instruction MRC transfers a co-processor register to an ARM register. It takes the form:
MRC , , , , ,
The co-processor is denoted in most assemblers by CPx.
The register is written to , using operation . This may, possibly, be further modified by and . For an idea of the sorts of times when this might be necessary, consider instructions of the form LDR Ra, [Rb], #x.
The final may be omitted, as it is in the example, but the other parts of the MRC instruction must be supplied.



MCR
The instruction MCR transfers an ARM register to a co-processor register. It takes the form:
MCR , , , , ,
The co-processor is free to interpret the fields as it desires, but the standard interpretation is that the contents of the ARM register are written to the co-processor register using the operation code given, which may be further modified by the second co-processor register and/or the second operation code.



LDC and STC
The instruction LDC loads data from memory into the co-processor register, while STC saves data from a co-processor register to memory.
The ARM should supply the address, the co-processor accepts the data and controls how much is transferred.
LDC , ,

LDCL , ,

STC , ,

STCL , ,

If the 'L' flag is specified, a long transfer is performed. Otherwise a short transfer is performed. The 'L' flag follows the extension, like LDCEQL.
The address is an expression which results in an address being generated, so examples of which are:
[Rx]
[Rx, #x] !
[Rx], #x
These are like those used for the LDR instruction. However they are only eight bits wide and specify word offsets (the ARM types are 12 bit and byte offset).
What happens is the 8 bit unsigned offset is shifted left two bits and added or subtracted from the base register, this may be done before or after the base is used as the transfer address. The new base value can be written back, or left unmodified.
The next difference is that post-indexed addressing requires explicit setting of the W bit of the instruction (unlike LDR/STR which always does it when post-indexed). You set the 'W' bit with the '!' flag, like STR CP0, CR1, [R2, #16]!.
The base register is used for the first transfer. If there are any further transfers, the base will be incremented by one word for each of those additional transfers.



CDP
The instruction CDP instructs the co-processor to do some processing. It takes the form:
CDP , , , ,
This tells the co-processor to do something. The ARM will not wait for it to finish, nor is any sort of status sent back to the ARM. It is possible for a co-processor to maintain a queue of instructions, allowing it and the ARM to process in parallel.
A variant of this may be obtained with the floating point hardware; while it does not (to my knowledge) support a queue of instructions, it is true that the ARM will await the FPU to finish an operation before providing the next. With careful coding, it would therefore be possible to get the ARM to do some sort of processing (a few instructions) in between sending an instruction to the FPU and reading it's result back.
So instead of:
FLTE F0, R0
FLTE F1, R1
MUFE F2, F0, F1
FIX R0, F2
MOV R1, #0
you could save a small amount of time with:
FLTE F0, R0
FLTE F1, R1
MUFE F2, F0, F1
MOV R1, #0
FIX R0, F2
as the FPU could be finishing the MUF while you MOV. The hardware FPU (as in the 7500FE) runs asynchronous - you can switch to synchronous by setting a bit in the FPSR. The software emulation always runs synchronously, and as it uses the ARM in order to emulate the FP instructions, there is no possible advantage to be gained.
Obviously the above example is somewhat contrived. However it is only an example. Real life code, such an an MP3 decoder, could well benefit from careful arrangement of code.
There are no rules for the register types and/or the operation codes. These depend upon the co-processor.

ไม่มีความคิดเห็น:

แสดงความคิดเห็น

DEVELOPER ZOne