[Continuation of restoration work on a DEC GT40 graphic display terminal; part one here.]

At this point, Scott had taken over the restoration work as I had had to leave town for work. We consulted a few times via IMs and video calls over the next couple weeks; the following is a narrative of Scott's continued work on the project as I understood it remotely.

The next thing that needed doing was to replace the failed microcode PROM described at the end of the previous article. I did some work to manually transcribe the PROM contents from the binary microcode listings included in the engineering drawings (4 bits x 256 microcode locations). Scott double-checked my work here and found and fixed at least three transcription errors (always good to have a double check on tedious tasks like this, and I seem to be developing a worsening dyslexia with age -- thanks, Scott!)

Scott tracked down and burned a replacement PROM and replaced the failing one on the board, and we were back again to the previous high water mark (able to run toggle in programs and the ROM bootstrap terminal emulator, with the same linefeed and binary load failures I had seen previously).

Scott played around with the binary loader for a bit, but it seemed to be suffering a pretty frustrating blend of several different issues. Attention was turned back to the bootstrap ROM terminal emulator LF handling problem, which was consistent and repeatable. Scott began single stepping the code by instruction, using the listings in the GT40/GT42 User's Guide, and soon made two discoveries:

  • The unit has the GT42 version of the boostrap ROM, and not the GT40 version (this can be seen because the bootstrap terminal emulator correctly handles TAB characters).

  • Upon receiving a LF char, the bootstrap code got to a loop which was scanning the input buffer looking for LFs, but failed to find any and looped indefinitely.

The malfunctioning ROM code scanning for LFs can be seen at location 166310, in the listing on PDF page 81 of the GT40/GT42 User's Guide, and is as follows:

1
2
3
4
5
6
166310  122300              LFLOOP: CMPB    (SCAN)+,CHAR        ;AND LOOK FOR A LINEFEED
166312  001406                      BEQ     LFOUND              ;GOT IT, SEARCH HAS ENDED
166314  020327  007000              CMP     SCAN,#BLIMIT        ;ARE WE AT END OF BUFFER?
166320  103773                      BLO     LFLOOP              ;NOPE, KEEP ON LOOKING.
166322  012703  001000              MOV     #BSTART,SCAN        ;IF AT TOP, RESET TO BOTTOM OF BUFFER
166326  000770                      BR      LFLOOP              ;AND KEEP ON LOOKING.

Scott began microstepping at address program address 166310, which is machine code 122300, CMPB (R2)+,R0. The microcode flow traced through is as follows, using state names from the microcode listings in the engineering drawings:

  • H-2: Tracing activity starts with the machine halted and looping in microstate H-2. The KM11 is set to manual clock mode, front panel CONT switch depressed and held, and several manual clocks taken causing microbranch to...

  • CCS-1: Loads B←PC, causing PC to be displayed on console lights.
  • CCS-2: Loops waiting for CONT switch to be released.
  • CCS-3: Turns on RUN light.

  • F-1: Loads BA←PC, and initiates asynchronous bus cycle to fetch instruction.
  • F-2: Loads B←PC+2, causing next instruction address to be displayed on console lights.
  • F-3: Loads PC←B, updating the PC, and suspends processor clock until instruction fetch bus cycle reaches SSYN.
  • F-4: Resumes here when fetched instruction is on bus; latches into B (displaying instruction on console lights) and also into the IR decode register; releases the bus.
  • F-5: First big microcode branch based on instruction type.

  • S2-1: Source addressing mode 2 (register auto-increment). BA←R[S], and initiates asynchronous bus cycle to read source operand from memory.
  • S2-2: B←R[S]+1+BYTE.BAR, which increments the source register by 1 or 2 depending on byte or word instruction.
  • S2-3: R[S]←B (stores back incremented source register), suspends processor clock until source operand fetch bus cycle reaches SSYN.
  • S1-2: Resumes here when source operand is on bus; latches into B (displaying source operand on console lights) and releases the bus, then branches on byte even, byte odd, or word.

So far so good. In the case being traced, we happen to be doing a byte read from an odd address. In this case, the fetched source data must next be shifted right 8 bits; this is done over the course of the next 8 microsinstructions, SBO-1 - SBO-8. Here Scott noticed a problem: bit 3 was always set in the B register after any single right shift, even if the bit4 to the left was zero. This points directly at E044 on sheet DPA, a four bit shift register which implements bits 0:3 of the B-register:

This part was pulled and replaced, and the ROM terminal emulator could then correctly handle LFs! After a few additional red herring to do with loose power connectors and occasional accidental bumping of the test switches on the M7013 display control board, Scott was also able to get the lunar lander code to load and run via the ROM bootstrap binary loader, though still with some display problems:

Scott discovered a major clue concerning remaining loader problems: some GT40 binary-loader encoded binaries we had been using which were downloaded off other enthusiast web sites contained erroneous extra linefeed and "!" characters, which confused the loader and/or triggered bad checksums. After stripping these out, the loader was seen to work quite reliably.

With diagnostics now in hand, Scott was able to track down a few remaining hardware issues on the display boards (a bad register with a stuck high bit, and a swap of one of the DACs which had been acting flaky with one from a spare board. I don't have precise details on these particular fixes, but will expand here later if/when I get more information.)

Below, screen shots of some diagnostics, and at long last, Scott lands on the moon and gets his cheeseburger! Drop by and visit Scott at his booth at VCFMW this weekend, see and play game, and hear tales of the restoration first-hand!


GT40 Terminal

Mon 21 August 2023 by Fritz Mueller

A while ago my friend Scott approached me with an idea to collaborate on restoration of a DEC GT40 graphic display terminal of unknown status, belonging to a third collector friend of ours; the idea was to restore the machine to working order for exhibition at the various summer/fall vintage computer shows. The GT40 ran an early (pre-Atari) graphical version of the lunar lander game which was released in 1973. The 50th anniversary of this code seemed a nice theme for the exhibit.

The GT40 was an integrated product consisting of a PDP-11/05 (KD11-B) CPU, a VT11 vector display processor, a VR14 X-Y monitor, a light pen, keyboard, and a DL11 serial interface:

Scott retrieved the terminal, which had a fairly bad case of screen rot. We agreed that Scott would work on restoration of the monitor while I dug in on the system unit. Scott got to work while I dealt with distractions of ongoing home renovations, a ton of work-related travel, and my first bout of COVID (blaargh!)

The screen rot is caused by a deterioriation of a polyvinyl acetate (PVA) layer sandwiched between the front surface of the CRT glass and an outer protective implosion shield. All of this is held together by a retaining ring affixed to the CRT with silicone adhesive. The only fix for this is to disassemble the monitor, separate the sandwich, and clean out and replace the deteriorated PVA layer.

After chatting with some folks who had successfully conducted a similar VR14 restoration at the Rhode Island Computer Museum, Scott obtained some silicone digester to aid in separation of the retaining ring. The terminal was disassembled and then digester was repeatedly injected under the ring with a syringe, allowed to sit, and the resulting softened silicone scraped away over the course of a week.

Scott then worked to conform a lexan sheet to the interior of the implosion shield as a replacement for the PVA layer, as RICM had done. This process, conducted in a home oven, proved to be quite fiddly. But persistence paid off, and the end result looks very nice!

After a precautionary reform of the larger power supply electrolytics, careful reassembly, and a gradual bringup on a variac, the monitor showed proof of life on the bench, hooked up to a signal generator source.

In the meantime, starting to feel better, I began to look at the CPU unit. Power supply electrolytics appeared to be in good shape, and the supply came up on the bench without much difficulty.

The module utilization for this backplane is as follows:

ABCDEF
1A320 VT40 Display Generator
2M7013 VT40 Display Control
3M7014 VT40 Bus Control
4M7800 DL11 Serial
5M930 Term. / UNIBUS outH214 Core Stack (8K x 16)
6G231 Core Memory Driver
7G110 Core Memory Control
8M7261 KD11-B Control
9M7260 KD11-B Data Paths

On the assumption (later proved wrong) that this was effectively the same as a PDP-11/05 setup, I began debug with just the two CPU cards, an M9301 boot/terminator in position 5A-B, and a grant continuity "knuckle buster" in position 4D. Some problems were immediately apparent from the front console: deposit and examine operations to various memory-mapped CPU registers seemed to work as expected, but when examining contents the M9301 ROM locations bit 13 was always displaying zero. The CPU would not enter run mode, nor could it be single stepped.

Docs suggested that the GT40 would accomodate a KM11 debug module in postion 1B, so in this went. The machine behaved even more strangely when the KM11 was in, hanging up entirely unless the KM11 was put in manual clock mode, and even then stepping microstates at unexpected times. It took a little probing of the CPU clock circuits to discover why:

Here we see the RC clock at E019. CONJ MAN CLK L is wired to KM11 switch S2, and inhibits the RC clock when pulled low. With the RC clock thus disabled, NOR gate E027 admits manual clocking via CONJ S CLK ON L, connected to KM11 (momentary) switch S4. The output at E027 pin 11 continues downstream from here as the basis of the main processor clock signal.

As it happened, momentary switch S4 was wired on my KM11 replica with opposite sense from that expected. Thus in its resting postion CONJ S CLK ON L was asserted (low), which meant the clock output at E027 pin 11 was forced constantly high, regardless of the state of the RC clock. This was verified by leaving S2 "off" and pulling S4 over to its momentary position, whence the CPU clock immediately picked up again.

I had never noticed this switch reversal when using the KM11 with the 11/45, the RK11-C, or the 11/34 -- all of these have different clocking circuits unaffected by the default postion of S4. Desoldered and rotated S4 180 degrees, and the problem was fixed.

After having addressed this, I single stepped through a few of the console microcode flows and was able to match the microcode listings to what was displayed on the KM11 and the console lights with some success. An action shot of the KM11:

A few tips for anybody else who might be micro-stepping the KD11-B CPU, while we are here:

  • The MPC address displayed on the KM11 is negated -- dark LEDs are ones, and lit LEDs are zeros. This definitely takes a little getting used to...

  • The MPC address displayed on the KM11 is the address of the next micro-instruction, not the current one. This is also pretty tricky until you get the hang of it. One nice thing about it, though, is that the displayed next address does include the wired-or outputs of micro-branches.

  • Each manual clock toggle is one bus clock, and typically, a micro-instruction will take two bus clocks to execute. An exception is the inner part (single shifts) of the micro-flows for shift and rotate instructions, which only take a single bus clock. Generally, it is useful to go ahead and advance two bus clocks at a time, as it is easy to get confused probing for signals that by design aren't clocked until the second bus clock within the micro-instruction.

  • The console lights are hard-wired to always display the ALU B-leg input. Useful intermediate information is often displayed there intentionally by the microcode flows.

Now it was possible to put the data paths board out on extenders and step the microcode for a console examine of a ROM location with bit 13 set, and see why bit 13 never showed up on the console lights. To understand this properly, we need to see an excerpt of the KD11-B data paths:

Here you see the ALU in the middle, fed by its B-leg and A-leg inputs. B-leg is fed from the B-register, with provisions for shifting, sign-extension, or forcing the constant +1. B-leg is also continuously displayed on the console lights. A-leg contains, significantly, the 16-location scratch-pad memory (SPM). The first eight locations of this hold processor registers R0 through R7; the remaining eight locations serve as temporary registers for use by the microcode. A-leg can also provide misceallaneous constants from a small ROM.

The A-mux, below the ALU, determines whether the main processor data path is fed from the ALU output, or from the UNIBUS data lines.

With this in mind, the relevant microcode source sequence (taken from the listings in the engineering drawings) is as follows:

LOC  NXT  * CONSOLE EXAMINE SWITCH- FIRST TIME IN SEQUENCE (DON'T INC R[17])
          / GET TO CE1-1 FROM H-2 VIA BUT SWITCH
          / GET TO CE1-1 FROM CE2-2 VIA GOTO
317  307  CE1-1 BA,B←R[17]; BUT SWITCH
          / DISPLAY ADDRESS BY PUTTING INTO THE B REGISTER WHILE EXAMINE IS DOWN
          / LOOP ON CE1-1 UNTIL SWITCH IS RELEASED
307  326  CE1-2 DATI; CKOFF
326  302  CE1-3 B←UNIBUS DATA; GOTO H-2

At micro-location 317 (state CE-1, "console examine 1"), the bus address register and B-register are loaded from SPM location 17, which holds the current console load/examine address. BUT SWITCH ("branch micro-test switch") causes the microcode to loop here as long as the examine switch is depressed. During this time, the fetch address is displayed on the console lights since it has been loaded into the B-register. This was all observed to be functioning normally.

When the examine switch is released, we branch to micro-location 307. Here, a UNIBUS read (DATI) bus cycle is initiated, and the processor clock and microcode execution are suspended until the bus target asyncrohonously asserts SSYN (indicating valid data on the bus) or alternatively times out. The bus cycle was observed to occur normally, leaving SSYN and the correct data (including a correct bit 13) asserted on the UNIBUS.

Proceeding to micro-location 326, we see that the A-mux should be set up there to admit the data from the UNIBUS to the main processor data path and then the B-register should be latched for display. Here a problem was apparent. Sheet DPD of the GT40 engineering drawings covers bits 15:12 of the data paths; package E015 there is an 8266 2x4 mux which implements that slice of the A-mux. E015 was seen via logic probe to be set up with correct select codes and correct input from the UNIBUS. UNIBUS bit 13 was not being correctly passed on to its corresponding output, however -- a failed part.

The 8266 is out of production and somewhat rare; for the time being a functioning 8266 was "borrowed" from a spare GT40 data paths board that we obtained from our fellow collector. Removed the bad part, socketed, and replaced with the borrowed part, and the bit 13 display problem was fixed!

Moving next to the run/step problem, the machine was seen to be hanging up in micro-state F-3, after initiating the DATI bus cycle to fetch an instruction. This lead to investigation of some of the the bus control logic, as detailed on sheet CONC of the engineering drawings:

The CPU must negotiate for control of the UNIBUS and assert BBSY if successful. Here I could see the DATI request being successfully latched, but BBSY assertion was blocked at E014 by CONC NPR GRANT H, a non-processor request (DMA) bus grant. Sure enough, some more probing indicated the the processor had issued a NPR grant because it was reading an NPR request over the UNIBUS. Where was that coming from with nothing else in the system?

Well, it turns out in the GT40 the near-side bus termination is integrated onto the M7014 GT40 bus control board that must but in slot 3, so you can't really debug without this card in place! (It could be that an additional M930 terminator in 3-A,B would work, as in a stock 11/05, but I haven't checked the backplane wire list in detail to be certain of this.) In any case, slotted in the M7014, and the machine began to behave much more rationally with a properply configured bus...

Went for broke and slotted in the rest of the display interface boards and (why not?) the core memory and DL11 as well. The machine was showing very promising signs of life. The terminal emulator in the bootstrap ROM ran and was able to render recevied characters on the display! Characters typed on the keyboard were also successfully forwarded out the DL11. A line feed character typed to the terminal emulator seemed to crash it, so that still needed to be looked into. Took the time to toggle in a small test program from the user guide, and this executed correctly rendering a square on the display, indicating most of the logic in the display interface boards was also functioning correctly:

The toggle-in program running above:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
000100  012706  000500          START:  MOV     #500,R6         ; SETUP STACK
000104  012737  002000  172000          MOV     #TABLE,@#DPC    ; START VT11 ON TABLE
000112  000001                  DONE:   WAIT                    ; LET NPR HAPPEN
000114  000776                          BR      DONE            ; KEEP WAITING IF INTERRUPTED

002000  117124                  TABLE:  .WORD   POINT+INT4+LPOFF+BLKOFF+LINE0
002002  000500  000500                  .WORD   500, 500
002006  110000                          .WORD   LONGV
002010  040200  000000                  .WORD   200+INTX, 0
002014  040000  000200                  .WORD   0+INTX, 200
002020  060200  000000                  .WORD   200+INTX+MINUS, 0
002024  040000  020200                  .WORD   0+INTX, 200+MINUS
002030  160000  002000                  .WORD   DJMP, TABLE

Tried to get some program uploads going over the built-in binary loader in the bootstrap terminal emulator, but this didn't seem to be quite working, either. Took a short break for dinner, returned to examine this further, ran for a few minutes, then disaster... Something in the CPU let go, and the machine was once again unable to execute code.

Digging in on this new failure a little, when attempting to single step ROM code from the front panel, the PC was seen to increment by +1 instead of the expected +2; this resulted in an immediate bus error that halted the machine. Back in goes the KM11, then, and the same microcode stepping technique was used to begin investigating.

So how does the KD11-B (ostensibly) increment the PC by 2? It turns out this is done by selecting the PC (SPM location 7) onto the ALU A-leg, constant +1 on the ALU B-leg, and introducing the additional +1 at the carry input of the least significant bit slice of the ALU on sheet DPA of the engineering diagrams:

Signal CONF CIN H comes from microcode, wire-or'd with output of operation decode ROMs in the ALU aux control circuitry. In this case, the logic probe revealed that this signal was erroneously low; further investigation revealed that microcode PROM CONF E094 had failed:

Alright, this is an IM5603 (equiv. 82S126N) bipolar PROM, and I don't happen to have that in stock. So now we're stuck until we can source one. At this point, the day job once again intervened, and I needed to prepare to head off to the Rubin Observatory in Chile for a couple of weeks. Scott came by to pick up the work in progress; had time to share a short demonstration of microcode debug techniques, then off to pack and prepare for my trip...

[ to be continued... ]


Okay, so the Apple II Plus I was using to test the VM4209 monitor worked fine for about a half hour, then smoked a RIFA cap in its power supply. These are a very common known failure in various old microcomputer power supplies, and I really should have already caught this on inspection and shotgunned it. Boy, do they generate a lot of smoke and stink when they go...

Electrolytics in here are also all pushing 40 years at this point and Astec opted for only 85C parts for many of these, so I went ahead and ordered a whole replacement kit from Console5. I replaced the front end filters and anything on the back end that was 85C or looked even slightly bulgy, which ended up being most of them:

Put it all back together and back in business, except noticed an additional problem: DOS 3.3 booted but neither recognized nor loaded the "language card" (soft ROM) present in the machine. Since my machine has Applesoft BASIC in system ROM, this meant no Integer BASIC (and no ROM mini-assembler) for me until this was fixed.

For those who might not be familiar with the Apple II language card, it provides several features. An Apple II or II Plus can have at most 48K of RAM on the system board. The overall memory map looks like this:

$0000-$BFFFRAM (48K)
$C000-$CFFFIO space (4K)
$D000-$FFFFROM (12K)

The first thing the language card provides is a 2K ROM which by default overlays whatever ROM is in the system board socket for addresses $F800 to $FFFF. This ROM holds the 6502 reset vector and the boot monitor; the version provided on the language card has an "autostart" capability that will search for and boot off an attached Disk II floppy. I am not sure why Apple provided this ROM on the language card since the same ROM could be upgraded directly on the fully-socketed system board, and indeed everything after the Apple II Plus came with this autostart ROM pre-installed. Possibly this was done for maximum backward compatibility with Apple's previously-offered "Firmware Card", which also contained this ROM? Third-parties would later offer clones of the language card that did not contain the additional/redundant F8 ROM, and nobody seems to have particularly cared or noticed...

The second thing the language card provides is an additional 16K of RAM. This can be loaded, then write-protected and mapped over the entire ROM space of the machine from $D000 to $FFFF, providing a "soft ROM" capability. Since the available ROM address space is only 12K, the leftover 4K or RAM on the language card is bank switchable at address range $D000 to $DFFF.

Advanced software which doesn't depend on the system ROMs at all can also map the RAM over the system ROM address space and leave it in read/write mode, thus gaining access to a full 64K of RAM (with the caveat of the switched bank at $D000.)

The language card programming interface presents as follows:



My goal is to be able to run the initial (1980) release of Apple DOS 3.3. This version loads up the language card at boot in soft ROM mode with Integer BASIC and an older version of the F8 monitor ROM, which contains a few extra niceties like a built in mini-assembler and primitive step/trace capabilities. A system so loaded can then be conveniently switched back and forth between BASIC and monitor versions with the DOS "INT" and "FP" commands. But detection of my language card seemed to be failing, so it was never loaded up at boot. Attempts to manually load and activate it also came to naught.

So, time to dig in and see what's going wrong... DOS 3.3 does initial checks and setup of the language card via a small machine language program, loaded into RAM at boot by the DOS master disk's HELLO BASIC program. Machine code is used for this part rather than BASIC so the probe/setup code can be placed and run from RAM outside the mapping range of the language card (the BASIC interpreter itself running from ROM addresses within this mapping range). Here is the machine language program, after being POKEd into RAM by HELLO, listed via the ROM monitor:

I interpret the actions of this code as follows:

  • The code first retrieves and stashes the contents of location $E000, within the mappable address range. A read of $C081 then maps the ROMs on the system board (see programming interface above), and the contents of $E000 are checked again. If the contents now differ, the code assumes the card, now in mapped ROM mode, was previously in mapped RAM mode, and it jumps to address $0332 to put things back the way it found them and bail out.

  • If not, the code reads from address $C083 twice, which puts the card, if present, in writable mapped RAM mode. Two different values are written to and compared back from location $D000. If a language card is not present, this location will still be mapped to ROM on the system board, and at least one of these compares will fail; if either one does, the code jumps to $0332 to bail out.

  • Otherwise, we've verified that a card is present and wasn't already in active play. The program now reads from address $C081 twice to move back to mapped ROM mode, but leaving RAM writable to be loaded by the remainder of the HELLO program after return (in this mode, reads will come from system ROMs, but writes will go to the corresponding locations in language card RAM). A success return code is loaded, and the code jumps forward to the exit code at $0334 to return to the caller.

  • The bail out routine at $0332 just sets up a fail return code, then falls through to $0334. Code here stores the return code to the return code location and does one last compare between the stashed and current values at $E000 to determine if mapped RAM mode must be restored; if so this is done via a read from location $C080 (RAM is left write-protected). Control is then returned to BASIC.

In any case, the above is what the code does, even if I may have misinterpreted some of its motivations. On my system, the ROM on the language card seemed to be working correctly (monitor ran fine, but if ROM on language card was pulled the machine failed to boot, indicating the monitor was in fact running successfully out of ROM on the card). However, mapped RAM mode seemed never to engage. The machine code snippet above returned "00" without crashing, and the DOS HELLO program assumed no card present. Manually accessing the appropriate control registers also had no effect.

The language card design does not seem to be documented in detail, but a schematic is available and it is not too hard to suss out. The first thing to check would seem to be the register interface. Here is an excerpt from the schematic, with a couple of annotations added:



The register interface is principally implemented by a 74LS175 quad flip-flop at D4. These flip-flops share a common reset at power-on, and a common clock based on the expansion slot DEV_L signal; DEV_L is a slot-specific signal strobed by the system on any bus cycles to IO address space allocated to that slot. Examination of the programming interface along with address line decodes feeding the D inputs here leads to the conclusion that the top-most flip-flop holds RAM/ROM mapping mode, the next one down is RAM bank select, the next is RAM write-enable, and the last is a one-bit count to support write-enable after only two successive reads.

Put a chipclip on at D4, and observed behavior of the flip-flops on the logic analyzer while exercising the control registers via the monitor. Problems were apparent right away: while the bank select and count flip-flops were responding as expected, the mapping mode and wite-enable flip-flops were not. The shared reset and clock appeared correct, and all flip-flops themselves were responding logically correctly with respect to their inputs. The issue was apparently with the incoming D lines to two of the flip-flops. Starting with the ROM/RAM enable flip-flop, then, that D line is fed via C3 and C4. And a visual inspection in preparation for clipping these up for the analyzer revealed that, surprisingly, these two chips had actually been reversed on the board!

The language card is fully socketed, so presumeably this reversal had occured during previous troubleshooting or service by the former owner? In any case, it was easy to swap the chips back to their correct locations, and the logic analyzer showed that the entire register interface seemed to be behaving according to design after that. DOS 3.3 now recognized the card at boot (yay!), but the system would crash after loading and mapping the RAM. So, maybe a bad RAM chip as well?

To troubleshoot the RAM, I followed a technique described in this video on the "Adrian's Digital Basement" YouTube channel. Basically, a couple short machine language programs, entered and executed via the monitor, to copy ROM contents into the language card RAM, then copy back out to another location in system RAM. This allows a comparison of source data in ROM and round-tripped data in system RAM; a mismatch indicates a RAM problem on the card. Here are the two programs I used and, and dumps of the resulting data:

And sure enough, it's quite visible here that bit 7 is having problems. Swapped out the 4116 at A2, and that seems to have done it for this one. Now I need to go find a copy of Choplifter...


Sanyo VM4209

Sun 13 March 2022 by Fritz Mueller

Some time back I purchased a Sanyo VM4209 B&W video monitor on eBay. Sold as working, it arrived with apparently working horizontal but no vertical deflection. The price had been right, however, so it was just shelved into the repair queue.



I now have a few DEC VT series terminals to work on and would like to have a small video monitor available on the bench for this, so I pulled down the VM4209 to give it a look. Given the lack of vertical deflection, the first thing to check would be the vertical deflection power transistors, Q105 and Q106 here:



It turns out these transistors are quite well buried in the guts of the monitor on a small aluminum heat sink. Took a while to work my way in to them, and required unsoldering a few leads and removing the CRT:

Sure enough, the lead connecting Q106 collector to ground had been broken, and the part tested bad. Q105 appeared healthy, and there were no other apparent signs of distress in the surrounding circuitry. Q106 is an out of production 2SB474 germanium PNP in the less common TO66 package; I opted to replace it with the "modern" equivalent NTE226. Re-assembled, and was greeted immediately with a functioning raster. Pulled out an Apple II Plus to test with:

This was quite fun for a short while, until a RIFA cap in the Apple II power supply let go, filling the dining room with acrid smoke. So, expect an additional article on that one in the near future :-)


BASIC-11 under RT11

Sun 15 August 2021 by Fritz Mueller

I figured it might be fun to play around a little bit with BASIC-11 under RT11 on the newly-restored PDP-11/34. If I got that working, it could also be included on the RK05 RT11 disk image that I use regularly for demos on the larger PDP-11/45.

The first thing to do was to find a compatible disk image and get it running under simh. Bitsavers had BASIC-11_V2.1_RX02.DSK.zip, which would seem to fit the bill, but the contained image would not mount successfully on simh's RY device. Looking through a dump of the image, there was an apparent "RT11A" signature, so that looked promising. Tried putr under dosbox as well, but it would hang mounting the image. So, off to the cctalk mailing list for some advice...

Glen Slick on the list first noticed a file size discrepancy:

That BASIC.DSK image file has a size of 486,400 bytes. I don't know where that size would come from.

A physical RX-02 floppy should have a sector size of 256 bytes, with 26 sectors per track, and 77 tracks, which would be a total of 512,512 bytes, or 505,856 bytes if the first physical track is ignored.

Indeed, the other RX-02 floppy images available here do have a size of 505,856 bytes: http://www.bitsavers.org/bits/DEC/pdp11/floppyimages/rx02/

Hmm, maybe that BASIC.DSK image file was created by something that only copied the initial allocated logical sectors and ignored unused logical sectors at the end of the floppy, and maybe PUTR doesn't handle disk image files that are not the expected full size?

Example of padding the 486,400 byte BASIC.DSK image file to a size of 512,512 bytes on a Windows system:

FSUTIL FILE CREATENEW BLANK 26112
COPY /B BASIC.DSK+BLANK TEST.DSK

C:\PUTR>DIR TEST.DSK
Volume in drive C has no label.
Volume Serial Number is 14CE-1A29
Directory of C:\PUTR
08/11/2021  12:55p             512,512 TEST.DSK

C:\PUTR>PUTR
PUTR V2.01  Copyright (C) 1995-2001 by John Wilson <wilson@dbit.com>.
All rights reserved.  See www.dbit.com for other DEC-related software.

COPY mode is ASCII, SET COPY BINARY to change
(C:\PUTR)>MOUNT RX: TEST.DSK /RX02 /RT11 /RONLY
(C:\PUTR)>DIR RX:

Volume in drive RX is RT11A
Directory of RX:\*.*

11-Aug-2021
BSOT0D.EAE    12  04-Apr-1983
BSOT0S.EAE    10  04-Apr-1983
BSOT1D.EAE     9  04-Apr-1983
BSOT1S.EAE     6  04-Apr-1983
BSOT0D.EIS    12  04-Apr-1983
...

...etc. Nice. Still no luck mounting under simh, though. Glen further offers:

As far as I can tell by default PUTR expects to work with logical sector order RX-02 disk images that are 512,512 bytes in size. The BASIC-11 RX-02 disk image available here is in logical sector order, but is less than 512,512 bytes in size: http://www.bitsavers.org/bits/DEC/pdp11/floppyimages/rx02/ PUTR appears to be unhappy with the disk image unless it is padded to 512,512 bytes in size.

On the other hand as far as I can tell by default SIMH expects to work with physical sector order RX-02 disk images. If I mount the logical sector order RX-02 disk image that works with PUTR in SIMH, then RT-11 gives a "?DIR-F-Invalid directory" error. If I translate the logical sector order RX-02 disk image back into a physical sector order disk image (dealing with track shifting, sector interleaving, and track to track sector skewing) then RT-11 on SIMH is happy with the disk image.

and:

One bit of information that I found helpful as a reference when I looked at this quite a while ago was the 2.11BSD RX02 floppy disk device driver source code, which can be viewed online here:

https://minnie.tuhs.org/cgi-bin/utree.pl?file=2.11BSD/sys/pdpuba/rx.c

In particular, the routine rxfactr(), which as the comment says it calculates the physical sector and physical track on the disk for a given logical sector.

I used that as a starting point to write a simple utility to read an RX-02 disk image file in logical sector order and output an RX-02 disk image file in physical sector order.

/*
*  rxfactr -- calculates the physical sector and physical
*  track on the disk for a given logical sector.
*  call:
*      rxfactr(logical_sector,&p_sector,&p_track);
*  the logical sector number (0 - 2001) is converted
*  to a physical sector number (1 - 26) and a physical
*  track number (0 - 76).
*  the logical sectors specify physical sectors that
*  are interleaved with a factor of 2. thus the sectors
*  are read in the following order for increasing
*  logical sector numbers (1,3, ... 23,25,2,4, ... 24,26)
*  There is also a 6 sector slew between tracks.
*  Logical sectors start at track 1, sector 1; go to
*  track 76 and then to track 0.  Thus, for example, unix block number
*  498 starts at track 0, sector 25 and runs thru track 0, sector 2
*  (or 6 depending on density).
*/
static
rxfactr(sectr, psectr, ptrck)
   register int sectr;
   int *psectr, *ptrck;
{
   register int p1, p2;

   p1 = sectr / 26;
   p2 = sectr % 26;
   /* 2 to 1 interleave */
   p2 = (2 * p2 + (p2 >= 13 ?  1 : 0)) % 26;
   /* 6 sector per track slew */
   *psectr = 1 + (p2 + 6 * p1) % 26;
   if (++p1 >= 77)
       p1 = 0;
   *ptrck = p1;
}

An RX02 image shuffled into physical sector order generated by Glen and suitable for use with simh is attached here.

Jerry Weiss further suggested that the original, logically ordered image may work as is under simh if attached as an MSCP device rather than RX02. This turns out also to be the case:

On Fri, Aug 13, 2021 at 9:46 AM Jerry Weiss wrote:
Could you attach logical sector (block?) image as MSCP disk in SIMH? Other than some minor image manipulation for removing track 0 if present, is there any reason this would not be readable?

Hmm, it didn't occur to me to try that. Mounting the logical sector order RX-02 disk image, without any modification necessary, as a raw MSCP disk does indeed appear to work!

sim> ATTACH RQ1 BASIC.DSK
RQ1: 'BASIC.DSK' Contains RT11 partitions
1 valid partition, Type: V05, Sectors On Disk: 950

sim> SHOW RQ1
RQ1     486KB, attached to BASIC.DSK, write locked
       RD54, UNIT=1, autosize
       RAW format

.DIR DU1:

BSOT0D.EAE    12  04-Apr-83      BSOT0S.EAE    10  04-Apr-83
BSOT1D.EAE     9  04-Apr-83      BSOT1S.EAE     6  04-Apr-83
BSOT0D.EIS    12  04-Apr-83      BSOT0S.EIS     9  04-Apr-83
BSOT1D.EIS     9  04-Apr-83      BSOT1S.EIS     6  04-Apr-83
BSOT0S.FIS     7  04-Apr-83      BSOT1S.FIS     6  04-Apr-83
...

...etc. Armed with the above, I was able to get BASIC-11 into an RT11 image in the Unibone card, and running on the new PDP-11/34. Here's output from the DEC BASIC mandelbrot program at rosetta code:


Unibone

Wed 24 March 2021 by Fritz Mueller

I have been keeping an eye on Jörg Hoppe's interesting Unibone project for some time -- it is a general-purpose Unibus device emulator and diagnostic tool, built around a BeagleBone Black compute module running embedded real-time Linux. The PDP-11/34 restoration project finally provided enough impetus for me to pull the trigger on getting one.

Sent Jörg an email to order a kit, which arrived some weeks later complete with bundled BeagleBone. The kit is pretty well thought-out and was enjoyable to put together. Had to throw in a few of my own pin headers and jumpers to complete the assembly. The only other small confusions were a few of the resistor packs which did not match the schematic (Jörg informed me these are non-critical values.)

The kit did not include card handles. I decided to try having some 3D printed by Shapeways, using their "processed versatile plastic" process, which is a laser sintered nylon, color dyed and smoothed. I used a card handle model by Vince Slyngstad found here. The results were nice, sturdy, and dimensionally correct. The chosen "purple" color is a rather intense magenta in real life. Not exactly cheap for just a couple parts, but I had been wanting to try their print service.

The Unibone has all sorts of capabilities, and proved itself very useful during the '11/34 restoration:

  • Ability to bus master to probe the Unibus address space and run diagnostics on memory found there. This was very useful for debugging the memory card that came with the -11/34 and sussing out its undocumented configuration switch settings.

  • Ability to directly load and execute MAINDEC diagnostics, without needing a functioning console emulator or storage subsystem. This is a convenient and speedier alternative to PDP11GUI.

  • Subsequently, the ability to emulate entire storage subsystems, very useful for loading and running full operating systems on this -11/34 which otherwise has no storage of its own.

The Unibone goes in a quad SPC slot; I opted for slot 9 on the -11/34, and this entailed removing the NPG jumper on the backplane there to allow the Unibone to bus master. The device worked well straight-away after assembly.

There are, alas, a couple small frustrations with the current design:

  • It is desireable to configure the Unibone and backplane to allow the Unibone to bus master and interrupt. However, this leaves grant chain(s) open at boot until the Unibone's own embedded software can boot and take control of the card (which takes on the order of a minute or so). During this time the host system is non-functional or may hang, and it needs to be subsequently reset (this reset can be scripted from the Unibone, but all of this does significantly increase end-to-end boot time of the machine). It would be nice if the Unibone had something like some normally-closed relays on the grant chains, to preserve grant continuity until control is actively assumed.

  • It would be desireable to be able to power the embedded BeagleBone in isolation, in place in a system, without having to having to have the entire host system powered at the same time (e.g. for maintenance of the Unibone's embedded software stack, maintenance of locally stored storage system media images, etc.) There is a relay on the Unibone which switches in Unibus power when available, but unfortunately, the design is such that if the BeagleBone is also externally powered the relay remains engaged when the host system is shut down. This could lead to the BeagleBone trying to power then entire Unibus via its 5V supply/connector, which could obviously be problematic... For now it seems best just to pull the card in order to run it in isolation, which is a little less than convenient.

That said, the designs and software are open source, and the card comes with some generous prototyping areas built right in, so some mods to address these issues could be a fun project. All in all, Jörg has put together a fantasically useful bit of kit, and I'm certainly glad to have it in my toolbox!


PDP-11/34

Tue 09 March 2021 by Fritz Mueller

This spring I worked on repair/restoration of a friend's PDP-11/34. The system was in fairly good shape, but missing a few bits and pieces and with the usual sorts of issues for 45-year-old kit. Started per usual with disassembly, cleaning, and inspection. The BA11-K chassis was in pretty decent shape; just a few scratches requiring some sanding and a little touch-up paint to inhibit future corrosion.

Date codes on the chassis and CPU cards are from 1976, but other components in the chassis are a bit of mix-and-match (a KY11-LB console interface and a third-party Monolithic systems memory board date from 1981, and a DL11-W SLU/RTC card is from 1977). Serial number is 2001. There is also a sticker for "OHIO NUCLEAR", which was an early manufacturer of CT devices.

Foam problems here were limited to a decayed air pre-filter at the front of the chassis and some padding on the cable retaining bar at the rear. A heat gun and a paint scraper are your friend for removing the leftover cellophane adhesive strips that were used to secure the foam. For the replacement pre-filter, I opted for 3M Filtrete carbon pre-filter sheet (part FAPF-UCTF2PAMZ) which comes in sheets large enough to cover the front of the chassis and is easily cut to size with scissors.

The front panel brackets ended up being a bit of a puzzle to reassemble -- I unfortunately failed to pay close attention to how exactly the lower fasteners were configured during disassembly. Most of the wisdom out in the restoration community seems to pertain to a newer, and much more convenient, version of these brackets (or the ones that arrived on this system were mismatched?) Here's a picture of the brackets that I have, and a shot of the arrangement I finally opted for for the flange-blinded mounting hole at the bottom of the chassis: machine screws driven from the back of the bracket with Keps nuts toward the front. I also added some 1/8" nylon spacers so the pre-filter could be extended across the entire front of the chassis, behind the brackets, and everything still remains square when tightened up. A serviceable replacement power knob was tracked down here.

The BA11-K chassis has an integrated H765 power supply. The power-controller unit was in pretty good shape, but I replaced the line cord since the old one had some fairly serious nicks in its outer jacket. Also replaced cap C1 (50uF) which seemed to be drifting off value. Replaced the .1uF across-the-line caps mounted on the power transformer with modern X2 safety caps. The DC regulator modules (2x H744 +5V and 1x H745 -15V) were disassembled and cleaned. Reformed all the large electrolytics, then load tested the reassembled regulators individually. Nothing out of sorts here except the usual replacement of burnt out incandescent indicator bulbs.

I filled out the system with a near-side M9301 bootstrap-terminator (recent eBay purchase), some G727 "knuckle buster" grant continuity cards, and an M9302 SACK turnaround far-side terminator. New on this restoration was a UniBone Linux-to-Unibus bridge, used to emulate storage devices among other things (more on this in a separate article soon). Checked/adjusted NPR continuity on the backplane (continuity wire wraps in place for all slots except slot 9, to accommodate the UniBone). Module utilization as follows:

ABCDEF
1M7266 CPU control
2M7265 CPU data path
3M9301 boot termM7859 console
4Monolithic Systems 303-0158 64 KiB memory
5M7856 serial / line clock
6G727
7G727
8G727
9M9302 SACK termUniBone

Connected up a VT100 to the serial card, and fired things up. Good signs of life from the front panel, but the machine immediately halted without producing a boot monitor prompt. Was able to reset the machine from the front panel, though, and then examine and deposit various memory locations from there.

Boot ROM memory locations were readable, and the contents looked correct. RAM addresses were generally readable and writable, but bit 10 appeared incorrect (sometimes always set; sometimes always clear). I was also able to successfully write to the console XBUF register from the front panel and see characters appear on the VT100.

A bus init from the front panel followed by manually punching in the boot ROM entry point produced a functional ROM monitor on the VT100. Deposits and examines to RAM done from the boot monitor produced results identical to those seen using the front panel (same bit 10 problem).

One of the cool features of the KY11-LB console is a maintenance mode that can run Unibus bus cycles on its own without a CPU. This gives a way to do limited testing of cards in isolation: just set up the M7859 on a powered, terminated backplane segment and plug in cards to be tested one at a time. Deposits and examines can then be done using the buttons and display on the front panel.

Interestingly, when running this way with just the console and memory cards in place the bit 10 errors were no longer apparent. Some other card was apparently corrupting bit 10 on the bus; by checking one at a time the problem was quickly isolated to the M9301 boot terminator card.

The M9301 drives the implicated bit onto the Unibus via an 8881 bus driver at position E9, as seen below. The signal coming in from the bottom here is ENAB DATA H, which is meant to enable these drivers only when the M9301 detects a valid address decode. Verified that data was being incorrectly driven on BUS D10 L at E9 pin 13, regardless of the state of pin 12, indicating a faulty driver. Pulled this, socketed, and replaced (with a compatible ECG 7439), and the bit 10 problem was fixed.



There was still some problem with auto-boot to the M9301 monitor, however; the monitor prompt would now begin to print at power up, but the machine would halt a few characters in. The front panel bus init plus manual jump to monitor entry point workaround was still working though, so put off further investigation of this issue until later.

At this point, given the workaround, the system was working well enough to begin loading and running MAINDEC diagnostics over the serial line with PDP11GUI. Relevant diagnostics, from the PDP-11/34 System User's Manual:



DFKAA, DFKAB, and DKFAC all ran without issue. DFKTG, DZKMA, and DZQMC all reported various errors, so time to look into the memory board.

The board is a Monolithic Systems 303-0158:

I could not find any information on the board on the internet, but much can be figured out by inspection and testing. First, the board is using 4116 (16Kx1) DRAMs, pretty usual for the era. There is space for 4 banks of 18; each bank would be 16K words (16 data bits plus two parity bits per word.) Here we see two banks populated, minus one of the parity chips. So we'd expect to see 32K words (64 KiB) mappable (or 28K words [56 KiB] with address translation disabled, to accommodate the 4K word [8 KiB] I/O page.) The missing parity chip is unlikely to cause any trouble in this application; in an '11/34, there is no memory parity support without the optional M7850 parity board installed, and this system does not have one.

One of the capabilities of the Unibone is to probe the full 18-bit Unibus address space, looking for active pages. These tests indicated that the memory board as configured was responding to the lower 128 KiB of addresses, even though only 64 KiB was populated. One would suppose that the mapped address range was configured via the DIP switches on the board. Some experimentation with various switch settings yielded the following:

SW1: Memory start addr, 000000 + values as follows
76543
0....400000
.0...200000
..0..100000
...0.040000
....0020000
SW2: Memory size, 020000 + values as follows
4321
0...200000
.0..100000
..0.040000
...0020000

After setting the switches appropriately for the amount of memory physically present, memory test errors went away and the MAINDEC memory diagnostics (excepting parity tests) also ran successfully.

So the Last thing to fix was the problem with the boot monitor at startup. For this, the boot ROM card went back out on an extender so I could get at it with a logic analyzer.

A PDP-11 generates power down and power up traps, through location 024, based on transitions of the AC LO and DC LO Unibus signals. In handling this trap, the processor first reads the PC from location 024, then the PSW from location 026. Many PDP-11s had core memory or battery-backed RAM; this allowed for orderly recovery from power failure events.

PDP-11 boot ROM cards like the M9301 or its younger cousin the M9312 use a hack to obtain control at boot. They monitor AC LO and DC LO, and when detecting a boot condition they jam higher order address bits on the Unibus for a the first couple bus cycles. This causes the PC and PSW to be fetched from locations within the address space of the boot ROM card. Here is most of the circuitry responsible for this:

The bus drivers that jam the address bus are seen on the right. The central player here is E21, a 9602 one-shot. CLEAR ADDR L is supposed to arrive after the first two bus cycles (fetch of PC and PSW) and release the bus; the one-shot is set up to timeout after about 300ms and release the bus in any case.

On the logic analyzer, we can see an issue here:

MSYN delimits bus cycles mastered by the CPU. Here we can see that CLEAR ADDR L never arrives, and so the higher-order address bits remained jammed by the M9301 for the full duration of the one-shot timeout. This is okay for the first few instructions, which are executing out of the ROM anyway, but things quickly go awry...

Here is the circuitry responsible for CLEAR ADDR L:

The desired pulse is mediated by 270 uF capacitor C36 in one leg of gate E20, so this is a good thing to check first, and... it is actually missing from the board! (Visible in the M9301 gallery picture above.) Replaced this cap, and now we are in good shape:

With this, the machine is fully repaired. Spent a little time with it, booting and running various operating systems from emulated storage on the Unibone card. Frieda also approves:


I recently had need to assess and repair several DL11 serial interfaces in my stock of spares. One of these had had some sort of end-user hack applied; in the course of putting the board back to factory condition, I did some analysis of the hack and its intended purpose, documented here.

Easy enough to beep this out and reverse to a schematic:

So, the hack appears to dynamically alter the CSR address and interrupt vector of the card, choosing between two hard-wired presets, based on whether P1A/P1B are connected together or not.

The CSR jumpers on a stock DL11 operate with pull-ups upstream of the address decode logic, so these can be directly driven by the hack so long as the jumpers for the bits-to-be-hacked are left open on the board. The vector address bits, however, must be driven by the DL11 onto to the Unibus contingent on an appropriate global enable. On a stock DL11, drivers for all configurable vector bits are activated by a single global enable, and jumpers downstream of the drivers control which of these activated bits will be admitted to bus. So, for the vector address part of the hack to function, hack control must be asserted instead of the global enable for each of the to-be-driven bits, and the corresponding jumpers for these bits must be left in. And indeed, upon inspection of the DL11 there are trace cuts that have been done (marked here with "X") to lift the global enable and allow individual hack control of each of the affected bits:


Last, we can look at the board jumpering and the wiring of the hack to determine the specific CSR and vector addresses at play:

A11A10A9 A8A7A6 A5A4A3 A2A1A0
P1 Open 1 1 0 1 0 1 0 0 1 0 0 0 776510
P1 Closed 1 1 1 1 0 1 1 1 0 0 0 0 777560
V8V7V6 V5V4V3 V2V1V0
P1 Open 0 1 1 0 0 1 0 0 0 310
P1 Closed 0 0 0 1 1 0 0 0 0 060


We see from these specific addresses that closing the contacts of P1 would dynamically re-jumper the board from assignment as the 2nd non-console interface to assignment as the console interface. So perhaps this was once used (in conjunction with another similarly hacked interface?) to swap console terminals with the flip of a single switch.


[A catch-up article, documenting events of April/May 2020]

In late April, I offered to give a video demonstration of the '11/45 to some interested work colleagues. Since I hadn't had it on in a while, I fired it up to make sure everything was still in working order. The machine behaved well from the front panel and was able to boot both V6 Unix and RSTS V06C. Great! Typed a very simple demo program in to RSTS (print a multiplication table) and that ran, but produced some very strange results. Uh oh...

Asked RSTS to PRINT PI, and it spat out a value somewhere around 3.7... :-)

So, time to try the floating point MAINDECS... Sure enough, failures all over the place, starting with the very first diagnostic in the floating point suite, CFPAB0. This diagnostic covers utility operations like LDFPS/STFPS, SETI/SETL, SETF/SETD, etc.

I do not have listings for the diagnostics in this suite, but it is usually simple enough to reproduce failures with short toggle-in programs given the names and descriptions of the failing diagnostics. In this case, the following simple code to exercise an LDFPS/STFPS sequence from the front panel switches and lights showed that bits 10 and 11 of the floating point status/control word would come back erroneously toggled:

1
2
3
4
5
001000  170137  START:  LDFPS   @#177570        ;LOAD FPS FROM SWITCH REGISTER
        177570
001004  170237          STFPS   @#177570        ;AND STORE BACK TO DISPLAY REGISTER
        177570
001010  000773          BR      START           ;REPEAT

First things first, check power to the FPU and its clock; these look fine. Next, plug the KM11 into the floating point slot and check the FPU microcode sequences while executing LDFPS and STFPS instructions. These also look fine:

  • For LDFPS @#177570 I see RDY.00, RDY.10, RDY.20, RDY.30, RDY.70, LD.50

  • For STFPS @#177570 I see RDY.00, RDY.10, RDY.20, RDY.30, RDY.80, STR.30, STR.08

Most of the data paths of interest regarding the FPS register are on the fraction low (FRL) board, so this goes out on extenders so the microcode can be stepped and gate-level logic inspected with a logic probe.

Here is the block diagram of data paths in the FPU, for reference in discussion below:

FP11-B data paths

So, one thing to note with regard to the FPS register is that it is gated through the ACMX multiplexer and written into scratch pad register AC7[0] during microcode state RDY.00 which is the first state in the common prolog of every FPU instruction:

FP11-B microcode prolog

Stopping in state RDY.00 and examining the ACMX inputs, selects, and outputs for bits 10 and 11 immediately reveals a problem. These bits of ACMX are implemented by a 74153 dual 4-input mux, E71 on sheet FRLB of the FP11-B engineering drawings:

FP11-B ACMX <11:10>

Inputs from the FPS register on pins 6 and 10 appear correct, as do the selector signals on pins 14 and 2. But outputs on pins 7 and 9 appear to be inverted. So E71 appears bad. Pulled this, socketed, and replaced. After this fix, LDFPS/STFPS function correctly in the toggle-in test program, and MAINDEC CFPAB0 passes.

Not out of the woods yet, though... Progressing down the sequence of MAINDECS, diagnostic CFPDC0 (add/subtract) now fails :-( For this, we bring back the simple "add two floats" diagnostic used during previous FP11 debug:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
        000000                          AC0=%0
        000001                          AC1=%1
000000                                  .ASECT
        001000                          .=1000
001000  170011                  START:  SETD                ;SET DOUBLE PRECISION MODE
001002  172467  000014                  LDD     D1,AC0      ;FETCH FIRST ADDEND FROM D1
001006  172567  000020                  LDD     D2,AC1      ;FETCH SECOND ADDEND FROM D2
001012  172100                          ADDD    AC0,AC1     ;ADD THEM (RESULT IN AC1)
001014  174167  000022                  STD     AC1,D3      ;STORE RESULT TO D3
001020  000000                          HALT
001022  040200  000000  000000  D1:     .WORD   040000,000000,000000,000000 ;0.5
001030  000000
001032  040200  000000  000000  D2:     .WORD   040000,000000,000000,000000 ;0.5
001040  000000
001042  000000  000000  000000  D3:     .WORD   000000,000000,000000,000000
001050  000000
        001000                          .END    START

Sure enough, this is producing incorrect results. The microcode flows for add/subtract/compare are a bit more involved than the simple load/store sequences above. The sequence starts with common prolog RDY.00, RDY.10, RDY.20, RDY.30, same as above. The first fork after RDY.30 goes to RDY.60, since add/subtract/compare are "no memory class" instructions (FP accumulator register operands only). The second fork after RDY.60 takes us to ADD.00 on sheet FP11 FLOWS 8.

The left side if FLOWS 8 is a decision tree for zero operands and/or whether or not we are executing a compare instruction. Traversal of these states sets up fraction and exponent operands and, if necessary, a comparison of operand exponents in the EALU. In our case (addition of two double-precision non-zero operands), the sequence is: ADD.00, ADD.04, ADD.06, ADD.02, ADD.08, ADD.12.

We then end up at state ADD.22 at the top of the right side of FLOWS 8. The previously set up exponent difference is used to index into a 256x4 "range ROM"; output bits from this ROM inform the subsequent microcode fork which determines which operand shift, if any, to apply before the upcoming fraction ALU operation.

Here a problem is evident. We should fork to ADD.24, for equal exponents, but instead we end up add ADD.30, for destination exponent less than source exponent. Putting the FXP board out on the extender and pausing in this state, the operands and operation codes on the EALU bit-slices appear to be correct, but signal FRMH ALU CIN L is erroneously asserted at E34 pin 7 (sheet FXPA). This extra carry (borrow, really, since the operation is a subtract) into the least significant bit-slice causes the EALU output to be -1 instead of 0.

Moving back to the source of this signal on the FRM board, it turns out that FRM E20, a 74H40 dual quad-input NAND, is outputting an invalid logic level at pin 8. Pulled this, socketed, replaced, and the problem appears to be fixed.

After this second repair, the full suite of FP11-B diagnostics is passing again. And RSTS/E has a much less fanciful interpretation of PI...


[A catch-up article, documenting discoveries of Feb 2019]

In early 2019, I made a V6 Unix pack from the Ken Wellsch tape image, as mentioned in this blog entry. It booted on my machine, but dumped core on the first ls in single-user mode, or as soon as I did any heavy lifting in multi-user mode.

The following is the conclusion of a chronology of the troubleshooting campaign that took place over the next month and a half, culminating in a hardware fix and successful operation of V6 Unix on the machine (part I is here.) This was largely a collaborative effort between Noel Chiappa an myself via direct email correspondence, though some help was received from others via the cctalk mailing list as well.

By this point, the nature of the ls problem had been fairly well characterized: part of the ls process address space ended up holding an incorrect portion of its program text; subsequently, when execution jumped to some of these unexpected bits, an out-of-bounds memory access would occur triggering a memory management trap. Efforts now focus on understanding how and why the bad bits got there...

February 7

[Here and below, block-quoted content is excerpted from email correspondence.]

Fritz:

Noel, is it possible for you deduce where Unix should be placing these "bad" bits (from file offset octal 4220)? Maybe a comparison of addresses where the bits should be, with addresses where the "bad" copy ends up, could point us at some particular failure modes to check in the KT11, CPU, or RK11...

Noel:

Yes, it's quite simple: just add the virtual address in the code to the physical address of the bottom of the text segment (given in UISA0). The VA is actually 04200, though: the 04220 includes 020 to hold the a.out header at the start of the command file.

So, with UISA0 containing 01614, that gives us PA:161400 + 04200 = PA:165600, I think. And it wound up at PA:171600 - off by 04000 (higher) - which is obviously an interesting number.


Here's where it gets 'interesting'.

Executing a command with pure text on V6 is a very complicated process. The shells fork()s a copy of itself, and does an exec() system call to overlay the entire memory in the new process with a copy of the command (which sounds fairly simple, at a high level) - but the code path to do the exec() with a pure text is incredibly hairy, in detail. In particular, for a variety of reasons, the memory of the process can get swapped in and out several times during that. I apparently used to understand how this all worked, see this message:

https://minnie.tuhs.org/pipermail/tuhs/2018-February/014299.html

but it's so complicated it's going to take a while to really comprehend it again. (The little grey cells are aging too, sigh...)

The interesting point is that when V6 first copies the text in from the file holding the command (using readi(), Lions 6221 for anyone who's masochistic enough to try and actually follow this :-), it reads it in starting from the bottom, one disk block at a time (since in V6, files are not stored contiguously).

So, if it starts from the bottom, and copies the wrong thing from low in the file up to VA:010200, when it later gets to VA:010200 in the file contents, that should over-write the stuff that got put there in the wrong place earlier. Unless there's another problem which causes that later write to also go somewhere wrong...

So, I'm not sure when this trashage is happening, but because of the above, my guess is that it's in one of the two swap operations on the text (out, and then back in). (Although it might be interesting to look at PA:165600 and see what's actually there.) Unix does swapping of pure texts in a single, multi-block transfer (although not always as an integral number of blocks, as we found out the hard way with the QSIC :-).

So my suspicions have now switched back to the RK11... One way to proceed would be to stop the system after the pure text is first read in (say around Lions 4465), and look to see what the text looks like in main memory at that point. (This will require looking at KT11 registers to see where it's holding the text segment, first.)

If that all looks good, we'll have to figure out how to stop the system after the pure text is read back in (which does not happen in exec(), it's done by the normal system operation to swap in the text and data of a process which is ready to run).

We could also stop the system after the text is swapped out, and key in a short (~ a dozen words) program to read the text back in from the swap device, and examine it - although we'd have to grub around in the system a bit to figure out where it got written to. (It might be just easier to stop it at, say, Lions 5196 and look at the arguments on the kernel stack.)

Fritz:

...it might be interesting to look at PA:165600 and see what's actually there

A sea of zeros, as it turns out.


The most valuable thing ... would be to look at the text segment, after it's read in and before it's swapped out. I can work out where to put a halt, if you want to try that.

Yes, this sounds like a good plan to me! Is this as simple as dropping a HALT at VA:0 in the text?

Noel:

No; actually, probably easier! :-) Probably easiest is to, just before you type 'ls', put a HALT in the OS just after 4467 in Lions. Halt the machine momentarily, patch the kernel, and CONT. (Basically the same as your patch to the trap vector, just a different address.) That'll be at 021320 (should contain 062706), physical or virtual. :-)

When the system halts, you'll need to look at the text in memory. Two ways to find the location: look on the kernel stack, the address should be the second thing down:

mov 16(r3),-(sp)
add $20,(sp)
mov (r4),-(sp)
jsr pc,*$_swap

(i.e. the thing that 020 got added to). Probably easier, though, is just to look in UISA0 (which at this point is pointing to the block of memory that's been allocated to read the text into, Lions 4459-60).

That number in UISA0, T, will be the click address of the text. So PA:T00 should be the start of the text (170011 010600, etc). So then PA:(T00+010200) should be the trashed chunk of text: 110024 010400 000167 000016 010500 etc (right) or 016162 004767 000224 000414 016700 (wrong).

February 8

Noel:

In addition to the info I already sent about how to [set the breakpoint], if you could note down the top 3 words on the kernel stack, and the contents of the RK registers, those would be really useful; the first will allow us to work out what should be in the RK registers after the swap I/O operation completes - I don't think the RK11 will be asked to do anything after that finishes and before the system hits that halt in xalloc().

To find the kernel stack.... read out KISA6, S. This value will point to the 'user' area of that process, plus the kernel stack. The kernel SP should be something like 01417xx; subtract 140000 (the segment number), and add what's left to S00. Alternatively, you can probably use the rotating switch on the front panel to just look up VA:1417xx (whatever's in R6) directly.

Oh, if you need some bed-time reading to put you to sleep, check out the bottom section ("exec() and pure-text images") in:

http://gunkies.org/wiki/Unix_V6_internals

which will explain what's going on here with the swapping in and out, which is sorta complicated.

February 9

Noel:

just halt the machine after the text is swapped in

The code we need is at Lions 2034, where the pure text of a process is swapped in (and this should only be traversed once; I don't think the system will need to swap in the text of the shell); just put a HALT in (in the usual manner, just before trying 'ls') at 015406, which should contain a 062706 (again).

At that point, since the text size is 010400, and the location of the text in physical memory is 0161400, the BAR should contain 0172000. If not, and it's 0232000 (note that the 0200000 bit will be in the CSR, the lower XM bit) instead, Bazinga!, it's nailed (unless the system somehow snuck another RK operation in there, but I don't see anything that could do that).

I finally get some time back in front of the machine, after a few days in bed with a cold:

...put a HALT in the OS just after 4467 in Lions. Halt the machine momentarily, patch the kernel, and CONT. (Basically the same as your patch to the trap vector, just a different address.) That'll be at 021320 (should contain 062706)...

But alas, it does not. [PA:021320] = 010246. Furthermore, [PA:015406] = 016504.


I just tried under SIMH, also, and got consistent results:

[PA:015406] = 016504
[PA:021320] = 010246

...so, one would think, my rkunix and yours are different?

Noel:

That must be it. I thought we were both working from the V6 distribution? Oh, yours prints out that Western Electric copyright notice, I don't think mine has that...

February 10

The first part of the day is spent sorting out and comparing the "Wellsch" V6 distribution that I have been using, and the "Ritchie" version that Noel has been using. Noel comes to the conclusion that the only differences in the kernel sources are in fact the four printfs for the copyright notice, but this is enough to perturb the locations of various symbols of interest between the two kernels. He also finds the binaries ls, cc, as, as2, ld c0, c1, and c2 all match; as do liba.a, libc.a and crt0.o.

Getting back on the trail of the bug:

So the first place I'd like to try HALTing is just after the call to swap, Lions 4467; at that point, the text should be in main memory, and also just written to disk. Should be at 021320 (old contents should be 062706).

Fun things to do here: look at the text in main memory (0161400 and up), see if it's correct at this point. Also: pull the arguments off the top of the stack, and write a small program to read it back in...

This turns out to be one last typo ("rkunix" vs. "rrkunix" on Noel's part) resulting in incorrect symbol addresses for my kernel, but I'm hip to Noel's curveballs now so:

Okay, using today's newly acquired 'db' skillz :-), in my rkunix, that spot is at PA:21420. Firing up the machine again and trying that now...

It works; I end up stopped at the breakpoint and start extracting data:

Hmmm:

PA:161400: 141644 141660 000000 000000 000000 000000 000000 000000
PA:161420: 000000 000000 000000 000000 000000 000000 000000 000000

Noel:

The text is probably at a different location in PA at this point. Read out UISA0 for its base.

Fritz:

UISA0: 001654

PA:165400: 170011 010600 011046 005720 010066 000002 004767 000010

KSP: 141656 -> PA:165256

PA:165256: 007656 001654 000104 000000 101602 066312 000000 141726
PA:175600: 110024 010400 000167 000016 010500 010605 101446 010346

So far so good -- both beginning and eventually-bogus sections of text check out at this point!

Noel:

Woo-Hoo!!!! YEAH!!!!

So that part of the text is right at this point.

Needless to say, this is very, very important data.

So chances are very strong, at this point, that it's the RK11.

What did you want to do next? You could start with the RK11 registers. Also, use PDP11GUI to read the copy off the swap device, once I decipher the stack?


PA:165256: 007656 001654 000104 000000 101602 066312 000000 141726

OK, so the 01654 is the start address in PA (in clicks) for the area to swap out, and that matches UISA0. 0104 is the text length (also in clicks), and that also matches. The 0 is a flag which says it's a write (read is 01). And the 07656 is the block number (4014.).

Fritz:

I should have a valid swap on the disk from before I shut down... Going to fire up PDP11GUI and grab it now to have a look. We want blocks 4014-4022, then? (9 x 512-byte blocks = 0110 clicks if I got that right?)

Noel:

4014.-4023., I think...

(9 x 512-byte blocks = 0110 clicks if I got that right?)

I think 8-1/2 or so; text is 010400 bytes (a little less, actually, but that's what the system is using), 01000 bytes/block, = 010.4 blocks.

Fritz:

Hmm, the beginning looks good, but it seems to cut off to soon:

0000000    000000  000000  000000  000000  000000  000000  000000  000000
*
7656000    170011  010600  011046  005720  010066  000002  004767  000010
7656020    010016  004737  006374  104401  004567  010154  162706  000044
7656040    012716  000001  004737  004652  010067  022314  010516  062716
7656060    177762  004737  006346  016500  177762  062700  177413  010067
   |
7660320    000137  002346  016516  000004  012746  020452  004737  003562
7660340    005726  000137  002542  005067  017552  012704  022336  005003
7660360    012716  021050  004737  005042  110024  005203  022703  000020
7660400    000000  000000  000000  000000  000000  000000  000000  000000
*
11410000

Noel:

7656000    170011  010600  011046  005720  010066  000002  004767  000010

Yup, good start; SETD, etc.

7660360    012716  021050  004737  005042  110024  005203  022703  000020
7660400    000000  000000  000000  000000  000000  000000  000000  000000

Hunh; not good. (Might be worth looking at that location in main memory, see if it's zeros or not.)

That's so odd that it's all zeros - I wonder where they came from? Maybe they were already on the disk, and the write stopped way early? (At 01000 bytes per block, it stopped after 2-1/2 blocks; 056000s, 057000s, stopped half-way through the 060000's.)

Would be useful to have the RK register contents after the swap() call returns...

Fritz:

Okay, the write should be from PA:165400 - PA:175777, to sectors 07656 - 07667. Block 7667 encodes to an RKDA value of 012363.

After the halt, I find:

RKDS: 004707 (OK)
RKER: 000000 (OK)
RKCS: 000322 (BOGUS! EX.MEM = 01)
RKWC: 000000 (OK)
RKBA: 176000 (OK)
RKDA: 012363 (OK)

So, EX.MEM are the smoking bits here! I will review the associated designs and come up with things the try/check.


Okay, taking a look:

RKBA is implemented in the M795 module in slots AB07, as detailed on sheet RK11-C-15. The M795 is a generic WC/BA Unibus interfacing module. The BA part only covers 16 bits, but generates an overflow out "D15 RKBA=ALL 1 L".

EX MEM 01 and EX MEM 02 are maintained on the M239 module in slot A17, as detailed on sheet RK11-C-03. The M239 is a 3x 4-bit counter/register module, so this also implements counting up these bits, when triggered by "D15 RKBA = ALL 1 L".

Based on where we see the data on disk fall off (offset 2400) and the start PA (165400), I'm guessing we get a false trigger on this "ALL 1" at RKBA 167777. So that looks like a false "1" detect on RKBA bit 12.

So I think the thing to do is to put the M795 out on an extender, load RKBA with 167777, and have a check at E28 pin 5, and E34 pin 8!

And we leave the cliffhanger there, for now, at least until tomorrow evening. Because due to the way the RK11-C is mounted, in order to do the above I'm going to have to spin the whole machine around (its a dual H960), extend the RK05's so there is room to physically climb in the back, rig a work light, and get on in there...

February 11

SUCCESS!!

Put the M795 out on an extender, loaded 167777 in RKBAR, and had a look around with a logic probe. Narrowed it down to E34 (a 7430 8-input NAND). Pulled, socketed, replaced, and off she goes!

I can now successfully boot and run both V6 Unix and RSTS/E V06C from disk.

THAT was a really fun and rewarding hunt :-) First message in the thread was back on Dec 30, 2018. Lots of debugging and DRAM repairs, then the final long assault to this single, failed gate...

Thanks to all here for the help and resources, and particular shout-outs for Noel and Paul who gave generously of their time and attention working through the densest bits, both on and off the list.

I predict a long happy weekend and a big power bill at the end of the month :-)

M795 module and the single failed gate