GT40 Terminal II: Lunar Lander

Fri 01 September 2023 by Fritz Mueller

[Continuation of restoration work on a DEC GT40 graphic display terminal; part one here.]

At this point, Scott had taken over the restoration work as I had had to leave town for work. We consulted a few times via IMs and video calls over the next couple weeks; the following is a narrative of Scott's continued work on the project as I understood it remotely.

The next thing that needed doing was to replace the failed microcode PROM described at the end of the previous article. I did some work to manually transcribe the PROM contents from the binary microcode listings included in the engineering drawings (4 bits x 256 microcode locations). Scott double-checked my work here and found and fixed at least three transcription errors (always good to have a double check on tedious tasks like this, and I seem to be developing a worsening dyslexia with age -- thanks, Scott!)

Scott tracked down and burned a replacement PROM and replaced the failing one on the board, and we were back again to the previous high water mark (able to run toggle in programs and the ROM bootstrap terminal emulator, with the same linefeed and binary load failures I had seen previously).

Scott played around with the binary loader for a bit, but it seemed to be suffering a pretty frustrating blend of several different issues. Attention was turned back to the bootstrap ROM terminal emulator LF handling problem, which was consistent and repeatable. Scott began single stepping the code by instruction, using the listings in the GT40/GT42 User's Guide, and soon made two discoveries:

The unit has the GT42 version of the boostrap ROM, and not the GT40 version (this can be seen because the bootstrap terminal emulator correctly handles TAB characters).
Upon receiving a LF char, the bootstrap code got to a loop which was scanning the input buffer looking for LFs, but failed to find any and looped indefinitely.

The malfunctioning ROM code scanning for LFs can be seen at location 166310, in the listing on PDF page 81 of the GT40/GT42 User's Guide, and is as follows:

166310  122300              LFLOOP: CMPB    (SCAN)+,CHAR        ;AND LOOK FOR A LINEFEED
166312  001406                      BEQ     LFOUND              ;GOT IT, SEARCH HAS ENDED
166314  020327  007000              CMP     SCAN,#BLIMIT        ;ARE WE AT END OF BUFFER?
166320  103773                      BLO     LFLOOP              ;NOPE, KEEP ON LOOKING.
166322  012703  001000              MOV     #BSTART,SCAN        ;IF AT TOP, RESET TO BOTTOM OF BUFFER
166326  000770                      BR      LFLOOP              ;AND KEEP ON LOOKING.

Scott began microstepping at address program address 166310, which is machine code 122300, CMPB (R2)+,R0. The microcode flow traced through is as follows, using state names from the microcode listings in the engineering drawings:

H-2: Tracing activity starts with the machine halted and looping in microstate H-2. The KM11 is set to manual clock mode, front panel CONT switch depressed and held, and several manual clocks taken causing microbranch to...
CCS-1: Loads B←PC, causing PC to be displayed on console lights.
CCS-2: Loops waiting for CONT switch to be released.
CCS-3: Turns on RUN light.
F-1: Loads BA←PC, and initiates asynchronous bus cycle to fetch instruction.
F-2: Loads B←PC+2, causing next instruction address to be displayed on console lights.
F-3: Loads PC←B, updating the PC, and suspends processor clock until instruction fetch bus cycle reaches SSYN.
F-4: Resumes here when fetched instruction is on bus; latches into B (displaying instruction on console lights) and also into the IR decode register; releases the bus.
F-5: First big microcode branch based on instruction type.
S2-1: Source addressing mode 2 (register auto-increment). BA←R[S], and initiates asynchronous bus cycle to read source operand from memory.
S2-2: B←R[S]+1+BYTE.BAR, which increments the source register by 1 or 2 depending on byte or word instruction.
S2-3: R[S]←B (stores back incremented source register), suspends processor clock until source operand fetch bus cycle reaches SSYN.
S1-2: Resumes here when source operand is on bus; latches into B (displaying source operand on console lights) and releases the bus, then branches on byte even, byte odd, or word.

So far so good. In the case being traced, we happen to be doing a byte read from an odd address. In this case, the fetched source data must next be shifted right 8 bits; this is done over the course of the next 8 microsinstructions, SBO-1 - SBO-8. Here Scott noticed a problem: bit 3 was always set in the B register after any single right shift, even if the bit4 to the left was zero. This points directly at E044 on sheet DPA, a four bit shift register which implements bits 0:3 of the B-register:

KD11-B data path bad BREG shift register

This part was pulled and replaced, and the ROM terminal emulator could then correctly handle LFs! After a few additional red herring to do with loose power connectors and occasional accidental bumping of the test switches on the M7013 display control board, Scott was also able to get the lunar lander code to load and run via the ROM bootstrap binary loader, though still with some display problems:

GT40 displaying multiple lines after CMPB (R2)+,R0 LF fix. GT40 running the lunar lander game, but still with some display issues.

Scott discovered a major clue concerning remaining loader problems: some GT40 binary-loader encoded binaries we had been using which were downloaded off other enthusiast web sites contained erroneous extra linefeed and "!" characters, which confused the loader and/or triggered bad checksums. After stripping these out, the loader was seen to work quite reliably.

With diagnostics now in hand, Scott was able to track down a few remaining hardware issues on the display boards (a bad register with a stuck high bit, and a swap of one of the DACs which had been acting flaky with one from a spare board. I don't have precise details on these particular fixes, but will expand here later if/when I get more information.)

Below, screen shots of some diagnostics, and at long last, Scott lands on the moon and gets his cheeseburger! Drop by and visit Scott at his booth at VCFMW this weekend, see and play game, and hear tales of the restoration first-hand!

Repaired GT40 running diagnostic MAINDEC-DDGTE. Repaired GT40 running the lunar lander game. Repaired GT40: Scott lands on the moon and gets his cheesburger!

GT40 Terminal

Mon 21 August 2023 by Fritz Mueller

A while ago my friend Scott approached me with an idea to collaborate on restoration of a DEC GT40 graphic display terminal of unknown status, belonging to a third collector friend of ours; the idea was to restore the machine to working order for exhibition at the various summer/fall vintage computer shows. The GT40 ran an early (pre-Atari) graphical version of the lunar lander game which was released in 1973. The 50th anniversary of this code seemed a nice theme for the exhibit.

The GT40 was an integrated product consisting of a PDP-11/05 (KD11-B) CPU, a VT11 vector display processor, a VR14 X-Y monitor, a light pen, keyboard, and a DL11 serial interface:

The DEC GT40

Scott retrieved the terminal, which had a fairly bad case of screen rot. We agreed that Scott would work on restoration of the monitor while I dug in on the system unit. Scott got to work while I dealt with distractions of ongoing home renovations, a ton of work-related travel, and my first bout of COVID (blaargh!)

The screen rot is caused by a deterioriation of a polyvinyl acetate (PVA) layer sandwiched between the front surface of the CRT glass and an outer protective implosion shield. All of this is held together by a retaining ring affixed to the CRT with silicone adhesive. The only fix for this is to disassemble the monitor, separate the sandwich, and clean out and replace the deteriorated PVA layer.

After chatting with some folks who had successfully conducted a similar VR14 restoration at the Rhode Island Computer Museum, Scott obtained some silicone digester to aid in separation of the retaining ring. The terminal was disassembled and then digester was repeatedly injected under the ring with a syringe, allowed to sit, and the resulting softened silicone scraped away over the course of a week.

Scott then worked to conform a lexan sheet to the interior of the implosion shield as a replacement for the PVA layer, as RICM had done. This process, conducted in a home oven, proved to be quite fiddly. But persistence paid off, and the end result looks very nice!

After a precautionary reform of the larger power supply electrolytics, careful reassembly, and a gradual bringup on a variac, the monitor showed proof of life on the bench, hooked up to a signal generator source.

GT40 display (VR14) with screen rot Silicone digester repeatedly injected beneath the retaining ring, and softened silicone sraped away a layer at a time. After a week of work the retaining ring was freed and the layers were able to be separated and cleaned. Conforming plexiglass in the oven to fill the gap between the display tube and the implosion shield where the PVA used to be. Display re-assembled and working, driven by a test oscillator, and looking nice!

In the meantime, starting to feel better, I began to look at the CPU unit. Power supply electrolytics appeared to be in good shape, and the supply came up on the bench without much difficulty.

The module utilization for this backplane is as follows:

	A	B	C	D	E	F
1	A320 VT40 Display Generator
2	M7013 VT40 Display Control
3	M7014 VT40 Bus Control
4			M7800 DL11 Serial
5	M930 Term. / UNIBUS out		H214 Core Stack (8K x 16)
6	G231 Core Memory Driver
7	G110 Core Memory Control
8	M7261 KD11-B Control
9	M7260 KD11-B Data Paths

On the assumption (later proved wrong) that this was effectively the same as a PDP-11/05 setup, I began debug with just the two CPU cards, an M9301 boot/terminator in position 5A-B, and a grant continuity "knuckle buster" in position 4D. Some problems were immediately apparent from the front console: deposit and examine operations to various memory-mapped CPU registers seemed to work as expected, but when examining contents the M9301 ROM locations bit 13 was always displaying zero. The CPU would not enter run mode, nor could it be single stepped.

Docs suggested that the GT40 would accomodate a KM11 debug module in postion 1B, so in this went. The machine behaved even more strangely when the KM11 was in, hanging up entirely unless the KM11 was put in manual clock mode, and even then stepping microstates at unexpected times. It took a little probing of the CPU clock circuits to discover why:

Here we see the RC clock at E019. CONJ MAN CLK L is wired to KM11 switch S2, and inhibits the RC clock when pulled low. With the RC clock thus disabled, NOR gate E027 admits manual clocking via CONJ S CLK ON L, connected to KM11 (momentary) switch S4. The output at E027 pin 11 continues downstream from here as the basis of the main processor clock signal.

As it happened, momentary switch S4 was wired on my KM11 replica with opposite sense from that expected. Thus in its resting postion CONJ S CLK ON L was asserted (low), which meant the clock output at E027 pin 11 was forced constantly high, regardless of the state of the RC clock. This was verified by leaving S2 "off" and pulling S4 over to its momentary position, whence the CPU clock immediately picked up again.

I had never noticed this switch reversal when using the KM11 with the 11/45, the RK11-C, or the 11/34 -- all of these have different clocking circuits unaffected by the default postion of S4. Desoldered and rotated S4 180 degrees, and the problem was fixed.

After having addressed this, I single stepped through a few of the console microcode flows and was able to match the microcode listings to what was displayed on the KM11 and the console lights with some success. An action shot of the KM11:

A few tips for anybody else who might be micro-stepping the KD11-B CPU, while we are here:

The MPC address displayed on the KM11 is negated -- dark LEDs are ones, and lit LEDs are zeros. This definitely takes a little getting used to...
The MPC address displayed on the KM11 is the address of the next micro-instruction, not the current one. This is also pretty tricky until you get the hang of it. One nice thing about it, though, is that the displayed next address does include the wired-or outputs of micro-branches.
Each manual clock toggle is one bus clock, and typically, a micro-instruction will take two bus clocks to execute. An exception is the inner part (single shifts) of the micro-flows for shift and rotate instructions, which only take a single bus clock. Generally, it is useful to go ahead and advance two bus clocks at a time, as it is easy to get confused probing for signals that by design aren't clocked until the second bus clock within the micro-instruction.
The console lights are hard-wired to always display the ALU B-leg input. Useful intermediate information is often displayed there intentionally by the microcode flows.

Now it was possible to put the data paths board out on extenders and step the microcode for a console examine of a ROM location with bit 13 set, and see why bit 13 never showed up on the console lights. To understand this properly, we need to see an excerpt of the KD11-B data paths:

KD11-B data path (excerpt)

Here you see the ALU in the middle, fed by its B-leg and A-leg inputs. B-leg is fed from the B-register, with provisions for shifting, sign-extension, or forcing the constant +1. B-leg is also continuously displayed on the console lights. A-leg contains, significantly, the 16-location scratch-pad memory (SPM). The first eight locations of this hold processor registers R0 through R7; the remaining eight locations serve as temporary registers for use by the microcode. A-leg can also provide misceallaneous constants from a small ROM.

The A-mux, below the ALU, determines whether the main processor data path is fed from the ALU output, or from the UNIBUS data lines.

With this in mind, the relevant microcode source sequence (taken from the listings in the engineering drawings) is as follows:

LOC  NXT  * CONSOLE EXAMINE SWITCH- FIRST TIME IN SEQUENCE (DON'T INC R[17])
          / GET TO CE1-1 FROM H-2 VIA BUT SWITCH
          / GET TO CE1-1 FROM CE2-2 VIA GOTO
317  307  CE1-1 BA,B←R[17]; BUT SWITCH
          / DISPLAY ADDRESS BY PUTTING INTO THE B REGISTER WHILE EXAMINE IS DOWN
          / LOOP ON CE1-1 UNTIL SWITCH IS RELEASED
307  326  CE1-2 DATI; CKOFF
326  302  CE1-3 B←UNIBUS DATA; GOTO H-2

At micro-location 317 (state CE-1, "console examine 1"), the bus address register and B-register are loaded from SPM location 17, which holds the current console load/examine address. BUT SWITCH ("branch micro-test switch") causes the microcode to loop here as long as the examine switch is depressed. During this time, the fetch address is displayed on the console lights since it has been loaded into the B-register. This was all observed to be functioning normally.

When the examine switch is released, we branch to micro-location 307. Here, a UNIBUS read (DATI) bus cycle is initiated, and the processor clock and microcode execution are suspended until the bus target asyncrohonously asserts SSYN (indicating valid data on the bus) or alternatively times out. The bus cycle was observed to occur normally, leaving SSYN and the correct data (including a correct bit 13) asserted on the UNIBUS.

Proceeding to micro-location 326, we see that the A-mux should be set up there to admit the data from the UNIBUS to the main processor data path and then the B-register should be latched for display. Here a problem was apparent. Sheet DPD of the GT40 engineering drawings covers bits 15:12 of the data paths; package E015 there is an 8266 2x4 mux which implements that slice of the A-mux. E015 was seen via logic probe to be set up with correct select codes and correct input from the UNIBUS. UNIBUS bit 13 was not being correctly passed on to its corresponding output, however -- a failed part.

KD11-B AMUX 15:12

The 8266 is out of production and somewhat rare; for the time being a functioning 8266 was "borrowed" from a spare GT40 data paths board that we obtained from our fellow collector. Removed the bad part, socketed, and replaced with the borrowed part, and the bit 13 display problem was fixed!

Moving next to the run/step problem, the machine was seen to be hanging up in micro-state F-3, after initiating the DATI bus cycle to fetch an instruction. This lead to investigation of some of the the bus control logic, as detailed on sheet CONC of the engineering drawings:

GT40 DATI bus control logic (excerpt)

The CPU must negotiate for control of the UNIBUS and assert BBSY if successful. Here I could see the DATI request being successfully latched, but BBSY assertion was blocked at E014 by CONC NPR GRANT H, a non-processor request (DMA) bus grant. Sure enough, some more probing indicated the the processor had issued a NPR grant because it was reading an NPR request over the UNIBUS. Where was that coming from with nothing else in the system?

Well, it turns out in the GT40 the near-side bus termination is integrated onto the M7014 GT40 bus control board that must but in slot 3, so you can't really debug without this card in place! (It could be that an additional M930 terminator in 3-A,B would work, as in a stock 11/05, but I haven't checked the backplane wire list in detail to be certain of this.) In any case, slotted in the M7014, and the machine began to behave much more rationally with a properply configured bus...

Went for broke and slotted in the rest of the display interface boards and (why not?) the core memory and DL11 as well. The machine was showing very promising signs of life. The terminal emulator in the bootstrap ROM ran and was able to render recevied characters on the display! Characters typed on the keyboard were also successfully forwarded out the DL11. A line feed character typed to the terminal emulator seemed to crash it, so that still needed to be looked into. Took the time to toggle in a small test program from the user guide, and this executed correctly rendering a square on the display, indicating most of the logic in the display interface boards was also functioning correctly:

First sign of end-to-end life on the GT40: terminal emulator boostrap running, and rendering received characters. GT40 display list processor running, rendering a square.

The toggle-in program running above:

000100  012706  000500          START:  MOV     #500,R6         ; SETUP STACK
000104  012737  002000  172000          MOV     #TABLE,@#DPC    ; START VT11 ON TABLE
000112  000001                  DONE:   WAIT                    ; LET NPR HAPPEN
000114  000776                          BR      DONE            ; KEEP WAITING IF INTERRUPTED

002000  117124                  TABLE:  .WORD   POINT+INT4+LPOFF+BLKOFF+LINE0
002002  000500  000500                  .WORD   500, 500
002006  110000                          .WORD   LONGV
002010  040200  000000                  .WORD   200+INTX, 0
002014  040000  000200                  .WORD   0+INTX, 200
002020  060200  000000                  .WORD   200+INTX+MINUS, 0
002024  040000  020200                  .WORD   0+INTX, 200+MINUS
002030  160000  002000                  .WORD   DJMP, TABLE

Tried to get some program uploads going over the built-in binary loader in the bootstrap terminal emulator, but this didn't seem to be quite working, either. Took a short break for dinner, returned to examine this further, ran for a few minutes, then disaster... Something in the CPU let go, and the machine was once again unable to execute code.

Digging in on this new failure a little, when attempting to single step ROM code from the front panel, the PC was seen to increment by +1 instead of the expected +2; this resulted in an immediate bus error that halted the machine. Back in goes the KM11, then, and the same microcode stepping technique was used to begin investigating.

So how does the KD11-B (ostensibly) increment the PC by 2? It turns out this is done by selecting the PC (SPM location 7) onto the ALU A-leg, constant +1 on the ALU B-leg, and introducing the additional +1 at the carry input of the least significant bit slice of the ALU on sheet DPA of the engineering diagrams:

KD11-B ALU least-significant slice

Signal CONF CIN H comes from microcode, wire-or'd with output of operation decode ROMs in the ALU aux control circuitry. In this case, the logic probe revealed that this signal was erroneously low; further investigation revealed that microcode PROM CONF E094 had failed:

KD11-B faile microcode PROM CONF E094

Alright, this is an IM5603 (equiv. 82S126N) bipolar PROM, and I don't happen to have that in stock. So now we're stuck until we can source one. At this point, the day job once again intervened, and I needed to prepare to head off to the Rubin Observatory in Chile for a couple of weeks. Scott came by to pick up the work in progress; had time to share a short demonstration of microcode debug techniques, then off to pack and prepare for my trip...

[ to be continued... ]

BASIC-11 under RT11

Sun 15 August 2021 by Fritz Mueller

I figured it might be fun to play around a little bit with BASIC-11 under RT11 on the newly-restored PDP-11/34. If I got that working, it could also be included on the RK05 RT11 disk image that I use regularly for demos on the larger PDP-11/45.

The first thing to do was to find a compatible disk image and get it running under simh. Bitsavers had BASIC-11_V2.1_RX02.DSK.zip, which would seem to fit the bill, but the contained image would not mount successfully on simh's RY device. Looking through a dump of the image, there was an apparent "RT11A" signature, so that looked promising. Tried putr under dosbox as well, but it would hang mounting the image. So, off to the cctalk mailing list for some advice...

Glen Slick on the list first noticed a file size discrepancy:

That BASIC.DSK image file has a size of 486,400 bytes. I don't know where that size would come from.

A physical RX-02 floppy should have a sector size of 256 bytes, with 26 sectors per track, and 77 tracks, which would be a total of 512,512 bytes, or 505,856 bytes if the first physical track is ignored.

Indeed, the other RX-02 floppy images available here do have a size of 505,856 bytes: http://www.bitsavers.org/bits/DEC/pdp11/floppyimages/rx02/

Hmm, maybe that BASIC.DSK image file was created by something that only copied the initial allocated logical sectors and ignored unused logical sectors at the end of the floppy, and maybe PUTR doesn't handle disk image files that are not the expected full size?

Example of padding the 486,400 byte BASIC.DSK image file to a size of 512,512 bytes on a Windows system:
FSUTIL FILE CREATENEW BLANK 26112
COPY /B BASIC.DSK+BLANK TEST.DSK

C:\PUTR>DIR TEST.DSK
Volume in drive C has no label.
Volume Serial Number is 14CE-1A29
Directory of C:\PUTR
08/11/2021  12:55p             512,512 TEST.DSK

C:\PUTR>PUTR
PUTR V2.01  Copyright (C) 1995-2001 by John Wilson <wilson@dbit.com>.
All rights reserved.  See www.dbit.com for other DEC-related software.

COPY mode is ASCII, SET COPY BINARY to change
(C:\PUTR)>MOUNT RX: TEST.DSK /RX02 /RT11 /RONLY
(C:\PUTR)>DIR RX:

Volume in drive RX is RT11A
Directory of RX:\*.*

11-Aug-2021
BSOT0D.EAE    12  04-Apr-1983
BSOT0S.EAE    10  04-Apr-1983
BSOT1D.EAE     9  04-Apr-1983
BSOT1S.EAE     6  04-Apr-1983
BSOT0D.EIS    12  04-Apr-1983
...

...etc. Nice. Still no luck mounting under simh, though. Glen further offers:

As far as I can tell by default PUTR expects to work with logical sector order RX-02 disk images that are 512,512 bytes in size. The BASIC-11 RX-02 disk image available here is in logical sector order, but is less than 512,512 bytes in size: http://www.bitsavers.org/bits/DEC/pdp11/floppyimages/rx02/ PUTR appears to be unhappy with the disk image unless it is padded to 512,512 bytes in size.

On the other hand as far as I can tell by default SIMH expects to work with physical sector order RX-02 disk images. If I mount the logical sector order RX-02 disk image that works with PUTR in SIMH, then RT-11 gives a "?DIR-F-Invalid directory" error. If I translate the logical sector order RX-02 disk image back into a physical sector order disk image (dealing with track shifting, sector interleaving, and track to track sector skewing) then RT-11 on SIMH is happy with the disk image.

and:

One bit of information that I found helpful as a reference when I looked at this quite a while ago was the 2.11BSD RX02 floppy disk device driver source code, which can be viewed online here:

https://minnie.tuhs.org/cgi-bin/utree.pl?file=2.11BSD/sys/pdpuba/rx.c

In particular, the routine rxfactr(), which as the comment says it calculates the physical sector and physical track on the disk for a given logical sector.

I used that as a starting point to write a simple utility to read an RX-02 disk image file in logical sector order and output an RX-02 disk image file in physical sector order.

/*
*  rxfactr -- calculates the physical sector and physical
*  track on the disk for a given logical sector.
*  call:
*      rxfactr(logical_sector,&p_sector,&p_track);
*  the logical sector number (0 - 2001) is converted
*  to a physical sector number (1 - 26) and a physical
*  track number (0 - 76).
*  the logical sectors specify physical sectors that
*  are interleaved with a factor of 2. thus the sectors
*  are read in the following order for increasing
*  logical sector numbers (1,3, ... 23,25,2,4, ... 24,26)
*  There is also a 6 sector slew between tracks.
*  Logical sectors start at track 1, sector 1; go to
*  track 76 and then to track 0.  Thus, for example, unix block number
*  498 starts at track 0, sector 25 and runs thru track 0, sector 2
*  (or 6 depending on density).
*/
static
rxfactr(sectr, psectr, ptrck)
   register int sectr;
   int *psectr, *ptrck;
{
   register int p1, p2;

   p1 = sectr / 26;
   p2 = sectr % 26;
   /* 2 to 1 interleave */
   p2 = (2 * p2 + (p2 >= 13 ?  1 : 0)) % 26;
   /* 6 sector per track slew */
   *psectr = 1 + (p2 + 6 * p1) % 26;
   if (++p1 >= 77)
       p1 = 0;
   *ptrck = p1;
}

An RX02 image shuffled into physical sector order generated by Glen and suitable for use with simh is attached here.

Jerry Weiss further suggested that the original, logically ordered image may work as is under simh if attached as an MSCP device rather than RX02. This turns out also to be the case:

On Fri, Aug 13, 2021 at 9:46 AM Jerry Weiss wrote:
Could you attach logical sector (block?) image as MSCP disk in SIMH? Other than some minor image manipulation for removing track 0 if present, is there any reason this would not be readable?

Hmm, it didn't occur to me to try that. Mounting the logical sector order RX-02 disk image, without any modification necessary, as a raw MSCP disk does indeed appear to work!
sim> ATTACH RQ1 BASIC.DSK
RQ1: 'BASIC.DSK' Contains RT11 partitions
1 valid partition, Type: V05, Sectors On Disk: 950

sim> SHOW RQ1
RQ1     486KB, attached to BASIC.DSK, write locked
       RD54, UNIT=1, autosize
       RAW format

.DIR DU1:

BSOT0D.EAE    12  04-Apr-83      BSOT0S.EAE    10  04-Apr-83
BSOT1D.EAE     9  04-Apr-83      BSOT1S.EAE     6  04-Apr-83
BSOT0D.EIS    12  04-Apr-83      BSOT0S.EIS     9  04-Apr-83
BSOT1D.EIS     9  04-Apr-83      BSOT1S.EIS     6  04-Apr-83
BSOT0S.FIS     7  04-Apr-83      BSOT1S.FIS     6  04-Apr-83
...

...etc. Armed with the above, I was able to get BASIC-11 into an RT11 image in the Unibone card, and running on the new PDP-11/34. Here's output from the DEC BASIC mandelbrot program at rosetta code:

BASIC-11 Mandelbrot program on a PDP-11/34, program output

Unibone

Wed 24 March 2021 by Fritz Mueller

I have been keeping an eye on Jörg Hoppe's interesting Unibone project for some time -- it is a general-purpose Unibus device emulator and diagnostic tool, built around a BeagleBone Black compute module running embedded real-time Linux. The PDP-11/34 restoration project finally provided enough impetus for me to pull the trigger on getting one.

Sent Jörg an email to order a kit, which arrived some weeks later complete with bundled BeagleBone. The kit is pretty well thought-out and was enjoyable to put together. Had to throw in a few of my own pin headers and jumpers to complete the assembly. The only other small confusions were a few of the resistor packs which did not match the schematic (Jörg informed me these are non-critical values.)

The kit did not include card handles. I decided to try having some 3D printed by Shapeways, using their "processed versatile plastic" process, which is a laser sintered nylon, color dyed and smoothed. I used a card handle model by Vince Slyngstad found here. The results were nice, sturdy, and dimensionally correct. The chosen "purple" color is a rather intense magenta in real life. Not exactly cheap for just a couple parts, but I had been wanting to try their print service.

Unibone: unassembled kit Unibone: 3d printed handles Unibone assembled

The Unibone has all sorts of capabilities, and proved itself very useful during the '11/34 restoration:

Ability to bus master to probe the Unibus address space and run diagnostics on memory found there. This was very useful for debugging the memory card that came with the -11/34 and sussing out its undocumented configuration switch settings.
Ability to directly load and execute MAINDEC diagnostics, without needing a functioning console emulator or storage subsystem. This is a convenient and speedier alternative to PDP11GUI.
Subsequently, the ability to emulate entire storage subsystems, very useful for loading and running full operating systems on this -11/34 which otherwise has no storage of its own.

The Unibone goes in a quad SPC slot; I opted for slot 9 on the -11/34, and this entailed removing the NPG jumper on the backplane there to allow the Unibone to bus master. The device worked well straight-away after assembly.

There are, alas, a couple small frustrations with the current design:

It is desireable to configure the Unibone and backplane to allow the Unibone to bus master and interrupt. However, this leaves grant chain(s) open at boot until the Unibone's own embedded software can boot and take control of the card (which takes on the order of a minute or so). During this time the host system is non-functional or may hang, and it needs to be subsequently reset (this reset can be scripted from the Unibone, but all of this does significantly increase end-to-end boot time of the machine). It would be nice if the Unibone had something like some normally-closed relays on the grant chains, to preserve grant continuity until control is actively assumed.
It would be desireable to be able to power the embedded BeagleBone in isolation, in place in a system, without having to having to have the entire host system powered at the same time (e.g. for maintenance of the Unibone's embedded software stack, maintenance of locally stored storage system media images, etc.) There is a relay on the Unibone which switches in Unibus power when available, but unfortunately, the design is such that if the BeagleBone is also externally powered the relay remains engaged when the host system is shut down. This could lead to the BeagleBone trying to power then entire Unibus via its 5V supply/connector, which could obviously be problematic... For now it seems best just to pull the card in order to run it in isolation, which is a little less than convenient.

That said, the designs and software are open source, and the card comes with some generous prototyping areas built right in, so some mods to address these issues could be a fun project. All in all, Jörg has put together a fantasically useful bit of kit, and I'm certainly glad to have it in my toolbox!

PDP-11/34

Tue 09 March 2021 by Fritz Mueller

This spring I worked on repair/restoration of a friend's PDP-11/34. The system was in fairly good shape, but missing a few bits and pieces and with the usual sorts of issues for 45-year-old kit. Started per usual with disassembly, cleaning, and inspection. The BA11-K chassis was in pretty decent shape; just a few scratches requiring some sanding and a little touch-up paint to inhibit future corrosion.

Date codes on the chassis and CPU cards are from 1976, but other components in the chassis are a bit of mix-and-match (a KY11-LB console interface and a third-party Monolithic systems memory board date from 1981, and a DL11-W SLU/RTC card is from 1977). Serial number is 2001. There is also a sticker for "OHIO NUCLEAR", which was an early manufacturer of CT devices.

PDP-11/34: serial number tag PDP-11/34: BA11-K chassis sticker dating the system to 1976 PDP-11/34: partial sticker for

Foam problems here were limited to a decayed air pre-filter at the front of the chassis and some padding on the cable retaining bar at the rear. A heat gun and a paint scraper are your friend for removing the leftover cellophane adhesive strips that were used to secure the foam. For the replacement pre-filter, I opted for 3M Filtrete carbon pre-filter sheet (part FAPF-UCTF2PAMZ) which comes in sheets large enough to cover the front of the chassis and is easily cut to size with scissors.

The front panel brackets ended up being a bit of a puzzle to reassemble -- I unfortunately failed to pay close attention to how exactly the lower fasteners were configured during disassembly. Most of the wisdom out in the restoration community seems to pertain to a newer, and much more convenient, version of these brackets (or the ones that arrived on this system were mismatched?) Here's a picture of the brackets that I have, and a shot of the arrangement I finally opted for for the flange-blinded mounting hole at the bottom of the chassis: machine screws driven from the back of the bracket with Keps nuts toward the front. I also added some 1/8" nylon spacers so the pre-filter could be extended across the entire front of the chassis, behind the brackets, and everything still remains square when tightened up. A serviceable replacement power knob was tracked down here.

PDP-11/34: BA11-K front panel mounting bracket PDP-11/34: Front panel mounted with replacement air pre-filter in place behind brackets and replacement power knob

The BA11-K chassis has an integrated H765 power supply. The power-controller unit was in pretty good shape, but I replaced the line cord since the old one had some fairly serious nicks in its outer jacket. Also replaced cap C1 (50uF) which seemed to be drifting off value. Replaced the .1uF across-the-line caps mounted on the power transformer with modern X2 safety caps. The DC regulator modules (2x H744 +5V and 1x H745 -15V) were disassembled and cleaned. Reformed all the large electrolytics, then load tested the reassembled regulators individually. Nothing out of sorts here except the usual replacement of burnt out incandescent indicator bulbs.

PDP-11/34: H765 power controller module cleaned with new line cord PDP-11/34: H765 power supply main transformer, with modern X2 safety caps (orange) installed PDP-11/34: Load testing an H744 DC regulator module;

I filled out the system with a near-side M9301 bootstrap-terminator (recent eBay purchase), some G727 "knuckle buster" grant continuity cards, and an M9302 SACK turnaround far-side terminator. New on this restoration was a UniBone Linux-to-Unibus bridge, used to emulate storage devices among other things (more on this in a separate article soon). Checked/adjusted NPR continuity on the backplane (continuity wire wraps in place for all slots except slot 9, to accommodate the UniBone). Module utilization as follows:

	A	B	C	D	E	F
1	M7266 CPU control
2	M7265 CPU data path
3	M9301 boot term		M7859 console
4	Monolithic Systems 303-0158 64 KiB memory
5			M7856 serial / line clock
6				G727
7				G727
8				G727
9	M9302 SACK term		UniBone

Connected up a VT100 to the serial card, and fired things up. Good signs of life from the front panel, but the machine immediately halted without producing a boot monitor prompt. Was able to reset the machine from the front panel, though, and then examine and deposit various memory locations from there.

Boot ROM memory locations were readable, and the contents looked correct. RAM addresses were generally readable and writable, but bit 10 appeared incorrect (sometimes always set; sometimes always clear). I was also able to successfully write to the console XBUF register from the front panel and see characters appear on the VT100.

A bus init from the front panel followed by manually punching in the boot ROM entry point produced a functional ROM monitor on the VT100. Deposits and examines to RAM done from the boot monitor produced results identical to those seen using the front panel (same bit 10 problem).

One of the cool features of the KY11-LB console is a maintenance mode that can run Unibus bus cycles on its own without a CPU. This gives a way to do limited testing of cards in isolation: just set up the M7859 on a powered, terminated backplane segment and plug in cards to be tested one at a time. Deposits and examines can then be done using the buttons and display on the front panel.

Interestingly, when running this way with just the console and memory cards in place the bit 10 errors were no longer apparent. Some other card was apparently corrupting bit 10 on the bus; by checking one at a time the problem was quickly isolated to the M9301 boot terminator card.

The M9301 drives the implicated bit onto the Unibus via an 8881 bus driver at position E9, as seen below. The signal coming in from the bottom here is ENAB DATA H, which is meant to enable these drivers only when the M9301 detects a valid address decode. Verified that data was being incorrectly driven on BUS D10 L at E9 pin 13, regardless of the state of pin 12, indicating a faulty driver. Pulled this, socketed, and replaced (with a compatible ECG 7439), and the bit 10 problem was fixed.

Part of the data drivers for the M9301 bootstrap terminator

There was still some problem with auto-boot to the M9301 monitor, however; the monitor prompt would now begin to print at power up, but the machine would halt a few characters in. The front panel bus init plus manual jump to monitor entry point workaround was still working though, so put off further investigation of this issue until later.

At this point, given the workaround, the system was working well enough to begin loading and running MAINDEC diagnostics over the serial line with PDP11GUI. Relevant diagnostics, from the PDP-11/34 System User's Manual:

PDP 11/34 Diagnostics

DFKAA, DFKAB, and DKFAC all ran without issue. DFKTG, DZKMA, and DZQMC all reported various errors, so time to look into the memory board.

The board is a Monolithic Systems 303-0158:

PDP-11/34: Monolithic Systems memory card

I could not find any information on the board on the internet, but much can be figured out by inspection and testing. First, the board is using 4116 (16Kx1) DRAMs, pretty usual for the era. There is space for 4 banks of 18; each bank would be 16K words (16 data bits plus two parity bits per word.) Here we see two banks populated, minus one of the parity chips. So we'd expect to see 32K words (64 KiB) mappable (or 28K words [56 KiB] with address translation disabled, to accommodate the 4K word [8 KiB] I/O page.) The missing parity chip is unlikely to cause any trouble in this application; in an '11/34, there is no memory parity support without the optional M7850 parity board installed, and this system does not have one.

One of the capabilities of the Unibone is to probe the full 18-bit Unibus address space, looking for active pages. These tests indicated that the memory board as configured was responding to the lower 128 KiB of addresses, even though only 64 KiB was populated. One would suppose that the mapped address range was configured via the DIP switches on the board. Some experimentation with various switch settings yielded the following:

SW1: Memory start addr, 000000 + values as follows
7	6	5	4	3
0	.	.	.	.	400000
.	0	.	.	.	200000
.	.	0	.	.	100000
.	.	.	0	.	040000
.	.	.	.	0	020000

SW2: Memory size, 020000 + values as follows
4	3	2	1
0	.	.	.	200000
.	0	.	.	100000
.	.	0	.	040000
.	.	.	0	020000

After setting the switches appropriately for the amount of memory physically present, memory test errors went away and the MAINDEC memory diagnostics (excepting parity tests) also ran successfully.

So the Last thing to fix was the problem with the boot monitor at startup. For this, the boot ROM card went back out on an extender so I could get at it with a logic analyzer.

A troublesome M9301 boot ROM card in a PDP-11/34, out on a card extender for troubleshooting. This one was acting strangely at power-up; the problem is actually visible in this picture...

A PDP-11 generates power down and power up traps, through location 024, based on transitions of the AC LO and DC LO Unibus signals. In handling this trap, the processor first reads the PC from location 024, then the PSW from location 026. Many PDP-11s had core memory or battery-backed RAM; this allowed for orderly recovery from power failure events.

Power up/down signals

PDP-11 boot ROM cards like the M9301 or its younger cousin the M9312 use a hack to obtain control at boot. They monitor AC LO and DC LO, and when detecting a boot condition they jam higher order address bits on the Unibus for a the first couple bus cycles. This causes the PC and PSW to be fetched from locations within the address space of the boot ROM card. Here is most of the circuitry responsible for this:

M9301 boot logic

The bus drivers that jam the address bus are seen on the right. The central player here is E21, a 9602 one-shot. CLEAR ADDR L is supposed to arrive after the first two bus cycles (fetch of PC and PSW) and release the bus; the one-shot is set up to timeout after about 300ms and release the bus in any case.

On the logic analyzer, we can see an issue here:

M9301 at boot with bus release issue

MSYN delimits bus cycles mastered by the CPU. Here we can see that CLEAR ADDR L never arrives, and so the higher-order address bits remained jammed by the M9301 for the full duration of the one-shot timeout. This is okay for the first few instructions, which are executing out of the ROM anyway, but things quickly go awry...

Here is the circuitry responsible for CLEAR ADDR L:

M9301 bus release logic

The desired pulse is mediated by 270 uF capacitor C36 in one leg of gate E20, so this is a good thing to check first, and... it is actually missing from the board! (Visible in the M9301 gallery picture above.) Replaced this cap, and now we are in good shape:

M9301 at boot with bus release issue repaired

With this, the machine is fully repaired. Spent a little time with it, booting and running various operating systems from emulated storage on the Unibone card. Frieda also approves:

Frieda approved PDP-11/34

PDP-11/45: Reversing a vintage DL11 hack

Fri 27 November 2020 by Fritz Mueller

I recently had need to assess and repair several DL11 serial interfaces in my stock of spares. One of these had had some sort of end-user hack applied; in the course of putting the board back to factory condition, I did some analysis of the hack and its intended purpose, documented here.

DL11 with end-user hack DL11 user hack front DL11 user hack back

Easy enough to beep this out and reverse to a schematic:

Schematic of DL11 hack

So, the hack appears to dynamically alter the CSR address and interrupt vector of the card, choosing between two hard-wired presets, based on whether P1A/P1B are connected together or not.

The CSR jumpers on a stock DL11 operate with pull-ups upstream of the address decode logic, so these can be directly driven by the hack so long as the jumpers for the bits-to-be-hacked are left open on the board. The vector address bits, however, must be driven by the DL11 onto to the Unibus contingent on an appropriate global enable. On a stock DL11, drivers for all configurable vector bits are activated by a single global enable, and jumpers downstream of the drivers control which of these activated bits will be admitted to bus. So, for the vector address part of the hack to function, hack control must be asserted instead of the global enable for each of the to-be-driven bits, and the corresponding jumpers for these bits must be left in. And indeed, upon inspection of the DL11 there are trace cuts that have been done (marked here with "X") to lift the global enable and allow individual hack control of each of the affected bits:

Trace cuts for DL11 hack

Last, we can look at the board jumpering and the wiring of the hack to determine the specific CSR and vector addresses at play:

	A11	A10	A9	A8	A7	A6	A5	A4	A3	A2	A1	A0
P1 Open	1	1	0	1	0	1	0	0	1	0	0	0	776510
P1 Closed	1	1	1	1	0	1	1	1	0	0	0	0	777560

	V8	V7	V6	V5	V4	V3	V2	V1	V0
P1 Open	0	1	1	0	0	1	0	0	0	310
P1 Closed	0	0	0	1	1	0	0	0	0	060

We see from these specific addresses that closing the contacts of P1 would dynamically re-jumper the board from assignment as the 2nd non-console interface to assignment as the console interface. So perhaps this was once used (in conjunction with another similarly hacked interface?) to swap console terminals with the flip of a single switch.

PDP-11/45: Some more floating point trouble

Sat 21 November 2020 by Fritz Mueller

[A catch-up article, documenting events of April/May 2020]

In late April, I offered to give a video demonstration of the '11/45 to some interested work colleagues. Since I hadn't had it on in a while, I fired it up to make sure everything was still in working order. The machine behaved well from the front panel and was able to boot both V6 Unix and RSTS V06C. Great! Typed a very simple demo program in to RSTS (print a multiplication table) and that ran, but produced some very strange results. Uh oh...

Asked RSTS to PRINT PI, and it spat out a value somewhere around 3.7... :-)

So, time to try the floating point MAINDECS... Sure enough, failures all over the place, starting with the very first diagnostic in the floating point suite, CFPAB0. This diagnostic covers utility operations like LDFPS/STFPS, SETI/SETL, SETF/SETD, etc.

I do not have listings for the diagnostics in this suite, but it is usually simple enough to reproduce failures with short toggle-in programs given the names and descriptions of the failing diagnostics. In this case, the following simple code to exercise an LDFPS/STFPS sequence from the front panel switches and lights showed that bits 10 and 11 of the floating point status/control word would come back erroneously toggled:

001000  170137  START:  LDFPS   @#177570        ;LOAD FPS FROM SWITCH REGISTER
        177570
001004  170237          STFPS   @#177570        ;AND STORE BACK TO DISPLAY REGISTER
        177570
001010  000773          BR      START           ;REPEAT

First things first, check power to the FPU and its clock; these look fine. Next, plug the KM11 into the floating point slot and check the FPU microcode sequences while executing LDFPS and STFPS instructions. These also look fine:

For LDFPS @#177570 I see RDY.00, RDY.10, RDY.20, RDY.30, RDY.70, LD.50
For STFPS @#177570 I see RDY.00, RDY.10, RDY.20, RDY.30, RDY.80, STR.30, STR.08

Most of the data paths of interest regarding the FPS register are on the fraction low (FRL) board, so this goes out on extenders so the microcode can be stepped and gate-level logic inspected with a logic probe.

Here is the block diagram of data paths in the FPU, for reference in discussion below:

FP11-B data paths

FP11-B data paths

So, one thing to note with regard to the FPS register is that it is gated through the ACMX multiplexer and written into scratch pad register AC7[0] during microcode state RDY.00 which is the first state in the common prolog of every FPU instruction:

FP11-B microcode prolog

FP11-B microcode prolog

Stopping in state RDY.00 and examining the ACMX inputs, selects, and outputs for bits 10 and 11 immediately reveals a problem. These bits of ACMX are implemented by a 74153 dual 4-input mux, E71 on sheet FRLB of the FP11-B engineering drawings:

FP11-B ACMX >11:10<

FP11-B ACMX <11:10>

Inputs from the FPS register on pins 6 and 10 appear correct, as do the selector signals on pins 14 and 2. But outputs on pins 7 and 9 appear to be inverted. So E71 appears bad. Pulled this, socketed, and replaced. After this fix, LDFPS/STFPS function correctly in the toggle-in test program, and MAINDEC CFPAB0 passes.

Not out of the woods yet, though... Progressing down the sequence of MAINDECS, diagnostic CFPDC0 (add/subtract) now fails :-( For this, we bring back the simple "add two floats" diagnostic used during previous FP11 debug:

        000000                          AC0=%0
        000001                          AC1=%1
000000                                  .ASECT
        001000                          .=1000
001000  170011                  START:  SETD                ;SET DOUBLE PRECISION MODE
001002  172467  000014                  LDD     D1,AC0      ;FETCH FIRST ADDEND FROM D1
001006  172567  000020                  LDD     D2,AC1      ;FETCH SECOND ADDEND FROM D2
001012  172100                          ADDD    AC0,AC1     ;ADD THEM (RESULT IN AC1)
001014  174167  000022                  STD     AC1,D3      ;STORE RESULT TO D3
001020  000000                          HALT
001022  040200  000000  000000  D1:     .WORD   040000,000000,000000,000000 ;0.5
001030  000000
001032  040200  000000  000000  D2:     .WORD   040000,000000,000000,000000 ;0.5
001040  000000
001042  000000  000000  000000  D3:     .WORD   000000,000000,000000,000000
001050  000000
        001000                          .END    START

Sure enough, this is producing incorrect results. The microcode flows for add/subtract/compare are a bit more involved than the simple load/store sequences above. The sequence starts with common prolog RDY.00, RDY.10, RDY.20, RDY.30, same as above. The first fork after RDY.30 goes to RDY.60, since add/subtract/compare are "no memory class" instructions (FP accumulator register operands only). The second fork after RDY.60 takes us to ADD.00 on sheet FP11 FLOWS 8.

The left side if FLOWS 8 is a decision tree for zero operands and/or whether or not we are executing a compare instruction. Traversal of these states sets up fraction and exponent operands and, if necessary, a comparison of operand exponents in the EALU. In our case (addition of two double-precision non-zero operands), the sequence is: ADD.00, ADD.04, ADD.06, ADD.02, ADD.08, ADD.12.

We then end up at state ADD.22 at the top of the right side of FLOWS 8. The previously set up exponent difference is used to index into a 256x4 "range ROM"; output bits from this ROM inform the subsequent microcode fork which determines which operand shift, if any, to apply before the upcoming fraction ALU operation.

FP11-B Exponent Comparison Flow

Here a problem is evident. We should fork to ADD.24, for equal exponents, but instead we end up add ADD.30, for destination exponent less than source exponent. Putting the FXP board out on the extender and pausing in this state, the operands and operation codes on the EALU bit-slices appear to be correct, but signal FRMH ALU CIN L is erroneously asserted at E34 pin 7 (sheet FXPA). This extra carry (borrow, really, since the operation is a subtract) into the least significant bit-slice causes the EALU output to be -1 instead of 0.

Moving back to the source of this signal on the FRM board, it turns out that FRM E20, a 74H40 dual quad-input NAND, is outputting an invalid logic level at pin 8. Pulled this, socketed, replaced, and the problem appears to be fixed.

FP11-B FRMH ALU CIN L

After this second repair, the full suite of FP11-B diagnostics is passing again. And RSTS/E has a much less fanciful interpretation of PI...

PDP-11/45: V6 Unix Troubleshooting, Part II

Sun 25 October 2020 by Fritz Mueller

[A catch-up article, documenting discoveries of Feb 2019]

In early 2019, I made a V6 Unix pack from the Ken Wellsch tape image, as mentioned in this blog entry. It booted on my machine, but dumped core on the first ls in single-user mode, or as soon as I did any heavy lifting in multi-user mode.

The following is the conclusion of a chronology of the troubleshooting campaign that took place over the next month and a half, culminating in a hardware fix and successful operation of V6 Unix on the machine (part I is here.) This was largely a collaborative effort between Noel Chiappa an myself via direct email correspondence, though some help was received from others via the cctalk mailing list as well.

By this point, the nature of the ls problem had been fairly well characterized: part of the ls process address space ended up holding an incorrect portion of its program text; subsequently, when execution jumped to some of these unexpected bits, an out-of-bounds memory access would occur triggering a memory management trap. Efforts now focus on understanding how and why the bad bits got there...

February 7

[Here and below, block-quoted content is excerpted from email correspondence.]

Fritz:

Noel, is it possible for you deduce where Unix should be placing these "bad" bits (from file offset octal 4220)? Maybe a comparison of addresses where the bits should be, with addresses where the "bad" copy ends up, could point us at some particular failure modes to check in the KT11, CPU, or RK11...

Noel:

Yes, it's quite simple: just add the virtual address in the code to the physical address of the bottom of the text segment (given in UISA0). The VA is actually 04200, though: the 04220 includes 020 to hold the a.out header at the start of the command file.

So, with UISA0 containing 01614, that gives us PA:161400 + 04200 = PA:165600, I think. And it wound up at PA:171600 - off by 04000 (higher) - which is obviously an interesting number.

Here's where it gets 'interesting'.

Executing a command with pure text on V6 is a very complicated process. The shells fork()s a copy of itself, and does an exec() system call to overlay the entire memory in the new process with a copy of the command (which sounds fairly simple, at a high level) - but the code path to do the exec() with a pure text is incredibly hairy, in detail. In particular, for a variety of reasons, the memory of the process can get swapped in and out several times during that. I apparently used to understand how this all worked, see this message:

https://minnie.tuhs.org/pipermail/tuhs/2018-February/014299.html

but it's so complicated it's going to take a while to really comprehend it again. (The little grey cells are aging too, sigh...)

The interesting point is that when V6 first copies the text in from the file holding the command (using readi(), Lions 6221 for anyone who's masochistic enough to try and actually follow this :-), it reads it in starting from the bottom, one disk block at a time (since in V6, files are not stored contiguously).

So, if it starts from the bottom, and copies the wrong thing from low in the file up to VA:010200, when it later gets to VA:010200 in the file contents, that should over-write the stuff that got put there in the wrong place earlier. Unless there's another problem which causes that later write to also go somewhere wrong...

So, I'm not sure when this trashage is happening, but because of the above, my guess is that it's in one of the two swap operations on the text (out, and then back in). (Although it might be interesting to look at PA:165600 and see what's actually there.) Unix does swapping of pure texts in a single, multi-block transfer (although not always as an integral number of blocks, as we found out the hard way with the QSIC :-).

So my suspicions have now switched back to the RK11... One way to proceed would be to stop the system after the pure text is first read in (say around Lions 4465), and look to see what the text looks like in main memory at that point. (This will require looking at KT11 registers to see where it's holding the text segment, first.)

If that all looks good, we'll have to figure out how to stop the system after the pure text is read back in (which does not happen in exec(), it's done by the normal system operation to swap in the text and data of a process which is ready to run).

We could also stop the system after the text is swapped out, and key in a short (~ a dozen words) program to read the text back in from the swap device, and examine it - although we'd have to grub around in the system a bit to figure out where it got written to. (It might be just easier to stop it at, say, Lions 5196 and look at the arguments on the kernel stack.)

Fritz:

...it might be interesting to look at PA:165600 and see what's actually there

A sea of zeros, as it turns out.

The most valuable thing ... would be to look at the text segment, after it's read in and before it's swapped out. I can work out where to put a halt, if you want to try that.

Yes, this sounds like a good plan to me! Is this as simple as dropping a HALT at VA:0 in the text?

Noel:

No; actually, probably easier! :-) Probably easiest is to, just before you type 'ls', put a HALT in the OS just after 4467 in Lions. Halt the machine momentarily, patch the kernel, and CONT. (Basically the same as your patch to the trap vector, just a different address.) That'll be at 021320 (should contain 062706), physical or virtual. :-)

When the system halts, you'll need to look at the text in memory. Two ways to find the location: look on the kernel stack, the address should be the second thing down:
mov 16(r3),-(sp)
add $20,(sp)
mov (r4),-(sp)
jsr pc,*$_swap
(i.e. the thing that 020 got added to). Probably easier, though, is just to look in UISA0 (which at this point is pointing to the block of memory that's been allocated to read the text into, Lions 4459-60).

That number in UISA0, T, will be the click address of the text. So PA:T00 should be the start of the text (170011 010600, etc). So then PA:(T00+010200) should be the trashed chunk of text: 110024 010400 000167 000016 010500 etc (right) or 016162 004767 000224 000414 016700 (wrong).

February 8

Noel:

In addition to the info I already sent about how to [set the breakpoint], if you could note down the top 3 words on the kernel stack, and the contents of the RK registers, those would be really useful; the first will allow us to work out what should be in the RK registers after the swap I/O operation completes - I don't think the RK11 will be asked to do anything after that finishes and before the system hits that halt in xalloc().

To find the kernel stack.... read out KISA6, S. This value will point to the 'user' area of that process, plus the kernel stack. The kernel SP should be something like 01417xx; subtract 140000 (the segment number), and add what's left to S00. Alternatively, you can probably use the rotating switch on the front panel to just look up VA:1417xx (whatever's in R6) directly.

Oh, if you need some bed-time reading to put you to sleep, check out the bottom section ("exec() and pure-text images") in:

http://gunkies.org/wiki/Unix_V6_internals

which will explain what's going on here with the swapping in and out, which is sorta complicated.

February 9

Noel:

just halt the machine after the text is swapped in

The code we need is at Lions 2034, where the pure text of a process is swapped in (and this should only be traversed once; I don't think the system will need to swap in the text of the shell); just put a HALT in (in the usual manner, just before trying 'ls') at 015406, which should contain a 062706 (again).

At that point, since the text size is 010400, and the location of the text in physical memory is 0161400, the BAR should contain 0172000. If not, and it's 0232000 (note that the 0200000 bit will be in the CSR, the lower XM bit) instead, Bazinga!, it's nailed (unless the system somehow snuck another RK operation in there, but I don't see anything that could do that).

I finally get some time back in front of the machine, after a few days in bed with a cold:

...put a HALT in the OS just after 4467 in Lions. Halt the machine momentarily, patch the kernel, and CONT. (Basically the same as your patch to the trap vector, just a different address.) That'll be at 021320 (should contain 062706)...

But alas, it does not. [PA:021320] = 010246. Furthermore, [PA:015406] = 016504.

I just tried under SIMH, also, and got consistent results:
[PA:015406] = 016504
[PA:021320] = 010246
...so, one would think, my rkunix and yours are different?

Noel:

That must be it. I thought we were both working from the V6 distribution? Oh, yours prints out that Western Electric copyright notice, I don't think mine has that...

February 10

The first part of the day is spent sorting out and comparing the "Wellsch" V6 distribution that I have been using, and the "Ritchie" version that Noel has been using. Noel comes to the conclusion that the only differences in the kernel sources are in fact the four printfs for the copyright notice, but this is enough to perturb the locations of various symbols of interest between the two kernels. He also finds the binaries ls, cc, as, as2, ld c0, c1, and c2 all match; as do liba.a, libc.a and crt0.o.

Getting back on the trail of the bug:

So the first place I'd like to try HALTing is just after the call to swap, Lions 4467; at that point, the text should be in main memory, and also just written to disk. Should be at 021320 (old contents should be 062706).

Fun things to do here: look at the text in main memory (0161400 and up), see if it's correct at this point. Also: pull the arguments off the top of the stack, and write a small program to read it back in...

This turns out to be one last typo ("rkunix" vs. "rrkunix" on Noel's part) resulting in incorrect symbol addresses for my kernel, but I'm hip to Noel's curveballs now so:

Okay, using today's newly acquired 'db' skillz :-), in my rkunix, that spot is at PA:21420. Firing up the machine again and trying that now...

It works; I end up stopped at the breakpoint and start extracting data:

Hmmm:

PA:161400: 141644 141660 000000 000000 000000 000000 000000 000000
PA:161420: 000000 000000 000000 000000 000000 000000 000000 000000

Noel:

The text is probably at a different location in PA at this point. Read out UISA0 for its base.

Fritz:

UISA0: 001654

PA:165400: 170011 010600 011046 005720 010066 000002 004767 000010

KSP: 141656 -> PA:165256

PA:165256: 007656 001654 000104 000000 101602 066312 000000 141726
PA:175600: 110024 010400 000167 000016 010500 010605 101446 010346

So far so good -- both beginning and eventually-bogus sections of text check out at this point!

Noel:

Woo-Hoo!!!! YEAH!!!!

So that part of the text is right at this point.

Needless to say, this is very, very important data.

So chances are very strong, at this point, that it's the RK11.

What did you want to do next? You could start with the RK11 registers. Also, use PDP11GUI to read the copy off the swap device, once I decipher the stack?

PA:165256: 007656 001654 000104 000000 101602 066312 000000 141726
OK, so the 01654 is the start address in PA (in clicks) for the area to swap out, and that matches UISA0. 0104 is the text length (also in clicks), and that also matches. The 0 is a flag which says it's a write (read is 01). And the 07656 is the block number (4014.).

Fritz:

I should have a valid swap on the disk from before I shut down... Going to fire up PDP11GUI and grab it now to have a look. We want blocks 4014-4022, then? (9 x 512-byte blocks = 0110 clicks if I got that right?)

Noel:

4014.-4023., I think...

(9 x 512-byte blocks = 0110 clicks if I got that right?)

I think 8-1/2 or so; text is 010400 bytes (a little less, actually, but that's what the system is using), 01000 bytes/block, = 010.4 blocks.

Fritz:

Hmm, the beginning looks good, but it seems to cut off to soon:

0000000    000000  000000  000000  000000  000000  000000  000000  000000
*
7656000    170011  010600  011046  005720  010066  000002  004767  000010
7656020    010016  004737  006374  104401  004567  010154  162706  000044
7656040    012716  000001  004737  004652  010067  022314  010516  062716
7656060    177762  004737  006346  016500  177762  062700  177413  010067
   |
7660320    000137  002346  016516  000004  012746  020452  004737  003562
7660340    005726  000137  002542  005067  017552  012704  022336  005003
7660360    012716  021050  004737  005042  110024  005203  022703  000020
7660400    000000  000000  000000  000000  000000  000000  000000  000000
*
11410000

Noel:

7656000    170011  010600  011046  005720  010066  000002  004767  000010
Yup, good start; SETD, etc.
7660360    012716  021050  004737  005042  110024  005203  022703  000020
7660400    000000  000000  000000  000000  000000  000000  000000  000000
Hunh; not good. (Might be worth looking at that location in main memory, see if it's zeros or not.)

That's so odd that it's all zeros - I wonder where they came from? Maybe they were already on the disk, and the write stopped way early? (At 01000 bytes per block, it stopped after 2-1/2 blocks; 056000s, 057000s, stopped half-way through the 060000's.)

Would be useful to have the RK register contents after the swap() call returns...

Fritz:

Okay, the write should be from PA:165400 - PA:175777, to sectors 07656 - 07667. Block 7667 encodes to an RKDA value of 012363.

After the halt, I find:
RKDS: 004707 (OK)
RKER: 000000 (OK)
RKCS: 000322 (BOGUS! EX.MEM = 01)
RKWC: 000000 (OK)
RKBA: 176000 (OK)
RKDA: 012363 (OK)
So, EX.MEM are the smoking bits here! I will review the associated designs and come up with things the try/check.

Okay, taking a look:

RKBA is implemented in the M795 module in slots AB07, as detailed on sheet RK11-C-15. The M795 is a generic WC/BA Unibus interfacing module. The BA part only covers 16 bits, but generates an overflow out "D15 RKBA=ALL 1 L".

EX MEM 01 and EX MEM 02 are maintained on the M239 module in slot A17, as detailed on sheet RK11-C-03. The M239 is a 3x 4-bit counter/register module, so this also implements counting up these bits, when triggered by "D15 RKBA = ALL 1 L".

Based on where we see the data on disk fall off (offset 2400) and the start PA (165400), I'm guessing we get a false trigger on this "ALL 1" at RKBA 167777. So that looks like a false "1" detect on RKBA bit 12.

So I think the thing to do is to put the M795 out on an extender, load RKBA with 167777, and have a check at E28 pin 5, and E34 pin 8!

And we leave the cliffhanger there, for now, at least until tomorrow evening. Because due to the way the RK11-C is mounted, in order to do the above I'm going to have to spin the whole machine around (its a dual H960), extend the RK05's so there is room to physically climb in the back, rig a work light, and get on in there...

February 11

SUCCESS!!

Put the M795 out on an extender, loaded 167777 in RKBAR, and had a look around with a logic probe. Narrowed it down to E34 (a 7430 8-input NAND). Pulled, socketed, replaced, and off she goes!

I can now successfully boot and run both V6 Unix and RSTS/E V06C from disk.

THAT was a really fun and rewarding hunt :-) First message in the thread was back on Dec 30, 2018. Lots of debugging and DRAM repairs, then the final long assault to this single, failed gate...

Thanks to all here for the help and resources, and particular shout-outs for Noel and Paul who gave generously of their time and attention working through the densest bits, both on and off the list.

I predict a long happy weekend and a big power bill at the end of the month :-)

M795 WC/BAModule

M795 module and the single failed gate

PDP-11/45: V6 Unix Troubleshooting

Sat 24 October 2020 by Fritz Mueller

[A catch-up article, documenting discoveries of Jan/Feb 2019]

The following is the first part of a chronology of the troubleshooting campaign that took place over the next month and a half, culminating in a smoking gun hardware fix and successful operation of V6 Unix on the machine. This was largely a collaborative effort between Noel Chiappa an myself via direct email correspondence, though help was received from others via the cctalk mailing list as well.

January 8-9

Initial experiments. Described the ls crashes to Noel. He theorizes that ls works in one case and crashes in another is because it lands in a different spot in memory in each case.

Luckily, a subsequent od on the core file does not crash, and a core file is successfully extracted:

140004 000000 141710 141724 
$DK
@rkunix
mem = 1035
RESTRICTED RIGHTS

Use, duplication or disclosure is subject to
restrictions stated in Contract with Western
Electric Company, Inc.
# LS
MEMORY FAULT -- CORE DUMPED
# OD CORE
0000000 141552 141562 000000 000000 000000 000000 000000 000000
0000020 000000
0000060 000000 000000 000000 000001 000000 000000 063260 140076
0000100 001700 000000 000104 066112 067543 062562 000000 000000
0000120 000000 000000 000000 060221 000567 067543 062562 000000
0000140 000000 000000 000000 000000 066112 000000 000020 000000
0000160 000000 000000 000000 000000 177701 000000 000020 000000
0000200 000000 000000 000000 000000 177701 041402 016006 000000
0000220 000000 000000 000000 000000 066016 041402 016006 000000
0000240 000000 000000 000000 000000 066016 075120 075120 075120
0000260 000000
0000300 000000 000000 000000 000000 000013 010400 001050 002366
0000320 000000 000104 000035 000024 000000 141732 141742 141664
0000340 141674 000000 000000 000000 000000 000000 000000 000000
0000360 000000
0000400 000000 000000 000000 000000 000000 000000 000012 000000
0000420 000000 000000 000000 141772 000000 000000 000000 000000
0000440 000000
0001500 000000 025334 003602 001236 025334 003602 002454 003602
0001520 063260 177716 000000 141542 016070 001176 000000 003602
0001540 063260 177716 000000 141562 016070 001176 066352 030300
0001560 063260 025334 003602 077572 000013 107564 141626 000512
0001600 000000 141604 141616 000300 074616 025334 003602 000217
0001620 000203 107404 020276 000512 000000 141634 141640 003602
0001640 000007 000135 107454 141662 014314 003602 066352 005674
0001660 000000 141712 013640 074616 000000 001000 000000 000000
0001700 001000 074616 063260 066352 000013 141726 023730 066352
0001720 063260 000000 000013 141742 023502 003602 000000 177760
0001740 000013 141756 022050 000013 000000 000000 000000 000034
0001760 000444 000031 177760 000000 030351 177770 010210 170010
0002000 000001 177777 177777 023436 023436 020264 000162 000262
0002020 000262 000202 000262 000256 000210 000262 000250 000262
0002040 000262 000216 000262 000262 000262 000262 000262 000224
0002060 000170 000234 000242 000003 100000 000144 040000 000142
0002100 020000 000143 000055 000001 000400 000162 000055 000001
0002120 000200 000167 000055 000002 004000 000163 000100 000170
0002140 000055 000001 000040 000162 000055 000001 000020 000167
0002160 000055 000002 002000 000163 000010 000170 000055 000001
0002200 000004 000162 000055 000001 000002 000167 000055 000001
0002220 000001 000170 000055 000001 010000 000164 000040 020066
0002240 020106 020116 020126 020142 020152 020162 020176 020206
0002260 020216 020226 000056 062457 061564 070057 071541 073563
0002300 000144 062457 061564 063457 067562 070165 005000 071445
0002320 005072 072000 072157 066141 022440 005144 022400 062065
0002340 000040 031045 020144 022400 033055 033056 000163 026445
0002360 062066 022400 062063 022454 062063 022400 071467 020000
0002400 026445 027067 071467 022440 032055 032056 020163 020000
0002420 026445 031061 030456 071462 000040 032045 020144 022400
0002440 005163 022400 030456 071464 000012 071445 072440 071156
0002460 060545 060544 066142 005145 022400 020163 067556 020164
0002500 067546 067165 005144 000000 003750 000144 004076 000157
0002520 004070 000170 004172 000146 004210 000145 004026 000143
0002540 004044 000163 003764 000154 004226 000162 000000 000000
0002560 177774 177760 177775 177770 104404 022376 000000 104405
0002600 000000 000000 104403 000000 001000 104405 000000 000000
0002620 104421 000000 023436 104423 000000 000000 104422 000000
0002640 000000 000037 000034 000037 000036 000037 000036 000037
0002660 000037 000036 000037 000036 000037 043120 020712 020716
0002700 000001 000005 000515 000072 000457 051505 000124 042105
0002720 000124 060504 020171 067515 020156 030060 030040 035060
0002740 030060 030072 020060 034461 030060 000012 072523 046556
0002760 067157 072524 053545 062145 064124 043165 064562 060523
0003000 000164 060512 043156 061145 060515 040562 071160 060515
0003020 045171 067165 072512 040554 063565 062523 047560 072143
0003040 067516 042166 061545 000000 000000 000000 000000 000000
0003060 000000
0010060 000000 000020 000001 177770 177774 177777 071554 000000
0010100 
#

Noel prepares to analyze the core file (block quotes here and further below taken from email correspondence):

I just checked, and the binary for the 'ls' command is what's called 'pure code'; i.e. the instructions are in a separate (potentially shared) block of memory from the process' data (un-shared).

On another front, that error message ("Memory error") is produced when a process gets a 'memory management trap' (trap to 0250). This could be caused by any number of things (it's a pity we don't know the contents of SR0 when the trap happened, that would tell us exactly what the cause was).

[Memory management registers in the core dump] are 'prototypes', later modified for actual use by adding in the actual address in main memory. Still trying to understand how that works - the code (in sureg() in main.c) is kind of obscure.

January 10-24

Further communication with Noel and the cctalk list raises some suspicion about the memory in my machine. Though I had done spot checks and repairs on this in the past, which had been sufficient to pass most MAINDEC diagnostics and to boot and run RT11, in fact the memory had not yet been exhaustively tested.

Over the course of some days, memory test codes are developed and run, and several additional failed DRAMs in the MS11 memory system are isolated and repaired. These efforts have previously been reported in detail in this blog entry.

After these repairs, the MAINDEC MS11 memory diagnostics and KT11-C MMU diagnostics, both of which are beastly and exhaustive, are found to pass robustly with one caveat: memory parity tests. A deep-dive into the design and implementation of memory parity on the PDP-11/45 follows. At the end it is concluded that the machine, a very early serial no. in its line, is in fact functioning per-design. These efforts are documented in this blog entry.

Even though the memory system looks solid after this, the V6 Unix crash behavior remains exactly the same...

January 27-29

With the KT11 and memory now verified, Noel takes up the core dump again:

The problem is that Unix does not save enough info in the core dump for me to thoroughly diagnose the MM fault; e.g. 'ls' is a 'pure text' program/command, and the code's not included in the core dump (in normal operation, there's no need/use for it), so I don't have the code that was running at the time, just the data and swappable per-process kernel data - which is not all the per-process data, e.g. it doesn't include the location of the process's code and data segments in main memory.

Also, I'll look at the V6 code that sets up the KT11 registers to make sure I understand what it's doing. (The dump contains the 'prototype' for those contents, but the values are modified, by adding the actual memory location, before being stored in the KT11.)

I did find out that the PC at the time of the segmentation fault was 010210, which I thought looked awfully big (so I was wondering if somehow it went crazy), but in fact the text size is 010400, so it's just inside the pure text.

We agree to use Lions as a common reference point for detailed discussion of the loading and running of "ls" and what may be seen in the core dump.

January 30

Noel:

So, a bit more from my examination of the swappable per-process kernel data (the 'user' structure - not sure how much of a Unix internals person you are).

It gives the following for the text, data and stack sizes:
tsize 000104
dsize 000035
ssize 000024
which seems reasonable/correct, because looking at the header for 'ls' we see:
000410 010400 001050 002366 000000 000000 000000 000001
'0410' says it's pure text, non-split; the 010400 is the text size, which matches (those sizes above are in 'clicks', i.e. the 0100 byte quantum used in the PDP-11 memory management).

The data size also appears to be correct:
001050 (initialized)
002366 (BSS)
------
003436
which again matches (round up and divide by 0100).

I have yet to dig around through the system sources and see what the initial stack allocation is, to see if that's reasonable (of course, it may have been extended during execution).

And here are the 'prototype' segmentation register contents:
UISA 000000 000020 000000 000000 000000 000000 000000 177701
UDSA 000000 000020 000000 000000 000000 000000 000000 177701
UISD 041402 016006 000000 000000 000000 000000 000000 066016
UDSD 041402 016006 000000 000000 000000 000000 000000 066016
Since it's not split, the D-space ones are clones of the I-space (which is what the code does - I don't think it turns user D off and on, depending on what the process has: I'd have made context switching faster by not having to set up the D-space registers for non-split processes, but I guess the extra overhead is pretty minimal).

I have yet to check all the contents to make sure they look good, but the U?SA registers look OK; the '020' is for the data, and that's kept contiguous with the 'user' area, so the '020' is to offset past that.

The PC at fault time of 010210 seems to point to the following code (assuming what was in main memory was actually the same as the binary on the disk):
        mov r4,r0
        jmp 10226
210:    mov r5,r0 
        mov sp,r5
We don't have SSR2, which points to the failing instruction, and I forget whether the saved PC on an MMU fault points to the failing instruction, or the next one; I'm going to assume the latter.

But either way, this is very puzzling, because I don't see an instruction there that could have gotten an MMU fault! The jump is to a location within the text segment (albeit at the end), and everything else it just register-register moves!

And how could the fault depend on the location in main memory?!?!

If you want to poke around in the core dump yourself, to verify that I haven't made a mistake, see this page:

http://gunkies.org/wiki/Unix_V6_dump_analysis

which gives useful offsets. (The ones in the user table I verified by writing a short program which did things like 'printf("%o", &0->u_uisa)', and the data at those locations looks like what should be there, so I'm pretty sure that table is good. For the other one, core(5) (in the V6 man pages) gives the register offsets (albeit in a different form), so you can check that I worked them out correctly.

Two things you could try to get rid of potential pattern sensitivities: before doing the 'ls', say 'sleep 360 &' first; that running in the background should cause the 'ls' to be loaded and run from a different address in main memory. The other thing you could try is 'cp /bin/ls xls' and then 'xls', to load the command from a different disk location. (Both of these assume that you don't get another fault, of course!)

[Initial stack size] is 20. clicks, which is what it still is (024 clicks) in the process core dump, so the stack has not been extended. So any MM fault you see after starting 'ls' will probably be the one that's causing the process to blow out.

I tried to re-create that exact version of the 'ls' binary, because the one in the distro is stripped, and I wanted one with symbols to look at. I failed, because a library routine (for dates) has changed on my machine, see here:

http://www.chiappa.net/~jnc/tech/V6Unix.html#Issues

However, I did verify that the binary for ls.o is identical to what I can produce (using the -O flag). It's just that library routine which is different. I don't think it's worth backing out my library; I did manage to hand-produce a stub of the symbol table for where the error is happening in the old 'ls' binary:
010210T csv
010226T cret
010244T cerror
010262T _ldiv
010304T _lrem
010324T _dpadd
The fault does indeed seem to be happening at either the last instruction in the previous routine (ct_year, in ctime.c), or the first of csv.

(I should explain that PDP-11 C uses two small chunks of code, CSV and CRET, to construct and take down stack frames on procedure entry and exit. So on exit from any C procedure, the last instruction is always an PC-relative jump to CRET.)

It looks like that's what's blowing up - but it apparently works with the command at a different location in main memory! So it pretty much has to be a pattern sensitivity.

However, I think the KT11 does the bounds checking before it does the relocation - the bounds checking is done on virtual, un-relocated addresses. So that part of it should be the same for both locations! So here's my analysis:

Is it actually an indexed jump that's blowing up? I've been looking at the command binary, but that might not be what's in main memory. Or the CPU might be looking somewhere else (because of a KT error). (If we don't find the problem soon, we might want to put in that breakpoint so we can look in main memory and see what inst is actually at the location where SSR2 says the failing inst was; that can rule out a whole bunch of potential causes in one go - e.g. RK11 errors.)

If it is actually that jump that's failing - how? The PC hasn't been updated yet, so it can't be the fetch of the next instruction that's failing. Is the fetch of the index word producing the MM fault?

Fritz:

It occurs to me that we don't even really know if the fault occurs from the same address every time, since we have a core sample size of 1; I should duplicate the fail and extract another core file to compare.

Another thing I thought I might try tonight: deposit a trap catcher in the memory mgmt trap location from the front panel, just before issuing the 'ls' command. I can then check the PSW, PC, SP, and KT11 regs right at the time of fault.

Experiments begin from the front panel, and continue on into the early hours, producing:

Core #2:

140004 000000 141710 141724
$DK
@rkunix
mem = 1035
RESTRICTED RIGHTS

Use, duplication or disclosure is subject to
restrictions stated in Contract with Western
Electric Company, Inc.
# RM CORE
# LS
MEMORY FAULT -- CORE DUMPED
# OD CORE
0000000 141552 141562 000000 000000 000000 000000 000000 000000
0000020 000000
0000060 000000 000000 000000 000001 000000 000000 063260 140076
0000100 001700 000000 000104 066112 067543 062562 000000 000000
0000120 000000 000000 000000 060221 000571 067543 062562 000000
0000140 000000 000000 000000 000000 066112 000000 000020 000000
0000160 000000 000000 000000 000000 177701 000000 000020 000000
0000200 000000 000000 000000 000000 177701 041402 016006 000000
0000220 000000 000000 000000 000000 066016 041402 016006 000000
0000240 000000 000000 000000 000000 066016 075120 075120 075120
0000260 000000
0000300 000000 000000 000000 000000 000013 010400 001050 002366
0000320 000000 000104 000035 000024 000000 141732 141742 141664
0000340 141674 000000 000000 000000 000000 000000 000000 000000
0000360 000000
0000400 000000 000000 000000 000000 000000 000000 000011 000000
0000420 000000 000000 000000 141772 000000 000000 000000 000000
0000440 000000
0001500 000000 000000 000000 000000 000000 000000 000000 003602
0001520 063260 177716 000000 141542 016070 001176 000000 003602
0001540 063260 177716 000000 141562 016070 001176 066352 030300
0001560 063260 141576 000005 003602 066352 001612 074376 044516
0001600 003602 025334 003602 000000 000443 107144 141646 000512
0001620 000000 141624 141640 000300 020276 020356 030000 003602
0001640 000007 000135 107454 141662 014314 003602 066352 004404
0001660 000000 141712 013640 074616 000000 001000 000000 000000
0001700 001000 074616 063260 066352 000013 141726 023730 066352
0001720 063260 000000 000013 141742 023502 003602 000000 177760
0001740 000013 141756 022050 000013 000000 000000 000000 000034
0001760 000444 000031 177760 000000 030351 177770 010210 170010
0002000 000001 177777 177777 023436 023436 020264 000162 000262
0002020 000262 000202 000262 000256 000210 000262 000250 000262
0002040 000262 000216 000262 000262 000262 000262 000262 000224
0002060 000170 000234 000242 000003 100000 000144 040000 000142
0002100 020000 000143 000055 000001 000400 000162 000055 000001
0002120 000200 000167 000055 000002 004000 000163 000100 000170
0002140 000055 000001 000040 000162 000055 000001 000020 000167
0002160 000055 000002 002000 000163 000010 000170 000055 000001
0002200 000004 000162 000055 000001 000002 000167 000055 000001
0002220 000001 000170 000055 000001 010000 000164 000040 020066
0002240 020106 020116 020126 020142 020152 020162 020176 020206
0002260 020216 020226 000056 062457 061564 070057 071541 073563
0002300 000144 062457 061564 063457 067562 070165 005000 071445
0002320 005072 072000 072157 066141 022440 005144 022400 062065
0002340 000040 031045 020144 022400 033055 033056 000163 026445
0002360 062066 022400 062063 022454 062063 022400 071467 020000
0002400 026445 027067 071467 022440 032055 032056 020163 020000
0002420 026445 031061 030456 071462 000040 032045 020144 022400
0002440 005163 022400 030456 071464 000012 071445 072440 071156
0002460 060545 060544 066142 005145 022400 020163 067556 020164
0002500 067546 067165 005144 000000 003750 000144 004076 000157
0002520 004070 000170 004172 000146 004210 000145 004026 000143
0002540 004044 000163 003764 000154 004226 000162 000000 000000
0002560 177774 177760 177775 177770 104404 022376 000000 104405
0002600 000000 000000 104403 000000 001000 104405 000000 000000
0002620 104421 000000 023436 104423 000000 000000 104422 000000
0002640 000000 000037 000034 000037 000036 000037 000036 000037
0002660 000037 000036 000037 000036 000037 043120 020712 020716
0002700 000001 000005 000515 000072 000457 051505 000124 042105
0002720 000124 060504 020171 067515 020156 030060 030040 035060
0002740 030060 030072 020060 034461 030060 000012 072523 046556
0002760 067157 072524 053545 062145 064124 043165 064562 060523
0003000 000164 060512 043156 061145 060515 040562 071160 060515
0003020 045171 067165 072512 040554 063565 062523 047560 072143
0003040 067516 042166 061545 000000 000000 000000 000000 000000
0003060 000000
0010060 000000 000020 000001 177770 177774 177777 071554 000000
0010100
#

and also:

'db' works
'cp' works
'rm' works

'sleep 360 &' followed by 'ls' works, and then when the 'sleep' ends no longer works! So confirmation about memory location dependence.

'cp /bin/ls xls' followed by 'xls' does not work (dumps core); works with 'sleep' as with 'ls' above.

Okay, last experiment, booting up, then depositing trap catcher from the front panel into vector 250:
000250: 000252
000252: 000000
...then issuing the 'ls' seems to catch it. I can then examine registers and memory etc. from the front panel. This is a quick and easy repro. I went ahead and dumped a few of the KT11 registers (but its late, so I can't guarantee I didn't slip up -- should try this again when I'm fresh):
SR0: 040143 (ah! page length fault, user I-space, page 1)
SR1: 000000 (no auto inc/dec to clean up)
SR2: 010210 (virtual PC, agrees with your deduction from core dump)
SR3: 000000 (that's odd -- shouldn't split I/D be enabled?)

UIPDR: 041402 016006 000000 000000 000000 000000 000000 066116
UIPAR: 001614 001760 001614 001614 001614 001614 001614 001614

UDPDR: 010501 057517 077717 077717 037611 067616 076300 064317
UDPAR: 002417 002564 007777 007766 005635 005656 007777  oops
...where "oops" means I thought I was done scribbling all these down, and turned off the machine. Did I mention it's late? :-)

[Note: It was late, and there is an error with UIPAR7 in this transcription. This will be the source of some uncertainty until corrected on February 2.]

January 31

Noel:

'sleep 360 &' followed by 'ls' works, and then when the 'sleep' ends no longer works! So confirmation about memory location dependence.

Yeah, that's a really important data-point. The fact that it is physical location dependent really does tend to implicate the KT11; I think the KB11 mostly only knows/has virtual addresses? (So I probably shouldn't bang my head trying to think of failure modes in the KB11?) If you have the source for its diag, you might try looking through it, looking for things it doesn't try...

Although I suppose it could be a location-dependent issue with the RK11. I should explain how to find, and examine the pure-text for the 'ls' command; if you halt the CPU on the trap again, look at UISA0, and that should give you the 'click' where the text starts; at that point I'd probably examine every 256th (block size) word and we can compare them to the original to make sure the in-core copy is OK.

SR0: 040143 (ah! page length fault, user I-space, page 1)
SR2: 010210 (virtual PC, agrees with your deduction from core dump
If it's really 010210, I wonder how it could be a fault on page 1; each page (segment, really) of virtual address space is 020000 long, so that address is well inside page 0?

Unless it has fetched some other instruction, due to some other error, one which does try and do something on page 1... Might want to try looking at a few instructions around 010210 when you try this again, see what's actually there. Let's see, code starts at 0161400 in real memory (per UIPAR0 below), so 010210 is at 0171610... Maybe dump a few words from 171600 on?
SR3: 000000 (that's odd -- shouldn't split I/D be enabled?)
No; you're running binary for a /40 system, so no split I/D. So also, all the UDPARs and UDPDRs will contain junk.
UIPAR: 001614 001760 001614 001614 001614 001614 001614 001614
?? UIPAR7 looks wrong; if the data is really at 01760, I think the stack should be above that in real memory - but I might be wrong, I will check.

If it is wrong, did something cause the wrong value to be stored there (e.g. an error in the execution of lines 1750/1751 in Lions); or was the prototype calculated wrong (around line 1704) - but I think the prototypes looked correct in the process' core dump, but I will check them; or did the hardware flake out, and e.g. copy a later store (the code fills them from the top down) up to UISA7?

To check out the latter, maybe a bespoke tiny program, toggled in, to try storing the 'correct' data in the UISPARs, in the exact way that the Unix code does it, and then look and see what's in there?

This might also correlate to the strange stuff I saw in the process' user-mode stack, in the dump - I will go back and look at that now.

If you do this again, please add KISA6 and KISD6 to the registers to dump (you can skip UDS*), so we can see what it thinks is going on with the per-process swappable data, which should be just below the process' user-mode data, in terms of real memory.

Yes, the stack is directly above the user data, which is directly above the swappable per-process data (user struct, and kernel stack). But the address math for stack segments in the KT11 is weird (see below).

I think the prototypes:
UISA 000000 000020 000000 000000 000000 000000 000000 177701
UISD 041402 016006 000000 000000 000000 000000 000000 066016
are right, but the negative direction of the stack is making my head hurt (and the UISA7 you recorded from the hardware might be right after all - but then the UISA0 might be wrong - it's suspicious, but not impossible, that they are the same value).

If the SPPD is at physical xxx, the user data will be at xxx+20 (in clicks, as above) through xxx+20+34 (below), and then the stack above that. Per the SPPD:
tsize 000104
dsize 000035
ssize 000024
the stack should then run from xxx+20+35 to xxx+20+35+23. The way the MM hardware works for stack segment, the 'base' is where the first click would be if the segment were a full 0200 clicks. (Per the example in the /45 proc handbook; for a 3-click stack running from physical 0331500 to 0331776, the PAR would contain 03120, i.e. segment base at 0312000.)

So let me do the math (please check to see if I'm confused :-); base of user data is at 0176000 (per UISA1 contents), runs to 0201476 (i.e. plus 03500); the stack would run from 0201500 to 0204076 (i.e. plus 02400). So the stack segment 'base' would be 020000 below the next word, or 0164100.

(My head hurts too much to work out if the 177701 of the prototype is right; basically, the location of the SPPD in clicks would be 01740 (I think - 01760 - 020), and that plus 177701 should give us 01641.)

But, anyway, I'm fairly sure that 01614 is not right for UISA7 (unless it really was 1641 and you inverted the digits because it looked so close).

Having KISA6 would help since it would give us a cross-check on the value of UISA1.....

So, according to the process core dump, these are the register contents at the time of the fault:
R0 177770
R1 0
R2 0
R3 0
R4 34
R5 444
SP 177760
PC 010210
PS 170010
Now, PDP-11 uses R5 for a frame pointer, set up thus:
        jsr     r5,csv        (first instruction in every C routine)

csv:
        mov     r5,r0
        mov     sp,r5
        mov     r4,-(sp)
        mov     r3,-(sp)
        mov     r2,-(sp)
        tst     -(sp)
        jmp     (r0)
on subroutine entry (the 'jsr r5, csv' pushes the old R5 contents, and temporarily saves the return PC - to just after the call to CSV, not to the sunroutine which called this one, that's further down - in R5). So, except for the first two instructions of CSV, R5 always contains an old SP.

Now look at the R5 from the crash. That's not an old SP. Something has already gone seriously wrong by this point - actually, likely the process has just started to run the newly-loaded command code (see below), and hasn't even set up its first stack frame yet.

Now look at the top of the stack, as recorded in the process' core dump:
0010060: 000000 000020 000001 177770 177774 177777 071554 000000
And that's it; the rest if all 0's! (The base address does seem to correspond; with:
dsize 000035
ssize 000024
and the SPPD being 020 clicks, that puts the top of the stack at 0101 clicks, or 010100, and the last location there is 010076.

The core dump routine, core() writes the user data out in two transfers (Lions 4113-4124), one for the SPPD, one for the user's data+stack. So we probably got the SPPD OK, but the rest - who knows?

It does call estabur(), which sets up the prototype MM register contents, and then writes them into the actual registers, so the prototypes in the process' core dump that I was looking at before have already been overwritten. :-(But estabur() then called sureg (Lions 1724) so hopefully the MM regs wound up pointing to the actual memory being used for the stack - but who knows?

Anyway, looking at the contents, the top of the stack does look vaguely like what it should be when the command starts executing, after the exec() call; the SP is even reasonable; it points to that 0 at offset 010060.

The 020 is the return point for the call to _main (see below; that 'jsr pc,_main' ends at 016); the '1' is probably 'nargs' (see Exec(II) in the V6 Manual), the '0177770' is argv, '177774' is argv[0], 177777 is argv[1] (end of list marker), and '071554' is 'ls' (the command name, by convention the first argument).

R0 contains what looks like an old SP, although I suppose that could have been left over from the assembler startup:
start:
          setd
          mov     sp,r0
          mov     (r0),-(sp)
          tst     (r0)+
          mov     r0,2(sp)
          jsr     pc,_main
but clearly the attempt to execute the first instruction in CSV blew up. And where did the '444' in R5 come from? The call to CSV is at 030?

February 1

Noel, regarding the second core file:

I took a quick look, and everything 'important' seems to be identical: the registers, PC, etc at the time of the trap (including that mysterious '444' in R5); the prototype MM registers; the user's stack (looking again like the command just started.

I went ahead and dumped a few of the KT11 registers
UIPDR: 041402 016006 000000 000000 000000 000000 000000 066116
Oh, BTW, I checked, and these match the prototype values in the user struct.

February 2-3

A tip from Noel:

Something stirred this in my memory: the best quick overview of the internals of the Bell PDP-11 Unixes is K. Thompson, "UNIX Implementation", available here:

https://users.soe.ucsc.edu/~sbrandt/221/Papers/History/thompson-bstj78.pdf

if you want to know more about what the insides are like.

Fritz:

Okay, here's the latest, done with some care:

UISD: 041402 016006 000000 000000 000000 000000 000000 066116
UISA: 001614 001760 001614 001614 001614 001614 001614 001641

KISD: 077406 077406 077406 077506 077506 077406 007506 077506
KISA: 000000 000200 000400 000600 001000 001200 001740 007600

SRs: 040143 000000 010210 000000

171600: 016162 004767 000224 000414 006700 006152 006702 006144

[Note: this fixes the previous late-night transcription error with UISA7...]

Noel:

UISD: 041402 016006 000000 000000 000000 000000 000000 066116
UISA: 001614 001760 001614 001614 001614 001614 001614 001641
So, 'good news' is these are the same except for UISA7, for which as I suspected, it looks like the digits were transposed. But the new value is exactly the one I calculated.

'Bad news' is that takes out what I was thinking might be a potential cause, which was UPAR's getting trashed by hardware failure. So more hard work ahead (see below).
KISD: 077406 077406 077406 077506 077506 077406 007506 077506
KISA: 000000 000200 000400 000600 001000 001200 001740 007600
Those all look OK: KISD6 show the segment length as 020 (017 being the last valid click), which is right, and KISA6 is 01740, so with the user area and kernel stack being 20 clicks, that makes the start of the user data 01760, which is what UISA1 contains.
SRs: 040143 000000 010210 000000
OK, same failing location as before (010210); SSR0 shows:
Abort - page length error
User mode
Page 1
which is the same as last time.
171600: 016162 004767 000224 000414 006700 006152 006702 006144
Let me just re-check the math here: text base is 0161400, plus a PC of 010210, gives us 0171610, which is right in the middle there - thanks!

That does not, alas, look anything at all like what's supposed to be there, which is:
010200: 110024
        010400  mov r4,r0
        000167  jmp 10226 (cret)
        000016
        010500  mov r5,r0 (start of CSV)
        010605  mov sp,r5
        010446  mov r4,-(sp)
        010346  mov r3,-(sp)
So maybe the RK11 went berserk? But maybe not...

The 4767 is a 'jsr pc, xxx' which is typical C compiler emission, but the rest looks like rubbish - 6700 is a SXT R0, for instance.

What's actually there at 010210 (virtual) still doesn't explain the MM trap we got; 'SXT R0' should have executed OK, no matter what? Confoozled...

What's also odd is how it got here; it's almost like the first few instructions:
start:
          setd
          mov     sp,r0
          mov     (r0),-(sp)
          tst     (r0)+
          mov     r0,2(sp)
          jsr     pc,_main

_main:
          jsr     r5,csv
executed OK, and then it tried to go off to csv, only there's trash there? And what's with the 0444 in R5? That should be 034, the return from that last JSR.

I'm going to go ponder all this. One more thing you could try is do this all again, and write down the first couple of instructions at the start of the text segment (UISA0 = 01614, so 0161400 on for a few words), so we can see if that looks OK.

If so, it will look like the command got read in off the disk wrong - since it's not coming from swap (it's just starting), it's coming out of the file system wrong. Why will be a good question.

And I still don't understand the 'segment 1' fault, and the R5 contents - so many things going wrong all at once, for reasons that make no sense... I wonder if there's a noise glitch hitting several things all at the same time?

Fritz:

I read a bit through the KT11 maintenance manual you sent yesterday, to refresh myself on it a bit (thanks for that!). I realized I almost always use my console in "PROG PHY" or "CONS PHY" mode; but using "USER I" and "KERNEL I" I may be able to verify quickly that the KT11 is thinking VA:010210 -> PA:171610.

When I set this up to try later, I'll examine that start of the text segment at 161400 as well, per your recommend.

February 4

Noel sends up a flare on cctalk in the the early AM, summarizing the problem and experiments to date. Suggestions start to flow in. Some have already been tried or can be ruled out. Some others:

Bob Smith: "I keep wondering about the psu...". This gets some agreement from the list, and a few interesting/relevant anecdotes are relayed. Paul Koning:

In RSTS development we once ran into DMC-11s not working reliably. The field service tech knew exactly what to look for, and started checking all the supply voltages. The spec says allowed tolerances are +/- 5%. He knew the reality for correct operation was -0%, +5%, so he tweaked all the supplies to read a hair above nominal.

Warner Losh:

I recall our PDP-11 tech tweaking +5V from 5.05V to 4.95V and back again to demonstrate that tiny differences matter a lot on one of the cranky 11/23+''s we had after I made a particularly unhelpful teenage smart ass remark... The 11/23+ wouldn't boot at the slightly lower than full voltage.

It is worth noting that in both of these cases, a slight undervoltage proved problematic...

Paul Koning suggests a potential KT11 failure mode:

Another possibility occurs to me: bad bits in the MMU (UISAR0 register if I remember correctly). Bad memory is likely to show up with a few bits wrong; if UISAR0 has a stuck bit so the "plain" case maps incorrectly you'd expect to come up with execution that looks nothing at all like what was intended.

Noel provides a short diagnostic (apparently, straight from his mind to machine code; props! :-) to check read-after-write on UISA* so we can rule this out:

1000:   12706       / Put stack at 0700
        700
        12701       / Load UISA0 address in R1
        177640
        5000        / Start testing at 0
        10011       / Store it
        20011       / Check it
        1401        / Skip if match
        0           / Halt here on error
        5200        / Next value
        20027       / 07777 or less?
        7777
        101770      / Go around
        5721        / Next register
        20127       / Done them all?
        177660
        101401      / Skip if not
        0           / Halt here when done
        137         / Go back
        1010

This is toggled in and passes on the machine.

Mattis Lind:

Would it be any difference if you run the machine at full speed or lower speed or even single step past this instruction? ... The TIG module has a separate non crystal controlled oscillator which one could tune for marginal checking.

Ah, yes, the margining clock! Always worth a check, and very easy to use with if you have a KM11 handy. A variety of clock speeds are tried, but the behavior remains the same.
Brent Hilpert:

For consideration, what about the refresh circuitry of the memory board?

Mem diagnostics, unless they explicitly account for it, may not show up problems with memory refresh if the loop times are short enough to effectively substitute as refresh cycles, while they could show up later in real-world use with arbitrary time between accesses.

Refresh on some early boards/systems was asynchronously timed by monostables or onboard oscillators which can drift or fail on the margin/slope. (I don't know what DEC's design policy was for DRAM refresh). It might also explain why a number of 4116s were (apparently) failing earlier in the efforts (if I recall the discussion correctly), replacing them might have just replaced them with 'slightly better' chips, i.e. with a slightly longer refresh tolerance.

This one also gets some follow-up. The schematics are consulted, and the MS11-L refresh is seen, indeed, to be driven by a simple free-running 555. Further from Brent:

4116 datasheet specs 2mS, my calcs give a refresh period of 1.5mS, the 14.5uS from the manual would give 1.86 mS, 7% shy of 2. The schematic specs 1% resistors, and the parts list does appear to spec a high-tolerance "1%200PPM" cap.

Although there are the internal voltage divider Rs in the 555 which are also critical for the timing and everything is 40+ years old...

The actual MS11 in use measures out on my 'scope at 15.2us. From Brent:

15.2uS gives a 1.95mS refresh, so it's awfully close to the 2mS spec, but still within. The datasheet I was looking at doesn't seem to give any spec for tolerance on the refresh so one would guess there's a safety margin built into the 2mS spec.

Fritz:

R0 177770
R1 0
R2 0
R3 0
R4 34
R5 444
SP 177760
PC 010210

060: 000000 000020 000001 177770 177774 177777 071554 000000
Okay, I've had a bit of time in front of the machine to repro this and take a look. What I actually see is:
R0 177770
R1 0
R2 0
R3 0
R4 0
R5 34
R6 141774
PC 000254
(remember, for the last, this will have been after taking a trap to 250, where I have the usual "BR .+2; HALT" catcher installed)

Also, memory at 060 (PA:164060) is all zeros as far as the eye can see...

Then, a big discovery from Noel:

Argh. (Very red face!)

I worked out the trap stack layout by looking at m40.s and trap.c, and totally forgot about the return PC (that's the 0444) from the call to trap():
0001740 000013 141756 022050 000013 000000 000000 000000 000034
0001760 000444 000031 177760 000000 030351 177770 010210 170010
I clearly should have looked at core(V) in the V6 manual!

The R6 you have recorded is correct for just after the trap; that's the kernel mode SP, which points to the top of the kernel stack, in segment 6 (in the swappable per-process kernel area, which runs from 140000-1776).

So there is no R5 mystery, I was just confused. Back to the other two!

But meanwhile, back in front of the actual machine:

Seeing some quite strange stuff now, after the crash, flipping between "CONS PHY" and "PROG PHY"...

Bits 6-12 are not acting as I would expect, almost as if the KT11 ALU is doing an incorrect operation (subtraction rather than add!)

I see these are 74S181 bit slice ALUs, and function code should be hardwired to "A+B"... So that brings us back around to really checking those supply voltages...

It turns out the +5V supplies were, in fact, slightly low (about 4.9 or so). Trimmed these up, and the the observed problems with bits 6-12 receded, though the "ls" crash remained exactly the same. It would appear, though, consistent with remarks above, that the machine has very little undervoltage tolerance on +5V -- certainly less than the documented -5%.

How long had the machine been in this condition, and what else might have been affected? It could not have been for very long, since the previously run KT11 diagnostics would certainly have failed. But the situation was spooky, and instilled some uncertainty about other data that had recently been retrieved via the front panel...

February 5

Noel clears away one additional address calculation error:

So I had to grub a bit to find this, but here's what I said:
With KISA7 at 001641, 0164100 should be the first location after the stack, so 0164060 and up would be good. They should be:
060: 000000 000020 000001 177770 177774 177777 071554 000000
and I have no idea how I screwed the address there up that that badly. The data I'm showing there is the top (address-wise; i.e. bottom, push-pop-wise) of the user stack, and I think it's correct. However, it's UISA7 which contains 01641, and that's the 'bottom' of that segment. I had previously done the math correctly:

base of user data is at 0176000 (per UISA1 contents), runs to 0201476 (i.e. plus 03500); the stack would run from 0201500 to 0204076 (i.e. plus 02400). So the stack segment 'base' would be 020000 below the next word, or 0164100.

So physical 0164060 is just in the middle of nowhere; it's somewhere in the middle of the text (which starts at physical 0161400).

If you could try this again, and check the top of the actual user stack (which will be at physical 0204060-0204076), I'd really appreciate it. I do expect it to be correct: the process core dump has it correct (as shown by the analysis of argc, argv, etc).

And I am able to get some consistent, correct, data after the power-supply tune-up:

Okay, latest numbers for you!

Stack, confirmed:

PA:204060: 000000 000020 000001 177770 177774 777777 071554 000000

Text; as I had feared, a few dropped bits there! Went ahead and grabbed you eight extra words while I was there:

PA:171600: 016162 004767 000224 000414 016700 016152 016702 016144
PA:171620: 004767 000206 000405 012404 012467 016124 000167 177346

In disassembly from 171602, this yields:

171602:   JSR     PC,172032
171606:   BR      171640
171610:   MOV     7766,R0
171614:   MOV     7764,R2
171620:   JSR     PC,172032
171624:   BR      171640
171626:   MOV     (R4)+,R4
171630:   MOV     (R4)+,7760
171634:   JMP     171206

...which looks at least like feasible code, if not the code we are expecting?

Last, a note on procedure for using the front panel to verify KT11 address mappings:

The way this works is you select the mapping set you want (in our case, USER I) with the top knob on the console, then toggle in a virtual address, hit "LOAD ADRS", and then when you hit "EXAM" it maps your provided address through the selected set. Under these circumstances, I'll also see the "KERNEL" light go out and the "USER" light light up on the front panel indicating the active mapping set. You can then flip to "PROG PHY" to see the mapped-to physical address. This is not explained very clearly in the handbooks; it took me a little experimentation to figure out how to do it.

Anyway, in our case, I toggle in "10210", and can read out "171610".

February 6

Noel:

In disassembly from 171602, this yields: ...which looks at least like feasible code

The first 4 words, yes, but not the rest. (Oh, and your disassembly is wrong; you used PA addresses, not VA.)

But excitingly, that could explain the MM trap, since 16700/16152 at VA: 010210 gives us:
MOV 26364, R0
and that address is in segment 1, which is only 03500 long...

Fritz:

Also, that exact sequence does occur in the ls binary!

From last night:

PA:171600: 016162 004767 000224 000414 016700 016152 016702 016144
PA:171620: 004767 000206 000405 012404 012467 016124 000167 177346

And from an od on bin/ls:

0004220 016162 004767 000224 000414 016700 016152 016702 016144
0004240 004767 000206 000405 012404 012467 016124 000167 177346

All together, this brings us to a significant juncture in the debug effort: the power supply issue has been addressed, and various red herring have been cleared away. Pre-conditions which exactly match the observed fault are apparent. We are left with a single, consistent, and reproducible issue: part of the process address space ends up holding the wrong part of the program text. But how, and why?

PDP-11/45: ECOs

Sun 07 June 2020 by Fritz Mueller

[A catch-up article, documenting discoveries of Jan/Feb 2019 and some additional research while writing up.]

Okay, per last article, specifics of parity-handling on my CPU indicate that it is missing several significant ECOs. I spent some time seeing if I could dig up and/or puzzle out more details on ECOs issued for the CPU, MMU, FPU, chassis, power supplies, and peripherals. What I could find I will summarize here.

ECO Designations

This material is excerpted from PDP-8E_ECO_LOG_Dec74.pdf on bitsavers. Of particular note is the distinction between "ECOs", implemented at the factory, and "FCOs", designed to be implemented in the field. The latter are distinguishable by the presence of an additional letter inserted in their identifier:

ECO IDENTIFIER

Every ECO that is issued for a product is entered in the DEC-O-LOG with an identification number that includes the equipment type or option number, followed by a unique sequential number. For example, the ECO identifier “LA30-00053" is assigned to the 53rd ECO issued for the LA30 DECwriter.

FIELD DISTRIBUTION

Depending upon the nature of the engineering change, ECOs are categorized in one of five groups, which are assigned letter codes F, D, DF, P, and M. The appropriate field distribution code for the ECO appears on the first line of the DEC-O-LOG entry.
F Code: When the ECO has been engineered for field retrofit, it is coded “F”. A Field Change Order (FCO) is prepared by Field Service. The FCO provides a complete description of the engineering change and includes instructions for making the change to existing equipment installations.

When an F-coded ECO is issued, a letter code is inserted as the first character of the sequential ECO identifier number. The letter code indicates the level of urgency or importance of the ECO as follows:
A   Mandatory change - highest priority      D   Low Priority change
B   Mandatory change                         E   Optional change for product improvement
C   Mandatory change if specified
    trouble symptoms are observed
NOTE

All ECOs issued for a product are entered into the DEC-O-LOG. Usually, the publication process that produces updated DEC-O-LOG pages is triggered by an F-coded ECO. At that time, any D, DF, P, or M-coded ECOs that precede the latest FCO are also published to provide a complete history of all ECO activity for the product.

D Code: An ECO that changes the design of an equipment in production is coded “D”. This category includes changes to:

Production models, as a normal step in the design phase of a new product.

All future production units of a specific equipment type or option, and the ECO is not to be retrofitted.

Production units that include special features ordered by a specific customer.

ECOs that are coded “D” are not retrofitted into existing units installed in the field. They are entered in the DEC-O-LOG for information purposes, and customers can purchase them from DEC.

DF Code: The “DF" field distribution code is assigned to ECOs that change the design of products that have been released to the field in limited quantity, when the retrofit is essentially a part of the design process.

P Code: When an ECO is issued solely for the purpose of changing engineering drawings or other engineering documentation, it is coded P (for Prints).

M Code: When an ECO is issued to change mechanical parts, structural components, or other items that do not affect existing units in the field, it is coded "M". M-coded ECOs cannot be purchased for retrofit.
ML, DD, and WL REVISIONS

A Master Drawing List (ML) or a Drawing Directory (DD) lists all engineering drawings included in the engineering drawing set for a specific product or option. If an ECO causes any drawing in that set to be revised, the ML or DD is revised to indicate the latest revisions of all drawings in the set. The revision code for the ML or DD that reflects the ECO is included in the first line of that ECO entry in the DEC-O-LOG.

A Wire List (WL) is used to indicate all wired connections for an equipment or option System Unit. If an ECO calls for changes in this wiring, the WL is appropriately revised. The revision code for the WL that reflects the ECO is included in the first line of that ECO entry in the DEC-O-LOG.

DEC-O-LOGs on Bitsavers

Unfortunately, at time of writing the selection of DEC-O-LOGs archived at bitsavers (here) is pretty sparse with respect to the '11/45. There is a 1974 log for the M8105 TMC CPU board, a 1973 log for the M8108 SSR MMU board, and that's pretty much it for the '11/45 :-(

Here's an entry from the M8105 log that is found there, relating to the subject of the previous article:

FCO
M8105-D0005

Unfortunately, no cut/jump list here (some log entries do contain these). There is a bunch of useful info to be gleaned, though:

The complete trap parity to 114 mod involved all of the M8100 (DAP), M8103 (RAC), M8105 (TMC), M8106 (UBC) boards, plus mods to the processor wire-wrap backplane.
All of these FCOs are given prefix "D", meaning DEC viewed fixing these in the field as "low priority".
A reason for the change is given: "Parity errors are time consuming to detect..." I suppose what they mean is that adding code to sniff for possible parity errors, in order to distinguish these from other possible sources of trap 4, was getting in the way of handling the other trap 4 sources promptly. I can't see that there would be any difference in processor overhead per se in dispatching between vector 4 and vector 114?
Machines which would trap parity to 114 as-delivered were phased into production starting in March 1973. This is consistent with the low serial number and component date codes in my machine, confirming its manufacture somewhere in 1972.
The note about the MS11-B/C (M8110 SMC) is also interesting, in conjunction with fact that the re-work is called out for "all core-parity PDP-11/45's".

PDP-11/45 11/50 MOS Memory Troubleshooting Guide

As it turns out, this document is a bit of a gold mine, containing a complete list of ECO's for the '11/45 CPU/MMU/FPU modules, console, and backplane, as of 1974. There are brief notes, dates, and revision codes for each.

In the case of the M8105, we can compare this with the DEC-O-LOG we do have on bitsavers. This shows that the "Notes" in the troubleshooting guide in some cases give information not included in the corresponding DEC-O-LOG entries. For example, the DEC-O-LOG entry for M8105-00002 says only:

Change disable gate in priority arbitration.

...while the troubleshooting guide offers a bit more explanation:

Corrects FP TRAPS to location 4 where BRs below 7 occur at the same time.

The troubleshooting guide also calls out several "clusters" of interdependent ECOs:

Parity, as remarked above: M8100-00003, M8103-00005, M8105-00005, M8106-00007/00008/00012/00012A, KB11-A-00015, and M8110-00018.

Separately, M8106-00003, from August 1972, is annotated "Adds disable jumper for UNIBUS PARITY", as observed missing in my hardware in the previous article.
CMP.B, BIT.B: M8100-00002A, KB11-A-00006. These are dated June, 1972. Hmm, I should check these instructions on my machine...
Speed-up: M8104-00002, M8106-00005A, KB11-A-00012A/00013. Not sure what is "sped up" here, but perhaps Unibus signaling; the M8106 change is annotated "Clear MSYN with T1".

Also of note are a few items that seem to relate directly to previously observed behaviors of my machine:

KM11 uPB: M8109-00015 "CPU fails to halt on selected ROM state when micro program break mode is set on maintenance card. Detected when running a test sequence not containing a pause, i.e., 150ns cycle time." Ah ha! I had run across this (mis)behavior previously, noted here.
Slots 26-28: KB11-A-0008 "Signal missing in small peripheral controller slots in KB11-A (slots 26-28)." Another ah ha... This seems likely related to issues noted here.
Burnt +5V Trace: KB11-00001: "Reworks backpanels with 24 AWG wire in parallel with +5 V etch." This may explain the burnt trace and "extra red wire" as noted here.

Jay Jaeger DEC-O-LOG Microfiche Transcriptions

After an ask on the cctalk mailing list, Jay Jaeger commented that he has some complete sets of DEC-O-LOGs on microfiche! Though he has no equipment with which to scan fiche, he very kindly took the time to manually transcribe those that seemed like they might be of most interest to me (thanks, Jay!) Jay has made his transcriptions publicly available here, under pdf/dec/fieldService/dec-o-log.

There is a lot there, and it's worth a read. Here are some excerpts I found relevant to discussions and investigations here:

On the parity rework:

M8100-C0003 CODE: F CS C Etch C JAN-73 [FCO]

Problem: Parity errors require the generation of trap vector 114
Correction: Modify trap vector logic
Required on all 11/45 systems with parity

M8103-C0005 CODE: F CS F MAR-73 [FCO]

Problem: Parity errors need special abort logic to assert ZAP signal
Correction: Modify ZAP gate on RACA

M8105-D0005 CODE: F CS F MAR-73 [FCO]

Parity errors are time consuming to detect through vector 4. Core parity requires these changes: M8100-D0003, M8103-D0005, M8105-D0005, M8106-D0007, KB11A-D0015 and M8106-D0008

FCO's may be installed separately - each FCO relies upon the others only to fully implement parity.

NOTE: MS11-B/C semiconductor parity memories with etch revision B M8110s will NOT function properly with these FCOs. A new M8110 is being designed to function with these FCOs.

M8106-D0007 CODE: F CS: J [FCO]

Problem: Parity errors are time consuming to detect through vector 4
Correction Allow processor to trap to 114 for parity errors.

M8106-D0008 CODE: F CS: K MAR-73 [FCO]

Problem: UNIBUS parity errors cause machine to halt
Correction: Disable UBCB UNI PERF [ 1 ] L from generating UBCB PARITY ERR SET L.

M8106-C0012 CODE: F CS: M JUN-73 [FCO]

Problem: MOS Parity memory is too tight
Correction: Strobe parity errors sooner
NOTE: See M8106-C0012A

Rework in parity systems with M7259 or etch revision C M8110s, rework all systems with parity and all PDP11-45s at next PM.

M8106-C0012A CODE: F JUN-73 [FCO]

The rework procedure in M8106-C0012 in steps #25 and #26 incorrectly references an R22. Should be R20.

KB11A-D0015 CODE: F DD:M WL:L [FCO] MAR-73

Problem: Detection of parity errors through vector 4 is slow.
Correction: Detect parity errors through vector 114.

Wire Adds: D11D2 to A06P1, E12A1 to F11F1, C09K1 to E12A1

NOTE: This FCO must be installed in conjunction with the following FCOs to implement core parity: M8100-D0003, M8103-D0005, M8106-D0007 and M8106-D0008. Each of these related FCOs may be installed separately.

Note that MS11-B/C semiconductor parity memories with etch revision B M8110s will NOT function properly when above FCOs are installed. A new M8110 is being designed to function with these FCOs.

Okay, much of this confirms deductions worked out in the previous article. The big additional clue here is that we actually have the backplane wire adds this time!
- D11D2 to A06P1: This forwards UBCB PARITY ERR L, already at the TMC card, on to DAPE E7. This provides parity condition input to the extended trap vector generation logic explored in the previous article.
- E12A1 to F11F1 and C09K1 to E12A1: These relate to enhanced abort processing -- they distribute UBCB PE ABORT L to TMCC E87, TMCE E38, and RACA E52, which I have not previously inspected.
Looking at the RAC changes first, we have this:

...which, per expectation, is not implemented on the RAC board I am running; no connection from CK1 to E52, and no connection from E52 to E101. The ZAP signal is used to force the microcode counter to 200 on a trap or on power up, kicking off the exception handling microcode flow on sheet FLOWS 12. I'm not sure why parity required special plumbing here; by my read a parity error should end up setting TMCC ABORT H which should also result in a ZAP?. My guess would be the special plumbing allows the trap to be taken at an earlier clock phase?

On the TMC we have these:

Also related to abort signaling, and also not implemented on my machine (no connection from FF1 to E87 or E38, pins 12 and 10 common on E38). So perhaps the reason to move parity to its own trap vector wasn't to avoid extra discrimintation logic in the trap 4 handler, but rather to enable an accelerated abort path without affecting extablished behavior of the other machine exceptions on vector 4? Curious...
On CMP.B, BIT.B:

M8100-A002A CODE: DF JUN-72 [FCO]

Problem: Disposition code on M8100-A0002 is 02: phase-in
Correction: Change to code 03: rework immediately.
Problem: CMP.B and BIT.B instructions with SM0 and DM0 and DF7 will destroy the PC
Correction: Inhibit PCB clock under those conditions

The troubleshooting guide mentions that that this change also needs KB11-A-00006, but that is not included in Jay's transcript (bummer).

In any case, the description here tells us a lot more about the exact nature of the bug. SM0, DM0, DF7 here are microcode conditions; collectively this description implicates a CMP or BIT instruction, either byte or word length, with two registers as arguments, and the second argument being the PC (R7).

CMP or BIT should not modify their second argument, but inspection of the the microcode flows shows that under this particular set of conditions the flow (FET.00, FET.10, IRD.00, EXC.90) is shared with many other E/class instructions which should do. So an update inhibit derived from existing microcode outputs to distinguish this particular set of conditions is needed, and thus this FCO.

Here again the schematics hold some clues. The relevant bit is the net for CLKPCB H on drawing DAPJ:

The AND terms of E42 serve as inhibits for the update signal, which otherwise sources from TIGC T1 H via E43. The top three terms in the diagram inhibit on cases of UPCB=2 * ~SF7, UPCB=0, or UPCB=3 * ~DF7 (with some term elimination on the first case). This would be the vanilla implementation of the UPCB microcode signals.

The bottom-most AND term seems to be the one involved in the FCO. This adds an additional inhibit, UPCB=3 * ICLASS * UPWE, which would be active in the situation described in the FCO description. Some minor additional clues here are that the handwriting for the signal names here doesn't quite match the rest of the drawing and that the backplane pin assignment for bringing in UPWE00 is not contiguous with the others in this net; together these suggest revision. From this we could also guess that the missing corresponding KB11-A-00006 probably involves bringing signal RACB UPWE00 H to pin E06V2.

Inspection of my DAP modules shows the predicted changes implemented around E42 with green-wires, so it looks like my machine did get this FCO, either in the factory or the field:

Next time I have the machine fired up I'll experiment with some BIT instructions from the front panel and make sure.
On the "speedup" fixes:

M8106-A0005 CODE: F CS: F DEC-72 [FCO]

Problem: Present board not meeting cycle time specifications for UNIBUS.
Correction: Add logic changes.
NOTE: This FCO must be installed in conjunction with FCO KB11A-A0012

M8106-A005A CODE: F JAN-73 [FCO]

Problem 1: Rework procedure in M8106-A0005 is incorrect.
Correction 1:
In step #2 changed E83-11 to E83-13
In step #4 change R23 to R22
Problem 2: The assembly hole drawing does not specify which side of the board etch cut #13 is on
Correction 2: Side 2

KB11A-A0012 CODE: F DD: J WL: J DEC-72 [FCO]

Problem: 11/45 processor does not meet UNIBUS cycle time specification.
Correction: Correct KB11-A back panel attached ADD/DELETE sheet and install FCO M8106-A0005 to UBC module.

KB11A-A012A CODE: F DEC-72 [FCO]

Problem: Not all the add/deletes listed on sheet 2 of KB11A-A0012 are necessary for this retrofit.
Correction: use the ADD/DELETE sheet included in this supplement for FCO KB11A-A0012.

KB11A-E0013 CODE: F DD: K WL: K JAN-73 [FCO]

Problem: Improve performance of PDP-11/45 processor as UNIBUS master.
Correction: Revise KB11-A Wire List as defined the ADD/DELETE sheet and install etch revision C M8104 module

NOTE: This FCO completes the total FCO package necessary for improving the speed performance of the PDP-11/45 with core memory. Note prerequisite FCOs are KB11A-A0012, KB11A-A012A and M8106-A0005.

Well, definitely Unibus timing related then. There is not very much info to start from here, and logs for the M8104 module are also missing from Jay's transcripts (perhaps not issued since no F-coded ECOs for this board?)

We do see from above, though:
- Involvement of M8106 E83 (sheet UBCB, driving CLR DESKEW L). Handwriting and "white out" artifacts around this gate on the engineering drawings also suggest rework.
- Involvement of M8106 R22 (also sheet UBCB, pull up on fastbus parity error handling)
- The troubleshooting guide annotates M8106 00005 "CLR MSYN with T1". Handwriting mismatch on the clock input of the MSYN flip-flop (sheet UBCB, E79) may be related.
- At least 13 etch cuts existed in the FCO (!)
Looking through the drawings a bit more here after this, I noticed for the first time (doh!) that there is an updated set of Unibus timing diagrams explicitly to be used with M8104 rev C and post- KB11-A-E00013. The differences seem to be around the bus long pause cycle. This prompted another look through the 1972 and 1976 KB11-A maintenance manual, and sure enough, in section 8.7.2 of the 1976 version we find:

ECO KBl l-A No. 13 ("Speed-up ECO"), in conjunction with Revision C or higher of the PDR Module (M8104), has changed the data transfer operations. Explanations of both versions are presented in this paragraph. (In general, ECO KB11-A No. 13 eliminated the bus long pause cycle.)

...followed by much description of the signalling differences between pre- and post- KB11-A-E00013 machines. This can definitely be mined for further clues.

I did verify that at least some of this rework is not implemented on my M8106 boards (specifically, configuration of E83 on my boards does not match either of the apparent revisions in the drawings). Definitely a lot more work to do puzzling this one out. Even though I'll continue to investigate the details of this change, it probably ends up being too much to undertake without complete documentation.
On KM11 uPB Break:

M8019-C0015 CODE: F CS: R CS: M3 APR-74 [FCO]

Problem: CPU fails to halt on selected ROM state when Micro Program break mode is set on maintenance card. The problem is detected when running a test sequence not containing a pause 150 nsec cycle time. Correction: Delete U/L: CTRL latch flipflop and add latch at input gating. Test by running a branch dot with ROM match at 343.

FOR ETCH REVISION C:

Cuts:
  Side 2 at E12 pin 9
  Side 2 E32 pin 9 to E31 pin 13
  Side 2 E32 pin 9 to E32 pin 10
  Side 2 at E42 pin 13
  Remove jumper E22 pin 14 to E35 pin 7

Jumpers:
  E12 pin 9 to E22 pin 11
  E42 pin 13 to E35 pin 7
  E42 pin 8 to E32 pin 9
  E32 pin 10 to E31 pin 13

(etch revision F instructions ellided; in Jay's transcript linked above)

NOTE: Etch revision F boards are reworked from CS revision P to R. Etch revision C boards are reworked from CS revision M2 to M3.

My M8109 timing generator is etch C. There are a lot of ECOs on this board, culminating in CS level M2 ahead of this FCO. In principle, many/most of these are verifiable by inspection, based on information contained in Jay's transcripts.

This one is a bit deep, and I haven't really covered much about the M8109 yet. I'll probably take up analysis of this problem, the associated FCO fix, and inspection/determination of the revision level of my board as its own topic in a future article. I'd like to implement this fix if it proves not overly complicated.
On slots 26-28:

KB11A-B0008 CODE: DF DD: E WL: E SEP-72 [FCO]

Problem: Small peripheral controller slots in KB11-A panel, slots 26 thru 28, are not wired to accept some quad module controllers: NPG, PA, PB, LTC, ACLO, DCLO and +15V are missing. Correction: Revise wiring to include signals listed above.

Wow, lots missing from these slots besides the +15V and NPG I had noticed. In other correspondence, Jay mentions that the DL11 (with which I was having trouble in these slots) is explicitly mentioned elsewhere in this FCO.
On the burnt-trace / extra wire situation:

KB11-00001 CODE: D May-72 [ECO]

Problem: Etch carrying +5V current from Mate-n-Lock pins to backpanel pins is not heavy enough to carry required current. Correction: Run 24AWG wire in parallel with etch on panels which already have Mat-n-Lock assembly installed. Increase thickness of conductor with solder bead if Mate-n-Lock assembly not installed. PDP-11/45 system serial number 101 and later.

Sounds about right, though the excerpt does not mention the exact trace or connector.

There are a good handful of others described in the DEC-O-LOG transcriptions as well. Probably worth making a chronological (rather than topical) pass through each of the logs and each of my boards. For another time...

Things still sought

Somebody to scan Jay's DEC-O-LOG fiche, so the complete contents can be made available on bitsavers. Jay kindly offered to lend these out to have them scanned. If you have the equipment for this and would be interested in helping out, please drop a line on the cctalk mailing list!
Information from FCO "kits", including cut/jump lists, diagrams, and instructions. I've not yet seen any of these, so I'm unsure what form they take, or where we might find them?
Alternate versions of the '11/45 engineering drawings. These can be quite useful for puzzling out the contents of ECOs by visually "diffing" them. So far I have only seen the June 1974 and April 1976 versions, which are currently available at bitsavers.
PDP-11/45 backplane wirelists. I have never seen one of these. The wirelist section in the commonly available PDP-11/45 engineering drawing sets actually describes the power harness, and not the backplane.