Games you lurve

Started by Unbeknownst, July 27, 2009, 07:24:46 PM

Previous topic - Next topic

Fibre

Well, here's the ones that have really captured my attention out of the few that I have played:

  • Nethack: great game, although I've never managed to get past Orcus-town.
  • Captain Comic: finally completed it after ~10 years!
  • Deus Ex: unfortunately I've only played about 30 minutes of this, but have seen quite a bit more; I really need to get my hands on it to play through it someday.
  • Rise of Nations: interesting, relatively slow RTS
  • Wing Commander III: I don't really remember much but loved it; really need to find this one as well...

Quote from: Tapewolf on July 28, 2009, 11:44:02 AM
I wouldn't have thought that emulating a 16bpp buffer on a native 32bpp display in software would really be all that hard.  Surely you'd just need a 64k lookup table.
I'd be very surprised if a 256kB (64k*32b) lookup table would be more efficient than just doing the conversion directly on a current general-purpose CPU, and especially on a GPU which is where I would think that it would be done...

Tapewolf

#31

*****
Caution - there now follows a highly technical discussion of low-level graphics manipulation.  I suggest you don't read it.  You have been warned, so don't blame me if reading it does make your head explode.
*****


Quote from: Fibre on July 28, 2009, 09:27:51 PM
I'd be very surprised if a 256kB (64k*32b) lookup table would be more efficient than just doing the conversion directly on a current general-purpose CPU, and especially on a GPU which is where I would think that it would be done...

Maybe.  A lot depends on the formats involved.  16bpp can be coded either as RRRRRGGGGGBBBBB0, or RRRRRGGGGGGBBBBB or sometimes with the R and B reversed.  Assuming the latter and an RGBA target you would presumably have to do something like:

xor edx,edx
mov dx, [16bpp source]

mov eax,edx
shl eax,16
and eax, 0xf8000000  ; R

mov al,dl
shl dx,12; B
and edx, 0x0000f800
or eax,edx

shl edx, 13
and edx, 0x00fc0000
or eax, edx ; G

mov [framebuffer],eax

...I think there may be a few bugs in there, but you'd have to do something like that for each pixel.  Unless I'm much mistaken, MMX and friends don't support packed 16BPP so it would have to be done manually as I have done here.  You would need a different algorithm for each combination of input and output pixel format, and if it was being done on-the-fly you'd also need logic to choose the right algorithm which either means using a function pointer (and more performance hit) or some kind of if-then-else logic which will also be slow.

With a lookup table you can precompute it, and so something like:

xor esi,esi
mov si, [16bpp source]
shl esi,2
add esi,lookuptable
lodsd
mov [framebuffer],eax

...which _should_ be quicker.

If we do have any assembly-heads, optimisations would be welcome.  Does coding belong in the Tower of Art?

EDIT:

Here are the C equivalents:

// bitshifting

unsigned short in16bpp = *inbuffer++;
unsigned long out32bpp = ((in16bpp << 16)&0xf8000000) | ((in16bpp << 12)&0x00fc0000) | ((in16bpp << 13)&0xf800);
*outbuffer++ = out32bpp;

// lookup

unsigned short in16bpp = *inbuffer++;
unsigned long out32bpp = lookup[in16bpp];
*outbuffer++ = out32bpp;


EDIT EDIT:
To get back to your original point, the issue is that the GPU isn't doing this, or at least is doing it in a half-assed manner that gives you about 8bpp worth of colourspace instead of 16bpp, ruining the shadows.   What I'm essentially proposing is software emulation layer to have the game running inside a 32bpp framebuffer.

J.P. Morris, Chief Engineer DMFA Radio Project * IT-HE * D-T-E


hapless

I have to admit that this is the first time I heard of dropping support for 16bpp.
Anyway, that would have to be implemented in form of a patch to the video driver (which manufacturers aren't wanting to do), or to the executable itself, wouldn't it?
Other point to keep in mind is the fact that memory access is relatively expensive, taking "a little" more than one CPU cycle. While a 256KB table would technically fit in the L2 cache of most recent processors, it would be probably fetched from memory anyway. Computation MIGHT be faster. After all it's only a few bit ops... and that's what the CPU's best at.

//h(oping what he said makes some sense)
Chaosnet device not responding - check breaker on the Unibus

Tapewolf

Quote from: hapless on July 29, 2009, 07:40:56 AM
I have to admit that this is the first time I heard of dropping support for 16bpp.
Anyway, that would have to be implemented in form of a patch to the video driver (which manufacturers aren't wanting to do), or to the executable itself, wouldn't it?
Yes, you'd need to patch the EXE.  If memory serves, the Thief engine has DLLs for its output drivers, but I'd have to check.

QuoteOther point to keep in mind is the fact that memory access is relatively expensive, taking "a little" more than one CPU cycle. While a 256KB table would technically fit in the L2 cache of most recent processors, it would be probably fetched from memory anyway. Computation MIGHT be faster. After all it's only a few bit ops... and that's what the CPU's best at.
Yes.  It's the sort of thing you'd need to measure in real-world tests to establish which algorithm works best, and it would probably end up different on different setups.

J.P. Morris, Chief Engineer DMFA Radio Project * IT-HE * D-T-E


Fibre

Obviously I should have posted this before, but I actually tried it and with a very naive C implementation saw about identical speeds between direct conversion and a lookup table. GCC4.3 was able to vectorize the loop with MMX, although I didn't look at the details of what it was doing and highly suspect that it could optimized quite a bit more as GCC really doesn't seem to be that good at this type of thing.

Unfortunately I have to run off to work but when I get home this evening I'll post the details and a full reply, and maybe try my hand at a hand-optimized kernel if someone doesn't beat me to it.

Tapewolf

#35
Quote from: Fibre on July 29, 2009, 08:02:57 AM
Obviously I should have posted this before, but I actually tried it and with a very naive C implementation saw about identical speeds between direct conversion and a lookup table. GCC4.3 was able to vectorize the loop with MMX, although I didn't look at the details of what it was doing and highly suspect that it could optimized quite a bit more as GCC really doesn't seem to be that good at this type of thing.

I stand corrected.  I will have to give this a go, it could be a fun project.  (Writing the conversion layer, that is.  Injecting it into Thief is beyond my ability.)

J.P. Morris, Chief Engineer DMFA Radio Project * IT-HE * D-T-E


TheJimTimMan

#36
Lets see...
Half Life 2
[HL2 mods] Research and Development (Imploding microwaves, anyone?)
               Action Half-Life 2: The Sauce of Death
Deus Ex (I feel guilty for still not having finished this, despite buying it about a year ago)
S.T.A.L.K.E.R. Shadow of Chernobyl (Using STALKER Complete 2009)
Unreal Tournament 2004
Unreal Tournament 3
Half Life 1 (Unfinished)
[HL1 mods] Sven Co-op
               The Specialists
Minecraft
Freespace 2
Mondo Agency
Diablo 2: Lord of Destruction
[D2 mod] Median XL
Cortex Command
[CC mods] Far too many to list, the game practically lives on them... not that thats a bad thing.
Mass Effect
Cave Story
Portal
Plants vs. Zombies (Melon-pults, FIRE!)
As well a whole lot I've probably forgotten somewhere.

Rapid-fire edit: Eve Online and World in Conflict. Might as well toss TF2 in there as well.

Teh_Hobo

GAH How could I forget S.T.A.L.K.E.R.??!? and Clear Sky, for that matter! I'll have to try the Complete 2009 mod.
One week in air, two weeks in water, two weeks in water, eight weeks in ground.

Ryudo Lee

Quote from: Tapewolf on July 29, 2009, 08:07:50 AM
I stand corrected.  I will have to give this a go, it could be a fun project.  (Writing the conversion layer, that is.  Injecting it into Thief is beyond my ability.)

If someone could inject it into Thief (1 & 2 and System Shock 2) then that someone would be loved by many.

Thanks to Taski & Silverfoxr for the artwork!



Dannysaysnoo

Ooh, thought of another good un. Tomb Raider 3.

Kipiru

Quote from: bill on July 28, 2009, 08:26:42 PM
this almost almost makes me want to play Unreal II, it was received badly when it was released, if I remember right

Yeah, people said it didn't have an intense enough gameplay and a good enough story- none of them I bet had ever actually gotten beyond the first Skaarj in the elevator, otherwise they wouldn't be saying such nonsense. The game is a classic in my book.

Noone

My Favorites are: (well, the ones that crossed my mind anyways)

Neverwinter Nights
Neverwinter Nights 2
Planescape Torment
Baldur's Gate 2
Starcraft
Master of Magic
Chess

Fibre

#42
Quote from: Tapewolf on July 29, 2009, 06:44:42 AM

*****
Caution - there now follows a highly technical discussion of low-level graphics manipulation.  I suggest you don't read it.  You have been warned, so don't blame me if reading it does make your head explode.
*****


Alright, here's what I have so far: http://72.14.179.253/cmf/imgconv-20090729a.tar.bz2

The output I'm seeing is:
c_nat   ok   2.537796
c_tw   FAIL   3.148723
c_tbl   ok   2.567760


It's a bit messy, and slightly incomplete (I need to add alignment support), but it has your conversion function (c_tw), my naive C one (c_nat), and the simple lookup table (c_tbl). Yours says "FAIL" at the moment because it checks against c_nat by default and I think I'm using a slightly different output format than you right now. I'll try to resolve that. If you run it, you'll probably want to adjust the image size in main(); right now it runs on 32768x32768 images which take a total of ~10GB.

The key in optimizing kernels like these tends to be in inter-iteration parallelism. I don't think we'll be able to get much by just optimizing conversion functions for a single pixel. As it is yours runs a bit slower than mine right now because GCC4.3 isn't vectorizing yours for some reason, even though it has two less shifts. At first glance, it looks like GCC is at least doing vectorization with SSE plus maybe software pipelining, so it may be hard to do better. I am going to try to write an SSE3 kernel using GCC intrinsics first, for reference and in hopes that GCC will be able to schedule and software-pipeline it decently. Otherwise, I'll try a full assembly version (AMD64/Intel64 with SSE3).

EDIT: Note that I tried this with GCC 4.2, 4.3, 4.4, and 4.5. GCC4.2 doesn't vectorize the loop, and for some reason 4.4+ have performance regressions compared to 4.3 on this code... If anyone has different compilers handy, it would be interesting to see how they do.

Quote
Unless I'm much mistaken, MMX and friends don't support packed 16BPP so it would have to be done manually as I have done here.
I don't see anything directly supporting RGB565 in MMX/SSE, but it looks quite doable by using the regular SSE instructions to run at least 4-8 iterations in parallel.

Quote
You would need a different algorithm for each combination of input and output pixel format, and if it was being done on-the-fly you'd also need logic to choose the right algorithm which either means using a function pointer (and more performance hit) or some kind of if-then-else logic which will also be slow.
I'm assuming that we have pretty much full control over the image formats...

Quote
If we do have any assembly-heads, optimisations would be welcome.  Does coding belong in the Tower of Art?
I don't know, just tell me where I should be posting. :)

Quote
EDIT EDIT:
To get back to your original point, the issue is that the GPU isn't doing this, or at least is doing it in a half-assed manner that gives you about 8bpp worth of colourspace instead of 16bpp, ruining the shadows.   What I'm essentially proposing is software emulation layer to have the game running inside a 32bpp framebuffer.
OK, I'm not familiar with the games referenced in the original comment. I'm guessing they do all of the rendering on the CPU, then?

Tipod

You know how Call of Duty does scripted events pretty seamlessly and makes them awesome?
Yeah, Unreal 2 kind of went the other way with its scripted events. I know, I played through the entire thing.

To actually contribute, Air Zonk. Old school repping itt
"How is it that I should not worship Him who created me?"
"Indeed, I do not know why."

M

Hmm, my favorite games? They would probably be these:

Okami
Paper Mario (both original and TTYD)
Klonoa 2
Tail Concerto
Phoenix Wright
Cooking Mama
DDR
Persona 4
Okage
Harvest Moon (ALL OF THEM)
Pokemon
Majora's Mask
Ocarina of Time
Twilight Princess
Final Fantasy IX
Professor Layton and the Curious Village (Can't wait for Diabolical Box)

There are probably more, but I can't think of them. X3

Tapewolf

Quote from: Fibre on July 29, 2009, 08:13:19 PM
Alright, here's what I have so far: http://72.14.179.253/cmf/imgconv-20090729a.tar.bz2
Oo, fun.  I have a lot going at the moment but I'll try and play with it tonight.

QuoteYours says "FAIL" at the moment because it checks against c_nat by default and I think I'm using a slightly different output format than you right now. I'll try to resolve that. If you run it, you'll probably want to adjust the image size in main(); right now it runs on 32768x32768 images which take a total of ~10GB.
Yes, I was thinking of multiple iterations of 1024x768 framebuffers.  Mine was targetted at RGBA, but now I think of it, ARGB is more likely so I'll need to modify the shifts and masking accordingly.

QuoteThe key in optimizing kernels like these tends to be in inter-iteration parallelism. I don't think we'll be able to get much by just optimizing conversion functions for a single pixel. As it is yours runs a bit slower than mine right now because GCC4.3 isn't vectorizing yours for some reason, even though it has two less shifts. At first glance, it looks like GCC is at least doing vectorization with SSE plus maybe software pipelining, so it may be hard to do better. I am going to try to write an SSE3 kernel using GCC intrinsics first, for reference and in hopes that GCC will be able to schedule and software-pipeline it decently. Otherwise, I'll try a full assembly version (AMD64/Intel64 with SSE3).

I kind of dropped out of the assembler game after MMX, however if it is possible to do 64-bit shifts and ANDs, that would allow you to do two pixels per iteration.


QuoteI don't see anything directly supporting RGB565 in MMX/SSE, but it looks quite doable by using the regular SSE instructions to run at least 4-8 iterations in parallel.
I will have to look into that.

Quote
I'm assuming that we have pretty much full control over the image formats...
At the end of the day, it's most likely 565->0888.

QuoteI don't know, just tell me where I should be posting. :)
If this takes off, I may move it to the Tower and call it 'prose'...

QuoteOK, I'm not familiar with the games referenced in the original comment. I'm guessing they do all of the rendering on the CPU, then?
I don't really know, and to be honest I don't know enough about DirectX to know how feasible this would be, so at the moment it's more an intellectual exercise.  My basic concept (and I don't know if it's possible) was to hack Thief so that it creates a 16-bit software surface and then insert some kind of hook to new code so that it copies and converts the framebuffer to a hardward 32-bit surface.
This sort of thing has been done before, e.g. the Morrowind Graphics Extender, so it's not wholly impossible.

J.P. Morris, Chief Engineer DMFA Radio Project * IT-HE * D-T-E


RJ

As far as I know, I'm not losing at "Survive Living in Australia". Now that's a fun game.  :)

Fibre

#47
Quote from: Tapewolf on July 30, 2009, 04:26:53 AM
I kind of dropped out of the assembler game after MMX, however if it is possible to do 64-bit shifts and ANDs, that would allow you to do two pixels per iteration.

QuoteI don't see anything directly supporting RGB565 in MMX/SSE, but it looks quite doable by using the regular SSE instructions to run at least 4-8 iterations in parallel.
I will have to look into that.

The algorithm I am thinking of, in rough pseudo-code (hopefully the vector notation makes sense), is:

for(unsigned x = 0; x < pitch; x += 8)
{
v8i16 s = *(v8i16 const *)(src_row + x);
v4i32 s0 = <(i32)s[0], (i32)s[1], (i32)s[2], (i32)s[3]>;
v4i32 s1 = <(i32)s[4], (i32)s[5], (i32)s[6], (i32)s[7]>;

v4i32 r0 = (s0 << rshift) & rmask;
v4i32 r1 = (s1 << rshift) & rmask;

v4i32 g0 = (s0 << gshift) & gmask;
v4i32 g1 = (s1 << gshift) & gmask;

v4i32 b0 = (s0 << bshift) & bmask;
v4i32 b1 = (s1 << bshift) & bmask;

*(v4i32 *)(dest_row + x + 0) = r0 | g0 | b0;
*(v4i32 *)(dest_row + x + 4) = r1 | g1 | b1;
}


All of these vector operations appeared to be supported by at least SSE3, although I haven't mapped out the exact instructions/intrinsics yet. The unpacking/expansion at the beginning may need a few tricks but looks possible. Of course, when lowered this code will need to be intelligently scheduled, software-pipelined, and perhaps unrolled a bit to make full use of the hardware.

Quote
Quote
I'm assuming that we have pretty much full control over the image formats...
At the end of the day, it's most likely 565->0888.

Well, more than the exact pixel format, it would be useful to guarantee the alignment of rows as well as the legality of reading and writing the padding after each row. The above code, as is, requires 16-byte row alignment for efficiency, and must to be able to read the padding in the source and write it in the destination if the width is not a multiple of 8 pixels.

Janus Whitefurr

Quote from: RJ on July 30, 2009, 05:01:51 AM
As far as I know, I'm not losing at "Survive Living in Australia". Now that's a fun game.  :)

Except for the drop-bears.
This post has been brought to you by Bond. Janus Bond. And the Agency™. And possibly spy cameras.

M


Kenji

Quote from: Marmonstein on July 30, 2009, 02:01:57 AM
Hmm, my favorite games? They would probably be these:

Okami
Paper Mario (both original and TTYD)
Klonoa 2
Tail Concerto
Phoenix Wright
Cooking Mama
DDR
Persona 4
Okage
Harvest Moon (ALL OF THEM)
Pokemon
Majora's Mask
Ocarina of Time
Twilight Princess
Final Fantasy IX
Professor Layton and the Curious Village (Can't wait for Diabolical Box)

There are probably more, but I can't think of them. X3

Son of a... how did I forget Tail Concerto...

M

Quote from: Kenji on July 30, 2009, 01:37:51 PM
Son of a... how did I forget Tail Concerto...

I can't believe that there's someone else who knows about that game besides my sister and I. XD

Vidar

Quote from: Marmonstein on July 30, 2009, 12:28:52 PM
Quote from: Janus Whitefurr on July 30, 2009, 08:40:53 AM
Except for the drop-bears.

Don't forget the hoop snakes.

Not to mention the spiders, crocodiles, scorpions, snakes, insect, grass, and everything else that wants to kill you. Why would anyone want to live in Oz anyway?

Also, Aquaria : buy this game. It's really pretty, and you get to kick the butts of a couple of gods while you explore the place.
\^.^/ \O.O/ \¬.¬/ \O.^/ \o.o/ \-.-/' \O.o/ \0.0/ \>.</

llearch n'n'daCorna

Quote from: Vidar on July 31, 2009, 02:06:15 AM
Not to mention the spiders, crocodiles, scorpions, snakes, insect, grass, and everything else that wants to kill you. Why would anyone want to live in Oz anyway?

Some of the sheep.
Thanks for all the images | Unofficial DMFA IRC server
"We found Scientology!" -- The Bad Idea Bears

Tapewolf

Quote from: llearch n'n'daCorna on July 31, 2009, 06:32:35 AM
Some of the sheep.

Do you mean that the heep are one of the reasons to live there, one of the many perils or both?  (Now I'm imagining some kind of sheep succubus or something)

J.P. Morris, Chief Engineer DMFA Radio Project * IT-HE * D-T-E


Janus Whitefurr

Quote from: Vidar on July 31, 2009, 02:06:15 AM
Not to mention the spiders, crocodiles, scorpions, snakes, insect, grass, and everything else that wants to kill you. Why would anyone want to live in Oz anyway?

External opinion seems to be that you ALSO get a drop-dead gorgeous accent that turns foreigners (usually of the US variety) into mush just hearing it.

And besides, it's not like we -wanted- to live here. We just adapted after Britain dumped us.
This post has been brought to you by Bond. Janus Bond. And the Agency™. And possibly spy cameras.

Kipiru

I just thought of another underappreciated game I love- Kiss Psycho Circus: The Nightmare Child. It was a really fun game with unique atmosphere and characters. The only game I've played that really felt like you were in a wacky dream.

hapless

#57
Quote from: Tapewolf on July 31, 2009, 06:38:39 AM
Quote from: llearch n'n'daCorna on July 31, 2009, 06:32:35 AM
Some of the sheep.

Do you mean that the heep are one of the reasons to live there, one of the many perils or both?  (Now I'm imagining some kind of sheep succubus or something)

Quote from:  "The Last Continent" by Terry Pratchett
Death held out a hand, I WANT, he said, A BOOK ABOUT THE DANGEROUS
CREATURES OF FOURECKS -

Albert looked up and dived for cover, receiving only mild bruising
because he had the foresight to curl into a ball.

After a while Death, his voice a little muffled, said: ALBERT, I WOULD
BE SO GRATEFUL IF YOU COULD GIVE ME A HAND HERE.

Albert scrambled up and pulled at some of the huge volumes, finally
dislodging enough of them to allow his master to clamber free.

HMM . . . Death picked up a book at random and read the cover.

DANGEROUS MAMMALS, REPTILES, AMPHIBIANS, BIRDS, FISH, JELLYFISH,
INSECTS, SPIDERS, CRUSTACEANS, GRASSES, TREES, MOSSES, AND LICHENS OF
TERROR INCOGNITA, he read. His gaze moved down the spine. VOLUME 29C, he
added. OH. PART THREE, I SEE.

He glanced up at the listening shelves. POSSIBLY IT WOULD BE SIMPLER IF
I ASKED FOR A LIST OF THE HARMLESS CREATURES OF THE AFORESAID CONTINENT?

They waited.

IT WOULD APPEAR THAT -

'No, wait, master. Here it comes.'

Albert pointed to something white zigzagging lazily through the air.
Finally Death reached up and caught the single sheet of paper.

He read it carefully and then turned it over briefly just in case
anything was written on the other side.

'May I?' said Albert. Death handed him the paper.

' "Some of the sheep," ' Albert read aloud. 'Oh, well. Maybe a week at
the seaside'd be better, then.'
Chaosnet device not responding - check breaker on the Unibus

llearch n'n'daCorna

Quote from: Janus Whitefurr on July 31, 2009, 06:49:34 AM
External opinion seems to be that you ALSO get a drop-dead gorgeous accent that turns foreigners (usually of the US variety) into mush just hearing it.

... and an inability to pronounce "six" correctly. ;-]

Quote from: Tapewolf on July 31, 2009, 06:38:39 AM
Do you mean that the heep are one of the reasons to live there, one of the many perils or both?  (Now I'm imagining some kind of sheep succubus or something)

Perhaps.
Thanks for all the images | Unofficial DMFA IRC server
"We found Scientology!" -- The Bad Idea Bears

Tapewolf

Quote from: Janus Whitefurr on July 31, 2009, 06:49:34 AM
And besides, it's not like we -wanted- to live here. We just adapted after Britain dumped us.

I think it was a case of 'go there or be hanged'  :B

J.P. Morris, Chief Engineer DMFA Radio Project * IT-HE * D-T-E