Hey there. I'm an intermediate coder with a decent mud. I've learned to code in a sort of patchwork way, dealing in specifics instead of the blanket learning most schooled people end up with. Enough about myself, though, lets get ot my problem.
When I have a copyover, it sometimes hangs. This has shown me through cores and backtraces when I am running gdb attached that the source of the problem is when it's attempting to write to a descriptor in copyover_recover. That was my first clue to the problem. Perhaps it is trying to write to a file descriptor thinking it's a socket? Looking at the copyover code I noticed that it inherits all descriptors, and I do mean ALL, which leads me to believe open file descriptors may be tagging along.
I went into /proc/<proc_id>/fd and did "ls -l", noticing pipes and sockets. So, becoming more interested, I ran lsof on the process. The list I was given was a mix of descriptors it says it can't identify the protocol of (I figure these are the tcp/ip stacks), and a bunch called FIFO. Reasoning led me to assume the acronym means FILE IN FILE OUT, but I could be wrong. To me this appears to be a descriptor opened, likely with fopen(), which is sort of hanging out there.
Under normal circumstances I think I could get away with it, but I have a lot of activity on my mud. Not sure if it's just a byproduct or a direct result, but over time the gap between my total users (in do_users) and my fds increases. I have checked out my most recent additions to the mud, which do include writing to files, but these have matching fclose() statements. The big problem with this is that it will cause the mud to hang, and therefore if I'm not there to kill the process it'll hang until I come around. This just isn't good, as I work a full-time job.
In a perfect world, lsof would have shown me the path of the file the descriptor points to. In this world however I'm left searching for the elusive answer to this annoying problem. My big questions is what are my options when it comes to this sort of thing?
I've read around at some of the other threads on this forum and I've noticed people having somewhat of a similar problem. I've tried the "ls -la", but procview isn't really showing me the source of the problem. I know it was mentioned at least once that the only recourse would be to dig through the files and find every instance of fopen() and ensure there is a matching fclose().
After this has been posted I'll get to digging through each of my files individually. My codebase is highly modified, and this'll be a pretty big undertaking. I was thinking perhaps I could redefine fopen? Just pop in a redeclaration of it in each file using fopen for the purpose of debugging, sort of like how KEY is done with fMatch. Have it write out to another file (oh sweet irony) for each one that I fopen() and have it write if it executes an fclose()? Then I could look in /file, for instance, and check which file lacks a closing comment? I don't really know how to redefine it though.
The biggest problem here is that it doesn't seem to crop up on my coding port, where theres just me. This problem seems to happen when my 60+ players are on, and I can't figure out if this is the product of sockets not being closed or files not being closed.
When I do "lsof -p <pid>" I get a mix of things. Aside from the expected libraries, I get these two listings:
<exec name> 9881 <username> 75u sock 0,0 14803045 can't identify protocol
as well as
<exec name> 9881 <username> 80r FIFO 0,5 14833858 pipe
The second one, the pipe, is what I think the problem is. I base this on the fact that my gap between connections and FDs matches the amount of latter example more closely than the former. Heres procview:
lrwx------ 1 <username> procview 64 Mar 15 19:30 8 -> socket:[14760085]
and the pipe
lr-x------ 1 <username> procview 64 Mar 15 19:30 83 -> pipe:[14845918]
I have IMC installed on my mud. I noticed that in the IMC code it opens something with fopen() and closes it with IMCFCLOSE(). I wonder if this has any signifigance? It was written by far superior coders than I, so I doubt that the blame for my situation lies on anyones shoulders but my own. I am, however, given to investigation. At times I've seen the FDs as high as 256, much higher than my player count.
Not sure if it's related, but it does seem to happen every time after I try to run a manual pfile scan with Samsons pfile code. Likely my 12,000+ files within /players might cause the code itself to crash. When my mud crashes it tries to copyover instead (intentional), and it hangs immediately. In fact, most of the hang-ups seems to happen when its a crashover instead of a manual copyover, but it has happened with manual copyovers nonetheless.
I'm posting this anonymously because there are people who seem to dedicate so much of their time to being pests. If they figured out some way to crash my mud on demand I'd be in serious trouble until the problem was fixed. My post was this long simply because I hope to give adequate information to have some of the fine minds here recognize the nature or cause of my problem. Thank you for reading, have a good day. |