Register forum user name Search FAQ

Gammon Forum

Notice: Any messages purporting to come from this site telling you that your password has expired, or that you need to verify your details, confirm your email, resolve issues, making threats, or asking for money, are spam. We do not email users with any such messages. If you have lost your password you can obtain a new one by using the password reset link.

Due to spam on this forum, all posts now need moderator approval.

 Entire forum ➜ SMAUG ➜ SMAUG coding ➜ Problem with files, I think...

Problem with files, I think...

It is now over 60 days since the last post. This thread is closed.     Refresh page


Posted by Justacoder   (3 posts)  Bio
Date Wed 16 Mar 2005 01:48 AM (UTC)
Message
Hey there. I'm an intermediate coder with a decent mud. I've learned to code in a sort of patchwork way, dealing in specifics instead of the blanket learning most schooled people end up with. Enough about myself, though, lets get ot my problem.

When I have a copyover, it sometimes hangs. This has shown me through cores and backtraces when I am running gdb attached that the source of the problem is when it's attempting to write to a descriptor in copyover_recover. That was my first clue to the problem. Perhaps it is trying to write to a file descriptor thinking it's a socket? Looking at the copyover code I noticed that it inherits all descriptors, and I do mean ALL, which leads me to believe open file descriptors may be tagging along.

I went into /proc/<proc_id>/fd and did "ls -l", noticing pipes and sockets. So, becoming more interested, I ran lsof on the process. The list I was given was a mix of descriptors it says it can't identify the protocol of (I figure these are the tcp/ip stacks), and a bunch called FIFO. Reasoning led me to assume the acronym means FILE IN FILE OUT, but I could be wrong. To me this appears to be a descriptor opened, likely with fopen(), which is sort of hanging out there.

Under normal circumstances I think I could get away with it, but I have a lot of activity on my mud. Not sure if it's just a byproduct or a direct result, but over time the gap between my total users (in do_users) and my fds increases. I have checked out my most recent additions to the mud, which do include writing to files, but these have matching fclose() statements. The big problem with this is that it will cause the mud to hang, and therefore if I'm not there to kill the process it'll hang until I come around. This just isn't good, as I work a full-time job.

In a perfect world, lsof would have shown me the path of the file the descriptor points to. In this world however I'm left searching for the elusive answer to this annoying problem. My big questions is what are my options when it comes to this sort of thing?

I've read around at some of the other threads on this forum and I've noticed people having somewhat of a similar problem. I've tried the "ls -la", but procview isn't really showing me the source of the problem. I know it was mentioned at least once that the only recourse would be to dig through the files and find every instance of fopen() and ensure there is a matching fclose().

After this has been posted I'll get to digging through each of my files individually. My codebase is highly modified, and this'll be a pretty big undertaking. I was thinking perhaps I could redefine fopen? Just pop in a redeclaration of it in each file using fopen for the purpose of debugging, sort of like how KEY is done with fMatch. Have it write out to another file (oh sweet irony) for each one that I fopen() and have it write if it executes an fclose()? Then I could look in /file, for instance, and check which file lacks a closing comment? I don't really know how to redefine it though.

The biggest problem here is that it doesn't seem to crop up on my coding port, where theres just me. This problem seems to happen when my 60+ players are on, and I can't figure out if this is the product of sockets not being closed or files not being closed.

When I do "lsof -p <pid>" I get a mix of things. Aside from the expected libraries, I get these two listings:

<exec name> 9881 <username> 75u sock 0,0 14803045 can't identify protocol

as well as

<exec name> 9881 <username> 80r FIFO 0,5 14833858 pipe

The second one, the pipe, is what I think the problem is. I base this on the fact that my gap between connections and FDs matches the amount of latter example more closely than the former. Heres procview:

lrwx------ 1 <username> procview 64 Mar 15 19:30 8 -> socket:[14760085]

and the pipe

lr-x------ 1 <username> procview 64 Mar 15 19:30 83 -> pipe:[14845918]

I have IMC installed on my mud. I noticed that in the IMC code it opens something with fopen() and closes it with IMCFCLOSE(). I wonder if this has any signifigance? It was written by far superior coders than I, so I doubt that the blame for my situation lies on anyones shoulders but my own. I am, however, given to investigation. At times I've seen the FDs as high as 256, much higher than my player count.

Not sure if it's related, but it does seem to happen every time after I try to run a manual pfile scan with Samsons pfile code. Likely my 12,000+ files within /players might cause the code itself to crash. When my mud crashes it tries to copyover instead (intentional), and it hangs immediately. In fact, most of the hang-ups seems to happen when its a crashover instead of a manual copyover, but it has happened with manual copyovers nonetheless.

I'm posting this anonymously because there are people who seem to dedicate so much of their time to being pests. If they figured out some way to crash my mud on demand I'd be in serious trouble until the problem was fixed. My post was this long simply because I hope to give adequate information to have some of the fine minds here recognize the nature or cause of my problem. Thank you for reading, have a good day.
Top

Posted by Samson   USA  (683 posts)  Bio
Date Reply #1 on Wed 16 Mar 2005 02:39 AM (UTC)
Message
Problems like this are often quite difficult to track down, but since you're familiar with the /proc/pid/fd method, do you have anything in there that looks similar to:

lr-x------ 1 samson samson 64 Mar 15 19:27 6 -> /home/samson/BlitzRom24/imc/imc2.channels

The above, taken from a Rom2.4 I have running, is telling me the imc2.channels file is still hanging open. ( incidentally, didn't realize this, not my code :) ) Anyway. You didn't mention if you had any dangling files like this. Dangling files like this will duplicate over time, but it would take forever before it would hang the mud.

The IMCFCLOSE macro you noticed is just a wrapper for the following 2 statements:

fclose(fp);
fp = NULL;

12,000 pfiles is alot to process. It wouldn't surprise me in the least if the pfiles snippet folded somehow, but the version I have here has the proper amount of fopen/fclose matchups. Even if it sucked down all your resources for a few minutes, I don't see how it could avoid closing all the files it opened, unless like you say, it crashes and runs a copyover to recover from it. That would leave a whole lot of leaky file descriptors laying around. Descriptors that would probably not close until you did a full reboot. Such descriptors would float along from copyover to copyover and be a real hassle to squash. Stuff like this is why I'm not a big fan of copyover being used as a "crash recovery" tool :)
Top

Posted by Greven   Canada  (835 posts)  Bio
Date Reply #2 on Wed 16 Mar 2005 03:17 AM (UTC)
Message
I've had many problems with copyover hanging, specifically when I add MCCP and client detection to a mud. Check the copyover file, because if you have an extra word or entry in any of the lines for the characters, it will hang. One problem I had was when one specific play was on, he changed the name that his client output to something that had a space in it, like "Marowi rules!". When it wrote it to the file, it would write the space, and then when it tried to read that line in, it would hang there because when it called fscanf in a fashion such as this:
                fscanf(fp, "%d %d %d %d %d %s %s %s \n", &desc, &room,
                       &bCompress, &msp, &mxp, name, host, client);
It would have an extra word before the \n was there, and it not go to the next line of the file. Check that when you have this problem, to figure it out I just removed the call to unlink for a short time. Want to make sure you put it back, though :)

Nobody ever expects the spanish inquisition!

darkwarriors.net:4848
http://darkwarriors.net
Top

Posted by Justacoder   (3 posts)  Bio
Date Reply #3 on Thu 17 Mar 2005 02:22 AM (UTC)

Amended on Thu 17 Mar 2005 02:25 AM (UTC) by Justacoder

Message
Hey guys, thanks for the help. I'm halfway there, I've figured out the problem with the copyovers.

My codebase has MCCP. Certain clients such as MUSHclient allow you to set the name of your terminal. This can be problematic if someone supplies a string comprised of a space, or a strange character. In order to combat this I broke it up into two halves. First I made it check for the string length when it's assigning their client name on the descriptor_data. If too low, it replaces whatever name with "[unknown]" instead. Next I filter the client name when it specifies it in copyover.

Basically it just runs through the string character by character weeding out spaces (redundant safety) and foreign characters. I urge anyone who reads this to examine their own code, as I believe there is a good chance this was done purposefully by a player to my mud to slow us down*.

I'm left now with the second problem, my FIFO pipes that open as read only (a clue?). So I checked through my code to find any calls to open(), to find only two. One is in malloc.c and the other is in a code module that has been disabled. One thing I did notice, however, is that when I recompiled with IMC undefined I got rid of the first FIFO pipe which was opened before the NULL streams. We have two streams reserved, just to clarify.

Now on to the question. Is it possible that a function called in the code is opening a pipe and not closing it? And if thats the case, is there any way I can patch this (ugly patch, but a patch nonetheless) by searching through the FDs and closing the FIFO pipes? If nothing else this would show me where the FIFO pipe was being referanced by the crash.

One option I've considered is running sys() with lsof searching for the executable name, outputting that to a file, opening the file, cycling through it and interpreting the lines to figure out the FD each FIFO is using. I could then close() them, but is there an easier way? Again, thanks for the read through :)

* - Edited to rephrase "Theres a good chance this was done purposefully." as it may have been interpreted as an attack on whomever wrote the MCCP code, which isn't the case
Top

Posted by Samson   USA  (683 posts)  Bio
Date Reply #4 on Thu 17 Mar 2005 11:18 AM (UTC)
Message
Quote:

One thing I did notice, however, is that when I recompiled with IMC undefined I got rid of the first FIFO pipe which was opened before the NULL streams.


I am curious now. Which version of IMC does your particular codebase have? There should be no reason for the more recent versions to leave any kind of file handle open, and if they are, I'd like to be able to narrow it down and fix it.
Top

Posted by Justacoder   (3 posts)  Bio
Date Reply #5 on Fri 18 Mar 2005 02:37 AM (UTC)
Message
IMC2 is the version, (c)2004.

This is likely the product of an error in installing it. I'll tell you why. Big problem with these pipes here, they're tying up my descriptors with dummies and when I hit my max limit (which I do, *sigh*) it dumps data to ./, which happens to be my area folder. I checked my quota today and noticed I had somehow gained around 500M. Something like that would get reported pretty quick, methinks, so maybe this is MCCP again.

The pain in the neck is that it continues after I undefined IMC and recompiled. It could very well be that it's all my fault and has nothing to do with you, Samson. Weakest link sort of thing ;)

I just can't seem to figure out what is actually calling open() in the first place. If anyone thinks of anything to do with this or possible related causes, please let me know. Until then I've written my pipe protector that lets me rest assured /area isn't overflowing with weird files.

Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


16,590 views.

It is now over 60 days since the last post. This thread is closed.     Refresh page

Go to topic:           Search the forum


[Go to top] top

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.