Gammon Forum : MUSHclient : Python : Pickling

Entire forum ➜ MUSHclient ➜ Python ➜ Pickling

Pickling

It is now over 60 days since the last post. This thread is closed. Refresh page

Pages: 1 2 3 4

Posted by

David Haley USA (3,881 posts) Bio

Date

Reply #30 on Mon 30 Apr 2007 09:53 PM (UTC)

Message

Quote:
I am a bit lost about what "pickling" is exactly, I thought you did that to vegetables, like onions.

I think it's another word for serialization. After all you pickle food items to conserve them, so I suppose "pickling" data would be a conservation process of some sort.

Personally I just prefer "saving" or "serializing"... :)

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org

Top

Posted by

Shaun Biggs USA (644 posts) Bio

Date

Reply #31 on Mon 30 Apr 2007 11:54 PM (UTC)

Message

Quote:
But, because Linux does it wrong, and that is the widest used standard, it also does something even stupider and treats CR and LF as the "same" thing when found, thus double spacing everything in some applications that are not smart enough to strip out the redundant information.

Actually, most programs I've used just have a ^M at the end of the line when they don't strip out the extra line feed. Nearly any editor I use has settings to deal with a file as DOS, Unix, or Mac formats depending on what you want. Saves a lot of time and effort rather than making you convert the ends yourself.

Quote:
I am a bit lost about what "pickling" is exactly, I thought you did that to vegetables, like onions.

Pickle is a fairly standard serialization for data in python. It converts the data into a byte string similar to the serialize function included with MUSHclient. It does not save the code within the function, so you can edit the code and unpickle the data back into an instance of the new object, allowing you to fix logic errors and such without data loss.

It is much easier to fight for one's ideals than to live up to them.

Top

Posted by

Nick Gammon Australia (23,166 posts) Bio Forum Administrator

Date

Reply #32 on Tue 01 May 2007 01:37 AM (UTC)

Message

Quote:

But, because Linux does it wrong ...

I don't think you can say it is wrong, if you define end of a line as a linefeed. Then it is correct, by definition, which I suppose is what Unix does. It seems odd to me to have to use 2 characters to indicate end of line, because you have a whole heap of possible problems then.

If cr/lf (0x0D 0x0A) is end of line, then what is:

Linefeed on its own?
Carriage-return on its own?
lf/cr? (ie. 0x0A 0x0D)

These questions go away if you stick to a single byte.

There have been other schemes to indicate the end of a line - one operating system I worked on had a length indicator at the start of each line (that is, a 2-byte field that was the length of the line).

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by Worstje Netherlands (899 posts) Bio

Date Reply #33 on Tue 01 May 2007 02:49 AM (UTC)

Message

It looks like I resurrected an ancient thread with a fresh discussion. Wonderful!

Pickling is, as some have already mentioned, a way of storing data. In this particular instance, I am using it to save a dictionary containing information about a number of players in a dictionary where the values are containing tuples. I got it all working fine using the workarounds mentioned above, that isn't the issue.

However, I am not dealing with pure binary data here. Pickled objects are, in the current implementation, just strings. All characters I have been able to see so far have been ASCII characters (but I don't deal in unicode). My 'beef' is that, irregardless of the platform, there may be a very good reason we want to store just \r or just \n or maybe the precise combination. Pickling, serializing, saving, whatever you wish to call it, is one of those.

When you serialize objects, there can not be any doubt about bytes being what they are. Removing or changing a single byte could possibly ruin the functionality. As such, I can imagine that the Python library has decide to use strings with \n as line-endings, given the fact it is the most practical solution.

The following is a light bit of code I have written to deal with (un)pickling in MUSHclient, although I am quite afraid what 'other' changes could possibly break the pickled string.

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
#                     PICKLING IN A NUTSHELL                    #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

def GetPickleVariable(MUSHVariable, Default=None):
	# MUSH stores strings at UTF 8 and returns
	# them in that shape, too. On top of that, it
	# mutilates \n into \r\n when storing as XML, but
	# it won't remove the extra \r when loading again.
	# See: http://www.gammon.com.au/forum/bbshowpost.php?id=5410&page=1
	enc = codecs.getencoder('utf8')
	APickle = world.GetVariable(MUSHVariable)
	if (APickle != None):
		return pickle.loads(enc(APickle.replace("\r", ""))[0])
	else:
		return Default

def SetPickleVariable(MUSHVariable, PythonVariable):
	world.SetVariable(MUSHVariable, pickle.dumps(PythonVariable))

I am well aware I need to implement a catch for the exception loads() throws when the incoming data is not proper, but in its essence, this works for as far I've been able to test it. Stuff saved using SetPickleVariable() seems to be loaded fine by GetPickleVariable() in all tests I did so far.

With the risk on going off on a theoretical rant, and the fact that I may be spoiled that Python is usually a very clean language, but isn't the usage of replace() to remove something -hoping- it won't destroy another part of the string rather crude? On top of that.. the help for SetVariable() and GetVariable() never state anything about changing the values to fit. They 'set' or 'retrieve' the 'value of a variable'.

Anyhow, I'm done with my reasoning in circles here. I promise not to write anymore posts after coming back from a night out. :)

Top

Posted by

Nick Gammon Australia (23,166 posts) Bio Forum Administrator

Date

Reply #34 on Tue 01 May 2007 03:55 AM (UTC)

Amended on Tue 01 May 2007 04:07 AM (UTC) by Nick Gammon

Message

Quote:

On top of that, it
# mutilates \n into \r\n when storing as XML, but
# it won't remove the extra \r when loading again.

This isn't strictly correct. It stores them OK, but changes them when reading back in.

Quote:

... but isn't the usage of replace() to remove something -hoping- it won't destroy another part of the string rather crude?

Yes, a bit crude. :)

Conceivably the save routine could convert \t to 	 - then it would read back in without changing it, and still fixup tabs to spaces where they are outside a quoted value.

Quote:

On top of that.. the help for SetVariable() and GetVariable() never state anything about changing the values to fit.

They don't, apart from the proviso of the 0x00 value, which I didn't expect people to put into strings in the first place.

It is the *loading* that changes the values, the Get/Set work OK, internally.

Quote:

enc = codecs.getencoder('utf8')

I don't see how this will help. UTF-8 only encodes differently bytes >= the value 128, it won't change \r, \n or \t.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by Nick Gammon Australia (23,166 posts) Bio Forum Administrator

Date Reply #35 on Tue 01 May 2007 04:06 AM (UTC)

Message

I have amended the help for GetVariable and SetVariable to clarify this situation:



Please note that the following characters will not be handled correctly:



* The byte value hex 00 (otherwise known as 0x00 or null). This is used as a string terminator in MUSHclient, and attempts to imbed 0x00 values into variables will result in the variable being terminated at the 0x00 position.



* The "tab" character hex 09 (otherwise known as 0x09). You can use the tab character inside variables internally, however when variables are loaded from a plugin state file, or a MUSHclient world file, tabs are converted to spaces.



* Carriage-return (hex 0D or 0x0D) and line-feeds (hex 0A or 0x0A). You can use these internally however you like, however when they are read from a plugin state file, or a MUSHclient world file, carriage returns are dropped, and line-feeds are converted to the sequence 0x0D 0x0A (carriage-return followed by linefeed).





If you are planning to use variables to store "binary" data - that is, data that can have all 256 possible character values, you are advised to convert them to base-64 encoded strings (using Base64Encode) when saving them, and conver them back (using Base64Decode) when loading them. Also be aware that only the Lua version of the base 64 encoding (utils.base64encode and utils.base64decode) will correctly handle the 0x00 value.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by

Worstje Netherlands (899 posts) Bio

Date

Reply #36 on Tue 01 May 2007 12:18 PM (UTC)

Message

Sorry, I wrote that code+comment before you pasted the relevant code. :)

And I'm not sure why I need the UTF8 conversion (I merely found this topic after my first attempts to unpickle were in vain) so maybe I can remove it. But right now, I'm just glad it works. And thanks for the adjustment of the help files, although personally I'd think think the html-encoding you mentioned would be a cleaner solution.

Top

Posted by Shadowfyr USA (1,791 posts) Bio

Date

Reply #37 on Tue 01 May 2007 06:54 PM (UTC)

Amended on Tue 01 May 2007 06:55 PM (UTC) by Shadowfyr

Message

What are CR and LF on there own? I thought I already explained what those did in the original teletype system where they originated. CR ***only*** returned the Carriage, i.e., the print position to the "start of the line". This was used for doing things like bolding, overstriking, underlining, etc, since you couldn't have those extra characters on a print head, you had to produce them on the printer/teletype by telling it, "Return me to the first carriage position, then proceded." LF was **only** a line feed. In other words, when you used that, it didn't reset the carriage to the start of the line, it left it where it was, then continued printing from there. You had the same thing with a type writer. One lever "moved" the page to the next line, i.e. LF. The "Carriage Return" involved physically pushing the carriage back to the original position, otherwise once you moved down a line you where still typing at the last place the carriage was before you moved to the new line.

When computers and automatics where introduced, someone decided to confuse the issue and make the "Enter" or "Return" key perform "both" actions. Apparently, under Unix, they opted to do that for everything, but not by generating both symbols, but by treating LF (move one line down the page) as though it was "both". This of course fouls up anything treating those characters as their original meanings. DOS and even most simpler text boxes for Windows get it wrong too though. They often show the markers for the LF, if the CR is missing, but do not correctly display the data on a new line. In theory, if either one worked right, then this:

1LF2LF3LF4LF5CRLFEnd

Should display as:

It doesn't, in *either* system. DOS/Windows will often simply ignore the LF if not paired. Unix/Linux will, in its smarter applications, flag the file internally and treat it as DOS instead of Unix (or convert it by throwing out the redundant info), or it will, in the less smart ones, treat them as two sequential commands to do the same thing, double spacing. Both behaviors are technically wrong, if you follow the original intent of the codes. And as I said, the Unix/Linux version was probably done to save storage space back when 100MG HDs where something almost unheard of and everything larger was stored on huge banks of tape reels.

Top

Posted by

Shadowfyr USA (1,791 posts) Bio

Date

Reply #38 on Tue 01 May 2007 07:02 PM (UTC)

Message

Oh, and just to make things even more absurd... The text modes for both OSes *do* follow the correct usages (usually), for part of it, with LF being one of 4 codes that move the cursor one position up, down, right or left, without printing anything, while CR... Still does what electronic typewriters did and perform both actions at the same time (usually). I presume some may treat it as a, "return to start of line.", but I am not sure.

Top

Posted by

David Haley USA (3,881 posts) Bio

Date

Reply #39 on Tue 01 May 2007 09:04 PM (UTC)

Message

I have the impression we just got a lecture of dubious utility on what the difference between CR and LF is but I also have the impression you're not responding to a single of my points. In particular, you repeat your assertion about why you think Unix made the change, without responding to anything I wrote about other reasons they might have chosen.

And a new reason: why is it such a bad thing to have chosen to eliminate useless characters on very storage-limited systems? You speak of it as if it's the greatest of all stupidities.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org

Top

Posted by

Nick Gammon Australia (23,166 posts) Bio Forum Administrator

Date

Reply #40 on Tue 01 May 2007 09:25 PM (UTC)

Message

I think we are wandering off-topic a bit here. Forget old-fashioned teletype terminals. The issue is how to get pickling to work.

I just want to point out that I think it is reasonable for MUSHclient to change \n on its own to \r\n. One of the reasons for that is that if you try to display stuff to the notepad window (a Windows Edit control), and you don't use \r\n (but just \n) the window looks all screwy.

Now if someone has scripted a trigger or something that displays stuff in the notepad (eg. "send to notepad") and they (for simplicity) just put \n between each line, they will unexpectedly get bad looking output. So, the ImportXML, assumes that line endings need "fixing up". It doesn't assume that someone will want to try to import binary data, and that \r or \n are likely to appear on their own.

Remember, changes like that (and the tab to spaces thing) were put there to make people happy, and generally in response to a complaint of some sort. Sure, it can be removed, but then one person is happy (the pickler) and I get a flood of complaints from other users that the new version has "broken" ImportXML.

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

Posted by

Shaun Biggs USA (644 posts) Bio

Date

Reply #41 on Wed 02 May 2007 09:42 AM (UTC)

Message

Quote:
When computers and automatics where introduced, someone decided to confuse the issue and make the "Enter" or "Return" key perform "both" actions.

I honestly hope that you are not suggesting that in order to go to the beginning of a new line one should have to press return and enter (or whatever two keys you want).

Quote:
Apparently, under Unix, they opted to do that for everything, but not by generating both symbols, but by treating LF (move one line down the page) as though it was "both". This of course fouls up anything treating those characters as their original meanings. DOS and even most simpler text boxes for Windows get it wrong too though. They often show the markers for the LF, if the CR is missing, but do not correctly display the data on a new line.

Ok, so *nix systems got it wrong... M$ got it wrong... I'll assume that you think that Mac has it wrong, since they use just a carriage return... Did anyone get it "right?" Keep in mind, that this is all just how various programs interpret the data as a standard.

One more thing about the teletype machines. They used the Baudot code, not ASCII. If you still care to argue about the teletypes, please enjoy yourself. Might also want to address the issue that the delete (127) character originally turned a character into a null (0) code because you could not just remove hole punches. Line feed also just meant that a line's worth of paper should be pushed out of the printer on teletypes, so should that happen when we look at a text file?

All in all, the only valid argument I can see is that since MUSHclient is intended to be run on Windows systems, it should conform to Windows standards. Either that, or have a toggle for people to choose and watch in amusement as various scripts get completely messed up. In C, "\n" is supposed to default to the native newline code, so it should be interpreted as a CR/LF.

It is much easier to fight for one's ideals than to live up to them.

Top

Posted by Shaun Biggs USA (644 posts) Bio

Date Reply #42 on Wed 02 May 2007 09:50 AM (UTC)

Message

Just for fun, I saved three files. Both had "blah blah" twice on separate lines. I saved one as Unix type, one as DOS type, and one as Mac type ( LF, CR/LF, CR respectively).


biggs@omni ~ $ cat temp.nix
blah blah
blah blah
biggs@omni ~ $ cat temp.dos
blah blah
blah blah
biggs@omni ~ $ cat temp.mac
biggs@omni ~ $

That's with cat 6.9. Seems to interpret Unix and DOS formats "correctly" while not dealing with Mac's format well.

It is much easier to fight for one's ideals than to live up to them.

Top

Posted by

David Haley USA (3,881 posts) Bio

Date

Reply #43 on Wed 02 May 2007 09:55 AM (UTC)

Message

I have never seen a unixy terminal correctly deal with \r characters. Even the *#*!"@%!&@! terminal on OS X gets confused with \r characters, making it a #"@$"%""@^!@$!@"#" pain to do anything on the command line (e.g. grep) with files written in Mac editors.

And then of course you have the editors/programs that only use \r as the line editor -- the Lisp interpreter we use is like that. What that means is that if you give it Unix line endings, as soon as it sees a line-comment (like C's //) it considers the rest of the file to be a comment.....

This is a very major problem to me. (In case you hadn't noticed. :-) )

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org

Top

Posted by

Nick Gammon Australia (23,166 posts) Bio Forum Administrator

Date

Reply #44 on Wed 02 May 2007 10:36 AM (UTC)

Amended on Wed 02 May 2007 10:39 AM (UTC) by Nick Gammon

Message

Quote:

Seems to interpret Unix and DOS formats "correctly" while not dealing with Mac's format well.

Well it did what it was told to - moved the cursor to the start of the line twice, where the prompt overwrote it.

I think we are confusing two things here - the physical movement of the cursor (typewriter head in the old technology), and the logical end of a line.

In the olden days, the ASCII codes simply told a piece of mechanical equipment what to do:

carriage return - return the "carriage" (the thing that carried the print head) to the start of the line
line feed - move the paper up one line
backspace - move the carriage back one space
form feed - feed a new form (new piece of paper)
delete - in the case of paper tape, where you had 7 holes punched for the 7 possible bits, "delete" the character by punching all 7 holes (so actually you would backspace over the bad character, and then hit delete to remove it)

With the advent of modern terminals (and PCs and programs that run on them), all of these have morphed into a slightly different meaning:

carriage return - move the cursor back to the start of the line - perhaps. In the case of the Mac operating system, the carriage return character became the "logical end of line" marker.
line feed - start a new line. In the case of the Unix operating system this also became a "logical end of line marker". That is to say, you expect more from your shell (or MUD client) than it simply moving the cursor to the start of the line. You expect it to interpret the previous line as a "command". Thus, you don't normally backspace over a linefeed.

In the case of Windows, I'm not sure, where it uses carriage-return/linefeed combinations, whether the carriage-return, or the linefeed, or the combination together, trigger the "logical end of line" processing.
backspace - delete the character to the left of the cursor (this is more than simply moving the cursor, you notice).
form feed - blank the screen (it can hardly feed the paper on your screen, as it doesn't use paper)
delete - delete the character to the right of the cursor. This is different behaviour to the old paper tape punches, where you actually have the "deleted" character left on the tape.

Thus I think you have to be careful if you are mixing paradigms, just because the old teletype terminals moved the carriage around, is no real reason these days to say that a carriage return "must mean" that the "carriage" "returns".

- Nick Gammon

www.gammon.com.au, www.mushclient.com

Top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.

196,581 views.

This is page 3, subject is 4 pages long: 1 2 3 4

It is now over 60 days since the last post. This thread is closed. Refresh page

Go to topic: Search the forum

top