[Home] [Downloads] [Search] [Help/forum]


Register forum user name Search FAQ

Gammon Forum

[Folder]  Entire forum
-> [Folder]  Programming
. -> [Folder]  STL
. . -> [Subject]  Some string utilities

Some string utilities

It is now over 60 days since the last post. This thread is closed.     [Refresh] Refresh page


Posted by Nick Gammon   Australia  (22,973 posts)  [Biography] bio   Forum Administrator
Date Wed 02 Jul 2003 03:35 AM (UTC)

Amended on Wed 02 Jul 2003 05:55 AM (UTC) by Nick Gammon

Message
This example program demonstrates some useful utilities for strings:


  • Trim trailing spaces
  • Trim leading spaces
  • Trim leading and trailing spaces
  • Convert to lower-case
  • Convert to upper-case
  • Convert to capitalised (eg. Hello World)
  • Extract a word at a delimiter (eg. up to the first space)
  • Split a string into a vector at a delimiter
    (eg. "a, b, c, d" would become a 4-item vector if the delimiter was a comma)
  • Convert a vector of strings back into a single string with a delimiter (converse of above item)





// disable warnings about long names
#ifdef WIN32
  #pragma warning( disable : 4786)
#endif

#include <iostream>
#include <sstream>
#include <string>
#include <vector>
#include <iterator>
#include <algorithm>
#include <functional>

using namespace std;

#define SPACES " \t\r\n"

inline string trim_right (const string & s, const string & t = SPACES)
  { 
  string d (s); 
  string::size_type i (d.find_last_not_of (t));
  if (i == string::npos)
    return "";
  else
   return d.erase (d.find_last_not_of (t) + 1) ; 
  }  // end of trim_right

inline string trim_left (const string & s, const string & t = SPACES) 
  { 
  string d (s); 
  return d.erase (0, s.find_first_not_of (t)) ; 
  }  // end of trim_left

inline string trim (const string & s, const string & t = SPACES)
  { 
  string d (s); 
  return trim_left (trim_right (d, t), t) ; 
  }  // end of trim

// returns a lower case version of the string 
string tolower (const string & s)
  {
string d (s);

  transform (d.begin (), d.end (), d.begin (), (int(*)(int)) tolower);
  return d;
  }  // end of tolower

// returns an upper case version of the string 
string toupper (const string & s)
  {
string d (s);

  transform (d.begin (), d.end (), d.begin (), (int(*)(int)) toupper);
  return d;
  }   // end of toupper

// transformation function for tocapitals that has a "state"
// so it can capitalise a sequence
class fCapitals : public unary_function<char,char> 
  {
  bool bUpper;

  public:

  // first letter in string will be in capitals
  fCapitals () : bUpper (true) {}; // constructor

  char operator() (const char & c)  
    { 
    char c1;
    // capitalise depending on previous letter
    if (bUpper)
      c1 = toupper (c);
    else
      c1 = tolower (c);

    // work out whether next letter should be capitals
    bUpper = isalnum (c) == 0;
    return c1; 
    }
  };  // end of class fCapitals

// returns a capitalized version of the string 
string tocapitals (const string & s)
  {
string d (s);

  transform (d.begin (), d.end (), d.begin (), fCapitals ());
  return d;
  }  // end of tocapitals


// split a line into the first word, and rest-of-the-line
string GetWord (string & s, 
                const string delim = " ",
                const bool trim_spaces = true)
  {
    
  // find delimiter  
  string::size_type i (s.find (delim));

  // split into before and after delimiter
  string w (s.substr (0, i));

  // if no delimiter, remainder is empty
  if (i == string::npos)
    s.erase ();
  else
    // erase up to the delimiter
    s.erase (0, i + delim.size ());

  // trim spaces if required
  if (trim_spaces)
    {
    w = trim (w);
    s = trim (s);
    }

  // return first word in line
  return w;
  
  } // end of GetWord	

// To be symmetric, we assume an empty string (after trimming spaces)
// will give an empty vector.
// However, a non-empty string (with no delimiter) will give one item
// After that, you get an item per delimiter, plus 1.
// eg.  ""      => empty
//      "a"     => 1 item
//      "a,b"   => 2 items
//      "a,b,"  => 3 items (last one empty)

void StringToVector (const string s, 
                     vector<string> & v,
                     const string delim = " ", 
                     const bool trim_spaces = true)
  {

  // start with initial string, trimmed of leading/trailing spaces if required
  string s1 (trim_spaces ? trim (s) : s);

  v.clear (); // ensure vector empty

  // no string? no elements
  if (s1.empty ())
    return;

  // add to vector while we have a delimiter
  while (!s1.empty () && s1.find (delim) != string::npos)
    v.push_back (GetWord (s1, delim, trim_spaces));

  // add final element
  v.push_back (s1);
  } // end of StringToVector 

// Takes a vector of strings and converts it to a string 
// like "apples,peaches,pears" 
// Should be symmetric with StringToVector (excepting any spaces that might have
//  been trimmed).

string VectorToString (const vector<string> & v, 
                       const string delim = " ")
  {
  // vector empty gives empty string
  if (v.empty ())
    return "";

  // for copying results into
  ostringstream os;

  // copy all but last one, with delimiter after each one
  copy (v.begin (), v.end () - 1, 
        ostream_iterator<string> (os, delim.c_str ()));

  // return string with final element appended
  return os.str () + *(v.end () - 1);

  } // end of VectorToString

int main (void)
  {

  string s ("  happy/days/ are /here/again////  ");                         

  vector<string> v;

  StringToVector (s, v, "/");

  // display item count
  cout << "Converted string into " << v.size () << " items." << endl;

  // convert back into string
  cout << "Vector back to string = (" 
       << VectorToString (v, ", ") << ")" << endl;

  // test other routines

  cout << "original             -->" << s                 << "<--" << endl;
  cout << "trim_right           -->" << trim_right (s)    << "<--" << endl;
  cout << "trim_left            -->" << trim_left (s)     << "<--" << endl;
  cout << "trim                 -->" << trim (s)          << "<--" << endl;
  cout << "tolower              -->" << tolower (s)       << "<--" << endl;
  cout << "toupper              -->" << toupper (s)       << "<--" << endl;
  cout << "tocapitals           -->" << tocapitals (s)    << "<--" << endl;
  cout << "GetWord              -->" << GetWord (s, "/")  << "<--" << endl;
  cout << "After GetWord, s =   -->" << s                 << "<--" << endl;

  return 0;
  } // end of main



Output

Converted string into 9 items.
Vector back to string = (happy, days, are, here, again, , , , )
original             -->  happy/days/ are /here/again////  <--
trim_right           -->  happy/days/ are /here/again////<--
trim_left            -->happy/days/ are /here/again////  <--
trim                 -->happy/days/ are /here/again////<--
tolower              -->  happy/days/ are /here/again////  <--
toupper              -->  HAPPY/DAYS/ ARE /HERE/AGAIN////  <--
tocapitals           -->  Happy/Days/ Are /Here/Again////  <--
GetWord              -->happy<--
After GetWord, s =   -->days/ are /here/again////<--





The code to do the capitals is particularly interesting as that uses a "function object" - an object that remembers its state in order to work out whether to capitalise the next character or not. It has two functions - a constructor that initialises the state variable, and an operator() which is called once for each letter in the string to be capitalised.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Nick Gammon   Australia  (22,973 posts)  [Biography] bio   Forum Administrator
Date Reply #1 on Sun 06 Jul 2003 05:38 AM (UTC)

Amended on Sun 06 Jul 2003 09:51 PM (UTC) by Nick Gammon

Message
More string utilities

Below are a couple more utilities for strings. The case-independent compare that I did for other situations are collected together here. They are presented as binary functions (which you can use as part of things like find_if and so on) and straight functions for direct comparisons.

Also the handy utility strPrefix - that lets you see if a string starts with something. For example you might want to know if a string starts with "#&" rather than being equal to it.

There is also a version of strPrefix that works from an offset from the start of the source string (for when you are working your way through a larger buffer).





#include <string>
#include <iostream>
#include <functional>

using namespace std;  

// case-independent (ci) string compare
// returns true if strings are EQUAL
struct ci_equal_to : binary_function <string, string, bool>
  {

  struct compare_equal 
    : public binary_function <unsigned char, unsigned char,bool>
    {
    bool operator() (const unsigned char& c1, const unsigned char& c2) const
      { return tolower (c1) == tolower (c2); }
    };  // end of compare_equal

  bool operator() (const string & s1, const string & s2) const
    {

    pair <string::const_iterator,
          string::const_iterator> result =
      mismatch (s1.begin (), s1.end (),   // source range
                s2.begin (),              // comparison start
                compare_equal ());  // comparison

    // match if both at end
    return result.first == s1.end () &&
           result.second == s2.end ();

    }
  }; // end of ci_equal_to

// compare strings for equality using the binary function above
// returns true is s1 == s2
bool ciStringEqual (const string & s1, const string & s2)
  {
  return ci_equal_to () (s1, s2);
  }  // end of ciStringEqual

// case-independent (ci) string less_than
// returns true if s1 < s2
struct ci_less : binary_function <string, string, bool>
  {

  // case-independent (ci) compare_less binary function
  struct compare_less 
    : public binary_function <unsigned char, unsigned char,bool>
    {
    bool operator() (const unsigned char& c1, const unsigned char& c2) const
      { return tolower (c1) < tolower (c2); }
    }; // end of compare_less

  bool operator() (const string & s1, const string & s2) const
    {

    return lexicographical_compare
          (s1.begin (), s1.end (),   // source range
           s2.begin (), s2.end (),   // dest range
                  compare_less ());  // comparison
    }
  }; // end of ci_less

// compare strings for less-than using the binary function above
// returns true if s1 < s2
bool ciStringLess (const string & s1, const string & s2)
  {
  return ci_less () (s1, s2);
  }  // end of ciStringLess

// compare to see if start of s1 is s2
//  eg. returns true for: strPrefix ("abacus", "aba");

bool strPrefix (const string & s1,             // string to search    
                const string & s2,             // what to look for 
                const bool no_case = false)    // case-insensitive?   
  {                                               
  
  // if either string is empty or s1 is smaller than s2
  //  then they can't be identical
  if (s1.empty () ||
      s2.empty () ||
      s1.size () < s2.size ())
    return false;

  if (no_case)
    return ciStringEqual (s1.substr (0, s2.size ()), s2);
  else
    return s1.substr (0, s2.size ()) == s2;

  } // end of strPrefix


// compares to see if (s1, offset by pos, for length s2) == s2

bool strPrefix (const string & s1,              // string to search
                const string::size_type pos,    // where in s1 to start
                const string & s2,              // what to look for
                const bool no_case = false)     // case-insensitive?
  {
  // can't be true if position outside string size
  // casts are to ensure a signed comparison 
  if ((int) pos >= ((int) s1.size () - (int) s2.size ()))
    return false;

  // make a substring of s1 for s2's size
  return strPrefix (s1.substr (pos, s2.size ()), s2, no_case);

  } // end of strPrefix


// test
int main (void)
  {

  cout << boolalpha;

  cout << "Should be true ..." << endl << endl;

  // these should be true
  cout << "Nick == nick = "           << ciStringEqual ("Nick", "nick") << endl;
  cout << "Nick >= nick = "           << !ciStringLess ("Nick", "nick") << endl;
  cout << "Nick prefix == Ni = "      << strPrefix ("Nick", "Ni") << endl;
  cout << "Nick ci_prefix == ni = "   << strPrefix ("Nick", "ni", true) << endl;
  cout << "AABBC prefix (2) == BB = " << strPrefix ("AABBC", 2, "BB") << endl;

  cout << endl << "Should be false ..." << endl << endl;

  // these should be false
  cout << "Nick == nicky = "          << ciStringEqual ("Nick", "nicky") << endl;
  cout << "Nick < aaaa = "            << ciStringLess ("Nick", "aaaa") << endl;
  cout << "Nick prefix == ni = "      << strPrefix ("Nick", "ni") << endl;
  cout << "Nack ci prefix == ni = "   << strPrefix ("Nack", "ni", true) << endl;
  cout << "AABBC prefix (3) == BB = " << strPrefix ("AABBC", 3, "BB") << endl;

  return 0;
  } // end of main


Output

Should be true ...

Nick == nick = true
Nick >= nick = true
Nick prefix == Ni = true
Nick ci_prefix == ni = true
AABBC prefix (2) == BB = true

Should be false ...

Nick == nicky = false
Nick < aaaa = false
Nick prefix == ni = false
Nack ci prefix == ni = false
AABBC prefix (3) == BB = false


- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Davion   Canada  (10 posts)  [Biography] bio
Date Reply #2 on Sun 23 Apr 2006 05:13 AM (UTC)
Message
inline string trim_right (const string & s, const string & t = SPACES)
  { 
  string d (s); 
  string::size_type i (d.find_last_not_of (t));
  if (i == string::npos)
    return "";
  else
   return d.erase (d.find_last_not_of (t) + 1) ; 
  }  


I was wondering if there was any specific reason you called find_last_not_of a second time? Can't you just use


return d.erase(i+1);


Btw, I just got through reading this entire board on STL... great stuff!

Davion
[Go to top] top

Posted by David Haley   USA  (3,881 posts)  [Biography] bio
Date Reply #3 on Sun 23 Apr 2006 07:02 AM (UTC)
Message
Your proposed change would work just fine, because after all neither string d or t is being changed. It is possible however that an optimizer would figure it out anyhow, since find_last_not_of is a constant operation, and d and t aren't changing. Still, if you didn't trust the optimizer and wanted to sacrifice readability (which is probably fine in a utility function like this meant to be fast) it would work just as well to just reuse i.

By the way, your website is original, to say the least. :-)

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
[Go to top] top

Posted by Nick Gammon   Australia  (22,973 posts)  [Biography] bio   Forum Administrator
Date Reply #4 on Sun 23 Apr 2006 07:04 AM (UTC)
Message
Thanks!

I can't see any reason not to make your change. I probably copied and pasted the trim_left and found there was a problem with no match, and added the test, hence the duplication.

- Nick Gammon

www.gammon.com.au, www.mushclient.com
[Go to top] top

Posted by Davion   Canada  (10 posts)  [Biography] bio
Date Reply #5 on Sun 23 Apr 2006 02:15 PM (UTC)
Message
Your proposed change would work just fine, because after all neither string d or t is being changed. It is possible however that an optimizer would figure it out anyhow, since find_last_not_of is a constant operation, and d and t aren't changing. Still, if you didn't trust the optimizer and wanted to sacrifice readability (which is probably fine in a utility function like this meant to be fast) it would work just as well to just reuse i.

Err? Optimizers can snarf out redundant function calls? This is news to me! Are you sure about this? I'd rather it didn't, and warn me about the redundant calls so I can make my code cleaner rather than depending on an optimizer. Does it have to do with the constant operation? I've never fully grasped things like constant, inline and static functions/methods. Although, I just picked up Stroustrups book and am mauling my way through it as we speak!

By the way, your website is original, to say the least. :-)

Thanks! :D I put a lot of thought and hard work into it *grin*

Davion
[Go to top] top

Posted by David Haley   USA  (3,881 posts)  [Biography] bio
Date Reply #6 on Sun 23 Apr 2006 07:23 PM (UTC)
Message
Well, an intelligent optimizer can detect all sorts of things and avoid them, and recognizing that you already have the value of an operation is one of them. If you have:
const MyStruct & s;
int i = s.foo(); // where foo is a const method
int j = s.foo();
Then it might not call foo twice, because it knows that it already has the answer (given that s is constant and foo is constant). I suppose this might change in multithreaded environments, but it shouldn't (as opposed to can't).

'const' is one of the most versatile keywords in C++; it can do all sorts of things. What it's doing here is marking an object as constant i.e. immutable; no function can change a constant object. As for a constant method, it's a method that guarantees that it will not change the object either. (As such, constant methods can only pass the object on to other constant methods.)

Inline means to take the body of the function and to paste it wherever you call the function. Consider this:
struct MyStruct
{
    int i;
    int foo() const { return i; }
}

MyStruct s;
s.i = 5312;
cout << s.foo();


When you call s.foo(), a whole lot is happening behind the scenes; you're creating a new stack frame, pushing arguments onto it (in this case, none), then you're jumping to another location in the code, doing whatever you need to do, and finally jumping back and removing the stack frame you created.

But an intelligent optimizer (that's been told it's ok to inline), or one that sees that you wrote 'inline' next to the foo() method, can realize that all that is somewhat silly, and simply replace the "s.foo()" call with s.i (well, the equivalent in assembly, of course). That way, you avoid the whole function call overhead, and you incur the minimum possible cost, which is to simply grab the data member from s.


Generally, the advice given to people is to let the optimizer do its job and to not worry about making code as optimized as possible. Readability is very important for us humans. The places where you would pre-optimize would be in the very speed-sensitive parts of your program. If you look around at code you'll occasionally see somebody write a function or two in assembly. What this means is that they think they have the absolute fastest possible implementation, and have just given it to the compiler instead of trusting the optimizer to figure it out. The problem of course is that if you don't know assembly (and even if you do), it's harder to figure out what the code is actually doing.

Optimizers can do some pretty smart stuff. What we've seen here is really just the tip of the iceberg. Even a relatively stupid (but correct) optimizer can improve the speed of some programs fairly dramatically. Here's an example of something that is very easy to optimize:
void foo()
{
   int i = 5;
   int j = i+9;
   int k = j+14;
   int l = 4+136+123-k;
}

There are several things that can be optimized here. The first one, called constant folding, is to take 4+136+123 and simply calculate the answer: 263. There's no need to do that addition every time you run the program since it always comes out to the same answer. The other you can optimize (whose fancy name is constant propagation, I think) is i+9, j+14 and 263-k. Note that in all those cases, we know what the value of i, j and k is, so again we don't really need to repeat the calculation. Both of these, but especially the first, are pretty easy to optimize and can dramatically improve performance if you call the function very often.

One example is in drawing code, where you might have constant border values. E.g., you might be doing arithmetic with values like LEFTMOST_BORDER, RIGHTMOST_BORDER, CENTER_POINT etc. Some programmers will do their best to fold all these constants and get the final value, but then they greatly sacrifice readability by no longer having the "logic" spelled out. But this is a waste of time, since the optimizer will happily munch up constants and do the math for you.

One thing to try is to take some code that wasn't written with the intent of being super-optimized, and run it through the maximum-strength optimizer. The difference in speed between the two can be amazing. I once had a silly little game program that was running around 50 fps without optimization, and well over 300 fps with optimization. Not too bad, huh? :)

But enough rambling for now... if anybody has actually read this far. :P

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
[Go to top] top

Posted by Davion   Canada  (10 posts)  [Biography] bio
Date Reply #7 on Sun 23 Apr 2006 08:08 PM (UTC)
Message
If someone didn't read that to the end they miss out on valuable information! Thanks for that! It explains much of what I didn't quite grasp.

Is GCC's optimizer smart enough to make a huge difference? If so, to what magnitude is worth it for a MUD?

Davion
[Go to top] top

Posted by David Haley   USA  (3,881 posts)  [Biography] bio
Date Reply #8 on Sun 23 Apr 2006 08:24 PM (UTC)
Message
GCC's optimizer is really quite good. It sometimes makes a program bigger in bytes (due to inlining: instead of a function call, you paste the whole function body!) but often the result is much, much faster.

It's hard to tell how much it'll gain for a MUD without actually running the code and timing it yourself. But I suspect it'll gain quite a lot, relatively speaking. It's just not clear how important it is for a MUD to be super-fast, because you're not doing millions and millions of calculations every second like you'd be doing in a 3d game. Still, the main advantage is that it allows you to write readable code without going nuts on optimizing, which makes it: a) easier to write code; b) easier to read code. Both of these are of the utmost importance. :)

Keep in mind, though, by optimizing the code very heavily, you lose the ability to debug it meaningfully, as the optimizer will move stuff around and change things so the final code will no longer correspond to what you actually wrote. Sometimes an optimizer will completely rearrange the order of function calls or instructions, depending on what is the most speed efficient.

Oh, here's another fun optimization:
int i = 5;
int j = 0;
while ( j++ < 10 ) {
   int k = 5 * i;
   DoSomething(k);
}
Here, beyond constant propagation etc., the optimizer might notice that int k=5*i does not depend on anything in the loop, so it can take that line of code outside the loop and not evaluate it on every iteration. This can make things much faster for long loops, but it also shows what kind of debugging confusion in gdb can occur when the optimizer moves stuff around.

David Haley aka Ksilyan
Head Programmer,
Legends of the Darkstone

http://david.the-haleys.org
[Go to top] top

The dates and times for posts above are shown in Universal Co-ordinated Time (UTC).

To show them in your local time you can join the forum, and then set the 'time correction' field in your profile to the number of hours difference between your location and UTC time.


49,445 views.

It is now over 60 days since the last post. This thread is closed.     [Refresh] Refresh page

Go to topic:           Search the forum


[Go to top] top

Quick links: MUSHclient. MUSHclient help. Forum shortcuts. Posting templates. Lua modules. Lua documentation.

Information and images on this site are licensed under the Creative Commons Attribution 3.0 Australia License unless stated otherwise.

[Home]


Written by Nick Gammon - 5K   profile for Nick Gammon on Stack Exchange, a network of free, community-driven Q&A sites   Marriage equality

Comments to: Gammon Software support
[RH click to get RSS URL] Forum RSS feed ( https://gammon.com.au/rss/forum.xml )

[Best viewed with any browser - 2K]    [Hosted at HostDash]