Author Topic: [Java] URL detection  (Read 7259 times)

0 Members and 1 Guest are viewing this topic.

Offline Camel

  • Hero Member
  • *****
  • Posts: 1703
    • View Profile
    • BNU Bot
[Java] URL detection
« on: September 22, 2007, 03:19:09 am »
Okay, so I've got this function method:

http://bnubot.googlecode.com/svn/trunk/BNUBot/src/net/bnubot/bot/gui/components/TextWindow.java
Code: [Select]
private static Pattern pattern = null;
public String safeHtml(String in) {
if(pattern == null)
pattern = Pattern.compile("(.*)(\\b(http://|https://|www.|ftp://|file:/|mailto:)\\S+)(.*)");
Matcher matcher = pattern.matcher(in);

if(matcher.matches())
return safeHtml(matcher.group(1))
+ "<a href=\"" + matcher.group(2) + "\">" + matcher.group(2) + "</a>"
+ safeHtml(matcher.group(4));
return in
.replaceAll("&", "&amp;")
.replaceAll("<", "&lt;")
.replaceAll(">", "&gt;")
.replaceAll("\n", "<br>\n")
.replaceAll("  ", " &nbsp;");
}

And it seems to work alright some of the time, but not all of the time. Particularly, it doesn't like to find URLs that aren't the first thing in the input string, and it likes to turn http://www... in to http://<a href="...">www...</a>. I'm fairly confident that it's an issue with the regexp, but I'm not exactly sure what's wrong or how to fix it.
« Last Edit: September 22, 2007, 03:22:12 am by Camel »

<Camel> i said what what
<Blaze> in the butt
<Camel> you want to do it in my butt?
<Blaze> in my butt
<Camel> let's do it in the butt
<Blaze> Okay!

Offline Sidoh

  • x86
  • Hero Member
  • *****
  • Posts: 17634
  • MHNATY ~~~~~
    • View Profile
    • sidoh
Re: [Java] URL detection
« Reply #1 on: September 22, 2007, 02:07:14 pm »
That's a pretty liberal regex for a URL, isn't it?  I mean it would match http://hi.hellotherehowareyoutoday (lol, so does SMF) or www.$%%%#@$%@#$@!#$@#$.

The http://<a href="..." problem is happening because you're matching http:// or www. and I don't think \b works in the way you're expecting it to.

Since I'm pretty rusty on regex, I tried to make one.  It seems to work pretty well too, but I haven't tested it very extensively.

Maybe this will help ya?

Code: [Select]
public static void main(String[] args)
{
Pattern url = Pattern.compile(
  "(.*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))(.*)"
);

String[] tests = {
"www.google.com/howareyoutoday/",
"http://www.google.com/asdf",
"http://www.x86labs.org/forum/index.php/topic,10305.msg130809.html#new",
"http://www.google.com/ig?hl=en",
"wwhatwww.doyouwant.com",
"www.google.com/search?source=ig&hl=en&q=regex&btnG=Google+Search"
};

for(int i = 0; i < tests.length; i++)
{
Matcher match = url.matcher(tests[i]);

if(match.matches())
{
System.out.println("Match");

System.out.println(""
+ "<a href=\"" + (match.group(3).equals("www.") ? "http://" : "") + match.group(2) + "\">" +
match.group(2) + "</a>");
}
else
{
System.out.println("No match");
}
}
}

Output

Code: [Select]
Match
<a href="http://www.google.com/howareyoutoday/">www.google.com/howareyoutoday/</a>
Match
<a href="http://www.google.com/asdf">http://www.google.com/asdf</a>
Match
<a href="http://www.x86labs.org/forum/index.php/topic,10305.msg130809.html#new">http://www.x86labs.org/forum/index.php/topic,10305.msg130809.html#new</a>
Match
<a href="http://www.google.com/ig?hl=en">http://www.google.com/ig?hl=en</a>
No match
Match
<a href="http://www.google.com/search?source=ig&hl=en&q=regex&btnG=Google+Search">www.google.com/search?source=ig&hl=en&q=regex&btnG=Google+Search</a>

Oh, this doesn't work for mailto: things, but I don't think it makes too much sense to look for those anyway.  Wouldn't it make more sense to have another regex that searched for email addresses?  In any case, it could be modified pretty easily to include it.
« Last Edit: September 22, 2007, 03:34:04 pm by Sidoh »

Offline Camel

  • Hero Member
  • *****
  • Posts: 1703
    • View Profile
    • BNU Bot
Re: [Java] URL detection
« Reply #2 on: September 23, 2007, 06:09:35 pm »
Thanks Sidoh, that the http prefix problem.

Code: [Select]
private static Pattern pattern = null;
public String safeHtml(String in) {
if(pattern == null)
pattern = Pattern.compile("(.*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))(.*)");
Matcher matcher = pattern.matcher(in);

if(matcher.matches())
return safeHtml(matcher.group(1))
+ "<a href=\"" + matcher.group(2) + "\">" + matcher.group(2) + "</a>"
+ safeHtml(matcher.group(10));
return in
.replaceAll("&", "&amp;")
.replaceAll("<", "&lt;")
.replaceAll(">", "&gt;")
.replaceAll("\n", "<br>\n")
.replaceAll("  ", " &nbsp;");
}

It still isn't picking URLs out of my MOTD string, though. I'll have to do some more debugging, but I think it has something to do with a newline immediately prefixing http://.

<Camel> i said what what
<Blaze> in the butt
<Camel> you want to do it in my butt?
<Blaze> in my butt
<Camel> let's do it in the butt
<Blaze> Okay!

Offline Sidoh

  • x86
  • Hero Member
  • *****
  • Posts: 17634
  • MHNATY ~~~~~
    • View Profile
    • sidoh
Re: [Java] URL detection
« Reply #3 on: September 23, 2007, 06:49:06 pm »
Oh, that's right.  Good call.  I totally forgot "." is everything but newline and I don't think \w (which is referenced by \b) matches linebreaks.  I might be wrong on that, but messing around with things seems to suggest it.

Anyway, this little fix should take care of that:

Code: [Select]
...
"\nwww.google.com/howareyoutoday/",
...
Pattern url = Pattern.compile(
  "((.|\n)*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))(.*)"
);
...
System.out.println(""
+ "<a href=\"" + (match.group(4).equals("www.") ? "http://" : "") + match.group(3) + "\">" +
match.group(3) + "</a>");

There might be something else

Offline Camel

  • Hero Member
  • *****
  • Posts: 1703
    • View Profile
    • BNU Bot
Re: [Java] URL detection
« Reply #4 on: September 23, 2007, 07:15:53 pm »
Cool, that works. I did the same thing to the trailer, as well. It still misses trailing slashes, but I don't care since it still works.

Code: [Select]
private static Pattern pattern = null;
public String safeHtml(String in) {
if(pattern == null)
pattern = Pattern.compile("((.|\n)*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))((.|\n)*)");
Matcher matcher = pattern.matcher(in);

if(matcher.matches())
return safeHtml(matcher.group(1))
+ "<a href=\"" + matcher.group(3) + "\">" + matcher.group(3) + "</a>"
+ safeHtml(matcher.group(11));
return in
.replaceAll("&", "&amp;")
.replaceAll("<", "&lt;")
.replaceAll(">", "&gt;")
.replaceAll("\n", "<br>\n")
.replaceAll("  ", " &nbsp;");
}

<Camel> i said what what
<Blaze> in the butt
<Camel> you want to do it in my butt?
<Blaze> in my butt
<Camel> let's do it in the butt
<Blaze> Okay!

Offline Joe

  • B&
  • Moderator
  • Hero Member
  • *****
  • Posts: 10319
  • In Soviet Russia, text read you!
    • View Profile
    • Github
Re: [Java] URL detection
« Reply #5 on: September 24, 2007, 05:59:18 pm »
"wwhatwww.doyouwant.com",

That's totally a valid URL.
I'd personally do as Joe suggests

You might be right about that, Joe.


Offline Sidoh

  • x86
  • Hero Member
  • *****
  • Posts: 17634
  • MHNATY ~~~~~
    • View Profile
    • sidoh
Re: [Java] URL detection
« Reply #6 on: September 24, 2007, 06:19:23 pm »
"wwhatwww.doyouwant.com",

That's totally a valid URL.

Yes, but the idea is that you ignore strings that aren't almost certainly URLs (meaning they're prefixed with one of www. or a protocol://).

If you were building a regular expression to validate URLs, then you'd want to include ones like that, but you'd also want to be much more conservative, explicitly specifying protocols, .extensions, etc.

Offline Camel

  • Hero Member
  • *****
  • Posts: 1703
    • View Profile
    • BNU Bot
Re: [Java] URL detection
« Reply #7 on: September 24, 2007, 07:22:36 pm »
"wwhatwww.doyouwant.com",

That's totally a valid URL.

Just because your browser's address bar knows what you mean, you can't conclude that it's a URL.

That is not, in fact, a URL, because it does not give you enough information to know what service it provides, or how to connect to it.

<Camel> i said what what
<Blaze> in the butt
<Camel> you want to do it in my butt?
<Blaze> in my butt
<Camel> let's do it in the butt
<Blaze> Okay!