News:

Wieners, Brats, Franks, we've got 'em all.

Main Menu

[Java] URL detection

Started by Camel, September 22, 2007, 03:19:09 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Camel

Okay, so I've got this function method:

http://bnubot.googlecode.com/svn/trunk/BNUBot/src/net/bnubot/bot/gui/components/TextWindow.java
private static Pattern pattern = null;
public String safeHtml(String in) {
if(pattern == null)
pattern = Pattern.compile("(.*)(\\b(http://|https://|www.|ftp://|file:/|mailto:)\\S+)(.*)");
Matcher matcher = pattern.matcher(in);

if(matcher.matches())
return safeHtml(matcher.group(1))
+ "<a href=\"" + matcher.group(2) + "\">" + matcher.group(2) + "</a>"
+ safeHtml(matcher.group(4));
return in
.replaceAll("&", "&amp;")
.replaceAll("<", "&lt;")
.replaceAll(">", "&gt;")
.replaceAll("\n", "<br>\n")
.replaceAll("  ", " &nbsp;");
}


And it seems to work alright some of the time, but not all of the time. Particularly, it doesn't like to find URLs that aren't the first thing in the input string, and it likes to turn http://www... in to http://<a href="...">www...</a>. I'm fairly confident that it's an issue with the regexp, but I'm not exactly sure what's wrong or how to fix it.

<Camel> i said what what
<Blaze> in the butt
<Camel> you want to do it in my butt?
<Blaze> in my butt
<Camel> let's do it in the butt
<Blaze> Okay!

Sidoh

#1
That's a pretty liberal regex for a URL, isn't it?  I mean it would match http://hi.hellotherehowareyoutoday (lol, so does SMF) or www.$%%%#@$%@#$@!#$@#$.

The http://<a href="..." problem is happening because you're matching http:// or www. and I don't think \b works in the way you're expecting it to.

Since I'm pretty rusty on regex, I tried to make one.  It seems to work pretty well too, but I haven't tested it very extensively.

Maybe this will help ya?

public static void main(String[] args)
{
Pattern url = Pattern.compile(
  "(.*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))(.*)"
);

String[] tests = {
"www.google.com/howareyoutoday/",
"http://www.google.com/asdf",
"http://www.x86labs.org/forum/index.php/topic,10305.msg130809.html#new",
"http://www.google.com/ig?hl=en",
"wwhatwww.doyouwant.com",
"www.google.com/search?source=ig&hl=en&q=regex&btnG=Google+Search"
};

for(int i = 0; i < tests.length; i++)
{
Matcher match = url.matcher(tests[i]);

if(match.matches())
{
System.out.println("Match");

System.out.println(""
+ "<a href=\"" + (match.group(3).equals("www.") ? "http://" : "") + match.group(2) + "\">" +
match.group(2) + "</a>");
}
else
{
System.out.println("No match");
}
}
}


Output

Match
<a href="http://www.google.com/howareyoutoday/">www.google.com/howareyoutoday/</a>
Match
<a href="http://www.google.com/asdf">http://www.google.com/asdf</a>
Match
<a href="http://www.x86labs.org/forum/index.php/topic,10305.msg130809.html#new">http://www.x86labs.org/forum/index.php/topic,10305.msg130809.html#new</a>
Match
<a href="http://www.google.com/ig?hl=en">http://www.google.com/ig?hl=en</a>
No match
Match
<a href="http://www.google.com/search?source=ig&hl=en&q=regex&btnG=Google+Search">www.google.com/search?source=ig&hl=en&q=regex&btnG=Google+Search</a>


Oh, this doesn't work for mailto: things, but I don't think it makes too much sense to look for those anyway.  Wouldn't it make more sense to have another regex that searched for email addresses?  In any case, it could be modified pretty easily to include it.

Camel

Thanks Sidoh, that the http prefix problem.

private static Pattern pattern = null;
public String safeHtml(String in) {
if(pattern == null)
pattern = Pattern.compile("(.*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))(.*)");
Matcher matcher = pattern.matcher(in);

if(matcher.matches())
return safeHtml(matcher.group(1))
+ "<a href=\"" + matcher.group(2) + "\">" + matcher.group(2) + "</a>"
+ safeHtml(matcher.group(10));
return in
.replaceAll("&", "&amp;")
.replaceAll("<", "&lt;")
.replaceAll(">", "&gt;")
.replaceAll("\n", "<br>\n")
.replaceAll("  ", " &nbsp;");
}


It still isn't picking URLs out of my MOTD string, though. I'll have to do some more debugging, but I think it has something to do with a newline immediately prefixing http://.

<Camel> i said what what
<Blaze> in the butt
<Camel> you want to do it in my butt?
<Blaze> in my butt
<Camel> let's do it in the butt
<Blaze> Okay!

Sidoh

Oh, that's right.  Good call.  I totally forgot "." is everything but newline and I don't think \w (which is referenced by \b) matches linebreaks.  I might be wrong on that, but messing around with things seems to suggest it.

Anyway, this little fix should take care of that:


...
"\nwww.google.com/howareyoutoday/",
...
Pattern url = Pattern.compile(
  "((.|\n)*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))(.*)"
);
...
System.out.println(""
+ "<a href=\"" + (match.group(4).equals("www.") ? "http://" : "") + match.group(3) + "\">" +
match.group(3) + "</a>");


There might be something else

Camel

Cool, that works. I did the same thing to the trailer, as well. It still misses trailing slashes, but I don't care since it still works.

private static Pattern pattern = null;
public String safeHtml(String in) {
if(pattern == null)
pattern = Pattern.compile("((.|\n)*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))((.|\n)*)");
Matcher matcher = pattern.matcher(in);

if(matcher.matches())
return safeHtml(matcher.group(1))
+ "<a href=\"" + matcher.group(3) + "\">" + matcher.group(3) + "</a>"
+ safeHtml(matcher.group(11));
return in
.replaceAll("&", "&amp;")
.replaceAll("<", "&lt;")
.replaceAll(">", "&gt;")
.replaceAll("\n", "<br>\n")
.replaceAll("  ", " &nbsp;");
}

<Camel> i said what what
<Blaze> in the butt
<Camel> you want to do it in my butt?
<Blaze> in my butt
<Camel> let's do it in the butt
<Blaze> Okay!

Joe

"wwhatwww.doyouwant.com",

That's totally a valid URL.
Quote from: Camel on June 09, 2009, 04:12:23 PMI'd personally do as Joe suggests

Quote from: AntiVirus on October 19, 2010, 02:36:52 PM
You might be right about that, Joe.


Sidoh

Quote from: Joex86] link=topic=10305.msg130971#msg130971 date=1190671158]
"wwhatwww.doyouwant.com",

That's totally a valid URL.

Yes, but the idea is that you ignore strings that aren't almost certainly URLs (meaning they're prefixed with one of www. or a protocol://).

If you were building a regular expression to validate URLs, then you'd want to include ones like that, but you'd also want to be much more conservative, explicitly specifying protocols, .extensions, etc.

Camel

Quote from: Joex86] link=topic=10305.msg130971#msg130971 date=1190671158]
"wwhatwww.doyouwant.com",

That's totally a valid URL.

Just because your browser's address bar knows what you mean, you can't conclude that it's a URL.

That is not, in fact, a URL, because it does not give you enough information to know what service it provides, or how to connect to it.

<Camel> i said what what
<Blaze> in the butt
<Camel> you want to do it in my butt?
<Blaze> in my butt
<Camel> let's do it in the butt
<Blaze> Okay!