I have a forum that's getting hit a lot by forum spambots, and of course the best way to defeat something is to know thy enemy. I'll worry about defeating those spambots later, but right now I'd like to know more about them. Reading around, I felt surprised about the lack of thorough information on the subject (or perhaps my ineptness to input the correct search terms for better google results).
I'm interested in learning all about spambots. I've asked on other forums and gotten brush-off answers like "Spambots are always users registering on your site."
How do forum spambots work?
Talented (if evil) programmers write them - there are probably as many different types of spambots as there are people writing them but, unfortunately, it only takes a few spambot authors sharing and selling their work to ruin life for administrators...
One popular forum spamming application is called "xrumer" - here's a video of xrumer in action (note that there is a fair amount of human intervention required to get it set up).
While I realize that this doesn't answer all of your questions, I think it bears mentioning that anything a bot can't do well (like solve complex non-static logic questions) can be done by a low-paid worker overseas. Spamming is a business much like any other and there is no shortage of cheap labor being plied toward putting spam messages out there.
How do they find the 'new user registration' page? (I'm especially surprised because some forums don't have a dedicated URL for this eg, www.forum.com/register.html , but instead use query strings or even other methods invisible to the URL bar)
They find new sites by:
How do they know what to enter into each 'new user registration' field?
They know what to enter into each field by using the field names as a guide. 99.99% of the time the email address field is named "email" or something containing the word "email". You don't have to be a rocket scientist to know that field probably is for an email address. For things like names, login ID, addresses etc. it works on the same principle.
How do they determine what's a page they can spam / enter data into and what is not?
They don't care. The automated tools can try so many forms in such a short period of time at virtually no costs so trying every form possible is a no-brainer to do. When human labor is involved they can be "script kiddies" and try the obvious stuff to see if they get any kind of response that indicates the form is potentially vulnerable. Basically, any form is a potential target to them as is any page that accepts user input.
How do forum spambots work?
Do they even 'view' this page at all? ..If not, then I'd assume they're communicating with the server directly - how is - this possible? How do they do it?
Where do spambots come from? Is someone sitting behind the computer snickering as they watch their bot destroy site after site? Or are they snickering as they simply 'release' it onto the internet somehow? Are spambots 'run' by an infected computer somewhere? Do they replicate themselves?
It's all automated. Tools like xrumer are built, and sold, and contain the ability to exploit software with known vulnerabilities. Anyone can buy it and after setting it up it's more or less fire and forget. It goes to every forum in its list and tries to spam it to the best of its ability. Just due to brute force it is successful and worth it for the spammers. That's why they never stop. They barely have to lift a finger for it to work.
Can forum spambots break CAPTCHAs? Can they solve logic questions (how?)? Math questions?
Yes, but not always. Depends on how well it is implemented. But many captchas, including those offered by big companies, have been beaten and are effectively useless. That's why multiple forms of protection are required to stop them. Even then, humans can usually beat any system.
What techniques are still valid to prevent them?
From a previous answer: You could do several things (and should be doing more then one) including:
1) Putting a fake field that only bots will see. Then if that field is submitted with the rest of the form you can ignore it (and ban them if desired). You can also trap bad bots who follow a hidden link.
2) Use a CAPATCHA like reCAPTCHA
3) Use a field that requires the user to answer a question like what is 5 + 3. Any human can answer it but a bot won't know what to do since it is auto-populating fields based on field names. So that field will be either incorrect or missing in which case the submission will be rejected.
4) Use a token and put it into a session and also add it to the form. If the token is not submitted with the form or doesn't match then it is automated and can be ignored.
5) Look for repeated submissions from the same IP address. If your form shouldn't get too many requests but suddenly is it probably is being hit by a bot and you should consider temporarily blocking the IP address.
6) Use Akismet. It is great at identifying spam.
I made Anti-spam plugin for WordPress, it blocks spam pretty good without Captcha or anything else.
You may download the plugin and use the code to solve problem with spam on your site.
When trying to defeat them, one thing I'd keep in mind is that their purpose is usually to post links to as many websites as possible for the black-hat SEO benefit.
They care about the amount of sites they gain access to, and not your site specifically. Someone just wanting to spam just your site alone could simply sign up without using a robot.
As such, I'm pretty sure that a well-written bespoke test (eg questions your forum members will know the answer to) is almost always going to be more effective against robots than any pre-written one which robots are likely to be wise to.
For example, if a robot cracked Recaptcha then it would have access to millions of forms to spam. If it cracked a bespoke test, then it would only have access to one website, so no automated spam-bot is going to bother doing that.
https://www.projecthoneypot.org may provide some good data to use (eg keywords and ips to block)