ReCAPTCHA

히든위키 코리아
Biological neural network Anonymity Assembly language Brain–computer interface
Neural circuit Security Assembly Programming Tutorial Brain–computer interface
Artificial intelligence Web programming Machine learning Virtual reality
Artificial neural network Django for Beginners Machine Learning Mastery with Python Virtual reality

reCAPTCHA is a provider of human verification systems owned by Google.

The original iteration of the service was a mass collaboration platform designed for the digitization of books, particularly those that were too illegible to be scanned by computers. The verification prompts utilized pairs of words from scanned pages, with one known word used as a control for verification, and the second used to crowdsource the reading of an uncertain word.[1] reCAPTCHA was originally developed by Luis von Ahn, David Abraham, Manuel Blum, Michael Crawford, Ben Maurer, Colin McMillen, and Edison Tan at Carnegie Mellon University's main Pittsburgh campus.[2] It was acquired by Google in September 2009.[3] The system helped to digitize the archives of The New York Times, and was subsequently used by Google Books for similar purposes.[4]

The system was reported as displaying over 100 million CAPTCHAs every day,[5] on sites such as Facebook, TicketMaster, Twitter, 4chan, CNN.com, StumbleUpon,[6] Craigslist (since June 2008),[7] and the U.S. National Telecommunications and Information Administration's digital TV converter box coupon program website (as part of the US DTV transition).[8]

In 2014, Google pivoted the service away from its original concept, with a focus on reducing the amount of user interaction needed to verify a user, and only presenting human recognition challenges (such as identifying images in a set that satisfy a specific prompt) if behavioral analysis suspects that the user may be a bot. reCAPTCHA v1 was declared end-of-life on March 31, 2018.

Origin

Distributed Proofreaders was the first project to volunteer its time to decipher scanned text that could not be read by optical character recognition (OCR) programs. It works with Project Gutenberg to digitize public domain material and uses methods quite different from reCAPTCHA.

The reCAPTCHA program originated with Guatemalan computer scientist Luis von Ahn,[9] and was aided by a MacArthur Fellowship. An early CAPTCHA developer, he realized "he had unwittingly created a system that was frittering away, in ten-second increments, millions of hours of a most precious resource: human brain cycles".[10][11]

Operation

파일:Modern-captcha.jpg
An example of how a reCAPTCHA challenge looked in 2007,[12] containing the words "following finding". The waviness and horizontal stroke were added to increase the difficulty of breaking the CAPTCHA with a computer program.

Scanned text is subjected to analysis by two different OCRs. Any word that is deciphered differently by the two OCR programs or that is not in an English dictionary is marked as "suspicious" and converted into a CAPTCHA. The suspicious word is displayed, out of context, sometimes along with a control word already known. If the human types the control word correctly, then the response to the questionable word is accepted as probably valid. If enough users were to correctly type the control word, but incorrectly type the second word which OCR had failed to recognize, then the digital version of documents could end up containing the incorrect word. The identification performed by each OCR program is given a value of 0.5 points, and each interpretation by a human is given a full point. Once a given identification hits 2.5 points, the word is considered valid. Those words that are consistently given a single identity by human judges are later recycled as control words.[13] If the first three guesses match each other but do not match either of the OCRs, they are considered a correct answer, and the word becomes a control word.[14] When six users reject a word before any correct spelling is chosen, the word is discarded as unreadable.[14]

The original reCAPTCHA method was designed to show the questionable words separately, as out-of-context correction, rather than in use, such as within a phrase of five words from the original document.[15] Also, the control word might mislead context for the second word, such as a request of "/metal/ /fife/" being entered as "metal file" due to the logical connection of filing with a metal tool being considered more common than the musical instrument "fife".틀:Citation needed

In 2012, reCAPTCHA began using photographs taken from Google Street View project, in addition to scanned words.[16]

파일:Images Recaptcha.png
Image identification CAPTCHA


No CAPTCHA reCAPTCHA

파일:NoCAPTCHA reCAPTCHA.gif
The NoCAPTCHA reCAPTCHA

In 2013, reCAPTCHA began implementing behavioral analysis of the browser's interactions to predict whether the user was a human or a bot. The following year, Google began to deploy a new reCAPTCHA API, featuring the "no CAPTCHA reCAPTCHA" — where users deemed to be of low risk only need to click a single checkbox to verify their identity. A CAPTCHA may still be presented if the system is uncertain of the user's risk; Google also introduced a new type of CAPTCHA challenge designed to be more accessible to mobile users, where the user must select images matching a specific prompt from a grid.[17][18]

In 2017, Google introduced a new "invisible" reCAPTCHA, where verification occurs in the background, and no challenges are displayed at all if the user is deemed to be of low risk.[19][20][21] According to former Google "click fraud czar" Shuman Ghosemajumder, this capability "creates a new sort of challenge that very advanced bots can still get around, but introduces a lot less friction to the legitimate human."[21]

reCAPTCHA v1 was declared end-of-life and shut down on March 31, 2018.[22]

Implementation

The reCAPTCHA tests are displayed from the central site of the reCAPTCHA project, which supplies the words to be deciphered. This is done through a JavaScript API with the server making a callback to reCAPTCHA after the request has been submitted. The reCAPTCHA project provides libraries for various programming languages and applications to make this process easier. reCAPTCHA is a free-of-charge service provided to websites for assistance with the decipherment,[23] but the reCAPTCHA software is not open-source.[24]

Also, reCAPTCHA offers plugins for several web-application platforms including ASP.NET, Ruby, and PHP, to ease the implementation of the service.[25]

Security

파일:Recaptcha.png
An example of how reCAPTCHA challenges were presented in 2010,[26] containing the words "and chisels"

The main purpose of a CAPTCHA system is to block spambots while allowing human users. On December 14, 2009, Jonathan Wilkins released a paper describing weaknesses in reCAPTCHA that allowed bots to achieve a solve rate of 18%.[27][28][29]

On August 1, 2010, Chad Houck gave a presentation to the DEF CON 18 Hacking Conference detailing a method to reverse the distortion added to images which allowed a computer program to determine a valid response 10% of the time.[30][31] The reCAPTCHA system was modified on July 21, 2010, before Houck was to speak on his method. Houck modified his method to what he described as an "easier" CAPTCHA to determine a valid response 31.8% of the time. Houck also mentioned security defenses in the system, including a high-security lockout if an invalid response is given 32 times in a row.[32]

On May 26, 2012, Adam, C-P and Jeffball of DC949 gave a presentation at the LayerOne hacker conference detailing how they were able to achieve an automated solution with an accuracy rate of 99.1%.[33] Their tactic was to use techniques from machine learning, a subfield of artificial intelligence, to analyse the audio version of reCAPTCHA which is available for the visually impaired. Google released a new version of reCAPTCHA just hours before their talk, making major changes to both the audio and visual versions of their service. In this release, the audio version was increased in length from 8 seconds to 30 seconds, and is much more difficult to understand, both for humans as well as bots. In response to this update and the following one, the members of DC949 released two more versions of Stiltwalker which beat reCAPTCHA with an accuracy of 60.95% and 59.4% respectively. After each successive break, Google updated reCAPTCHA within a few days. According to DC949, they often reverted to features that had been previously hacked.

On June 27, 2012, Claudia Cruz, Fernando Uceda, and Leobardo Reyes published a paper showing a system running on reCAPTCHA images with an accuracy of 82%.[34] The authors have not said if their system can solve recent reCAPTCHA images, although they claim their work to be intelligent OCR and robust to some, if not all changes in the image database.

In an August 2012 presentation given at BsidesLV 2012, DC949 called the latest version "unfathomably impossible for humans" – they were not able to solve them manually either.[33] The web accessibility organization WebAIM reported in May 2012, "Over 90% of respondents [screen reader users] find CAPTCHA to be very or somewhat difficult."[35]

Criticism

The original iteration of reCAPTCHA was criticized as being a source of unpaid work to assist in transcribing efforts.[36]

The current iteration of the system has been criticized for its reliance on tracking cookies and promotion of vendor lock-in with Google services; administrators are encouraged to include reCAPTCHA tracking code in all pages of their website to analyze the behavior and "risk" of users, which determines the level of friction presented when a reCAPTCHA prompt is used. Google stated in its privacy policy that user data collected in this manner is not used for personalized advertising. It was also discovered that the system favors those who have an active Google account login, and displays a higher risk towards those using anonymizing proxies and VPN services.[19]

Some people were concerned when Google announced reCAPTCHA v3.0 about their privacy because of the vulnerability that people viewing the sites with reCAPTCHA v2.0 faced with the possibility of Google tracking them throughout the website as they now would have full control.

Derivative projects

reCAPTCHA had also created project Mailhide, which protects email addresses on web pages from being harvested by spammers.[37] By default, the email address was converted into a format that did not allow a crawler to see the full email address; for example, "mailme@example.com" would have been converted to "mai...@example.com". The visitor would then click on the "..." and solve the CAPTCHA in order to obtain the full email address. One could also edit the pop-up code so that none of the address was visible. Mailhide has been discontinued in 2018 because it relied on reCAPTCHA V1.[38]

Automated solvers

In response to the difficulty for users with disabilities and regular users alike, automated solvers such as Buster have been created, which solve the reCAPTCHA for the user, without them having to complete a challenge. Buster uses the audio part of reCAPTCHA and solves that instead of selecting visual elements, and can be installed as a browser add-on.

See also

References

  1. 틀:Citation
  2. 틀:Cite web
  3. 인용 오류: <ref> 태그가 잘못되었습니다; AutoK4-1라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  4. 틀:Cite news
  5. 인용 오류: <ref> 태그가 잘못되었습니다; AutoK4-2라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  6. 인용 오류: <ref> 태그가 잘못되었습니다; BBCreport라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  7. 인용 오류: <ref> 태그가 잘못되었습니다; craig라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  8. 인용 오류: <ref> 태그가 잘못되었습니다; AutoK4-5라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  9. 인용 오류: <ref> 태그가 잘못되었습니다; CBC2라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  10. 인용 오류: <ref> 태그가 잘못되었습니다; AutoK4-6라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  11. 인용 오류: <ref> 태그가 잘못되었습니다; AutoK4-9라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  12. 틀:Cite web
  13. 인용 오류: <ref> 태그가 잘못되었습니다; AutoK4-8라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  14. 14.0 14.1 틀:Cite journal
  15. 인용 오류: <ref> 태그가 잘못되었습니다; DM라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  16. 틀:Cite web
  17. 틀:Cite web
  18. 틀:Cite magazine
  19. 19.0 19.1 틀:Cite web
  20. 틀:Cite web
  21. 21.0 21.1 틀:Cite magazine
  22. 틀:Cite web
  23. 인용 오류: <ref> 태그가 잘못되었습니다; FAQ라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  24. 틀:Cite web
  25. 틀:Cite web
  26. 틀:Cite news
  27. 인용 오류: <ref> 태그가 잘못되었습니다; Strong_CAPTCHA_Guidelines라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  28. 인용 오류: <ref> 태그가 잘못되었습니다; Register_Article라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  29. 틀:Cite web
  30. 인용 오류: <ref> 태그가 잘못되었습니다; Speaker_Program라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  31. 인용 오류: <ref> 태그가 잘못되었습니다; Decoding_reCAPTCHA라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  32. 인용 오류: <ref> 태그가 잘못되었습니다; Decoding_reCAPTCHA_pptx라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  33. 33.0 33.1 인용 오류: <ref> 태그가 잘못되었습니다; Project_Stiltwalker라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  34. 틀:Cite book
  35. 틀:Cite web
  36. 틀:Cite web
  37. 인용 오류: <ref> 태그가 잘못되었습니다; Mailhide라는 이름을 가진 주석에 제공한 텍스트가 없습니다
  38. 인용 오류: <ref> 태그가 잘못되었습니다; MailhideDiscontinued라는 이름을 가진 주석에 제공한 텍스트가 없습니다

Further reading

External links

  • Repository
  • ReCAPTCHA: The job you didn't even know you had Two-page article in The Walrus magazine
  • cite journal | last1 = Luis | last2 = Maurer | first2 = Benjamin | last3 = McMillen | first3 = Colin | last4 = Abraham | first4 = David | last5 = Blum | first5 = Manuel | year = 2008 | title = reCAPTCHA: Human-Based Character Recognition via Web Security Measures | url = | journal = Science | volume = 321 | issue = 5895| pages = 1465–1468 | doi = 10.1126/science.1160379 | pmid=18703711| citeseerx = 10.1.1.141.6563
  • TED talk|luis_von_ahn_massive_scale_online_collaboration|Massive-scale online collaboration|Luis von Ahn