Thursday, September 1, 2011

Cloud vs. software. Human Crowd at the plagiarism detection Hunter copy

Cloud vs. software. Human Crowd at the plagiarism detectionHunter copy
Sensational plagiarism cases have raised more recently, the academic community. Have such work does not consider using software to originality? At the Berlin University a team of scientists has been studying for several years plagiarism detection systems, and has recognized the five best in the guttenbergsche thesis.Not only universities, but also publishers and other businesses that generate certain publications to the public would be happy with a kind of silver bullet that can detect a breeze, whether a digital present work is plagiarized or not. At the universities needed some entries will be read not only by the teachers, they could immediately report to the Audit Committee.Although the doctor's father has claimed to Guttenberg, that these programs back then were not so far. But the makers disagreed with the plagiarism detection system, Turnitin, "Mr. Guttenberg's work could have been run through our program. That would have spared him much trouble. "Vile self-praise? Since 2004, the so-called plagiarism detection systems are tested at the Berlin University. Over several years, many test cases were developed in different languages, to test with different types of plagiarism, but also with original works of these systems through their paces. The manufacturers have to be willing to put their system free of charge available for testing, which sometimes because of poor ratings in the past was not the case.The most recent test came in January 2011, which examined 26 solutions were divided into three classes according to their performance, "particularly useful", "limited use" and "useless". Even the best systems can not exceed 60 to 70 percent of plagiarized portions. A number of other systems, however, is hardly useful. In the category of useless systems, we have even made some fraudsters find that only want to collect the documents.The systems can naturally compare the text to check only with digital content available. They are working Internet-based, it loads its document on a Web form to the system or send it by e-mail and one gets more or less after a long delay, a result that points to similar or identical text on the web. Attempt detection systems to check whether these passages are marked in the document to be analyzed as a quote - but that rarely works in practice.Swarm intelligence in the "Crowd"A frequently voiced criticism of the test method was the short length of the examined documents 389-1055 words. In February 2011 we decided to test the work of Karl-Theodor zu Guttenberg. Here we had the opportunity to build a 475-page book, by examining the "crowd-sourcing" to the cooperation of many people over the internet, has already broken down in detail in his plagiarized parts, with the five "partially supporting" systems. It should therefore be whether the "crowd" the beats in the cloud operating system software, or vice versa.After the Bremen law professor Andreas Fischer-Lescano in February 2011 of a plagiarism suspect with respect to the thesis of Karl-Theodor had expressed to Guttenberg and the latter the accusation "absurd" was called, there was a small group of interested people, including graduate students and postdocs, together in a Wikia wiki. Wikia is an advertising-funded, open collaboration platform, which offers free wikis with a very simple WYSIWYG editor held. Because plagiarists plagiarize only rarely in one place, the GuttenPlagWiki was founded, and we began to dissect the work.Over time, developed the core group, consisting of only about 20 people, a useful way for the further processing of the representation. A reference is a "fragment" and the page number in the name of plagiarism, followed by the corresponding rows in a from-to-notation.
What looks like a bar code that shows the findings of plagiarism by page. The symbol of Plagiarismusjäger - the "barcode" - indicates which pages contain plagiarism. Using a visualization language was generated based on the fragments of the barcode list semi-automatic, with a fully automatic creation of the inner core of the group had decided, because plagiarism is wrong messages in the wiki feared that the project would have to discredit them. Appeared before the bar code, there was always a plausibility check. The sharpness of the conflict was obvious even to non-combatants, have become so well known are the cases of text vandalism, verbal abuse and anonymous threats.The "cloud" as an alternativeThat the supply Guttenberg's work in large parts of plagiarism, is now undisputed, and later to the details GuttenPlag evaluation. The idea to submit all academic work digitally and to get tested by a software designed to prevent plagiarism is obvious on the background of the events.So we have filed with the five best plagiarism detection systems, the Guttenberg dissertation. All the participating vendors - PlagAware, Turnitin iThenticate /, Ephorus, and PlagScan WITNESS - working "in the cloud": You do not know what server on which country will land the highly charged work, nor who else has access. Include in some of the tested by us in 2010 systems the terms even that one grants all rights to the texts of the respective companies, such as Turnitin: "With regard to papers submitted to the Site, You hereby grant iParadigms a non-exclusive, royalty- free, perpetual, world-wide, irrevocable license to reproduce, transmit, display, disclose, archive and otherwise use in connection with the Services any paper You submit to the site. "But back to KTzG work. In its digital version, it is a professional compound PDF, use the ligatures and justified by the narrow space (square) produces. It is 7.3 MB in size and features, including attachments, et cetera for 475 pages with about 190 000 words, together the exact number depends on the counting of words such as "§ 6, paragraph 1".Because of the size of the file, there were some problems when uploading. PlagAware broke off on page 159, because already too many matches were found. iThenticate the work be split into 13 parts, comprising 15 000 words and tested each part individually. Witness at first seemed to have crashed the whole system. The correspondence with the technical support found that achieved in particular by the similarities with the GuttenPlagWiki the internal limit of 1000 hits and the report was therefore not represented. The technicians were able to fix in a weekend shift and allowed yet another report. PlagScan was not a problem with the file size, but had to expect a night, until a result was present.
The Crowd-result: the 20 most frequently used sources for GuttenPlag (Fig. 2) were used to assess the results GuttenPlagWiki results. This resulted in two problems that are typical for information retrieval:1st Like many of the most plagiarized sources (see Figure 2) have been found (yield)?2nd Like many of the reported findings are actually the primary sources (accuracy)? In the beginning there were 151 reported even GuttenPlag sources, but during the tests, some texts are deleted from the list because they were found in other plants. Some of these works were "plagiarism cousins", so they have taken from the same sources as Guttenberg, unless otherwise specified, others have quoted correctly. At least we have found a source that was not recorded in GuttenPlagWiki (Pernice) and this nacherfasst there.Prior to this, two commonly used terms may not be any discussion of the plagiarism explains.A pawn sacrifice is - according Lahusen - the correct citation and referencing of a little more literal, the subsequent acquisition without quote marks. This creates the impression that the text was by the author.A stricter pawn sacrifice is to put the source in the bibliography, without using the exact point mark.Plagiarism as a structure the acquisition of a thought process, a chain of argument or the footnotes in ascending order from another plant is called. This is not popularly regarded as plagiarism ("The spring was called" or "It's been said quite differently"), but quite scientifically.Coverage of the top sourcesAmazingly, the systems are not all searchable, often used originals. Seven of the 20 top sources of the Scientific Service of the German Bundestag and are not published, it was in four books, which can be found on Google Books. One source, Vile, 1991, was translated by machine, making it untraceable. Altogether, eight of the top 20 online sources were found, in principle, at least seven of them were at least three of the five systems (Table 1).
The cloud Result: 13 of 20 were top sources do not, but found all the online sources available from at least three systems (Table 1). The main source, Volkmann-Schluck called, all systems except headmaster, and indeed in the first place. Only Schmitz 2001 (fourth place among the sources searchable), the FAZ article of Ten Pfennig (No. 9) and a tower of Nettesheim 2002 (# 10) were found by all five systems. However Zehnpfennig was often even more so far down on the list, for example, reported PlagScan this source only in 38th Place. Some of the systems were also little agreement, however, we have only "yes / no" assessed: reported or not. Therefore, these figures are to be regarded as upper limits of what would be found with such a system.The systems in detailGuttenPlag The group found that around 94% of the pages and in 63% of lines contained plagiarism. The system gave the percentages in the table "Percentage weighting plagiarism" which explains where nowhere is, what do these percentages - are percent of the sample or to a percentage of total work?In addition, the percentages indicated and references quite varied from day to day, although we had requested in the meantime no new reports.In addition to the evaluation itself is important for its presentation and user friendliness of the software. There is much need for improvement. So you could click on any of the reports on the reported plagiarism site to see the page number. This would be necessary if a plagiarism case would be heard by a Doctoral Committee. Would need a very readable and clearly understandable, color-coded comparison of the plagiarized text and the original text as rendered in Figure 3 shows examples of hand. To document a plagiarism of an audit committee, the source would have to be documented with detailed information.

 
Now to the results in detail: PlagAware: The confrontation, which had fallen to us so far with short documents, was seen in this long work to be extremely problematic. There were often marked only three or four words, followed by "[...]" and again three to four words. First you had to scroll a long way to get past the preface and table of contents. A law professor at such a result would have probably judged the system as broken and aborted the test. In addition, it is not clear from which part of the work sites were selected - then we had to manually search the PDF. Several sources reported belonging to pastebin, a website for easy text publication, or GuttenPlagWiki. Such sources could, however, exclude from the comparison. The report disappeared in between after a software update from the database - which hopefully does not occur with paying customers. We had, like all candidates, a free test account.iThenticate irritated by the fact that over 40% of the links to the sources of an HTTP 404 error message ("file not found") returns, although the references were available, as an inspection by hand revealed. The latter was complicated, because the text of the report did not just copy - we had a phrase or three to write five words and search again using Google. It was also necessary for many large PDF sources. This means that reported iThenticate plagiarism, and we had to investigate further to find the acquired points. From time to these places, however, were quoted correctly - what the system had escaped. It is feared that by the tide of the 404 error is not just an IT inexperienced law professor would have aborted the test. Were finally broken up in six of the 13 parts into which the work had iThenticate, the top-plagiarism just such 404er.Ephorus offers a pleasant juxtaposition, however, include any source for a complete copy of the entire dissertation. We assume that the test broke off after about 10 sources, the report had already reached 54 Mbytes. It is simply unnecessary to repeat the parts not plagiarized. The copied part and specifying the page numbers are what's interesting.PlagScan reported not much, but what was reported was actually plagiarism - if you could find the bodies. There is a fold-out list for each position of the sources. You have to open each source individually, and then search in it to find the place. The overall presentation used flying the window, but there is no good comparison. PlagAware points from a possible source - and overlooks a large.WITNESS showed extremely slow. According to the party this is connected with the fact that with the amount of hits and alternative hits (most of current media who report on the case and cite a plagiarism site) can not cope. There were a lot of navigation problems and the reported figures are not evident. Meanwhile, can not charge more, this report.Of the total 131 for the time of testing by GuttenPlag verified sources found PlagAware 7 (5%), 30 (23%), Ephorus 6 (5%), PlagScan 19 (15%) and WITNESS iThenticate 16 (12%), all together found 38 .Based on the Top 20 online sources available GuttenPlag the yield looks like this: iThenticate found of which 16 (80%), WITNESS 13 (65%), PlagScan 12 (60%) and PlagAware and headmaster respectively 6 (30%).Even if 80% iThenticate has found the most important sources, they went to the sheer volume of sources in 1156 reported - a big problem precision. Of those, but only slightly more than 400 at all longer than 20 words. We have examined only the 117 sources, 100 or more words in length. Just over 40% of which were, however, at the given URL is no longer found on the Internet, including six of the 13 "top sources" that were reported for each section.Top among these sources, there was an absolutely correct quotation (also got the front) and a correspondence between the bibliography and bibliography to Guttenberg's another source. After all, three of them, Volkmann-Schluck, and two of the second most common source line.ConclusionHad the law professors at the University of Bayreuth, the plagiarism in the dissertation Guttenberg can explore using software? At least have the first and second reader an indication of some possible non-designated sources and get more matches can investigate.But all the systems studied suffer from deficiencies in the presentation of results and operating problems. Thus, the reported values ​​for the "quantity" are misleading to plagiarism. Even though Ephorus is only 5% (and many systems report "green, no plagiarism" for values ​​below 10%) - the found bodies are serious plagiarism.On the other hand, reports iThenticate for Part 5 of the dissertation 56% plagiarism, but eight of the 13 sources with at least 100 words are not copied on the Internet at the URL listed there. In defense of the software must be said that plagiarism is difficult to quantify. Shall report GuttenPlagWiki 94% of pages with plagiarism - which has been occasionally reported in the press as plagiarism share of 94%. That's not, 63% of the rows would be a more accurate indication of the amount of plagiarism.Our recommendation: If teachers have a suspect and will not find with Google, they should use one of the software systems, preferably two or three different ones. But you need assistance in interpreting the results, such as by trained staff at the university libraries. Basically, all work by sending a plagiarism detection system, then beyond a simple "threshold" to alert the teachers, resulting in far too many false alarms. With caution is the proposal of the Turnitin service provider, students may apply the systems themselves. This could in fact bring some to the idea Synonymisierungssysteme use, the systematic replacement of one's work in the found words by words of similar or identical meaning.The latest plagiarism detection systems work best with smaller texts such as housework. For complex texts with many quotations and footnotes, they are unsuitable. Way to win in the plagiarism detection of a sufficiently large crowd, particularly because it is much more precise and can also use offline sources.

No comments:

Post a Comment